# Introduction 

In this notebook we'll be looking of two issues:
- Missing data
- Converting categorical to numerical 

# Missing Data

- Missing values are common in dealing with real-world problems (data from long periods, multiple sources) 
- ML models require careful handling of missing data. 
- One strategy is imputing the missing values. Stategies include
    - interpolation, e.g., mean, median or mode
    - matrix factorization methods like SVD
    - statistical models like Kalman filters
    - etc

In this notebook we'll be looking at: 
- Mean -  average of all values in a set
- Median - the "middle" number in a set of numbers sorted by size
- Mode - the most common category / numerical value.

More, basic Pandas DataFrame operation will be used.

So, first, in kaggle you can find a competion which tries to build predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). See https://www.kaggle.com/competitions/titanic/ for full details. 

Let us try to build our own participating solution and, for that, start by loading the dataset using the pandas library. 

In [None]:
import pandas as pd

titanic_df = pd.read_csv("data/titanic_train.csv")
titanic_df.head(5) # show the first 5 lines

We can see immediately that:
- there are categorical and numerical features
- there are some missing values

this can also be cheched by running the `.info()` method and looking to the _Non-Null Count_ column

In [None]:
titanic_df.info()

For example, _Age_ is number (float64) and is define for 714 rows. 

As many algorithms do not accept `nan` values, several strategies can be taken to solve this, e.g.:
- remove rows with _nan_
- numerical data: replace the _nan_ values by, e.g., 0, mean, median etc.
- categorical data: replace the _nan_ values by, e.g., mode, "" (empty string) etc.

## Remove rows with _nan_

Removing rows with `nan`, can be achieve using `.dropna()` method, but can represent a major loss of the original data

In [None]:
# remove non uncomplete rows
titanic_df_with_nan_dropped = titanic_df.dropna()
titanic_df_with_nan_dropped.info()

The original dataset had 891 entries which would be reduced to 183. Maybe this not a good idea...! 

## Numerical data: replace the nan values by, e.g., 0, mean, median etc.

In the example exists numerical (_Age_) and categorical(_Cabin_) missing data. Let us look first to the numerical one.

First look at the distribution of the data

In [None]:
titanic_df["Age"].plot(kind="box")

In [None]:
titanic_df["Age"].plot(kind="hist")

- Replace Missing Values with Mean
    - when data is "centered" - equaly distributed around the mean value

- Replace Missing Values with Median
    - when data is "skewed" - not equaly distributed around the mean value


First compute the `mean` / `median` (we could also define our own constant)

In [None]:
# remove non uncomplete rows
value_to_replace_nan = titanic_df["Age"].median() #try with 0, .median()
print("value_to_replace_nan: ", value_to_replace_nan)

Then use the `.fillna()` method applied to the _Age_ column with the computed value

In [None]:
titanic_df_with_nan_replaced_by_mean = titanic_df.copy() # make a copy so we do not loose original dataset for later experiments!

titanic_df_with_nan_replaced_by_mean["Age"] = titanic_df["Age"].fillna(value_to_replace_nan)

You can check that the dataframe no longer misses values on the _Age_ column

In [None]:
titanic_df_with_nan_replaced_by_mean.info()

## Categorical data: replace the nan values by, e.g., mode, "" (empty string) etc.

This case is somehow distinct from other since, the _Cabin_ column has "lots" of distinct values, with "few" repetitions. 
In other cases this can be values from more limited sets (e.g, a color, True/False, animal species, etc).

In [None]:
print(set(titanic_df["Cabin"]))

So, maybe, it would be more useful to know the deck (1st letter) and from there compute the mode

In [None]:
def transf(x):
    try:
        return x[0] # return 1st letter
    except:
        return x # return the value if an exception was rased (e.g., when x is nan)

titanic_cabin_df = titanic_df.copy() # make a copy so we do not loose original dataset for later experiments!
titanic_cabin_df["Cabin"] = titanic_df["Cabin"].apply(transf)

titanic_cabin_df

We can now see that, for the known cabins/"deck", the mode should be "C" by plotting an histogram

In [None]:
titanic_cabin_df["Cabin"].value_counts().plot(kind='bar')

Or by computing the mode of the "Cabin" column

In [None]:
value_to_replace_nan = titanic_cabin_df["Cabin"].mode()
value_to_replace_nan

As before we can use the `.fillna()` method

In [None]:
titanic_cabin_df["Cabin"] = titanic_cabin_df["Cabin"].fillna(value_to_replace_nan[0])
titanic_cabin_df

# Converting categorical to numerical 

The majority of the ML algorithms does not accept categorical (string) values as input. So categorical data must be somehow converted to numerical.
Solutions include:
- categories mapping (e.g., True->1 / False->0 or S->0 / M->1 / L->2 / etc.)
- one hot encoding of data

Here we'll be looking at the later case (one hot encoding )

For a simpler example, consider the following pandas dataframe

In [None]:
df = pd.DataFrame({"Col1": ["a", "b","a", "a", "c"]})
df

Now, pandas provides the `.get_dummies()` method, resulting in

In [None]:
pd.get_dummies(df)

For the Titanic example we could just do, which whill apply the `.get_dummies()` method to all categorical columns.

In [None]:
pd.get_dummies(titanic_df)

In our case, the previous operation should be combined with the previous ones as 1371 columns where returned (for instance a column was creted for each name in the dataframe)

In [None]:
print(set(pd.get_dummies(titanic_df).columns))