### Dealing with Missing data

We typically see missing values as blank spaces in our data table or as a placeholder string such as NaN, which stands for "not a number", or NULL. Most computational tools are unable to handle such missing values or will produce unpredictable results if we simply ignore them. Therefore, it is crucial that we take care of those missing values before we proceed with further analyses. 

### Identifying missing values in tabular data

In [1]:
import pandas as pd
from io import StringIO

csv_data = \
    '''A, B, C, D
        1.0,2.0,3.0,4.0
        5.0,6.0,,8.0
        10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


We can use `isnull` method to return a `DataFrame` with Boolean values whether a cell contains a numeric value (False) or if data is missing (True). Using `sum` method, we can return the number of missing values per column.

In [4]:
df.isnull().sum()

A     0
 B    0
 C    1
 D    1
dtype: int64

### Eliminiting training examples or features with missing values

- One of the easiest ways to deal with missing data is to remove the correspoding feature (columns) or the training examples (rows) from the dataset. Rows with missing values can easily be dropped via the `dropna` method.

In [5]:
###
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


Similarly, we can drop columns that have at least one `NaN` in any row by setting the `axis` argument to 1:

In [6]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


The `dropna` method supports several additional parameters that can come in handy:

In [8]:
# drop rows that have fewer than 4 real values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [14]:
# only drop rows where NaN apper in specific columns (here: C)
df.dropna(subset=['A'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


## Inputing missing values

Often the removal of the training examples with missing features is not fesiable, beacause we might lose valuable information. In this case we can use various interpolation techniques to estimate the missing values from the other training examples. 
One of th most common interpolation techniques is **mean imputation**, where we replace the missing value with the mean value of the entire feature column.

In [15]:
## Inputing missing values
from sklearn.impute import SimpleImputer
import numpy as np

imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

A more convenient way to impute the missing values is by using the pandas' `fillna` method and providing an imputation method as an argument.

In [16]:
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0


## Handeling Categorical data

When talking about categorical data, we have to further distinguish between **ordinal** and **nominal** features. 
Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order. XL > L > M.
In contrast, nominal features dont't imply any order and, to continue with previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, red is larger than blue.


In [17]:
# Categorical data encoding wiht pandas
import pandas as pd
df = pd.DataFrame([
    ['green', 'M', 10.1, 'class2'],
    ['red', 'L', 13.5, 'class1'],
    ['blue', 'XL', 15.3, 'class2']
])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


In [18]:
# Mapping ordinal features

size_mapping = {'XL': 3, 'L': 2, 'M' : 1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


In [20]:
 ### Transforming the integer back to the original string representaion
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)


0     M
1     L
2    XL
Name: size, dtype: object

### Encoding class labels


In [21]:
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

We can use the mapping dictionary to transfrom the class labels into integers:

In [22]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,0
2,blue,3,15.3,1


In [24]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'].map(inv_class_mapping)

0    class2
1    class1
2    class2
Name: classlabel, dtype: object

Alternatively, there is a convinent `LabelEncoder` class directly implemented in scikit-learn to achieve this:

In [25]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([1, 0, 1])

In [27]:
class_le.inverse_transform(y)

array([1, 0, 1])

In [None]:
### Performing one-hot encoding on nominal features
