### Dealing with Missing data

We typically see missing values as blank spaces in our data table or as a placeholder string such as NaN, which stands for "not a number", or NULL. Most computational tools are unable to handle such missing values or will produce unpredictable results if we simply ignore them. Therefore, it is crucial that we take care of those missing values before we proceed with further analyses. 

### Identifying missing values in tabular data

In [37]:
import pandas as pd
from io import StringIO

csv_data = \
    '''A, B, C, D
        1.0,2.0,3.0,4.0
        5.0,6.0,,8.0
        10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


We can use `isnull` method to return a `DataFrame` with Boolean values whether a cell contains a numeric value (False) or if data is missing (True). Using `sum` method, we can return the number of missing values per column.

In [38]:
df.isnull().sum()

A     0
 B    0
 C    1
 D    1
dtype: int64

### Eliminiting training examples or features with missing values

- One of the easiest ways to deal with missing data is to remove the correspoding feature (columns) or the training examples (rows) from the dataset. Rows with missing values can easily be dropped via the `dropna` method.

In [39]:
###
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


Similarly, we can drop columns that have at least one `NaN` in any row by setting the `axis` argument to 1:

In [40]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


The `dropna` method supports several additional parameters that can come in handy:

In [41]:
# drop rows that have fewer than 4 real values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [42]:
# only drop rows where NaN apper in specific columns (here: C)
df.dropna(subset=['A'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


## Inputing missing values

Often the removal of the training examples with missing features is not fesiable, beacause we might lose valuable information. In this case we can use various interpolation techniques to estimate the missing values from the other training examples. 
One of th most common interpolation techniques is **mean imputation**, where we replace the missing value with the mean value of the entire feature column.

In [43]:
## Inputing missing values
from sklearn.impute import SimpleImputer
import numpy as np

imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

A more convenient way to impute the missing values is by using the pandas' `fillna` method and providing an imputation method as an argument.

In [44]:
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0


## Handeling Categorical data

When talking about categorical data, we have to further distinguish between **ordinal** and **nominal** features. 
Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order. XL > L > M.
In contrast, nominal features dont't imply any order and, to continue with previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, red is larger than blue.


In [45]:
# Categorical data encoding wiht pandas
import pandas as pd
df = pd.DataFrame([
    ['green', 'M', 10.1, 'class2'],
    ['red', 'L', 13.5, 'class1'],
    ['blue', 'XL', 15.3, 'class2']
])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


In [46]:
# Mapping ordinal features

size_mapping = {'XL': 3, 'L': 2, 'M' : 1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


In [47]:
 ### Transforming the integer back to the original string representaion
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)


0     M
1     L
2    XL
Name: size, dtype: object

### Encoding class labels


In [48]:
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

We can use the mapping dictionary to transfrom the class labels into integers:

In [49]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,0
2,blue,3,15.3,1


In [50]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'].map(inv_class_mapping)

0    class2
1    class1
2    class2
Name: classlabel, dtype: object

Alternatively, there is a convinent `LabelEncoder` class directly implemented in scikit-learn to achieve this:

In [51]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([1, 0, 1])

In [52]:
class_le.inverse_transform(y)

array([1, 0, 1])

### Performing one-hot encoding on nominal features

One-hot encoding is the process by which categorical data are coverted into numerical data for the use in machine learning. Categorical features are turned into binary features that are "one-hot" encoded, meaning that if a feature is represented by that column, it receives `1` otherwise it receives `0`. 

The idea behind this apporach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the `color` feature into three new features `blue`, `green` and `red`. Binary values can then be used to indicate the particular `color` of an example; for example a `blue` example can be encoded as `blue=1, green=0, red=0`. To perform this transformation, we can use `OneHotEncoder` that is implemented in scikit-learn's preprocessing module.


In [53]:

from sklearn.preprocessing import OneHotEncoder

X = df[['color', 'size', 'price']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()


array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

If we want to selectively transform columns in a multi-feature array, we can use the `ColumnTransformer`, which accepts a list of `(name, transformer, column(s))`

In [54]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

X = df[['color', 'size', 'price']].values
c_trasf = ColumnTransformer(
    [
        ('onehot', OneHotEncoder(), [0] ),
        ('nothing', 'passthrough', [1,2])
    ]
)

c_trasf.fit_transform(X).astype(float)

array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

An even more conveinient way to create those dummy features via one-hot encoding is to use `get_dummies` method implemented in pandas. Applied to `DataFrame`, the `get_dummies` method will only convert string columns and leave all other columns unchanged.

In [55]:
pd.get_dummies(df[['price', 'color', 'size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1,0,1,0
1,13.5,2,0,0,1
2,15.3,3,1,0,0


To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, for example, if we remove the column `color_blue`, the feature information is still preserved since if we observe `color_green=0` and `color_red=0`, it implies that the observation must be `blue`.

If we use `get_dummies` function, we can drop the first column by passing a `True` argument to the `drop_first` parameter, as shown in the following example:


In [56]:
pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)

Unnamed: 0,price,size,color_green,color_red
0,10.1,1,1,0
1,13.5,2,0,1
2,15.3,3,0,0


In order to drop a redundant column via the `OneHotEncoder`, we need to set `drop=first` and set `categories='auto'` as follows:

In [57]:
color_ohe = OneHotEncoder(categories='auto', drop='first')
c_trasf = ColumnTransformer(
    [
        ('onehot', color_ohe, [0] ),
        ('nothing', 'passthrough', [1,2])
    ]
)

c_trasf.fit_transform(X).astype(float)


array([[ 1. ,  0. ,  1. , 10.1],
       [ 0. ,  1. ,  2. , 13.5],
       [ 0. ,  0. ,  3. , 15.3]])

In [58]:
## Partitioning a dataset into seperate training and test datasets
## Wine dataset

df_wine = pd.read_csv('../data/wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines', 'Proline']

In [59]:
df_wine

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
2,1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,3,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
174,3,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
175,3,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
176,3,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


In [60]:
from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

## Brininging Features on to the same scale

**Feature Scaling** is a crucial step in our preprocssing pipeline that can be easily be forgotten. Decision Tree and Random Forests are two of the very few machine learining algorithms where we do not need to think about feature sacling as they are scale invarinet.

## Bringing Features on to the same scale

There are two common approches to brining different features onto the same sacle: **normalization** and **standardization**.
Normalization refers to the rescaling of the features to the range of [0,1], which is a special case of **min-max scaling**. To normalize our data, we can simply apply min-max scaling to each feature column, where the new value, $x_norm^{(i)}$, of an example $x^{(i)}$, can be calculated as follows:

$ x_{norm} ^ {(i)} = \frac{x^{(i)} - x_{min}}{x_{max} - x_{min}} $ <br/>

Here, $x^{(i)}$ is a particular example, $ x_min $ is the smallest value in a feature column, and $ x_max $ is the largest value.

The min-max scaling procedure is implemeted is scikit-learn and can be used as follows:

In [61]:
## Bring ing Features on to the same scale
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()

X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.fit_transform(X_test)

X_train_norm


array([[0.64619883, 0.83201581, 0.4248366 , ..., 0.45744681, 0.28571429,
        0.19400856],
       [0.6871345 , 0.15612648, 0.65359477, ..., 0.81914894, 0.63369963,
        0.68259629],
       [0.67836257, 0.15019763, 0.65359477, ..., 0.75531915, 0.52747253,
        0.71825963],
       ...,
       [0.72222222, 0.84980237, 0.34640523, ..., 0.10638298, 0.02197802,
        0.09771755],
       [0.16081871, 0.06916996, 0.39215686, ..., 0.54255319, 0.68131868,
        0.43366619],
       [0.37719298, 0.61857708, 0.45751634, ..., 0.75531915, 0.68131868,
        0.13195435]])

Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution (zero mean and unit variance).
Furthermore, standardization maintains useful information about the outliers and makes the algorithm less senstive to them in contrast to min-max sacling, which scales the data to limited range of values.

The procedure for standardizatin can be expressed by the following equation:

$x_{std}^{(i)} = \frac{x^{(i)} - \mu_x}{\sigma_x}$ <br/>

Here, $\mu_x$ is the sample mean of the particular feature column, and $\sigma_x$ is the corresponding standard deviation.


You can perform the standardization and normalization manually by executing the following code example:

In [62]:
ex = np.array([0,1,2,3,4,5])
print('standarized', (ex - ex.mean())/ ex.std())

standarized [-1.46385011 -0.87831007 -0.29277002  0.29277002  0.87831007  1.46385011]


scikit-learn also implements a class for standardization:

In [63]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)