### Missing value treatment

Lets create our data for demonstration of missing value treatement

In [1]:
import pandas as pd
from io import StringIO

In [2]:
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

In [3]:
df = pd.read_csv(StringIO(csv_data))

In [4]:
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [21]:
#  Getting the number of missing values in each column
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

In [20]:
#  Getting the percentage of missing values in each column
df.isnull().mean()*100

A     0.000000
B     0.000000
C    33.333333
D    33.333333
dtype: float64

In [8]:
# Deleting the rows with null values
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [9]:
# Deleting the columns with null values
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [11]:
#  drop rows with all values as null
df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [12]:
# drop rows that have less than 4 real values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [13]:
# only drop rows where NaN apper ins specific columns (here : 'C')
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


- Removal of missing data seems to be a convenient approach , it also comes with certain __disadvantages__;
> a. we may end up removing too many samples, which will make a reliable analysis impossible.<br>
> b. if we remove too many feature columns. we will run the risk of losing valuable infromation that our classifier needs to be discriminate between classes.

### Imputing missing values
- we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset.
- __One__ of the most common interpolation techniquess is __mean imputation__. replace missing value with the mean value of the entire feature column.

In [24]:
df.fillna(method='mean', axis=1)

ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got mean

In [25]:
from sklearn.preprocessing import Imputer

In [47]:
imr = Imputer(strategy='mean', axis=1, verbose=2)
imr = imr.fit(df)

imputed_values = imr.transform(df)
imputed_values

array([[  1.        ,   2.        ,   3.        ,   4.        ],
       [  5.        ,   6.        ,   6.33333333,   8.        ],
       [ 10.        ,  11.        ,  12.        ,  11.        ]])

__Note:__ If we change the axis =0 ,we would calcuate the columns mean

In [48]:
imr = Imputer(strategy='mean', verbose=2)
imr = imr.fit(df.values)

imputed_values = imr.transform(df.values)
imputed_values

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])

In [50]:
imr = Imputer(strategy='mean', verbose=2)
imr = imr.fit(df.values)

imputed_values = imr.transform(df.values)
imputed_values

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])




## Handling Categorical Data.
- Categorical Data can be mainly distinguished between __Ordinal__ and __nominal__

In [90]:
df = pd.DataFrame([['green','M',10.1,'class1'],
                   ['Red','L', 13.4,'class2'],
                   ['blue', 'XL',15.3,'class1']], 
                  columns= ['color','size','price','classlabel'])
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,Red,L,13.4,class2
2,blue,XL,15.3,class1


In our dataset we have _nominal feature_ (color) , _ordinal feature_ (size) and numerical feature (price).

### Mapping ordinal features

To make sure that the learning algorithm interprets the oridinal features correctly, convert the categorical string values into integers.
- There is no convenient function that can automatically derive the correct order, thus we have to define the mapping manually

In [91]:
size_mapping = {'M':1.0,
               'L': 2.0,
               'XL':3.0}

In [92]:
df['size'] = df['size'].map(size_mapping)

df

Unnamed: 0,color,size,price,classlabel
0,green,1.0,10.1,class1
1,Red,2.0,13.4,class2
2,blue,3.0,15.3,class1


In [93]:
inv_size_mapping = {v:k for k,v in size_mapping.items()}

In [94]:
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

### Encoding class labels

Most classifiers in _scikit-learn_ convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches.

we can use an approach similar to mapping of ordinal features

In [95]:
class_mapping = {label: idx for idx, label in enumerate(df['classlabel'].unique())}

In [96]:
class_mapping

{'class1': 0, 'class2': 1}

In [97]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1.0,10.1,0
1,Red,2.0,13.4,1
2,blue,3.0,15.3,0


In [98]:
inv_class_mapping = {v:k for k,v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1.0,10.1,class1
1,Red,2.0,13.4,class2
2,blue,3.0,15.3,class1


- Alternatively, there is a convenient _LabelEncoder_ class directly implemented in _scikit-learn_

In [99]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()

In [103]:
y  = class_le.fit_transform(df.classlabel)

y

array([0, 1, 0])

In [102]:
class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

###  Performing One-Hot encoding on nominal features

- The idea is to create a new dummy feature for each unique value in the nominal feature column.

In [120]:
X = df[['color','size','price']]

In [116]:
from sklearn.preprocessing import OneHotEncoder

In [117]:
ohe = OneHotEncoder( categorical_features=[0])


In [119]:
ohe.fit_transform(X)

ValueError: could not convert string to float: 'blue'

In [108]:
X

Unnamed: 0,color,size,price
0,green,1.0,10.1
1,Red,2.0,13.4
2,blue,3.0,15.3


In [122]:
pd.get_dummies(df[['color','size','price']])

Unnamed: 0,size,price,color_Red,color_blue,color_green
0,1.0,10.1,0,0,1
1,2.0,13.4,1,0,0
2,3.0,15.3,0,1,0


In [121]:
pd.get_dummies(df[['color','size','price']],drop_first=True)

Unnamed: 0,size,price,color_blue,color_green
0,1.0,10.1,0,1
1,2.0,13.4,0,0
2,3.0,15.3,1,0
