## Building good Training Examples - Preprocessing Data

In [68]:
# lets read some data
import pandas as pd
from io import StringIO

csv_data = \
    '''A, B, C, D
    1.0, 2.0, 3.0, 4.0
    5.0, 6.0,, 8.0
    10.0, 11.0, 12.0,'''
df = pd.read_csv(StringIO(csv_data))


## Dealing with Missing values

In [69]:
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [70]:
df.isnull().sum()

A     0
 B    0
 C    1
 D    1
dtype: int64

In [71]:
from sklearn.impute import SimpleImputer
import numpy as np
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)


In [72]:
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

In [73]:
imr.fit_transform(df)

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

## Dealing with categorical data
- Categorical can be nominal or ordinal.  Ordinal values can be ordered (think size of a shirt)
- Nominal values cannot be ordered think color of flower (red, blue, green)


In [74]:

df = pd.DataFrame([
    ['green', 'M', 10.1, 'class1'], 
    ['red', 'L', 13.5, 'class2'],
    ['blue', 'XL', 15.3, 'class1']])


In [75]:
df.columns = ['color', 'size', 'price', 'classLabel']

In [76]:
df

Unnamed: 0,color,size,price,classLabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


- in machine learning, we need to convert categorical features into numerical values
- size is a ordinal feature (categorical feature) where there is an implied order

In [11]:

size_mapping = {'XL': 3, 'L': 2, 'M': 1}


In [12]:
df['size'].map(size_mapping)

0    1
1    2
2    3
Name: size, dtype: int64

In [26]:
df['size']

0     M
1     L
2    XL
Name: size, dtype: object

In [13]:
label_mapping = {label: idx for idx, label in enumerate(df['classLabel'].unique())}

In [14]:
df['classLabel'].map(label_mapping)

0    0
1    1
2    0
Name: classLabel, dtype: int64

In [15]:
df

Unnamed: 0,color,size,price,classLabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


In [17]:
# use label encoder class is better for these kind of inbuilt tasks to deal with class labels
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classLabel'])

In [18]:
y

array([0, 1, 0])

In [19]:
class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

In [52]:
# Lets try encoding the color feature
color_le = LabelEncoder()
X = df[['color', 'size', 'price']].values

In [53]:
X

array([['green', 'M', 10.1],
       ['red', 'L', 13.5],
       ['blue', 'XL', 15.3]], dtype=object)

In [47]:
X[:, 0] = color_le.fit_transform(X[:, 0])

In [48]:
X

array([[1, 'M', 10.1],
       [2, 'L', 13.5],
       [0, 'XL', 15.3]], dtype=object)

Notice that we converted our categorial nominal feature color into three numbers 1, 2, and 0 which might imply that green 
is more important than blue and red is more important than green.  An incorrect conclusion.  

- Use one hot encoding to convert the feature where we convert each unique value $v_j$ of feature $f$ into separate boolean feature

In [54]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    transformers=[
        ("Color",
        OneHotEncoder(),
        [0]) # the column to apply transformation on
    ], remainder='passthrough'
)
X = ct.fit_transform(X)

In [58]:
pd.DataFrame(X, columns=['Blue', 'Green', 'Red', 'Size', "Price"])

Unnamed: 0,Blue,Green,Red,Size,Price
0,0.0,1.0,0.0,M,10.1
1,0.0,0.0,1.0,L,13.5
2,1.0,0.0,0.0,XL,15.3


- One hot encoding introduces multicollenearity in dataset and this could be a problem if we have to invert the matrix
- remove a feature vector such as blue.  Information is still preserved since if we observe 0 for green, 0 for red, then this implies its 1 for blue.  

In [77]:
df['size'] = df['size'].map(size_mapping)

In [78]:
df

Unnamed: 0,color,size,price,classLabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


- one can also use pd.get_dummies to do one hot encoding

In [80]:
pd.get_dummies(df[['price', 'color', 'size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1,0,1,0
1,13.5,2,0,0,1
2,15.3,3,1,0,0


Go into ml_best_practices and do the remaining pre-processing lecture there