# Chapter 4: building good training sets

## Handling missing values

Let's start by constructing a simple dataset with some missing values.

In [1]:
import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1,2,3.0,4.0
1,5,6,,8.0
2,10,11,12.0,


Pandas' data frame has some helpful methods for seeing which values are null:

In [4]:
df.isnull()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,False,False,True,False
2,False,False,False,True


In [5]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

Side note: while our data is in a data frame, we can always get a numpy array out if we'd like to:

In [6]:
df.values

array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,  nan,   8.],
       [ 10.,  11.,  12.,  nan]])

So we can quickly summarize the number of missing values for each feature.

How do we deal with these missing values before passing the data into a model? There are a few options.

### Filtering out missing data

One option is to simply remove columns or rows with missing data. The `dropna` method has several options for choosing when to filter it out:

In [7]:
df.dropna()

Unnamed: 0,A,B,C,D
0,1,2,3,4


In [9]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1,2
1,5,6
2,10,11


In [10]:
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1,2,3,4.0
2,10,11,12,


Ultimately we'll want a dataset with no missing values. Short of removing every row or column that has a missing value, how else can we massage the data? 

Note: we could do something like remove all rows where the majority of variables are missing using `df.dropna(thresh=3)` and *then* correct the rows with only one or two missing items using another mechanism so as not to throw out too much of our dataset.

### Imputing missing values

The most common way to correct missing data without removing the associated rows is to replace it with the mean value for that variable. pandas makes this easy:

In [13]:
from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr.fit(df)
corrected_data = imr.transform(df)
corrected_data

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])

Other options to choose from are 'median' and 'most_frequent'.

### The estimator API

Note that `Imputer` objects are similar to the models we used for supervized learning in chapter 3: there's a training step and a application step, and the application step can generalize to new data. This means we could massage the data based on means used in our training data set and then use that same massaging applied to the test data set without having to re-fit on the training set.

## Mapping categorical data to numbers

Categorical data, including nominal and ordinal variables, need to be mapped to number before we can fit them using models.


In [14]:
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


### Hand mapping ordinal data

Only we know what the meaning of the class labels for an ordinal variable like 'size' is, so we can map it out and apply it

In [15]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


### Mapping nominal features

Two options: the first by hand, the second using a built in helper:

In [17]:
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping


{'class1': 0, 'class2': 1}

In [18]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


In [19]:
from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([0, 1, 0])

### One hot encoding 

Mapping nominal data with more than two possible values to numerical values is not a great idea, as it tells our model that e.g 'blue' is of greater value than 'green' (assuming we mapped red, green, blue to 0, 1, 2).

The way to work around this is to transform the feature into multiple binary features using one hot encoding

In [21]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(df).toarray()

ValueError: could not convert string to float: 'blue'