# Preprocessing

In this notebook, we wil see how you can:
- replace specific values in a column
- replace missing values with an Imputer
- create categorical values
- create dummy variables

In [102]:
import pandas as pd
import numpy as np

First we will read the realestate dataset and fix some of the problems we identified earlier.

In [103]:
df = pd.read_csv('/data/datasets/realestate.csv', na_values=['na', '--'])
df[df.NUM_BATH == 'HURLEY'] = np.nan
df['NUM_BATH'] = pd.to_numeric(df.NUM_BATH)

In [104]:
df = df[['NUM_BEDROOMS', 'NUM_BATH', 'SQ_FT', 'OWN_OCCUPIED']]

In [105]:
df.dropna(how='all', inplace=True)

In [106]:
df

Unnamed: 0,NUM_BEDROOMS,NUM_BATH,SQ_FT,OWN_OCCUPIED
0,3.0,1.0,1000.0,Y
1,3.0,1.5,,N
2,,1.0,850.0,N
3,1.0,,700.0,12
4,3.0,2.0,1600.0,Y
5,,1.0,800.0,Y
7,1.0,1.0,,Y
8,,2.0,1800.0,Y


### Replace specific values

replace the 12 in own occupied by the most occurring value.

In [107]:
df.loc[(df.OWN_OCCUPIED == '12'), 'OWN_OCCUPIED'] = 'Y'

In [108]:
df

Unnamed: 0,NUM_BEDROOMS,NUM_BATH,SQ_FT,OWN_OCCUPIED
0,3.0,1.0,1000.0,Y
1,3.0,1.5,,N
2,,1.0,850.0,N
3,1.0,,700.0,Y
4,3.0,2.0,1600.0,Y
5,,1.0,800.0,Y
7,1.0,1.0,,Y
8,,2.0,1800.0,Y


### Replace missing values with an imputer
An imputer is a function that replaces missing values. There are several options:
- SimpleImputer: replace numeric missing values with the average of the column. Note that this SimpleImmuter only works on numerical values so we have to leave out the categorical column. There are other imputers as well.
- fillna: fills a column with a value such as a constant or the mean of a column 

In [109]:
from sklearn.impute import SimpleImputer

In [110]:
imp = SimpleImputer()

In [111]:
np.set_printoptions(suppress=True)
df[['NUM_BEDROOMS', 'NUM_BATH', 'SQ_FT']] = imp.fit_transform(df[['NUM_BEDROOMS', 'NUM_BATH', 'SQ_FT']])

or with fillna(), per column. Instead of the mean of a column you can also do use the median or another function or value. inplace=True means that the values are replaced in the dataframe, otherwise a new version is returned that you must assign as with the SimpleImputer.

In [112]:
df.NUM_BEDROOMS.fillna(df.NUM_BEDROOMS.mean(), inplace=True)

# Replacing values with categorical values

Categorical values may simplify models and may help when a numerical value is not likely linearly connected to the target value (e.g. pH-value).

In [113]:
def label(value):
    if value > 1500:
        return 'Extreme'
    if value > 1000:
        return 'Big'
    return 'Small'

df['expensive'] = df.SQ_FT.map(label)

df

Unnamed: 0,NUM_BEDROOMS,NUM_BATH,SQ_FT,OWN_OCCUPIED,expensive
0,3.0,1.0,1000.0,Y,Small
1,3.0,1.5,1125.0,N,Big
2,2.2,1.0,850.0,N,Small
3,1.0,1.357143,700.0,Y,Small
4,3.0,2.0,1600.0,Y,Extreme
5,2.2,1.0,800.0,Y,Small
7,1.0,1.0,1125.0,Y,Big
8,2.2,2.0,1800.0,Y,Extreme


### Binary labels

However, most machine learning algorithms work with numbers, so we gave to convert categorical labels to numbers. A binary label is easy:

In [114]:
df.OWN_OCCUPIED = df.OWN_OCCUPIED.apply(lambda x: x == 'Y').astype(int)

In [115]:
df

Unnamed: 0,NUM_BEDROOMS,NUM_BATH,SQ_FT,OWN_OCCUPIED,expensive
0,3.0,1.0,1000.0,1,Small
1,3.0,1.5,1125.0,0,Big
2,2.2,1.0,850.0,0,Small
3,1.0,1.357143,700.0,1,Small
4,3.0,2.0,1600.0,1,Extreme
5,2.2,1.0,800.0,1,Small
7,1.0,1.0,1125.0,1,Big
8,2.2,2.0,1800.0,1,Extreme


### Dummy variables

Alternatively, we can transform non-binary categorical values into so-called 'dummy variables'. Every category results in a new Boolean variable that indicates whether the value is that label or not. 

In [116]:
pd.get_dummies(df, columns=['expensive'])

Unnamed: 0,NUM_BEDROOMS,NUM_BATH,SQ_FT,OWN_OCCUPIED,expensive_Big,expensive_Extreme,expensive_Small
0,3.0,1.0,1000.0,1,0,0,1
1,3.0,1.5,1125.0,0,1,0,0
2,2.2,1.0,850.0,0,0,0,1
3,1.0,1.357143,700.0,1,0,0,1
4,3.0,2.0,1600.0,1,0,1,0
5,2.2,1.0,800.0,1,0,0,1
7,1.0,1.0,1125.0,1,1,0,0
8,2.2,2.0,1800.0,1,0,1,0


However, a culprit is that columns should not be colinear. This would be the case if we convert all category labels to a variable. For example, if we have values for expensive_small = True or False and expensive_big = True or False, we can derive what the value for expensive_extreme should be. Therefore we should always leave out one category. You should do that with drop_first=True. Note that the original column is also dropped.

In [117]:
df = pd.get_dummies(df, columns=['expensive'], drop_first=True)

In [118]:
df

Unnamed: 0,NUM_BEDROOMS,NUM_BATH,SQ_FT,OWN_OCCUPIED,expensive_Extreme,expensive_Small
0,3.0,1.0,1000.0,1,0,1
1,3.0,1.5,1125.0,0,0,0
2,2.2,1.0,850.0,0,0,1
3,1.0,1.357143,700.0,1,0,1
4,3.0,2.0,1600.0,1,1,0
5,2.2,1.0,800.0,1,0,1
7,1.0,1.0,1125.0,1,0,0
8,2.2,2.0,1800.0,1,1,0
