## Preprocessing data for machine learning models
This notebook gives general, reusable examples for preparing data to be used with machine learning algorithms. A small dataset is used for convenience.

In [23]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [15]:
# load the dataset
data = pd.read_csv('Data.csv')
data

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [16]:
# selecting features and target 
columns = data.columns
target = data['Purchased']
features = data[columns.drop('Purchased')]
features

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [17]:
# replace NaN values with column mean
# Note: trying to replace all NaN values at once, eg features[NaN_cols].fillna... , caused values to be placed in the wrong entries
NaN_cols = ['Age', 'Salary']
for col in NaN_cols:
    features[col] =  features[col].fillna(features[col].mean())
features

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [18]:
# encode categorical column to numeric values, drop original column
features  = pd.concat([features, pd.get_dummies(features['Country'])], axis=1)
features = features.drop('Country', axis=1)
features


Unnamed: 0,Age,Salary,France,Germany,Spain
0,44.0,72000.0,1,0,0
1,27.0,48000.0,0,0,1
2,30.0,54000.0,0,1,0
3,38.0,61000.0,0,0,1
4,40.0,63777.777778,0,1,0
5,35.0,58000.0,1,0,0
6,38.777778,52000.0,0,0,1
7,48.0,79000.0,1,0,0
8,50.0,83000.0,0,1,0
9,37.0,67000.0,1,0,0


In [19]:
# encode target column to categorical
target  = pd.get_dummies(target)['Yes']
target

0    0
1    1
2    0
3    0
4    1
5    1
6    0
7    1
8    0
9    1
Name: Yes, dtype: uint8

In [20]:
# split the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=0)

In [21]:
X_train

Unnamed: 0,Age,Salary,France,Germany,Spain
9,37.0,67000.0,1,0,0
1,27.0,48000.0,0,0,1
6,38.777778,52000.0,0,0,1
7,48.0,79000.0,1,0,0
3,38.0,61000.0,0,0,1
0,44.0,72000.0,1,0,0
5,35.0,58000.0,1,0,0


When scaling variables, we depend on context to decide whether or not to scale categorical variables in the training set. If these variables are scaled, we lose the interpretion of their category labels. The target column is not scaled when it is categorical. Note that many machine learning algorithms scale variables automatically.

In [34]:
scaler_X = StandardScaler()
X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)

In [35]:
X_train

array([[-0.2029809 ,  0.44897083,  0.8660254 ,  0.        , -0.8660254 ],
       [-1.82168936, -1.41706417, -1.15470054,  0.        ,  1.15470054],
       [ 0.08478949, -1.0242147 , -1.15470054,  0.        ,  1.15470054],
       [ 1.5775984 ,  1.62751925,  0.8660254 ,  0.        , -0.8660254 ],
       [-0.04111006, -0.14030338, -1.15470054,  0.        ,  1.15470054],
       [ 0.93011502,  0.94003267,  0.8660254 ,  0.        , -0.8660254 ],
       [-0.52672259, -0.43494049,  0.8660254 ,  0.        , -0.8660254 ]])