# Scikit-Learn (sklearn)

In [1]:
what_we_cover =[
    "0. An End to End Scikit learn workflow",
    "1. Getting the data ready",
    "2. Choose the right estimator / algorithm for our problems",
    "3. Fit the model / algorithm and use it to make predictions on our data",
    "4. Evaluate the model",
    "5. Improve the model",
    "6. Save and load the trained model",
    "7. Putting it all together"
]

# 1. Getting our Data ready to be used with machine learning

Three main things we have to do:
1. Split the data into features and labels (usually `X` and `y`)
2. Filling (also called imputting) or disgreading missing values
3. Converting non-numerical values into numerical values (also called features encoding)

## 1.1) Split the data into Features and Labels

In [2]:
# Standard Import
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [20]:
car_sales = pd.read_csv('../00.datasets/car-sales-extended.csv')
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [50]:
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

In [51]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [52]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 4), (200, 4), (800,), (200,))

In [53]:
X.shape, y.shape

((1000, 4), (1000,))

----------------
--------------

# 1.2) Handling missing values

### what if there are missing values?

1. Fill them with some values (Imputation)
2. Remove the samples with missing data altogether.

In [131]:
# import car sales missing datset sample
import pandas as pd
car_sales_missing = pd.read_csv("../00.datasets/car-sales-extended-missing-data.csv")

In [132]:
car_sales_missing.isnull().sum()

# same as below
# car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

We can see Make, Colour, Odometer (KM), Doors, Price columns have some missing values.

### Option 1) Fill missing data with Pandas

In [133]:
car_sales_missing.head(2)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0


In [134]:
car_sales_missing.columns

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

In [135]:
# fill Make column
car_sales_missing['Make'] = car_sales_missing['Make'].fillna('missing')

In [136]:
# fill  Colour
car_sales_missing['Colour'] = car_sales_missing['Colour'].fillna('missing')

In [137]:
# fill Odometer (KM)
car_sales_missing['Odometer (KM)'] = car_sales_missing['Odometer (KM)'].fillna(car_sales_missing['Odometer (KM)'].mean())

In [138]:
# fill doors with majority of population
car_sales_missing['Doors'].value_counts()

4.0    811
5.0     75
3.0     64
Name: Doors, dtype: int64

In [139]:
car_sales_missing['Doors'] = car_sales_missing['Doors'].fillna(4)

In [140]:
car_sales_missing.isnull().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [141]:
# as Prices is label column, we will remove rows which have missing values
car_sales_missing = car_sales_missing.dropna()

In [142]:
car_sales_missing.isnull().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

#### Split features, lables, encode columns and model training

In [152]:
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [153]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [156]:
# encoding
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

one_hot_encoder = OneHotEncoder()
 
categorical_features = ['Make', 'Colour', 'Doors']
transformer = ColumnTransformer([('one_hot_encoder', one_hot_encoder, categorical_features)], remainder='passthrough')

In [157]:
X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

In [158]:
# model training
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)

rf_model.score(X_test, y_test)

0.19307721691177393

### Option 2) Fill missing data with Scikit Learn, using SimpleImputer

In [159]:
car_sales_missing = pd.read_csv("../00.datasets/car-sales-extended-missing-data.csv")
car_sales_missing.isnull().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [161]:
# drops the rows with no Label Price
car_sales_missing = car_sales_missing.dropna(subset=['Price'])
car_sales_missing.isnull().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [162]:
# Split features and columns
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [163]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### fill missing values using SimpleImputer

In [165]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# fill categorical features with "missing" & numerical features with "mean"
categorical_imputer = SimpleImputer(strategy='constant', fill_value='missing') # fill with constant value of 'missing'
door_imputer = SimpleImputer(strategy='constant', fill_value=4) # fill missing doors with default value of 4 doors
numerical_imputer = SimpleImputer(strategy='mean') # fill with mean value

# define column
categorical_features = ['Make', 'Colour']
door_features = ['Doors']
numerical_features = ['Odometer (KM)']

# create imputer 
imputer = ColumnTransformer([
    ('categorical_imputer', categorical_imputer, categorical_features),
    ('door_imputer', door_imputer, door_features),
    ('numerical_imputer', numerical_imputer, numerical_features)
])

# transform data
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test)

In [171]:
# confirm whether null values still exist or not
car_sales_filled_X_train = pd.DataFrame(filled_X_train, columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])
car_sales_filled_X_test = pd.DataFrame(filled_X_test,  columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])

In [172]:
car_sales_filled_X_train.isnull().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [173]:
car_sales_filled_X_test.isnull().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

**As there is no more missing value, we can continue encoding and model training.**

In [174]:
# encoding
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

one_hot_encoder = OneHotEncoder()
 
categorical_features = ['Make', 'Colour', 'Doors']
transformer = ColumnTransformer([('one_hot_encoder', one_hot_encoder, categorical_features)], remainder='passthrough')

X_train = transformer.fit_transform(car_sales_filled_X_train)
X_test = transformer.transform(car_sales_filled_X_test)

In [175]:
# model training
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)

rf_model.score(X_test, y_test)

0.21054379092677078

-------------------

------------------

## 1.3) Handling Categorical Data Encoding

- Convert data to Numbers: Make sure it's all numerical
- As computer can't process categorical data types, we need to convert them into numerical data types.

In [57]:
car_sales = pd.read_csv('../00.datasets/car-sales-extended.csv')
car_sales.head(2)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943


In [58]:
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

In [60]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [61]:
# we encode convert Make, Colour, Doors 
# the reason we treat Doors as categorical data, because cars with door 4 can be treated as one category
car_sales['Doors'].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

In [62]:
# convert using encoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

one_hot_encoder = OneHotEncoder()

### option 1) using Column Transformer

In [63]:

categorical_features = ['Make', 'Colour', 'Doors']

transformer = ColumnTransformer([('one_hot_encoder', one_hot_encoder, categorical_features)], remainder='passthrough')

In [64]:
X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

### option 2) pandas get_dummies

In [65]:
dummies = pd.get_dummies(car_sales [['Make', 'Colour', 'Doors']])
dummies.head(5)

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0


#### Model Training

In [66]:
from sklearn.ensemble import RandomForestRegressor

In [67]:
rf_model = RandomForestRegressor()

In [68]:
rf_model.fit(X_train, y_train)

RandomForestRegressor()

In [69]:
rf_model.score(X_test, y_test)

0.3247903260869359

-----------