# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most useful functions of the Scikit-Learn library.

What we're going to cover:

0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluate a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together

In [1]:
what_were_covering = [
    '0. An end-to-end Scikit-Learn workflow',
    '1. Getting the data ready',
    '2. Choose the right estimator/algorithm for our problems',
    '3. Fit the model/algorithm and use it to make predictions on our data',
    '4. Evaluate a model',
    '5. Improve a model',
    '6. Save and load a trained model',
    '7. Putting it all together',
]
what_were_covering

['0. An end-to-end Scikit-Learn workflow',
 '1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make predictions on our data',
 '4. Evaluate a model',
 '5. Improve a model',
 '6. Save and load a trained model',
 '7. Putting it all together']

In [2]:
import numpy as np

## 0. An end-to-end Scikit-Learn worflow

In [3]:
# 1. Get the data ready
import pandas as pd
heart_disease = pd.read_csv('../data/heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [4]:
# Create X - features matrix
X = heart_disease.drop('target', axis=1)

# Create Y - labels
Y = heart_disease['target']

In [5]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
# clf = classifier/model & RandomForestClassifier = a classification model
clf = RandomForestClassifier(n_estimators=20)

# let's keep and view the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 20,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [6]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

# Creates training and testing sets, training = 80% testing = 20%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [7]:
# Find the patterns in our training data sets
clf.fit(X_train, Y_train);

In [8]:
# make a prediction
# DOES NOT WORK => needs to be same shape of data
# y_label = clf.predict(np.array([0,20,3,4]))

In [9]:
# Make predictions using the testing dataset
# y_preds is a standard
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0])

In [10]:
# 4. Evaluate the model on the training data and test data
# Returns the mean accuracy on the given test data and labels
clf.score(X_train, Y_train)
# Performed 100% on the training data

0.9958677685950413

In [11]:
# on test data
clf.score(X_test, Y_test)
# 80% correct with 100 estimators, 83% correct with 20 estimators

0.8032786885245902

In [12]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(Y_test, y_preds))

              precision    recall  f1-score   support

           0       0.65      0.85      0.74        20
           1       0.91      0.78      0.84        41

    accuracy                           0.80        61
   macro avg       0.78      0.82      0.79        61
weighted avg       0.83      0.80      0.81        61



In [13]:
confusion_matrix(Y_test, y_preds)

array([[17,  3],
       [ 9, 32]])

In [14]:
accuracy_score(Y_test, y_preds)

0.8032786885245902

In [15]:
# 5. Improve a model
# Try a different amount of n_estimators
np.random.seed(42) # make this reproducible
for i in range(10, 100, 10):
    print(f'Trying model with {i} estimators...')
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, Y_train)
    print(f'Model accuray on test set: {clf.score(X_test, Y_test) * 100:.2f}%') # want to evaluate on test data (learns on training)
    print('-----------------------------')
    print("")

Trying model with 10 estimators...
Model accuray on test set: 75.41%
-----------------------------

Trying model with 20 estimators...
Model accuray on test set: 83.61%
-----------------------------

Trying model with 30 estimators...
Model accuray on test set: 81.97%
-----------------------------

Trying model with 40 estimators...
Model accuray on test set: 80.33%
-----------------------------

Trying model with 50 estimators...
Model accuray on test set: 83.61%
-----------------------------

Trying model with 60 estimators...
Model accuray on test set: 80.33%
-----------------------------

Trying model with 70 estimators...
Model accuray on test set: 81.97%
-----------------------------

Trying model with 80 estimators...
Model accuray on test set: 83.61%
-----------------------------

Trying model with 90 estimators...
Model accuray on test set: 81.97%
-----------------------------



In [16]:
# 6. Save a model and load it
import pickle

pickle.dump(clf, open('random_forest_model_1.pkl', 'wb')) #wb = write binary

In [17]:
loaded_model = pickle.load(open('random_forest_model_1.pkl', 'rb')) # rb = read binary
loaded_model.score(X_test, Y_test)

0.819672131147541

In [18]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Getting data ready to be used with machine learning

Three main things
    1. Splitting data into features and labels (usually `X` and `y`)
    2. Filling (also called imputing) or disregarding missing values
    3. Converting non-numerical values to numerical values (aka Feature Encoding)

In [19]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### 1. Split the data into features (X) and labels (Y)

In [20]:
# Want to use the features columns to predict y
# drop the target column on axis 1 (columns in pandas df)
X = heart_disease.drop('target', axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
# Get y label
y = heart_disease['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

### 1.0 Split the data into training and test sets

In [22]:
from sklearn.model_selection import train_test_split
# pass in features X and labels y. Set the test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [23]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

### 1.1 Make sure it's all numerical

Want to use this to create a model that predicts the price of a car based on the Make, Color, Odometer and the Price

In [24]:
# let's import car-sales-extended data
car_sales = pd.read_csv('../data/scikit-learn-data/car-sales-extended.csv')
car_sales.head(), len(car_sales)

(     Make Colour  Odometer (KM)  Doors  Price
 0   Honda  White          35431      4  15323
 1     BMW   Blue         192714      5  19943
 2   Honda  White          84714      4  28343
 3  Toyota  White         154365      4  13434
 4  Nissan   Blue         181577      3  14043,
 1000)

In [25]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [26]:
# Split the dataset into X/y, X = feature matrix, y = label 
X = car_sales.drop('Price',1)
y = car_sales['Price']

# Split into training and test sets, sklearn.model_selection.train_test_split()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [27]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor

Note: Classifier vs Regressor
- Classifier: Predict a class (know possible outputs). Examples: heart disease or not. Spam, or not, or probability of spam
- Regressor: Predit a value (don't know all possible outputs. Examples: Price of a car, predict future income.

In [28]:
# This will error, all values must be numerical 

# model = RandomForestRegressor() # Create model
# model.fit(X_train, y_train) # Train model
# model.score(X_test, y_test) # Test model

# ValueError: could not convert string to float: 'Toyota'

In [29]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# get all categorical features. Make, Color, and Doors (cars that have 3, 4, and 5 doors can be categories)
# car_sales['Doors'].value_counts()
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
# ColumnTransformer accepts a list of tuples (name, transformer, feature_matrix) and transform the cols of the feature matrix
transformer = ColumnTransformer([('one_hot',# name
                                   one_hot, # transformer we want to use
                                   categorical_features)], # list of features to transform
                                   remainder='passthrough') # passthrough the remaining columns that don't match
# transform X (feature matrix)
transformed_X = transformer.fit_transform(X) # np.ndarray type
#turn into a dataframe

pd.DataFrame(transformed_X), X


(       0    1    2    3    4    5    6    7    8    9   10   11        12
 0    0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0   35431.0
 1    1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  192714.0
 2    0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0   84714.0
 3    0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  154365.0
 4    0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  0.0  181577.0
 ..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...       ...
 995  0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0   35820.0
 996  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  155144.0
 997  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0   66604.0
 998  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  215883.0
 999  0.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  248360.0
 
 [1000 rows x 13 columns],
        Make Colour  Odometer (KM)  Doors
 0     Honda  White          

<img src='OneHotEncoding.png'>

In [30]:
# Another way to encode with pandas (doesn't work on Numerical cols)
dummies = pd.get_dummies(car_sales[['Make', 'Colour', 'Doors']])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


In [31]:
# Let's try to refit the model
np.random.seed(42) # keep it consistent
X_train, X_test, y_train, y_test = train_test_split(transformed_X, # split data with transformed_x
                                                    y,
                                                    test_size=0.2)

model = RandomForestRegressor() # Create model
model.fit(X_train, y_train) # fit/train the model

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [32]:
model.score(X_test, y_test)

0.3235867221569877

### 1.2 What if there are missing  values?

1. Fill thm with some value (aka imputation)
2. Remove the samples with missing data altogether

In [33]:
# Import car sales missing data
car_sales_missing = pd.read_csv('../data/scikit-learn-data/car-sales-extended-missing-data.csv')
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [34]:
# check missing values 
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [35]:
# Create X & y
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [36]:
# Convert data to numbers
# Turn categories into numbers 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                   one_hot,
                                   categorical_features)],
                                   remainder='passthrough')

transformed_X = transformer.fit_transform(X)
# transformed_X
# ValueError: Input contains NaN

ValueError: Input contains NaN

In [None]:
car_sales_missing

#### Option 1: Fill missing data with Pandas
Categorical with string, numerical fill with mean

In [None]:
# Fill the 'Make' column
car_sales_missing['Make'].fillna('missing', inplace=True)

# Fill the 'Colour' column
car_sales_missing['Colour'].fillna('missing', inplace=True)

# Fill the 'Odometer' column
car_sales_missing['Odometer (KM)'].fillna(car_sales_missing['Odometer (KM)'].mean(), inplace=True)

# Fill the 'Doors' column
car_sales_missing['Doors'].fillna(4, inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
# Remove rows with missing Price value
car_sales_missing.dropna(inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
len(car_sales_missing)

In [None]:
# Split data into X & y
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

# Convert data to numbers
# Turn categories into numbers 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                   one_hot,
                                   categorical_features)],
                                   remainder='passthrough')

transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

#### Option 2: Fill missing values with scikit-learn

In [37]:
car_sales_missing = pd.read_csv('../data/scikit-learn-data/car-sales-extended-missing-data.csv')
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [38]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [43]:
# drop all missing rows with no labels 
#don't want to work with a dataset that has no labels
car_sales_missing.dropna(subset=['Price'], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [44]:
# Split into X and y
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [45]:
# Split data into train and test
from sklearn.model_selection import train_test_split

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

In [47]:
# Fill missing vallues (imputation) with Scikit-Learn 
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical values with mean
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
door_imputer = SimpleImputer(strategy='constant', fill_value=4)
num_imputer = SimpleImputer(strategy='mean')

# Define columns
cat_features = ['Make', 'Colour']
door_feature = ['Doors']
num_features = ['Odometer (KM)']

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ('cat_imputer', cat_imputer, cat_features),
    ('door_imputer', door_imputer, door_feature),
    ('num_imputer', num_imputer, num_features)
])

# Transform the data 
# Fill train and test values separately
filled_X_train = imputer.fit_transform(X_train) # fit_transform imputes the missing values from the training set and fills them simultaneously
filled_X_test = imputer.transform(X_test)

In [49]:
filled_X_train, filled_X_test

(array([['Honda', 'White', 4.0, 71934.0],
        ['Toyota', 'Red', 4.0, 162665.0],
        ['Honda', 'White', 4.0, 42844.0],
        ...,
        ['Toyota', 'White', 4.0, 196225.0],
        ['Honda', 'Blue', 4.0, 133117.0],
        ['Honda', 'missing', 4.0, 150582.0]], dtype=object),
 array([['Toyota', 'Blue', 4.0, 99761.0],
        ['Toyota', 'Black', 4.0, 17975.0],
        ['Honda', 'Blue', 4.0, 197664.0],
        ['Nissan', 'Green', 4.0, 235589.0],
        ['Honda', 'Black', 4.0, 231659.0],
        ['Toyota', 'Blue', 4.0, 247601.0],
        ['Toyota', 'Green', 4.0, 110078.0],
        ['missing', 'White', 4.0, 155383.0],
        ['Nissan', 'White', 4.0, 26634.0],
        ['Honda', 'White', 4.0, 130319.03314917127],
        ['Honda', 'Green', 4.0, 238825.0],
        ['Honda', 'Green', 4.0, 37606.0],
        ['Toyota', 'Blue', 4.0, 230908.0],
        ['Toyota', 'Red', 4.0, 159925.0],
        ['Toyota', 'Blue', 4.0, 181466.0],
        ['Toyota', 'Blue', 4.0, 140465.0],
        ['Toyota

#### Get our transformed data array's back into DataFrame's
Now we've filled our missing values, let's check how many are missing from each set.

In [50]:
car_sales_filled_train = pd.DataFrame(filled_X_train,
                                      columns=['Make', 'Colours', 'Doors', 'Odometer (KM)'])
car_sales_filled_test = pd.DataFrame(filled_X_test,
                                     columns=['Make', 'Colours', 'Doors', 'Odometer (KM)'])

# Check missing data in training set
car_sales_filled_train.isna().sum()

Make             0
Colours          0
Doors            0
Odometer (KM)    0
dtype: int64