# Introduction to Scikit-Learn (sklearn)

This notebook illustrates useful functions from sklearn

Content:

0. End-to-end
1. Prepare the data
2. Choose the right estimator/model/algorithm for our problem
3. Fit the model to the data and use it to make predictions
4. Evaluate the model
5. Improve the mode
6. Save and load a trained modal
7. Put it all together

# 0. End-to-end Scikit-Learn workflow

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [17]:
# features matrix
X = heart_disease.drop("target", axis=1)
# labels
y = heart_disease["target"]

In [18]:
x.shape, y.shape

((303, 13), (303,))

In [19]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [20]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [21]:
# Choose the model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [22]:
# Check out the hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [25]:
# Fit the model to the data
clf.fit(X_train, y_train)

In [26]:
# Use the model to make a prediction
y_predictions = clf.predict(X_test)
y_predictions

array([1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1])

In [42]:
# type(y_test), type(y_predictions)
# pd.DataFrame({"Test": y_test, "Predictions": y_predictions}).sort_index()

In [43]:
# Evaluate the model
# 1. On the training set
clf.score(X_train, y_train)


1.0

In [45]:
# 2. On the testing set
clf.score(X_test, y_test)

0.819672131147541

In [47]:
# Another method to evaluate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test, y_predictions))


              precision    recall  f1-score   support

           0       0.83      0.80      0.81        30
           1       0.81      0.84      0.83        31

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



In [51]:
conf_mat = confusion_matrix(y_test, y_predictions)
conf_mat
#                   Predicted Class
#                ---------------------
#               |   True Positive (TP)   |   False Negative (FN)   |
# Actual Class  |-----------------------|------------------------|
#               |   False Positive (FP)  |   True Negative (TN)    |
#                ---------------------


array([[24,  6],
       [ 5, 26]])

In [52]:
accuracy_score(y_test, y_predictions)

0.819672131147541

In [62]:
# Experiment to improve
# trying different numbers of estimators (trees) - no cross-validation
np.random.seed(42)
for i in range(1, 111):
    model = RandomForestClassifier(n_estimators=i)
    model.fit(X_train, y_train)
    print(f"Trying model with {i} estimators...")
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100:.2f}%", end="\n\n")

Trying model with 1 estimators...
Model accuracy on test set: 83.61%

Trying model with 2 estimators...
Model accuracy on test set: 72.13%

Trying model with 3 estimators...
Model accuracy on test set: 72.13%

Trying model with 4 estimators...
Model accuracy on test set: 73.77%

Trying model with 5 estimators...
Model accuracy on test set: 72.13%

Trying model with 6 estimators...
Model accuracy on test set: 73.77%

Trying model with 7 estimators...
Model accuracy on test set: 77.05%

Trying model with 8 estimators...
Model accuracy on test set: 78.69%

Trying model with 9 estimators...
Model accuracy on test set: 73.77%

Trying model with 10 estimators...
Model accuracy on test set: 81.97%

Trying model with 11 estimators...
Model accuracy on test set: 72.13%

Trying model with 12 estimators...
Model accuracy on test set: 77.05%

Trying model with 13 estimators...
Model accuracy on test set: 72.13%

Trying model with 14 estimators...
Model accuracy on test set: 81.97%

Trying model wi

In [79]:
# Save the model

import pickle

pickle.dump(model, open("models/random_forest_model_1.pkl", "wb"))

In [80]:
# Load the model

loaded_model = pickle.load(open("models/random_forest_model_1.pkl", "rb"))

In [81]:
loaded_model.score(X_test, y_test)

0.7868852459016393

# 1. Prepare the data

Three main things we have to do:

    1. Split the data into features and labels (usually `X` & `y`)
    2. Filling (imputing) or disregarding missing values
    3. Converting non-numerical values to numerical values a.k.a `feature encoding`

In [85]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [87]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

In [88]:
# splitting the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [89]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

## 1.1 Make sure it is all numerical

In [90]:
car_sales = pd.read_csv("data/car-sales-extended.csv")

In [92]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [93]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [103]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
X = car_sales.drop(["Price"], axis=1)
y = car_sales["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
mode.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

In [117]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                 remainder="passthrough")
transformed_X = transformer.fit_transform(X)
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [115]:
(len(car_sales["Make"].value_counts()),
 len(car_sales["Doors"].value_counts()),
 len(car_sales["Colour"].value_counts()) )

(4, 3, 5)

In [118]:
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.3235867221569877

In [119]:
pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


In [120]:
print(sklearn.__version__)

NameError: name 'sklearn' is not defined

## 1.2 Missing values

1. Imputation (fill in missing values)
2. Drop

In [208]:
# Import car sales missing data
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")

In [209]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [210]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [211]:
# Convert data to numbers
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                    one_hot,
                                    categorical_features)],
                                  remainder="passthrough")
X = transformer.fit_transform(X)

In [212]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [213]:
model.fit(X_train, y_train)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## Option 1: Fill missing data with Pandas

In [None]:
car_sales_missing["Make"].fillna("Missing", inplace=True)
car_sales_missing["Colour"].fillna("Missing", inplace=True)
odometer_mean = car_sales_missing["Odometer (KM)"].mean()
car_sales_missing["Odometer (KM)"].fillna(odometer_mean,
                                          inplace=True)
car_sales_missing["Doors"].fillna(4, inplace=True)

In [None]:
# Remove rows with missing Price value
car_sales_missing.dropna(inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [214]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                    one_hot,
                                    categorical_features)],
                                  remainder="passthrough")
X = transformer.fit_transform(X)
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
model.fit(X_train, y_train);
model.score(X_test, y_test)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## Option 2: Fill missing data with scikit-learn

In [249]:
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [250]:
car_sales_missing.dropna(subset=["Price"], inplace=True)

In [251]:
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [254]:
X = car_sales_missing.drop(["Price"], axis=1)
y = car_sales_missing["Price"]

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

0

In [263]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing & numerical with mean'
category_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
numeric_imputer = SimpleImputer(strategy="mean")

#define columns
category_features = ["Make", "Colour"]
door_feature = ["Doors"]
numeric_features = ["Odometer (KM)"]

# Create an imputer (sth which fills missing data)
imputer = ColumnTransformer([
    ("category_imputer", category_imputer, category_features),
    ("door_imputer", door_imputer, door_feature),
    ("numeric_imputer", numeric_imputer, numeric_features)
])

# Transform the data

filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test)

# Turn numeric

one_hot = OneHotEncoder()
categorical_features = ["Make", "Colour", "Doors"]
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")
type(filled_X_train)
car_sales_filled_trained = pd.DataFrame(filled_X_train, columns=["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled_test = pd.DataFrame(filled_X_test, columns=["Make", "Colour", "Doors", "Odometer (KM)"])
filled_X_train = transformer.fit_transform(car_sales_filled_trained)
filled_X_test = transformer.transform(car_sales_filled_test)


In [266]:
model = RandomForestRegressor()
model.fit(filled_X_train, y_train)
model.score(filled_X_test, y_test)

0.2129179053329857