# Introduction to Scikit-Learn (sklearn)

*This notebook documents my practical exploration of the scikit-learn library for Machine Learning.*

## Topics to cover in this notebook
0. End-to-end Scikit-Learn Workflow
1. Getting our data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions in our data
4. Evaluating the model
5. Improving the model
6. Save and load our trained model
7. Wrap it up all together


In [6]:
#code for what we are going to cover, so we can always refer back to it

our_workflow= [
"0. End-to-end Scikit-Learn Workflow",
"1. Getting our data ready",
"2. Choose the right estimator/algorithm for our problems",
"3. Fit the model/algorithm and use it to make predictions in our data",
"4. Evaluating the model",
"5. Improving the model",
"6. Save and load our trained model",
"7. Wrap it up all together"]

### 0. An end-to-end Scikit-learn Workflow

In [7]:
# 1 Getting our data ready
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt


heart_disease = pd.read_csv("heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [8]:
# Create X which is called features matrix in scikit-learn
X = heart_disease.drop("target", axis=1)

# Create y which is called labels
y = heart_disease["target"]


In [9]:
# 2. Choose the right model and hyperparameters
# Our problem here is a classification problem
#Hyperparameters are like dials on a model that you can adjust to make it better or worse

# Let's try Random Forest

from sklearn.ensemble import RandomForestClassifier #classification machine learing model and it's capable of learning patterns in data
clf = RandomForestClassifier() #clf is short for classifier

# We'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [10]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% of the data will be used for training and 20% for testing that's why test_size=0.2


In [11]:
clf.fit(X_train, y_train); #fitting the model to the training data

In [12]:
# make a prediction
# still step 3

y_label = clf.predict(np.array([0, 2, 3, 4])) # this will not work because the shape of the array is not the same as the shape of the data



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [15]:
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0])

In [16]:
y_test

142    1
140    1
225    0
93     1
262    0
      ..
186    0
107    1
116    1
65     1
276    0
Name: target, Length: 61, dtype: int64

In [17]:
# 4. Evaluate the model on the training data and test data

clf.score(X_train, y_train) # this will give us the accuracy of the model on the training data


1.0

In [18]:
clf.score(X_test, y_test) # this will give us the accuracy of the model on the test data

0.819672131147541

In [19]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test, y_preds))


              precision    recall  f1-score   support

           0       0.83      0.80      0.81        30
           1       0.81      0.84      0.83        31

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



In [20]:
confusion_matrix(y_test, y_preds)


array([[24,  6],
       [ 5, 26]])

In [21]:
accuracy_score(y_test, y_preds)

0.819672131147541

In [22]:
# 5. Improve a model

# Try different amount of n_estimators
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")


Trying model with 10 estimators...
Model accuracy on test set: 85.25%

Trying model with 20 estimators...
Model accuracy on test set: 83.61%

Trying model with 30 estimators...
Model accuracy on test set: 81.97%

Trying model with 40 estimators...
Model accuracy on test set: 85.25%

Trying model with 50 estimators...
Model accuracy on test set: 81.97%

Trying model with 60 estimators...
Model accuracy on test set: 83.61%

Trying model with 70 estimators...
Model accuracy on test set: 83.61%

Trying model with 80 estimators...
Model accuracy on test set: 85.25%

Trying model with 90 estimators...
Model accuracy on test set: 83.61%



In [14]:
# 6. Save a model and load it

import pickle #pickle is a module in python that allows you to save a model
# Save an existing model to file

pickle.dump(clf, open("random_forest_model_1.pkl", "wb")) #wb stands for write binary

In [15]:
# Load a saved model
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb")) #rb stands for read binary
loaded_model.score(X_test, y_test)

0.8360655737704918

### Getting the data ready to be used with machine learning

1. Split the data into features and labels (usually 'X' and 'y')
2. Filling (also called imputing) or disregarding missing values
3. Converting non-numerical values to numerical values (also called feature encoding)

In [16]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [17]:
# Using pandas

X = heart_disease.drop("target", axis=1) # X is going to be every single column except the target column
X.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [18]:
y = heart_disease["target"] # y is going to be the target column
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [20]:
#split the data into training and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [21]:
# We'll check the shapes of the data to make sure the shapes are correct for the model

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [22]:
# To verify that the data is split correctly like 80% for training and 20% for testing
# We can check the length of the data

len(heart_disease), len(X_train), len(X_test)

(303, 242, 61)

In [23]:
# We can check by multiplying the shape of the data by 0.8

X.shape[0] * 0.8

242.4

In [24]:
242 + 61 # this should be equal to the length of the data

303

### 1.1 Make sure it's all numerical


In [25]:
# We'll import our car-sales-extended.csv file for our dataset

car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head()


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [26]:
# Let's check the length of the data
len(car_sales)

1000

In [27]:
# Let's check the data types of the columns
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [28]:
# Split the data into X and y
X = car_sales.drop("Price", axis=1) # X is the features matrix
y = car_sales["Price"] # y is the labels

#Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [30]:
from sklearn.ensemble import RandomForestRegressor # we are going to use Random Forest Regressor because we are predicting a number for the price of the car based on the odometer, make, model, etc attributes

model = RandomForestRegressor() # Creating a model
model.fit(X_train, y_train) # fitting the model to the training data
model.score(X_test, y_test) # this will give us the accuracy of the model on the test data because we're evaluating the model on the test data

ValueError: could not convert string to float: 'Honda'

In [33]:
X.head() #checking the value of X variable

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [35]:
# Turn the categories into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] # these are the columns that have strings in them
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                  one_hot, 
                                  categorical_features)], 
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]], shape=(1000, 13))

In [36]:
# put the transformed data back into a DataFrame

pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


![one-hot-encoding](one-hot-encoding.png)

In [37]:
# Another way to do one hot encoding is to use pandas dummies

dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


In [40]:
# Since our data is in zeros and ones now, we can try to fit the model again

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model.fit(X_train, y_train)
model.score(X_test, y_test)

# The results are not good and it's less than 1 because may be it's not enough data to predict the price of the car on the basic of model, make, etc

0.3235867221569877

### 1.2 If there's missing values in the data

There's 2 ways to deal with missing data:

1. Fill them with some values also known as imputation.
2. Remove the samples with missing data altogether. 

In [42]:
# Import car sales missing data

car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [43]:
car_sales_missing.isna().sum() # this will give us the number of missing values in each column
#isna() is a method in pandas that checks for missing values in a DataFrame

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [44]:
# Create new X and y

X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [45]:
# As we can see above, there are missing values in each column

# Let's try converting the data to numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] # these are the columns that have strings in them
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                  one_hot, 
                                  categorical_features)], 
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4000 stored elements and shape (1000, 16)>

In [48]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [52]:
# Option 1: Fill missing data with pandas

# Fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True);

# Fill the "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace=True);

# Fill the "Odometer (KM)" column, we can't just put any data, that's why we'll use the mean of the column to fill the missing values
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True);

# Fill the "Doors" column
car_sales_missing["Doors"].fillna(4, inplace=True); # we'll fill the missing values with 4 because most of the cars have 4 doors

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True);
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_sales_missing["Doors"].fillna(4, inplace=True); # we'll fill the missing values with 4 because most of the cars have 4 doors


In [53]:
# Check out the dataframe again to see if the missing values are filled
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [54]:
# Remove rows with missing Price value

car_sales_missing.dropna(inplace=True)

In [56]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [57]:
len(car_sales_missing)

950

In [58]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]


In [60]:
# Let's try converting the data to numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] # these are the columns that have strings in them
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                  one_hot, 
                                  categorical_features)], 
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(car_sales)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]], shape=(1000, 14))

## Option 2: Fill missing values with Scikit-Learn
