# Introduction to Scikit-learn

This notebook is an into to scikit-learn library

## 1. Scikit-learn workflow

1. Scikit-learn Workflow
2. Getting the data ready
3. Choose the right algorithm/estimator for our problem
4. Fitting the model and use it to make predictions on our data
5. Evaluating a model
6. Improve a model
7. Save and load a trained model

Steps from 1 to 6 refer to heart-disease data set as a representation of all basic steps of ML modeling, all other steps, like 1.1, 2.1 relate to different data sets and represent detailes/sub steps of ML modeling process.


# 2. Getting the data ready

In [136]:
import pandas as pd
import numpy as np

#the whole process took about 20 minutes
heart_disease = pd.read_csv('heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [137]:
# Create X (features matrix)
#rows have axis 0, columns have axis 1
# drop all columns except for target column along the axis 1
X = heart_disease.drop("target", axis=1)

# Create Y (labels)
Y = heart_disease["target"]

X.head()



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [138]:
Y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [139]:
#Three main things we have to do with data to be used with ML
## 1. Split the data into features and labels (X and Y)
## 2. Filling (imputing) or disregarding missing values
## 3. Converting non-numerical values to numerical (called feature encoding)

## Wrangling data (munging)  - clear, transform, reduce data to make it useful.

# Let's change sex column type to string
X_categorical = pd.read_csv('heart-disease.csv').drop("target", axis=1)
X_categorical['sex'] = X_categorical['sex'].apply(lambda v: 'Female' if v == 1 else 'Male')
X_categorical




Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,Female,3,145,233,1,0,150,0,2.3,0,0,1
1,37,Female,2,130,250,0,1,187,0,3.5,0,0,2
2,41,Male,1,130,204,0,0,172,0,1.4,2,0,2
3,56,Female,1,120,236,0,1,178,0,0.8,2,0,2
4,57,Male,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,Male,0,140,241,0,1,123,1,0.2,1,0,3
299,45,Female,3,110,264,0,1,132,0,1.2,1,0,3
300,68,Female,0,144,193,1,1,141,0,3.4,1,2,3
301,57,Female,0,130,131,0,1,115,1,1.2,1,1,3


## 2.1 Convert categorical values to numerical, one hot encoding

In [140]:
# Let's use one hot encoding/engineering with scikit learn
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_feature = ['sex']

one_hot = OneHotEncoder()

transformer = ColumnTransformer([('one_hot', one_hot, categorical_feature)], remainder='passthrough') 

transformed_X = transformer.fit_transform(X_categorical)




## 2.2 Handling missing values


In [141]:
# Many machine learning models don't work well when there are missing values in the data.

# There are two main options when dealing with missing values:

# 1. Fill them with some given value, imputing data
#    For example, you might fill missing values of a numerical column with the mean of all the other values. 
#    The practice of filling missing values is often referred to as imputation.
# 2. Remove them. If a row has missing values, you may opt to remove them completely from your sample completely.
#    However, this potentially results in using less data to build your model.

# Dealing with missing values is a problem to problem issue. And there's often no best way to do it.

In [142]:
# Check if there are  missing values 
car_sales_missing = pd.read_csv("car-sales-missing-data.csv")
car_sales_missing


# Fill missing values with Pandas
car_sales_missing['Make'].fillna('missing', inplace=True)
car_sales_missing['Colour'].fillna('missing', inplace=True)
car_sales_missing["Odometer"].fillna(car_sales_missing["Odometer"].mean(), inplace=True)
car_sales_missing["Doors"].fillna(4, inplace=True)




## 2.2.1 Fill missing values with Scikit-Learn

In [143]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.head()


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [144]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [145]:
car_sales_missing.dropna(subset = ["Price"], inplace=True)
car_sales_missing.isna().sum()


Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [146]:
car_sales_missing


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [147]:
#Fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical with mean
#Categorical features
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
#Door feature that is num or categorical
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
#Numertical feature
num_imputer = SimpleImputer(strategy="mean")

cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

#Create an imputer 
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features),

])

# Transform the data
car_sales_missing_filled= imputer.fit_transform(car_sales_missing)
car_sales_missing_filled



array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [148]:
car_sales_filled = pd.DataFrame(car_sales_missing_filled, columns =["Make", "Colour", "Doors", "Odometer"])
car_sales_filled

Unnamed: 0,Make,Colour,Doors,Odometer
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0
...,...,...,...,...
945,Toyota,Black,4.0,35820.0
946,missing,White,3.0,155144.0
947,Nissan,Blue,4.0,66604.0
948,Honda,White,4.0,215883.0


In [149]:
car_sales_filled.isna().sum()


Make        0
Colour      0
Doors       0
Odometer    0
dtype: int64

# 3. Choose the right model/estimator and hyperparameters
### To choose a model for your problem refer to ML map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

### Classification problem - predicting a category (heart desease or not)
### Regression problem - predicting a number (selling price of car)

In [150]:


X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [151]:
# We have a classification problem, because we want to classify whether someone has a heart deseas or not.

# import classification ML model, learning patterns in data and classifing whether a sample (row) is one thing or
# another thing.
from sklearn.ensemble import RandomForestClassifier
#clf is a short for classifier in sklearn, can use model word.
clf =  RandomForestClassifier(n_estimators=100)

# We keep the default hyperparameters, we will see the parameters.
clf.get_params()



{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [152]:
# Fit the model to the data: train the model on training data set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape



((242, 13), (61, 13), (242,), (61,))

In [153]:
# Fit the model - classification model find the patterns in the training data

clf.fit(X_train, Y_train)

RandomForestClassifier()

In [154]:
# Make a prediction, y_preds is a conventional name
# y_label = clf.predict(np.array([0, 2, 3, 4])) - doesn't work as shape is not like X_train or X_test

y_preds = clf.predict(X_test)
y_preds




array([0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1])

In [155]:
Y_test

231    0
301    0
104    1
292    0
159    1
      ..
233    0
8      1
184    0
296    0
40     1
Name: target, Length: 61, dtype: int64

## 3.1 Picking a machine learning model for a regression problem.

In [156]:
## use California housing dataset

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df['target'] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [157]:
#Split to train and test sets
from sklearn.model_selection import train_test_split


np.random.seed(42)

X_housing = housing_df.drop("target", axis=1)
Y_housing = housing_df["target"] # median house price in 100,000


X_housing_train, X_housing_test, Y_housing_train, Y_housing_test = train_test_split(X_housing, Y_housing, test_size=0.2)


In [158]:
# Ridge regression model
from sklearn.linear_model import Ridge

model = Ridge()
model.fit(X_housing_train, Y_housing_train)
model.score(X_housing_test, Y_housing_test)

#Regration evalueation metrics - default metric is coefficient of determination, R squared value, measures
# linear relationships between two variables, measures relationships between features and the target variable.



0.5758549611440128

## 3.2 Picking another ML model for regression problem

In [159]:
np.random.seed(42)
X_housing_train, X_housing_test, Y_housing_train, Y_housing_test = train_test_split(X_housing, Y_housing, test_size=0.2)

# Try ensemble model ( it is a combination of smaller models)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_housing_train, Y_housing_train)
model.score(X_housing_test, Y_housing_test)


0.8057322392488782

## 3.3 Picking a ML model for a classification problem

In [175]:
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris["data"], columns=('sepal length (cm)','sepal width (cm)','petal length (cm)', 'petal width (cm)'))
iris_df["target"] = iris["target"]
iris_df.head()



X_iris = iris_df.drop("target", axis=1)
Y_iris = iris_df["target"] # median house price in 100,000


X_iris_train, X_iris_test, Y_iris_train, Y_iris_test = train_test_split(X_iris, Y_iris, test_size=0.2)

from sklearn.svm import LinearSVC
np.random.seed(42)

clf_iris = LinearSVC()
clf_iris.fit(X_iris_train, Y_iris_train)
clf_iris.score(X_iris_test, Y_iris_test)




0.9666666666666667

## 3.4 Recommendation:
1.  for structured data use ensemble methods
2.  for unstructured data use deep learning or transfer learning

## 4. Fit/train a model and use it to make predictions on our data 

In [163]:
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
clf_disease = RandomForestRegressor()
#Make the data
X_disease = heart_disease.drop("target", axis=1)
Y_disease = heart_disease["target"]


#Split the data
X_train_disease, Y_train_disease, X_test_disease, Y_test_disease = train_test_split(X_disease, Y_disease, test_size=0.2)

#Fit the model to the data, find patterns in the data
#clf_disease.fit(X_train_disease, Y_train_disease)

#Evaluate the Random Forest Regressor, how well it used the patterns the model has learnt on the test data.
#clf_disease.score(X_test_disease, Y_test_disease)


## 4.1 Meke predictions using a machine learning model
###     2 ways to make predictions:
###         1. predict()
###         2. predict_proba()   


### 4.1.1 Make prediction on a classification model

In [None]:
y_preds = clf.predict(X_test)

array([0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1])

In [164]:
np.array(Y_test)

array([0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1])

In [166]:
# Compare predictions to truth labels to evaluate the model
np.mean(y_preds == Y_test)

0.819672131147541

In [167]:
clf.score(X_test, Y_test)

0.819672131147541

In [169]:
# Predict_proba() returns probabilities estimates of a classification label.
clf.predict_proba(X_test)


array([[0.85, 0.15],
       [0.91, 0.09],
       [0.03, 0.97],
       [0.89, 0.11],
       [0.24, 0.76],
       [0.02, 0.98],
       [0.64, 0.36],
       [0.5 , 0.5 ],
       [0.34, 0.66],
       [0.55, 0.45],
       [0.95, 0.05],
       [0.  , 1.  ],
       [0.01, 0.99],
       [0.04, 0.96],
       [0.1 , 0.9 ],
       [0.4 , 0.6 ],
       [0.47, 0.53],
       [0.77, 0.23],
       [0.27, 0.73],
       [0.55, 0.45],
       [0.1 , 0.9 ],
       [0.57, 0.43],
       [0.47, 0.53],
       [0.72, 0.28],
       [0.03, 0.97],
       [0.45, 0.55],
       [0.84, 0.16],
       [0.  , 1.  ],
       [0.08, 0.92],
       [0.82, 0.18],
       [0.91, 0.09],
       [0.68, 0.32],
       [0.15, 0.85],
       [0.04, 0.96],
       [0.34, 0.66],
       [0.9 , 0.1 ],
       [0.37, 0.63],
       [0.76, 0.24],
       [0.63, 0.37],
       [0.04, 0.96],
       [0.11, 0.89],
       [0.64, 0.36],
       [0.32, 0.68],
       [0.07, 0.93],
       [0.72, 0.28],
       [0.35, 0.65],
       [0.52, 0.48],
       [0.93,

### 4.1.2 Make predictions on a regression model

In [171]:
# Use predict() for regression model
y_pred = model.predict(X_housing_test)
# compare y_pred and Y_housing_test
y_pred

array([0.4943   , 0.7642   , 4.9346864, ..., 4.8447587, 0.71681  ,
       1.65057  ])

In [173]:
np.array(Y_housing_test)

array([0.477  , 0.458  , 5.00001, ..., 5.00001, 0.723  , 1.515  ])

In [174]:
#compare predictions to the truth
from sklearn.metrics import mean_absolute_error

#average difference between Y_housing_test values and y_pred
mean_absolute_error(Y_housing_test, y_pred)

0.32670527078488387

# 5. Evaluate the model

Three ways to avaluate Scikit-learn model/estimator:
1. Estimator built-in 'score()' method
2. The scoring parameter
3. Problem specific metric functions

see info about a model evaluation by link: https://scikit-learn.org/stable/modules/model_evaluation.html

### 5.1 Evaluating classification problem with score method, average difference between Y values (label) and Y predicted
mean_absolute_error(Y_housing_test, y_pred)

In [None]:
# Evaluate the model on the training data, score returns mean accuracy on the given test data and labels.
# the highest value is 1.0, the lowerst vaue is 0.0 of score method.,
# this results are with n_estimars = 10
clf.score(X_train, Y_train)

1.0

In [None]:
# Evaluate the model on the test data
clf.score(X_test, Y_test)

0.819672131147541

In [None]:
#use other metrics to evaluate the model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#metrics to compare test labels with prediction labels.
print(classification_report(Y_test, y_preds))

              precision    recall  f1-score   support

           0       0.88      0.72      0.79        29
           1       0.78      0.91      0.84        32

    accuracy                           0.82        61
   macro avg       0.83      0.82      0.82        61
weighted avg       0.83      0.82      0.82        61



In [None]:
#metrics to compare test labels with prediction labels.
confusion_matrix(Y_test,y_preds)

array([[21,  8],
       [ 3, 29]])

In [None]:
#metrics to compare test labels with prediction labels.
accuracy_score(Y_test,y_preds)

0.819672131147541

### 5.1.2 Score for regression problem, the coefficient of determination, R squared.

In [178]:
# Score method for regression problem, 
# returns the coefficient of determination of the prediction.
# Highest = 1.0, lowest = 0.0

model.score(X_housing_train, Y_housing_train)


0.9736766533866273

In [179]:
model.score(X_housing_test, Y_housing_test)

0.8057322392488782

### 5.1.3 Evaluating a model using cross-validation and the different scoring parameter

In [187]:
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(clf, X, Y, cv=5)

#clf - RandomForestClassifier

#Compare single training and test split score with mean of 5-fold split scores
clf.score(X_test, Y_test), np.mean(cv_score)

#Scoring parameter is set to None by default
# If it is none, an appropriate scoring metric is choosen automatically for the model cross-validation method
# for a classification problem it is a mean accuracy
cross_val_score = cross_val_score(clf, X, Y, cv=5, scoring=None)

### 5.1.4 Classification model evaluation metrics
1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification report

In [189]:
# cross_val_score = cross_val_score(clf, X, Y, cv=5, scoring=None) - scoring=None, clf - RandomForestClassifier
#Heart Disease Classifier cross-validated accuracy
np.mean(cross_val_score)

0.811639344262295

### 5.5 C

### 5.1.5 Regression model evaluation metrics
1. R2 
2. MAE - mean absolute error
3. MSE - mean squared error

udemy 134 video

### 5.2 Evaluating ML model with scoring parameter
udemy 138 video

### 5.3 Evaluating ML model with Scikit-learn Functions
udemy 139 video 

# 6. Improve the model
udemy 140 video


In [None]:
# Try different hyperparameters values:  n_estimators
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, Y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, Y_test)*100:.2f} %")
    print("")
print(" The best score with 40 estimators")

Trying model with 10 estimators...
Model accuracy on test set: 81.97 %

Trying model with 20 estimators...
Model accuracy on test set: 83.61 %

Trying model with 30 estimators...
Model accuracy on test set: 81.97 %

Trying model with 40 estimators...
Model accuracy on test set: 81.97 %

Trying model with 50 estimators...
Model accuracy on test set: 83.61 %

Trying model with 60 estimators...
Model accuracy on test set: 83.61 %

Trying model with 70 estimators...
Model accuracy on test set: 83.61 %

Trying model with 80 estimators...
Model accuracy on test set: 81.97 %

Trying model with 90 estimators...
Model accuracy on test set: 81.97 %

 The best score with 40 estimators


# 7. Save a model and load

udemy 146 video

In [None]:
import pickle

#save model to file, wb - write binary
pickle.dump(clf, open("random_forest_model1.pkl", "wb"))

In [None]:
# load a model from file
loaded_model = pickle.load(open("random_forest_model1.pkl", "rb"))

#check model accuracy on test data
loaded_model.score(X_test, Y_test)

0.819672131147541

# 8. Put everything together
udemy 148 video