*Prepared by*

*Asif Newaz*

*Lecturer, EEE, IUT*

This notebook will introduce you to Scikit-learn library and how to use that to build an ML model.

# Import necessary libraries

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [26]:
from sklearn.preprocessing import *
from sklearn.model_selection import *
from sklearn.metrics import *
from sklearn.impute import *

Details of the libraries can be found here:

https://scikit-learn.org/stable/modules/preprocessing.html

https://scikit-learn.org/stable/modules/impute.html

https://scikit-learn.org/stable/modules/model_evaluation.html

In [27]:
# importing classifiers

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

# Data

In [28]:
data= pd.read_csv('titanic_train.csv')
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


So, the data contains 891 samples and 12 attributes.

In [29]:
data.size
# total number of entries

10692

In [30]:
data.shape

(891, 12)

In [31]:
data.dtypes

Unnamed: 0,0
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


# Target Variable

In [32]:
y=data['Survived']
y

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
...,...
886,0
887,1
888,0
889,1


In [33]:
y.value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


So, there are 342 people who survived. 549 people died.

# Feature Variables

In [34]:
x= data.drop(['Survived'],axis=1)
x

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [35]:
col_names= list(data.columns)
col_names

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

# Remove features

You may need to remove some features from the data.

*   unnecessary features
*   features with too many missing entries

'PassengerId', 'Name', and 'Ticket' are unncessary in building your ML model. You need to remove them.

In [36]:
# Check for missing entries
data.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


Out of 891 samples, Cabin variable has 687 entries missing. Its better to remove variables with so many missing values (>50%).

In [37]:
drop_col= ['PassengerId', 'Name', 'Ticket', 'Cabin']

In [38]:
x= x.drop(drop_col,axis=1)
x

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.2500,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.9250,S
3,1,female,35.0,1,0,53.1000,S
4,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,13.0000,S
887,1,female,19.0,0,0,30.0000,S
888,3,female,,1,2,23.4500,S
889,1,male,26.0,0,0,30.0000,C


Now, I have 7 feature variables. However, some of them are not numeric. You need to modify them.

# Preprocessing

In [39]:
x['Sex'].value_counts()

Unnamed: 0_level_0,count
Sex,Unnamed: 1_level_1
male,577
female,314


In [40]:
x['Embarked'].value_counts()

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,644
C,168
Q,77


In [41]:
x['Sex'] = x['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,0,22.0,1,0,7.25,S
1,1,1,38.0,1,0,71.2833,C
2,3,1,26.0,0,0,7.925,S
3,1,1,35.0,1,0,53.1,S
4,3,0,35.0,0,0,8.05,S


In [42]:
x['Embarked'] = x['Embarked'].map( {'S': 1, 'C': 0, 'Q':2} )
x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,0,22.0,1,0,7.25,1.0
1,1,1,38.0,1,0,71.2833,0.0
2,3,1,26.0,0,0,7.925,1.0
3,1,1,35.0,1,0,53.1,1.0
4,3,0,35.0,0,0,8.05,1.0


# Missing Entries

In [43]:
x.isnull().sum()

Unnamed: 0,0
Pclass,0
Sex,0
Age,177
SibSp,0
Parch,0
Fare,0
Embarked,2


You need to impute the missing entries (if they are not too many). But you need to be careful while imputing to not cause "data leakage".

# Train-Test Split

The function is from sklearn.model_selection

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [44]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=10)
# We usually use 80:20 split ratio.
# Use random state to reproduce same split.

In [45]:
x_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
57,3,0,28.5,0,0,7.2292,0.0
717,2,1,27.0,0,0,10.5000,1.0
431,3,1,,1,0,16.1000,1.0
633,1,0,,0,0,0.0000,1.0
163,3,0,17.0,0,0,8.6625,1.0
...,...,...,...,...,...,...,...
369,1,1,24.0,0,0,69.3000,0.0
320,3,0,22.0,0,0,7.2500,1.0
527,1,0,,0,0,221.7792,1.0
125,3,0,12.0,1,0,11.2417,0.0


In [46]:
x_test

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
590,3,0,35.0,0,0,7.1250,1.0
131,3,0,20.0,0,0,7.0500,1.0
628,3,0,26.0,0,0,7.8958,1.0
195,1,1,58.0,0,0,146.5208,0.0
230,1,1,35.0,1,0,83.4750,1.0
...,...,...,...,...,...,...,...
456,1,0,65.0,0,0,26.5500,1.0
191,2,0,19.0,0,0,13.0000,1.0
603,3,0,44.0,0,0,8.0500,1.0
94,3,0,59.0,0,0,7.2500,1.0


In [47]:
column_names= x.columns.values
column_names

array(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'],
      dtype=object)

# Imputation

There are many ways to impute (replace) the missing entries. You can use statistical values such as mean of the column to replace the missing entries. There are other more advanced ways such as MICE algorithm.

In [48]:
# to impute, first create an imputer object
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [49]:
# fit the imputer object on the train data [not the test data]
# it will calculate the mean of each column
imp.fit(x_train)

In [50]:
# update the missing entries using the imputer object that now has the missing values
x_train_imp= imp.fit_transform(x_train)
x_train_imp

array([[  3.        ,   0.        ,  28.5       , ...,   0.        ,
          7.2292    ,   0.        ],
       [  2.        ,   1.        ,  27.        , ...,   0.        ,
         10.5       ,   1.        ],
       [  3.        ,   1.        ,  29.66854419, ...,   0.        ,
         16.1       ,   1.        ],
       ...,
       [  1.        ,   0.        ,  29.66854419, ...,   0.        ,
        221.7792    ,   1.        ],
       [  3.        ,   0.        ,  12.        , ...,   0.        ,
         11.2417    ,   0.        ],
       [  2.        ,   0.        ,  36.        , ...,   0.        ,
         10.5       ,   1.        ]])

In [51]:
# using the same imputer object fitted on the train data, update the entries of the test data [to avoid data leakage]
x_test_imp= imp.transform(x_test)

In [None]:
x_test_imp

array([[ 3.        ,  0.        , 35.        , ...,  0.        ,
         7.125     ,  1.        ],
       [ 3.        ,  0.        , 20.        , ...,  0.        ,
         7.05      ,  1.        ],
       [ 3.        ,  0.        , 26.        , ...,  0.        ,
         7.8958    ,  1.        ],
       ...,
       [ 3.        ,  0.        , 44.        , ...,  0.        ,
         8.05      ,  1.        ],
       [ 3.        ,  0.        , 59.        , ...,  0.        ,
         7.25      ,  1.        ],
       [ 1.        ,  0.        , 29.66854419, ...,  0.        ,
        39.6       ,  0.        ]])

The values are now in numpy array format. You can change it back to pandas dataframe.

In [52]:
x_train_2= pd.DataFrame(x_train_imp)
x_train_2

Unnamed: 0,0,1,2,3,4,5,6
0,3.0,0.0,28.500000,0.0,0.0,7.2292,0.0
1,2.0,1.0,27.000000,0.0,0.0,10.5000,1.0
2,3.0,1.0,29.668544,1.0,0.0,16.1000,1.0
3,1.0,0.0,29.668544,0.0,0.0,0.0000,1.0
4,3.0,0.0,17.000000,0.0,0.0,8.6625,1.0
...,...,...,...,...,...,...,...
707,1.0,1.0,24.000000,0.0,0.0,69.3000,0.0
708,3.0,0.0,22.000000,0.0,0.0,7.2500,1.0
709,1.0,0.0,29.668544,0.0,0.0,221.7792,1.0
710,3.0,0.0,12.000000,1.0,0.0,11.2417,0.0


In [53]:
x_train_2= pd.DataFrame(x_train_imp, columns=column_names)
x_train_2

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3.0,0.0,28.500000,0.0,0.0,7.2292,0.0
1,2.0,1.0,27.000000,0.0,0.0,10.5000,1.0
2,3.0,1.0,29.668544,1.0,0.0,16.1000,1.0
3,1.0,0.0,29.668544,0.0,0.0,0.0000,1.0
4,3.0,0.0,17.000000,0.0,0.0,8.6625,1.0
...,...,...,...,...,...,...,...
707,1.0,1.0,24.000000,0.0,0.0,69.3000,0.0
708,3.0,0.0,22.000000,0.0,0.0,7.2500,1.0
709,1.0,0.0,29.668544,0.0,0.0,221.7792,1.0
710,3.0,0.0,12.000000,1.0,0.0,11.2417,0.0


In [54]:
x_test_2= pd.DataFrame(x_test_imp, columns=column_names)
x_test_2

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3.0,0.0,35.000000,0.0,0.0,7.1250,1.0
1,3.0,0.0,20.000000,0.0,0.0,7.0500,1.0
2,3.0,0.0,26.000000,0.0,0.0,7.8958,1.0
3,1.0,1.0,58.000000,0.0,0.0,146.5208,0.0
4,1.0,1.0,35.000000,1.0,0.0,83.4750,1.0
...,...,...,...,...,...,...,...
174,1.0,0.0,65.000000,0.0,0.0,26.5500,1.0
175,2.0,0.0,19.000000,0.0,0.0,13.0000,1.0
176,3.0,0.0,44.000000,0.0,0.0,8.0500,1.0
177,3.0,0.0,59.000000,0.0,0.0,7.2500,1.0


# Normalization

Some ML algorithms such as LR are very sensitive to scale of the features. Its better to normalize the data before.

In [55]:
sc= StandardScaler()
# define the scaler object

Other scaler includes MinMax scaler

In [56]:
sc.fit(x_train_2)
# calculates z-score to normalize each feature variable

In [57]:
x_train_sc= sc.fit_transform(x_train_2)
x_train_sc

array([[ 8.25233303e-01, -7.53844771e-01, -9.00475598e-02, ...,
        -4.77849481e-01, -4.96597757e-01, -1.79410807e+00],
       [-3.76333388e-01,  1.32653305e+00, -2.05636975e-01, ...,
        -4.77849481e-01, -4.33020011e-01,  1.93125518e-01],
       [ 8.25233303e-01,  1.32653305e+00,  2.73770730e-16, ...,
        -4.77849481e-01, -3.24167322e-01,  1.93125518e-01],
       ...,
       [-1.57790008e+00, -7.53844771e-01,  2.73770730e-16, ...,
        -4.77849481e-01,  3.67382090e+00,  1.93125518e-01],
       [ 8.25233303e-01, -7.53844771e-01, -1.36153112e+00, ...,
        -4.77849481e-01, -4.18602861e-01, -1.79410807e+00],
       [-3.76333388e-01, -7.53844771e-01,  4.87899515e-01, ...,
        -4.77849481e-01, -4.33020011e-01,  1.93125518e-01]])

In [58]:
x_test_sc= sc.transform(x_test_2)
x_test_sc

array([[ 8.25233303e-01, -7.53844771e-01,  4.10839905e-01, ...,
        -4.77849481e-01, -4.98623195e-01,  1.93125518e-01],
       [ 8.25233303e-01, -7.53844771e-01, -7.45054244e-01, ...,
        -4.77849481e-01, -5.00081043e-01,  1.93125518e-01],
       [ 8.25233303e-01, -7.53844771e-01, -2.82696585e-01, ...,
        -4.77849481e-01, -4.83640400e-01,  1.93125518e-01],
       ...,
       [ 8.25233303e-01, -7.53844771e-01,  1.10437639e+00, ...,
        -4.77849481e-01, -4.80643063e-01,  1.93125518e-01],
       [ 8.25233303e-01, -7.53844771e-01,  2.26027054e+00, ...,
        -4.77849481e-01, -4.96193447e-01,  1.93125518e-01],
       [-1.57790008e+00, -7.53844771e-01,  2.73770730e-16, ...,
        -4.77849481e-01,  1.32625214e-01, -1.79410807e+00]])

In [59]:
x_train_3= pd.DataFrame(x_train_sc, columns=column_names)
x_train_3

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0.825233,-0.753845,-9.004756e-02,-0.458505,-0.477849,-0.496598,-1.794108
1,-0.376333,1.326533,-2.056370e-01,-0.458505,-0.477849,-0.433020,0.193126
2,0.825233,1.326533,2.737707e-16,0.412044,-0.477849,-0.324167,0.193126
3,-1.577900,-0.753845,2.737707e-16,-0.458505,-0.477849,-0.637119,0.193126
4,0.825233,-0.753845,-9.762331e-01,-0.458505,-0.477849,-0.468737,0.193126
...,...,...,...,...,...,...,...
707,-1.577900,1.326533,-4.368158e-01,-0.458505,-0.477849,0.709933,-1.794108
708,0.825233,-0.753845,-5.909350e-01,-0.458505,-0.477849,-0.496193,0.193126
709,-1.577900,-0.753845,2.737707e-16,-0.458505,-0.477849,3.673821,0.193126
710,0.825233,-0.753845,-1.361531e+00,0.412044,-0.477849,-0.418603,-1.794108


In [60]:
x_test_3= pd.DataFrame(x_test_sc, columns=column_names)
x_test_3

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0.825233,-0.753845,4.108399e-01,-0.458505,-0.477849,-0.498623,0.193126
1,0.825233,-0.753845,-7.450542e-01,-0.458505,-0.477849,-0.500081,0.193126
2,0.825233,-0.753845,-2.826966e-01,-0.458505,-0.477849,-0.483640,0.193126
3,-1.577900,1.326533,2.183211e+00,-0.458505,-0.477849,2.210950,-1.794108
4,-1.577900,1.326533,4.108399e-01,0.412044,-0.477849,0.985467,0.193126
...,...,...,...,...,...,...,...
174,-1.577900,-0.753845,2.722628e+00,-0.458505,-0.477849,-0.121040,0.193126
175,-0.376333,-0.753845,-8.221139e-01,-0.458505,-0.477849,-0.384425,0.193126
176,0.825233,-0.753845,1.104376e+00,-0.458505,-0.477849,-0.480643,0.193126
177,0.825233,-0.753845,2.260271e+00,-0.458505,-0.477849,-0.496193,0.193126


The data is now ready for training.

# Training & testing

In [61]:
# define classifier objects

knn= KNeighborsClassifier(n_neighbors=5,n_jobs=-1)
lr = LogisticRegression()
dt= DecisionTreeClassifier(random_state=10)
svm = SVC()

Results will change with a different random state --> stochastic nature

In [62]:
# fit the classifier object on the train data
knn.fit(x_train_3, y_train)

In [63]:
knn.score(x_train_3, y_train)
# The score function provides the accuracy

0.8609550561797753

In [64]:
knn.score(x_test_3, y_test)

0.8324022346368715

In [67]:
lr.fit(x_train_3, y_train)
print('training:', lr.score(x_train_3, y_train))
print('testing:', lr.score(x_test_3, y_test))

training: 0.797752808988764
testing: 0.8268156424581006


In [68]:
dt.fit(x_train_3, y_train)
print('training:', dt.score(x_train_3, y_train))
print('testing:', dt.score(x_test_3, y_test))

training: 0.9859550561797753
testing: 0.7597765363128491


In [70]:
svm.fit(x_train_3, y_train)
print('training:', svm.score(x_train_3, y_train))
print('testing:', svm.score(x_test_3, y_test))

training: 0.8328651685393258
testing: 0.8435754189944135


**Identify Issues**

Q: Which models are overfitting, which are underfitting?

# Making predictions

In [71]:
x_test_3.iloc[1]

Unnamed: 0,1
Pclass,0.825233
Sex,-0.753845
Age,-0.745054
SibSp,-0.458505
Parch,-0.477849
Fare,-0.500081
Embarked,0.193126


In [72]:
x_test_2.iloc[1]

Unnamed: 0,1
Pclass,3.0
Sex,0.0
Age,20.0
SibSp,0.0
Parch,0.0
Fare,7.05
Embarked,1.0


In [73]:
y_test.iloc[1]

0

In [74]:
knn.predict([x_test_3.iloc[1]])




array([0])

In [81]:
import warnings
warnings.filterwarnings('ignore')

In [83]:
print(lr.predict_proba([x_test_3.iloc[1]]))
print (lr.predict([x_test_3.iloc[1]]))

[[0.84650607 0.15349393]]
[0]


In [84]:
dt.predict_proba([x_test_3.iloc[1]])


array([[1., 0.]])

In [85]:
svm.predict([x_test_3.iloc[1]])


array([0])

## Class Task

Train the algorithms with different random state in the train-test split step.
Write the codes in the below code box

# Using Pipeline

You can define a pipeline to direct apply all the approaches sequentially. This makes the code more straightforward.

In [86]:
from sklearn.pipeline import Pipeline

In [87]:
pipe = Pipeline([('imp',SimpleImputer()),('scaler', StandardScaler()), ('lr', lr)])

# Using Cross-validation

To reduce variance in the results, it is better to use cross-validation. It performs train-test split internally 'k' times and then provides the average performance measure.

There are several types of cross-validation.


*   K-fold cross-validation
*   Stratified K-fold cross-validation
*   Repeated Stratified K-fold cross-validation



In [88]:
result= cross_validate(pipe, x,y, cv = 10)

# you perform CV on the original x,y values before train-test split.

In [89]:
res= pd.DataFrame(result)
res

Unnamed: 0,fit_time,score_time,test_score
0,0.038666,0.014579,0.777778
1,0.028139,0.003374,0.764045
2,0.011082,0.003344,0.741573
3,0.011256,0.004064,0.831461
4,0.011068,0.003211,0.786517
5,0.01074,0.003078,0.764045
6,0.01105,0.009584,0.786517
7,0.020609,0.010308,0.775281
8,0.042536,0.004362,0.786517
9,0.02536,0.00308,0.820225


In [90]:
res.mean(axis=0)

Unnamed: 0,0
fit_time,0.02105
score_time,0.005898
test_score,0.783396


In [91]:
# DT
pipe = Pipeline([('imp',SimpleImputer()),('scaler', StandardScaler()), ('dt', dt)])
result= cross_validate(pipe, x,y, cv = 10)
res= pd.DataFrame(result)
res.mean(axis=0)

Unnamed: 0,0
fit_time,0.007896
score_time,0.002392
test_score,0.793508
