# ASL ML Workshop - Review of Day 3
- Introduction to scikit learn https://scikit-learn.org/
- Prep data with pandas
- Building logistic regression model using scikit learn
- Predicting with logistic regression model
- Scoring the model
- Model evaluation (evaluation metrics)
- Train - test split
- Build the model with splitted dataset
- ROC foundation (specificity & sensitivity)
- Adjusting the threshold for logistic regression
- The ROC curve
- Area under curve
- Comparision of two models

# What is Jupyter notebook?
Jupyter Notebook is an interactive computing environment that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. It is an open-source web application that supports over 100 programming languages, including Python, R, and Julia.

# Scikit Learn

Scikit-learn is a popular Python library for machine learning that provides a range of tools for various tasks such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. It is built on top of NumPy, SciPy, and matplotlib, which are also popular scientific computing libraries in Python.

In [None]:
# pip install scikit-learn

# Prep data with pandas

Prepare the training data with pandas

In [None]:
import pandas as pd

df = pd.read_csv('../titanic.csv')

In [None]:

df # our dataset

In [None]:
# create a new feature column

df['Male'] = df['Sex'] == 'male'

In [None]:
df

In [None]:
# create numpy array for training

X = df[['Pclass','Male','Age','Siblings/Spouses','Parents/Children','Fare']].values

y = df['Survived'].values

print(X.shape)

# Building logistic regression model

Logistic regression is a popular statistical model used for binary classification problems, where the goal is to predict a binary outcome (e.g., yes/no, 0/1) based on a set of input features. It is a type of linear model that uses a logistic function to model the probability of the binary outcome.

In [None]:
from sklearn.linear_model import LogisticRegression as LgR

model = LgR() # create an instance of LogisticRegression 

In [None]:
model

# Training the model with our dataset
Training the model with titanic dataset

In [None]:
model.fit(X,y) # passing the feature array X and the target y

# now the model will find the best parameters for our classification dataset

In [None]:
print(model.coef_,model.intercept_)

# Predicting with logistic regression model
Making prediction with the trained model

In [None]:
# prediction for a random person
print(model.predict([[3,True,22,0,1,7.25]]))

In [None]:
# prediction for first 5 entry in the dataset
print(model.predict(X[:5]))

In [None]:
# actual survival data \
# for first 5 entry in the dataset
print(y[:5])

# Scoring the model
Scoring the trained model

In [None]:
y_pred = model.predict(X)

In [None]:
print(y_pred)

In [None]:
(y == y_pred).sum() # number of correct predictions

In [None]:
# score
print((y == y_pred).sum()/y.shape[0])

In [None]:
print(model.score(X,y))

# Model evaluation
Evaluation metrics
- Accuracy
- Precision
- Recall
- F1 score

In [None]:
# necessary imports
from sklearn.metrics import accuracy_score as acc
from sklearn.metrics import precision_score as prec
from sklearn.metrics import recall_score as rec
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix as cfm

In [None]:
# accuracy score
print("Accuracy :",acc(y,y_pred))

# Confusion Matrix

A 2 x 2 matrix
- 0th row actually positive
- 1st row actually negative
- 0th column predicted positive
- 0th column predicted negative
- (0,0) => True positives, TP
- (0,1) => False positives, FP
- (1,0) => False negatives, FN
- (1,1) => True negatives, TN


In [None]:
print("Confusion matrix :\n",cfm(y,y_pred))
# this will produce the matrix in reverse order \
# this is because 0 correspond to negative and 1 correspond to positive

In [None]:
# precission = TP / (TP + FP)
print("Precision :",prec(y,y_pred))

In [None]:
# recall = TP / (TP + FN)
print("Recall :",rec(y,y_pred))

In [None]:
# f1 score = 2 * precision * recall / (precision + recall)
print("F1 score :",f1(y,y_pred))

# Train - test split
Use separate data for traing and testing purpose

In [None]:
# necessary imports
from sklearn.model_selection import train_test_split as split

In [None]:
# split the dataset
X_train,X_test,y_train,y_test = split(X,y,train_size=0.8)

In [None]:
print("Full dataset :",X.shape,y.shape)

In [None]:
print("Training dataset: ",X_train.shape,y_train.shape)

In [None]:
print("Test dataset: ",X_test.shape,y_test.shape)

# Rebuild the model with splitted dataset
This will provide more accurate evaluation

In [None]:
model_new = LgR()

In [None]:
model_new.fit(X_train,y_train)
y_pred = model_new.predict(X_test)

In [None]:
# accuracy score
print("Accuracy :",acc(y_test,y_pred))

In [None]:
# precision score
print("Precision :",prec(y_test,y_pred))

In [None]:
# recall score
print("Recall score :",rec(y_test,y_pred))

In [None]:
# f1 score
print("F1 score :",f1(y_test,y_pred))

# Specificity & Sensitivity
Foundation of ROC (Reciever Operating Characteristic) curve

In [None]:
# necessary imports
from sklearn.metrics import precision_recall_fscore_support as prfs

In [None]:
sensitivity = rec # sensitivity is another term for recall

In [None]:
# specificity = TN / (TN + FN) \
# that is also the recall of the negative class

def specificity(y,y_pred):
    p,r,f,s = prfs(y,y_pred)
    return r[0] # r[0] is the recall of the negative class here

In [None]:
print(sensitivity(y_test,y_pred))

In [None]:
print(specificity(y_test,y_pred))

# Adjusting the threshold
Tweeking the default threshold values can be beneficial to boost precision or recall

model.predict_proba(X_test)

The result is a numpy array with 2 values for each datapoint i.e., the first value is the probability that the datapoint is in the 0 class and the second value is the probality of the datapoint being in the 1 class.

In [None]:
print(model_new.predict_proba(X_test))

In [None]:
threshold = 0.75
# set the new threshold \
# default is 0.5

In [None]:
y_pred = model_new.predict_proba(X_test)[:,1] > threshold

In [None]:
print((y_test == y_pred).sum())

In [None]:
print("Precision :",prec(y_test,y_pred))

# ROC curve
Graphical evaluation of models

In [None]:
# imports
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve as roc
from sklearn.metrics import roc_auc_score as auc

In [None]:
# obtain specificity and sensitivity for every possible threshold
fpr,tpr,thresholds = roc(y_test,model_new.predict_proba(X_test)[:,1])

In [None]:
plt.ylabel('sensitivity')
plt.xlabel('1 - specificity')
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],linestyle = '--')
plt.show()

# Area under curve
Numeric measure of the area covered by the curve. Maximum is 1.

In [None]:
y_pred_proba = model_new.predict_proba(X_test)

In [None]:
# auc score
print(auc(y_test,y_pred_proba[:,1]))

Comparision of 2 different models

In [None]:
model2 = LgR()
# creating a new model with fewer features
model2.fit(X_train[:,0:2],y_train)
y_pred_proba2 = model2.predict_proba(X_test[:,0:2])

In [None]:
print(auc(y_test,y_pred_proba2[:,1]))

In [None]:
fpr2,tpr2,thresholds2 = roc(y_test,y_pred_proba2[:,1])
plt.ylabel('sensitivity')
plt.xlabel('1 - specificity')
plt.plot(fpr,tpr,color = 'green')
plt.plot(fpr2,tpr2,color = 'red')
plt.plot([0,1],[0,1],linestyle = '--')
plt.show()

# K-Fold cross validation
Instead of doing a single train/test split, we'll split our data into a training set and test set multiple times

In [None]:
# necessary imports
from sklearn.model_selection import KFold as kf

In [None]:
# we're doing 3 splits
kfold = kf(n_splits = 3,shuffle = True)

In [None]:
# using a smaller dataset
X_small = df[['Age','Fare']].values[:6]
y_small = df['Survived'].values[:6]

In [None]:
for train,test in kfold.split(X_small):
    print(train,test)

In [None]:
# now take the full dataset
splits = list(kfold.split(X))

In [None]:
first_split = splits[0]

In [None]:
print("Test datapoints :",first_split[1].shape)

In [None]:
train_indices,test_indices = first_split
print("Test set indices :\n",test_indices)

Let's do 3 fold cross validation

In [None]:
scores = []
for train_index,test_index in splits:
    X_train = X[train_index]
    X_test = X[test_index]
    y_train = y[train_index]
    y_test = y[test_index]
    model = LgR()
    model.fit(X_train,y_train)
    scores.append(model.score(X_test,y_test))

In [None]:
print(scores)

In [None]:
import numpy as np
print(np.mean(scores))

Model finalization

In [None]:
final_model = LgR()
final_model.fit(X,y)
print(final_model.score(X,y))