<a href="https://colab.research.google.com/github/knownbymanoj/Machine_Learning/blob/main/ScikitLearn_and_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Lab 2: Introduction to ScikiLearn and Classification Tasks

During this Lab, we aim to achieve the following:


*   Familiarize with <a href="https://scikit-learn.org/stable/"> scikit-learn </a>, an essential python library in data science;
*   learn how to approach a classification task with scikit-learn.

In this notebook, we learn to use Scikit-Learn with a practical example and then, in the second part, we will test our knowledge by doing some exercises.



# Part 1: A Classification Example With Scikit-Learn

We start our lab by reimplementing the *homemade perceptron* using  scikit-learn. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

In [None]:
X_toy, y_toy = datasets.make_blobs(n_samples=150,n_features=2,
                           centers=2,cluster_std=1.05,
                           random_state=2)
y_toy[y_toy==0]=-1

We can now define our classifier: a perceptron <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html"> [link] </a>.

In [None]:
from sklearn.linear_model import Perceptron

# we define two peceptrons: one with and one without intercept estimation (our theta_0)
clf1 = Perceptron(fit_intercept = False)
clf2 = Perceptron(fit_intercept = True)

Sklearn defines standard functions for models, like *fit* and *predict*.

In [None]:
#train phase
clf1.fit(X_toy, y_toy)
clf2.fit(X_toy, y_toy)

#estimation (y_hat)
y_pred_clf1 = clf1.predict(X_toy)
y_pred_clf2 = clf2.predict(X_toy)


How to evaluate our models' performance? <br>
Scikit-learn offers a broad set of evaluation functions already implemented <a href = "https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics">[link]</a>.

In [None]:
from sklearn.metrics import accuracy_score

print(f"CLF1 -- no intercept.\tACC: {accuracy_score(y_toy, y_pred_clf1)}")
print(f"CLF2 -- with intercept.\tACC: {accuracy_score(y_toy, y_pred_clf2)}")

CLF1 -- no intercept.	ACC: 0.7733333333333333
CLF2 -- with intercept.	ACC: 1.0


## Model Selection
When defining or training a model, we have the so called *hyperparameters*, i.e., different settings to configure for our training strategy.  <br>
The question is: *how can we decide the best configuration setting for the task?* <br>
The answer is the usage of *training* and *validation* partitions. <br>
We can use sklearn to do that: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_toy, y_toy, 
                                                  train_size = 0.8, random_state = 123)

print(f"Original size = {X_toy.shape[0]}\tTrain size = {X_train.shape[0]}\tVal size = {X_val.shape[0]}")  # alternative way to use the print when there are  
                                                                                                          # variables and text to print together

Original size = 150	Train size = 120	Val size = 30


# Part 2: Exercises

### Ex 2.1 Logistic Regression

Use Scikit-Learn to train a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">Logistic Regression</a> classifier with default parameters over the previously defined dataset (Lab introduction). <br>
Compure the accuracy in both training and validation sets. 

In [None]:
#
# Ex 2.1: complete here
#

In [None]:
#--------SOLUTIONS-------------
from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression()
clf_lr.fit(X_train, y_train)

#estimation (y_hat)
y_train_pred_lr = clf_lr.predict(X_train)
y_val_pred_lr = clf_lr.predict(X_val)

print(f"Logistic Regression.\tTrain ACC: {accuracy_score(y_train, y_train_pred_lr)}")
print(f"Logistic Regression.\tVal ACC: {accuracy_score(y_val, y_val_pred_lr)}")

Logistic Regression.	Train ACC: 0.8714285714285714
Logistic Regression.	Val ACC: 0.75


### Ex 2.2 Logistic Regression (2)

We ask you again to work on a classification task. <br>
This time, the classification is more challenging.
The dataset is called *sonar*. 

In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
ds = pd.read_csv(url, header = None)

# split into input and output elements
data = ds.values
X_sonar, y_sonar = data[:, :-1], data[:, -1]

**EX 2.2.1** Print the shapes of our arrays $X$ and $y$.

In [None]:
#--------SOLUTIONS-------------
print(X_sonar.shape, y_sonar.shape)

(208, 60) (208,)


It's time to partition our dataset. <br>
We ask you to create three partitions:


*   *train set* : a set of samples used to train a model.
*   *val set*: a set of samples used to decide the best model.
*   *test set*: a set of samples used to see our best model performance. 

We now first split samples that we can use in our training (train and val), from samples that we cannot touch (test). <br>
**EX 2.2.2** Create a split between train_val and test, by maintaining the 25% of samples in the test set.

In [None]:
# X_train_val, X_test, y_train_val, y_test = 

In [None]:
#--------SOLUTIONS-------------
X_train_val, X_test, y_train_val, y_test = train_test_split(X_sonar, y_sonar, 
                                                  train_size = 0.75, random_state = 123)

**EX 2.2.3** From the train_val variables, split train and validation sets. Maintain the 10% of samples in the validation.


In [None]:
# X_train, X_val, y_train, y_val = 

In [None]:
#--------SOLUTIONS-------------
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, 
                                                  train_size = 0.9, random_state = 123)

Sklearn uses a different name for the hyperparameter $\lambda$, can you recognise it in the documentation from its description? What is the default value for the parameter? What is its relationship with $\lambda$?



Solution: $\lambda=\frac{1}{C}$


Sklearn uses a different name for the hyperparameter $\lambda$, can you recognise it in the documentation from its description? What is the default value for the parameter? What is its relationship with $\lambda$?

**EX 2.2.4** Train and evaluate (using accuracy) a logistic regression with the default value for the hyperparameter. Do the evaluation **only** on the training and evaluation partitions. 

In [None]:
#--------SOLUTION-------------
clf_lr = LogisticRegression()
clf_lr.fit(X_train, y_train)

#estimation (y_hat)
y_train_pred_lr = clf_lr.predict(X_train)
y_val_pred_lr = clf_lr.predict(X_val)

print(f"Logistic Regression.\tTrain ACC: {accuracy_score(y_train, y_train_pred_lr)}")
print(f"Logistic Regression.\tVal ACC: {accuracy_score(y_val, y_val_pred_lr)}")

Logistic Regression.	Train ACC: 0.8714285714285714
Logistic Regression.	Val ACC: 0.75


This time we do not reach the 100% of accuracy in both training and validation set. <br>
A good strategy is to apply a grid-search, i.e., find a sub-optimal hyperparameters. <br>
We now ask you to manually implement a grid-search for *C*, an hyperparameter of the model. <br>
See the documentation <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"> [link] </a>. <br>
**EX 2.2.5**  We ask you to find the best *C* among the following: $C = [0.001, 0.01, 0.1, 1., 10]$.

In [None]:
C = [0.001, 0.01, 0.1, 1., 10]

In [None]:
#--------SOLUTION-------------
C = [0.001, 0.01, 0.1, 1., 10, 100]
for c in C:
    clf_lr = LogisticRegression(C = c, max_iter = 200)
    clf_lr.fit(X_train, y_train)

    #estimation (y_hat)
    y_train_pred_lr = clf_lr.predict(X_train)
    y_val_pred_lr = clf_lr.predict(X_val)
    tr_acc = accuracy_score(y_train, y_train_pred_lr)
    val_acc= accuracy_score(y_val, y_val_pred_lr)

    print(f"LR. C= {c}.\tTrain ACC: {tr_acc}\tVal Acc: {val_acc}")

LR. C= 0.001.	Train ACC: 0.55	Val Acc: 0.3125
LR. C= 0.01.	Train ACC: 0.5857142857142857	Val Acc: 0.3125
LR. C= 0.1.	Train ACC: 0.75	Val Acc: 0.5
LR. C= 1.0.	Train ACC: 0.8714285714285714	Val Acc: 0.75
LR. C= 10.	Train ACC: 0.8785714285714286	Val Acc: 0.75
LR. C= 100.	Train ACC: 0.9357142857142857	Val Acc: 0.75


The default parameter seems to work fine. <br>
We might want to extend the search in a smaller range: $C=[0.1, 0.5, 1, 5, 10, 15, 20] $

In [None]:
C=[0.1, 0.5, 1, 5, 10, 15, 20]

In [None]:
#--------SOLUTION-------------
for c in C:
    clf_lr = LogisticRegression(C = c)
    clf_lr.fit(X_train, y_train)

    #estimation (y_hat)
    y_train_pred_lr = clf_lr.predict(X_train)
    y_val_pred_lr = clf_lr.predict(X_val)
    tr_acc = accuracy_score(y_train, y_train_pred_lr)
    val_acc= accuracy_score(y_val, y_val_pred_lr)

    print(f"LR. C= {c}.\tTrain ACC: {tr_acc}\tVal Acc: {val_acc}")

LR. C= 0.1.	Train ACC: 0.75	Val Acc: 0.5
LR. C= 0.5.	Train ACC: 0.8428571428571429	Val Acc: 0.6875
LR. C= 1.	Train ACC: 0.8714285714285714	Val Acc: 0.75
LR. C= 5.	Train ACC: 0.8714285714285714	Val Acc: 0.75
LR. C= 10.	Train ACC: 0.8785714285714286	Val Acc: 0.75
LR. C= 15.	Train ACC: 0.8928571428571429	Val Acc: 0.75
LR. C= 20.	Train ACC: 0.9	Val Acc: 0.75


There is no much difference, but we find sub-optimal values with $C= [1, 5, 10, 20$]. <br>
Note that while the training performance vary, the validation set is the same between these four values.   

In the official documentation, we find several hyperparameters we can tune. <br>
For example, you might want to see what happen when we do not fit the intercept. To do so, we should try both combinations (i.e., true and false) for the parameters, but by combining all the possible C we found up to now. <br>
Since we use 6 possible values for $C$ and 2 for $fit intercept$, the total number of trials are $6 * 2 = 12$. <br>
Python-related speaking, this is translated into an inner loop, i.e., a loop inside a loop:

    for c in [0.1, 0.5, 1, 5, 10, 15, 20]:
        for fi in [True, False]:
            #here we train and test our model

If we want to find the sub-optimal amond three hyper-parameters, we must add another innner loop. If we have 10 hyper-parameters, we'll have 10 inner loops! <br>
For now, we can do it manually. Find the sub-optimal values using the validation performance. 

In [None]:
C=[0.1, 0.5, 1, 5, 10, 15, 20]
FI = [True, False]

In [None]:
#--------SOLUTION-------------
best_C = None
best_fi = None
best_train_acc = 0.
best_val_acc = 0.

for c in C:
    for fi in FI:
        clf_lr = LogisticRegression(C = c, fit_intercept= fi)
        clf_lr.fit(X_train, y_train)

        #estimation (y_hat)
        y_train_pred_lr = clf_lr.predict(X_train)
        y_val_pred_lr = clf_lr.predict(X_val)

        tr_acc = accuracy_score(y_train, y_train_pred_lr)
        val_acc= accuracy_score(y_val, y_val_pred_lr)

        if val_acc > best_val_acc:
            best_C = c
            best_fi = fi
            best_train_acc = tr_acc
            best_val_acc = val_acc

print(f"Found the best model with C:{best_C}\tFit Intercept:{best_fi}")
print(f"Best training acc:{best_train_acc}\tBest val acc:{best_val_acc}")

Found the best model with C:1	Fit Intercept:True
Best training acc:0.8714285714285714	Best val acc:0.75


We don't get much improvement, since we obtain similar a similar score compared to the "defaul" Logistic Regression. Don't worry about it, it just a toy-sh dataset. During the next weeks we are going to see more realistic tasks where a proper parameter selection can make the difference. 

# Ex 2.3 Grid Search Cross-Validation

In the previous exercise, we implemented a grid search manually. <br>
The more hyper-parameters, the harder to implement. <br>
Scikit-learn eases our pain, and it offers a grid-search cross-validation function that does everything for us! <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html"> [link]</a>.<br> 
We now see together an example of grid-search implementation.

In [None]:
from sklearn.model_selection import GridSearchCV #import the library

The first thing to do is to define a dictionary containing the hyper-parameters with the range of values we want to search on. 

In [None]:
param_grid_test = {
    'C': [0.1, 0.5, 1, 5, 10, 15, 20],
    'fit_intercept': [True, False]
}

Then, we create the grid-search object.

In [None]:
#target classifier
lr = LogisticRegression()

#grid-search object
clf = GridSearchCV(estimator= lr, param_grid=param_grid_test, 
                   cv = 5, scoring = "accuracy")

Finally, we can find the best model. <br>
Note that the tool already refit the best fund model in the entire dataset (i.e., train and validation). <br> 
This is a default parameter of the grid-search CV (see the documentation).


In [None]:
#fit the model
clf.fit(X_toy, y_toy) #we do not use the train - validation split strategy since it is included in the CV procedure

#see the best parameters and performance
print(clf.best_params_)
print(clf.best_score_)


{'C': 0.1, 'fit_intercept': True}
1.0


Now let's implement a grid-search CV with 10 fold for the *sonar* dataset.

In [None]:
#--------SOLUTION-------------
#define the parameters grid
param_grid = {
    'C': [0.1, 0.5, 1, 5, 10, 15, 20],
    'fit_intercept': [True, False]
}

#target classifier
lr = LogisticRegression(max_iter= 200)

#grid-search object
clf = GridSearchCV(estimator= lr, param_grid=param_grid, cv = 10, scoring = "accuracy")

#fit the model
clf.fit(X_train_val, y_train_val) #we do not use the train - validation split strategy since it is included in the CV procedure

#see the best parameters and performance
print(clf.best_params_)
print(clf.best_score_)

{'C': 15, 'fit_intercept': False}
0.7883333333333333


The grid-search cv returns a different Logistic Regression model, with $C=15$ and $fit\;intercept=False$.  <br>

It's time to see this best model on the test set. Use the accuracy as evaluation metric.

In [None]:
#--------SOLUTION-------------
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_test_pred))

0.7884615384615384
