# 2. Model Selection

In this notebook, we are going to apply different ways to split your dataset and try to select the best model.

__`Step 1`__ Import the needed libraries

In [3]:
import pandas as pd
import numpy as np

In [4]:
import warnings
warnings.filterwarnings('ignore')

__`Step 2`__ Read the dataset __tugas.xlsx__

In [5]:
tugas = pd.read_excel(r'./Datasets/tugas.xlsx')
tugas

Unnamed: 0,Custid,Year_Birth,Gender,Education,Marital_Status,Dependents,Income,Dt_Customer,Rcn,Frq,...,Kitchen,SmallAppliances,HouseKeeping,Toys,NetPurchase,CatPurchase,Recomendation,CostPerContact,RevenuePerPositiveAnswer,DepVar
0,1003,1991,M,Graduation,,1,29761.20,2014-05-27,69,11,...,19,24,1,24,59,41,3,2,15,0
1,1004,1956,M,Master,Married,1,98249.55,2013-07-21,10,26,...,10,19,6,5,35,65,5,2,15,0
2,1006,1983,F,PhD,Together,1,23505.30,2013-10-30,65,14,...,2,48,2,1,67,33,4,2,15,0
3,1007,1970,F,Graduation,Single,1,72959.25,2012-12-06,73,18,...,7,13,1,8,46,54,4,2,15,0
4,1009,1941,F,Graduation,Married,0,114973.95,2013-10-30,75,30,...,9,35,9,9,17,83,5,2,15,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,10989,1996,F,Basic,Single,1,29551.20,2013-03-20,41,10,...,40,24,22,2,59,41,3,2,15,0
2496,10991,1940,F,Graduation,Married,0,132566.70,2013-03-26,36,46,...,4,47,9,8,22,78,6,2,15,0
2497,10993,1955,F,Graduation,Together,0,91768.95,2013-08-04,1,25,...,8,27,8,1,47,53,4,2,15,0
2498,10994,1961,F,Basic,Married,1,99085.35,2012-09-23,1,28,...,5,21,3,4,55,45,5,2,15,0


__`Step 3`__ Create an object named __data__ that will contain your independent variables and another object named __target__ that will contain your independent varaible / target (the last column in the dataset)

In [6]:
# DO IT
data = tugas.drop(columns="DepVar")
target = tugas["DepVar"]

## 1.1. The train-test split

In this approach we randomly split the complete data into training and test sets. Then Perform the model training on the training set and use the test set for validation purpose, ideally split the data into 70:30 or 80:20. With this approach there is a possibility of high bias if we have limited data, because we would miss some information about the data which we have not used for training. If our data is huge and our test sample and train sample has the same distribution then this approach is acceptable.

In this exercise, we are going to split our dataset into train, test and validation. <br> <br>
By default, sklearn has a function named train_test_split that allows to split the dataset into two different datasets.

__`Step 4`__ Import the library `train_test_split` from `sklearn.model_selection`

In [7]:
# DO IT
from sklearn.model_selection import train_test_split

__`Step 5`__ Divide the `data`into `X_train_val` and `X_test`, the `target`into `y_train_val` and `y_test`, and define the following arguments: `test_size = 0.2`, `random_state = 15`, `shuffle = True` and `stratify = target` 

In [8]:
X_train_val, X_test, y_train_val, y_test = train_test_split(data, 
                                                    target, 
                                                    test_size=0.2, 
                                                    random_state=15, 
                                                    shuffle=True, 
                                                    stratify=target
                                                   )

This will allow me to create two different datasets, one for train (80% of the data) and one for test (20% of the data). <br>
The stratification will allow me to have the same proportion of each label of the dependent variable in both datasets.


### How to create the three datasets: train, validation and test?
To create three datasets (train, validation and test) we are going to use the function train_test_split twice. <br><br>
First we are going to create two sets of datasets, one for test (X_test and y_test) and another one that includes the data for training and validation (X_train_val and y_train_val).

__`Step 6`__  Divide the `X_train_val`into `X_train` and `X_val`, the `y_train_val` into `y_train` and `y_val`, and define the following arguments: `test_size = 0.25`, `random_state = 15`, `shuffle = True` and `stratify = y_train_val`.

In [9]:
# DO IT
X_train, X_val, y_train, y_val = train_test_split(X_train_val,
                                                  y_train_val,
                                                  test_size = 0.25,
                                                  random_state = 15,
                                                  shuffle=True,
                                                  stratify=y_train_val
)

__`Step 7`__ Check the proportion of data for each dataset. _(written for you)_

In [8]:
print('train:{}% | validation:{}% | test:{}%'.format(round(len(y_train)/len(target),2),
                                                     round(len(y_val)/len(target),2),
                                                     round(len(y_test)/len(target),2)
                                                    ))

train:0.6% | validation:0.2% | test:0.2%


Now we have three different datasets, namely:
- Training dataset, with 60% of the data, that will allow me to build the model;
- Validation dataset, with 20% of the data, that will allow me to fine tune the model and check some problems like overfitting;
- Test dataset, with 20% of the data, that will allow me to evaluate the performance of the final model.

## 1.2. K-Fold

The different techniques we are going to check in this step are commonly used in applied machine learning to compare and select a model for a given predictive modeling problem.

In the following cases, we are going to check the performance of a Logistic Regression using those different techniques.

In the following examples we are going to use the dataset `Diabetes.csv`

__`Step 8`__ Read the dataset __Diabetes.csv__ and define the independent variables as X and the dependent variable as y (the last column in the dataset).

In [10]:
data = pd.read_csv(r'./Datasets/Diabetes.csv')
# DO IT
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

__`Step 9`__ Import __KFold__ from __sklearn.model_selection__

In [11]:
# DO IT
from sklearn.model_selection import KFold

__`Step 10`__ Import __LogisticRegression__ from __sklearn.linear_model__

In [12]:
# DO IT
from sklearn.linear_model import LogisticRegression

__`Step 11` Create a function named as __run_model_LR__ that receives as parameters the dependent variable and the independent variables and returns a fitted Logistic Regression model to the data 

In [13]:
# DO IT
def run_model_LR(X,y):
    model = LogisticRegression().fit(X,y)
    return model

__`Step 12`__ Create a function named as __evaluate_model__ that receives as parameters the independent variables, the dependent variable and the model and returns the score method result.

In [15]:
def evaluate_model(X,y, model):
    return model.score(X,y) # .score = accuracy

__`Step 13`__ Create a function named __avg_score_LR__ that will return the average score value for the train and the test set. This will have as parameters the technique you are going to use, your dependent variable and your independent variables.

In [16]:
def avg_score_LR(method,X,y):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        model = run_model_LR(X_train, y_train)
        value_train = evaluate_model(X_train, y_train, model)
        value_test = evaluate_model(X_test,y_test, model)
        score_train.append(value_train)
        score_test.append(value_test)

    print('Train:', np.mean(score_train))
    print('Test:', np.mean(score_test))

__`Step 14`__ Create a KFold Instance where the number of splits is 10 (*n_splits*) and name it as __kf__

In [17]:
# DO IT
kf = KFold(n_splits=10)

__`Step 15`__ Call the function __avg_score_LR__ and check the average score for the train and the test sets using __kf__

In [23]:
avg_score_LR(kf, X, y)

Train: 0.7832744284483408
Test: 0.7669343814080655


## 1.3. Repeated K-Fold

__`Step 16`__ Import __RepeatedKFold__ from __sklearn.model_selection__

In [24]:
# DO IT
from sklearn.model_selection import RepeatedKFold

__`Step 17`__ Create a RepeatedKFold Instance where the number of splits is 6 (`n_splits=6`) and the number of times cross-validator needs to be repeated is 2 (`n_repeats=2`)  and name it as __rkf__

In [25]:
# DO IT
rkf = RepeatedKFold(n_splits=6, n_repeats=2)

__`Step 18`__ Call the function __avg_score_LR__ and check the average score for the train and the test sets using __rkf__

In [26]:
# DO IT
avg_score_LR(rkf, X, y)

Train: 0.7799479166666666
Test: 0.7740885416666666


## 1.4. Leave One Out

__`Step 19`__ Do the same steps you applied on the previous techniques, but this time using the Leave One Out. For that, you need to import __LeaveOneOut__ from __sklearn.model_selection__

In [20]:
# DO IT
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
avg_score_LR(loo, X, y)

Train: 0.7816659197088223
Test: 0.7786458333333334


## 1.5. Stratified k-fold and others

Using SkLearn you have several options to select your model, and the application is similar to the cases we saw previously.

<img src="model_selection.png" alt="Drawing" style="width: 800px;"/> <br>

## Comparing models

Don't forget that the purpose of this notebook is to compare different models. In this step, you are going to fit your data into a DecisionTree model also, and use the __RepeatedKFold__ to compare the performance of it with the Logistic Regression

__`Step 21`__ Import __DecisionTreeClassifier__ from __sklearn.tree__

In [21]:
# DO IT
from sklearn.tree import DecisionTreeClassifier

__`Step 22`__ Similarly to Step 11, create a function named as __run_model_DT__ that receives as parameters the dependent variable and the independent variables and returns a fitted Decision Tree Classifier model to the data 

In [22]:
# DO IT
def run_model_DT(X,y):
    model = DecisionTreeClassifier().fit(X,y)
    return model

__`Step 23`__ Similarly to step 13, create a function named __avg_score_DT__ that will return the average score value for the train and the test set. This will have as parameters the technique you are going to use, your dependent variable and your independent variables.

In [23]:
# DO IT
def avg_score_DT(method,X,y):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        model = run_model_DT(X_train, y_train)
        value_train = evaluate_model(X_train, y_train, model)
        value_test = evaluate_model(X_test,y_test, model)
        score_train.append(value_train)
        score_test.append(value_test)

    print('Train:', np.mean(score_train))
    print('Test:', np.mean(score_test))

__`Step 24`__ Apply RepeatedKFold to the data using `n_splits = 6` and `n_repeats = 2` and check the performance of the DecisionTree you created by calling the function __avg_score_DT__

In [24]:
# DO IT
rkf2 = RepeatedKFold(n_splits=6, n_repeats=2)
avg_score_DT(rkf2, X, y)

Train: 1.0
Test: 0.70703125
