<a href="https://colab.research.google.com/github/mkjubran/MachineLearningNotebooks/blob/master/KFoldCrossValidation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone the Source GitHub Reporsitory 
We need to clone some source files to be used throughtout this tutorial from a GitHub reprository

In [0]:
!rm -rf ./MachineLearning
!git clone https://github.com/mkjubran/MachineLearning.git

(Optional) You may also disable future warning through running the following code

In [0]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# K-Fold Cross Validation

**Readings and Resources**

[1] https://medium.com/datadriveninvestor/k-fold-cross-validation-6b8518070833

[2] https://scikit-learn.org/stable/modules/cross_validation.html

[3] https://machinelearningmastery.com/k-fold-cross-validation/

# Case #1: Studying Hours and Passing Exams

In this section, we will compare the performance of different ML techniques applied to the "Studying Hours and Passing Exams" case using the **K-Fold Cross Validation** (KCV) Method. We assume you have some experience or have practiced each of these ML techniques using our tutorials.

**Implementation** *(you may run the first few cells quickly if you have done this probem in the the previous tutorials)*

Read the input data (number of study hours and exam pass or fail) from the csv file (HoursPassExam.csv) file. Use the pandas library (https://pandas.pydata.org/) to read the data from the file.

In [0]:
import pandas as pd
df = pd.read_csv("./MachineLearning/6_KFold_Cross_Validation/HoursPassExam.csv")
df.head()

As can be seen, the output (pass) is binary; 0 for failing the exam and 1 for passing the exam. Next, we will divide the dataset into K training and testing datasets using the **K-Fold Cross Validation** (KVC) Method. However, before doing that let us show a simple example of using the **KVC** method.

Let us assume a dataset of integers from 0 to 10. We will use the **KVC** to split this dataset into 4 training and testing datasets as follows:

In [0]:
from sklearn.model_selection import KFold
kf = KFold( n_splits = 4 )
for train_index, test_index in kf.split([0,1,2,3,4,5,6,7,8,9,10]):
  print(train_index,test_index)

As can be observed that the datset is split into 4 none overlapping test datasets. The size of the dataset split (test and training) depends on the requested number of dataset splits (n_splits).

Let split the same dataset into 3 splits (n_splits = 3)


In [0]:
kf = KFold( n_splits = 3 )
for train_index, test_index in kf.split([0,1,2,3,4,5,6,7,8,9,10]):
  print(train_index,test_index)

Let us now get the **KVC** of the students pass/fail dataset with n_splits = 4

In [0]:
from sklearn.model_selection import KFold
kf = KFold( n_splits = 4 )
for train_index, test_index in kf.split(range(df.shape[0])):
  print(train_index,test_index)

To determine the traing and testing datasets we will define x and y, and then define the x_train, x_test and y_train and y_test based on the train_index and test_index as follows

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']
from sklearn.model_selection import KFold
kf = KFold( n_splits = 4 )

from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
ACC_train_lr = []
ACC_test_lr = []

import numpy as np
for train_index,test_index in kf.split(range(df.shape[0])):
   y_train = [y[i] for i in train_index]
   y_test = [y[i] for i in test_index]
   x_train = [np.array(x)[i,:] for i in train_index]
   x_test = [np.array(x)[i,:] for i in test_index]
   print('Size of dataset: x_traing = {}, x_test = {}, y_train = {}, y_test {}'.format(len(x_train),len(x_test),len(y_train),len(y_test)))



Let us know apply the logistic regression to the **KVC** train and test datasets

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']
from sklearn.model_selection import KFold
kf = KFold( n_splits = 4 )

from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
ACC_train_lr = []
ACC_test_lr = []

import numpy as np
for train_index,test_index in kf.split(range(df.shape[0])):
   y_train = [y[i] for i in train_index]
   y_test = [y[i] for i in test_index]
   x_train = [np.array(x)[i,:] for i in train_index]
   x_test = [np.array(x)[i,:] for i in test_index]
   model_lr.fit(x_train, y_train)
   ACC_train_lr.append(model_lr.score(x_train, y_train))
   ACC_test_lr.append(model_lr.score(x_test, y_test))

ACC_train_lr = np.mean(ACC_train_lr)
ACC_test_lr = np.mean(ACC_test_lr)

print(ACC_train_lr)
print(ACC_test_lr)


To compare different ML technqiues in an efficient way, we will write the model fitting and score computation as a function as follows

In [0]:
def ACC_ML(model, x_train,x_test,y_train,y_test, ACC_train, ACC_test):
   model.fit(x_train, y_train)
   ACC_train.append(model.score(x_train, y_train))
   ACC_test.append(model.score(x_test, y_test))
   return ACC_train,ACC_test



Now we repeat the model evaluation using the ACC_ML() function

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']
from sklearn.model_selection import KFold
kf = KFold( n_splits = 4 )

from sklearn.linear_model import LogisticRegression
ACC_train_lr = []
ACC_test_lr = []

import numpy as np
for train_index,test_index in kf.split(range(df.shape[0])):
   y_train = [y[i] for i in train_index]
   y_test = [y[i] for i in test_index]
   x_train = [np.array(x)[i,:] for i in train_index]
   x_test = [np.array(x)[i,:] for i in test_index]

   ## logistic regression
   ACC_train_lr, ACC_test_lr = ACC_ML(LogisticRegression(), x_train,x_test,y_train,y_test, ACC_train_lr, ACC_test_lr)

ACC_train_lr = np.mean(ACC_train_lr)
ACC_test_lr = np.mean(ACC_test_lr)

print(ACC_train_lr)
print(ACC_test_lr)

Let us try to compare Logistic Regression (**LR**), Decision Tree (**DT**), Support Vector Machine (**SVM**), Random Forest (**RF**), and **Naive Bayes** ML techqniues

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']
from sklearn.model_selection import KFold
kf = KFold( n_splits = 4 )

##LR
from sklearn.linear_model import LogisticRegression
ACC_train_lr = []; ACC_test_lr = []

##DT
from sklearn.tree import DecisionTreeClassifier
ACC_train_dt = []; ACC_test_dt = []

##SVM
from sklearn.svm import SVC
ACC_train_svm = []; ACC_test_svm = []

##RF
from sklearn.ensemble import RandomForestClassifier
ACC_train_rf = []; ACC_test_rf = []

##NB
from sklearn.naive_bayes import GaussianNB
ACC_train_nb = []; ACC_test_nb = []

import numpy as np
for train_index,test_index in kf.split(range(df.shape[0])):
   y_train = [y[i] for i in train_index]
   y_test = [y[i] for i in test_index]
   x_train = [np.array(x)[i,:] for i in train_index]
   x_test = [np.array(x)[i,:] for i in test_index]

   ## LR
   ACC_train_lr, ACC_test_lr = ACC_ML(LogisticRegression(), x_train,x_test,y_train,y_test, ACC_train_lr, ACC_test_lr)

   ##DT
   ACC_train_dt, ACC_test_dt = ACC_ML(DecisionTreeClassifier(), x_train,x_test,y_train,y_test, ACC_train_dt, ACC_test_dt)


   ##SVM
   ACC_train_svm, ACC_test_svm = ACC_ML(SVC(), x_train,x_test,y_train,y_test, ACC_train_svm, ACC_test_svm)
 
   ##RF
   ACC_train_rf, ACC_test_rf = ACC_ML(RandomForestClassifier(), x_train,x_test,y_train,y_test, ACC_train_rf, ACC_test_rf)

   ##NB
   ACC_train_nb, ACC_test_nb = ACC_ML(GaussianNB(), x_train,x_test,y_train,y_test, ACC_train_nb, ACC_test_nb)
  
## compute the mean of accuracy of the K-Fold datasets
ACC_train_lr_mean = np.mean(ACC_train_lr); ACC_test_lr_mean = np.mean(ACC_test_lr);
ACC_train_dt_mean = np.mean(ACC_train_dt); ACC_test_dt_mean = np.mean(ACC_test_dt);
ACC_train_svm_mean = np.mean(ACC_train_svm); ACC_test_svm_mean = np.mean(ACC_test_svm);
ACC_train_rf_mean = np.mean(ACC_train_rf); ACC_test_rf_mean = np.mean(ACC_test_rf);
ACC_train_nb_mean = np.mean(ACC_train_nb); ACC_test_nb_mean = np.mean(ACC_test_nb);

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic (%)' , 'DT (%)' , 'SVM (%)' , 'RF (%)', 'NB (%)'])
t.add_row(['Training', ACC_train_lr_mean*100, ACC_train_dt_mean*100, ACC_train_svm_mean*100, ACC_train_rf_mean*100, ACC_train_nb_mean*100])
t.add_row(['Testing', ACC_test_lr_mean*100, ACC_test_dt_mean*100, ACC_test_svm_mean*100, ACC_test_rf_mean*100, ACC_test_nb_mean*100])
print(t)

Thanks to sklearn library, we don't have to do all of this code to get the **KCV** score. We can use the 

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']
from sklearn.model_selection import cross_val_score

##LR
from sklearn.linear_model import LogisticRegression
ACC_test_lr = cross_val_score(LogisticRegression(),x,y)

##DT
from sklearn.tree import DecisionTreeClassifier
ACC_test_dt = cross_val_score(DecisionTreeClassifier(),x,y)

##SVM
from sklearn.svm import SVC
ACC_test_svm = cross_val_score(SVC(),x,y)

##RF
from sklearn.ensemble import RandomForestClassifier
ACC_test_rf = cross_val_score(RandomForestClassifier(),x,y)

##NB
from sklearn.naive_bayes import GaussianNB
ACC_test_nb = cross_val_score(GaussianNB(),x,y)

ACC_test_lr_mean = np.mean(ACC_test_lr);
ACC_test_dt_mean = np.mean(ACC_test_dt);
ACC_test_svm_mean = np.mean(ACC_test_svm);
ACC_test_rf_mean = np.mean(ACC_test_rf);
ACC_test_nb_mean = np.mean(ACC_test_nb);

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic (%)' , 'DT (%)' , 'SVM (%)' , 'RF (%)', 'NB (%)'])
t.add_row(['Testing', ACC_test_lr_mean*100, ACC_test_dt_mean*100, ACC_test_svm_mean*100, ACC_test_rf_mean*100, ACC_test_nb_mean*100])
print(t)


To get the training accuracy, training time, and score time you may use the following code

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']

import numpy as np
from sklearn.model_selection import cross_validate

##LR
from sklearn.linear_model import LogisticRegression
Score_lr = cross_validate(LogisticRegression(),x,y,return_train_score=True)
ACC_test_lr = np.mean(Score_lr['test_score'])
ACC_train_lr = np.mean(Score_lr['train_score'])
fit_time_lr = np.mean(Score_lr['fit_time'])
score_time_lr = np.mean(Score_lr['score_time'])

##DT
from sklearn.tree import DecisionTreeClassifier
Score_dt = cross_validate(DecisionTreeClassifier(),x,y,return_train_score=True)
ACC_test_dt = np.mean(Score_dt['test_score'])
ACC_train_dt = np.mean(Score_dt['train_score'])
fit_time_dt = np.mean(Score_dt['fit_time'])
score_time_dt = np.mean(Score_dt['score_time'])

##SVM
from sklearn.svm import SVC
Score_svm = cross_validate(SVC(),x,y,return_train_score=True)
ACC_test_svm = np.mean(Score_svm['test_score'])
ACC_train_svm = np.mean(Score_svm['train_score'])
fit_time_svm = np.mean(Score_svm['fit_time'])
score_time_svm = np.mean(Score_svm['score_time'])

##RF
from sklearn.ensemble import RandomForestClassifier
Score_rf = cross_validate(RandomForestClassifier(),x,y,return_train_score=True)
ACC_test_rf = np.mean(Score_rf['test_score'])
ACC_train_rf = np.mean(Score_rf['train_score'])
fit_time_rf = np.mean(Score_rf['fit_time'])
score_time_rf = np.mean(Score_rf['score_time'])

##NB
from sklearn.naive_bayes import GaussianNB
Score_nb = cross_validate(GaussianNB(),x,y,return_train_score=True)
ACC_test_nb = np.mean(Score_nb['test_score'])
ACC_train_nb = np.mean(Score_nb['train_score'])
fit_time_nb = np.mean(Score_nb['fit_time'])
score_time_nb = np.mean(Score_nb['score_time'])

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic' , 'DT' , 'SVM' , 'RF', 'NB'])
t.add_row(['Training (%)', ACC_train_lr*100, ACC_train_dt*100, ACC_train_svm*100, ACC_train_rf*100, ACC_train_nb*100])
t.add_row(['Testing (%)', ACC_test_lr*100, ACC_test_dt*100, ACC_test_svm*100, ACC_test_rf*100,  ACC_test_nb*100])
t.add_row(['fit_time', fit_time_lr, fit_time_dt, fit_time_svm, fit_time_rf, fit_time_nb])
t.add_row(['score_time', score_time_lr, score_time_dt, score_time_svm, score_time_rf, score_time_nb])
print(t)

The k (n_splits) value must be chosen carefully for your datasets. A poorly chosen value for k may result in a mis-representative idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model). $^{[1]}$

Three common tactics for choosing a value for k are as follows $^{[1]}$:

1- Representative: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.

2- k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.

3- k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the hold out dataset. This approach is called leave-one-out cross-validation.

[1] https://machinelearningmastery.com/k-fold-cross-validation/

Let us try **KCV** with k=10 

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']

from sklearn.model_selection import cross_val_score
cv_value = 10
##LR
from sklearn.linear_model import LogisticRegression
ACC_test_lr = cross_val_score(LogisticRegression(),x,y,cv = cv_value)

##DT
from sklearn.tree import DecisionTreeClassifier
ACC_test_dt = cross_val_score(DecisionTreeClassifier(),x,y,cv = cv_value)

##SVM
from sklearn.svm import SVC
ACC_test_svm = cross_val_score(SVC(),x,y,cv = cv_value)

##RF
from sklearn.ensemble import RandomForestClassifier
ACC_test_rf = cross_val_score(RandomForestClassifier(),x,y,cv = cv_value)

##NB
from sklearn.naive_bayes import GaussianNB
ACC_test_nb = cross_val_score(GaussianNB(),x,y,cv = cv_value)

ACC_test_lr_mean = np.mean(ACC_test_lr);
ACC_test_dt_mean = np.mean(ACC_test_dt);
ACC_test_svm_mean = np.mean(ACC_test_svm);
ACC_test_rf_mean = np.mean(ACC_test_rf);
ACC_test_nb_mean = np.mean(ACC_test_nb);

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic (%)' , 'DT (%)' , 'SVM (%)' , 'RF (%)', 'NB (%)'])
t.add_row(['Testing', ACC_test_lr_mean*100, ACC_test_dt_mean*100, ACC_test_svm_mean*100, ACC_test_rf_mean*100, ACC_test_nb_mean*100])
print(t)

Let us try the **leave-one-out cross-validation** approach

In [0]:
## compute maximum posssible cv values
x = df[['hours']] # two dimension array
y = df['pass']
y_0 =  df[df['pass'] == 0]
y_1 =  df[df['pass'] == 1]
cv_value = int(min(y_0.shape[0], y_1.shape[0]))

Use cross_val_score with this cv_value

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']

from sklearn.model_selection import cross_val_score

##LR
from sklearn.linear_model import LogisticRegression
ACC_test_lr = cross_val_score(LogisticRegression(),x,y,cv = cv_value)

##DT
from sklearn.tree import DecisionTreeClassifier
ACC_test_dt = cross_val_score(DecisionTreeClassifier(),x,y,cv = cv_value)

##SVM
from sklearn.svm import SVC
ACC_test_svm = cross_val_score(SVC(),x,y,cv = cv_value)

##RF
from sklearn.ensemble import RandomForestClassifier
ACC_test_rf = cross_val_score(RandomForestClassifier(),x,y,cv = cv_value)


##NB
from sklearn.naive_bayes import GaussianNB
ACC_test_nb = cross_val_score(GaussianNB(),x,y,cv = cv_value)

ACC_test_lr_mean = np.mean(ACC_test_lr);
ACC_test_dt_mean = np.mean(ACC_test_dt);
ACC_test_svm_mean = np.mean(ACC_test_svm);
ACC_test_rf_mean = np.mean(ACC_test_rf);
ACC_test_nb_mean = np.mean(ACC_test_nb);

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic (%)' , 'DT (%)' , 'SVM (%)' , 'RF (%)', 'NB (%)'])
t.add_row(['Testing', ACC_test_lr_mean*100, ACC_test_dt_mean*100, ACC_test_svm_mean*100, ACC_test_rf_mean*100, ACC_test_nb_mean*100])
print(t)

Comparing the two options of k values above emphasize the importance of selecting the right value of k. 

The **KCV** can also be used during parameter tuning of any ML technique. Let us use the **KCV** to tune the number of trees in the Random Forest Classifier (the default value is "n_estimators = 100").

In [0]:
x = df[['hours']] # two dimension array
y = df['pass']
from sklearn.model_selection import cross_validate

from prettytable import PrettyTable
t = PrettyTable(['n_estimators','Accuracy of RF'])

##RF (cv = 10)
from sklearn.ensemble import RandomForestClassifier
for n_trees in range(10,200,10):
  Score_rf = cross_validate(RandomForestClassifier(n_estimators = n_trees),x,y,return_train_score=True, cv = 10)
  ACC_test_rf = np.mean(Score_rf['test_score'])
  t.add_row([n_trees, ACC_test_rf*100])

print(t)

# Case #2: HR Analysis

In this section, we will compare the performance of different ML techniques applied to the "HR Analysis" case using the **K-Fold Cross Validation** (KCV) method. We assume you have some experience or have practiced each of these ML techniques using our tutorials.

**Implementation** *(you may run the first few cells quickly if you have done this probem in the the previous tutorials)*

Read the input data from the csv file (HR_comma_sep.csv) file. Dataset is downloaded from Kaggle. Link: https://www.kaggle.com/giripujar/hr-analytics. Use the pandas library (https://pandas.pydata.org/) to read the data from the file.

In [0]:
import pandas as pd
HR = pd.read_csv('./MachineLearning/5_random_forest/HR_comma_sep.csv')
HR.head()

To get some information about the read dataset use the pandas info method

In [0]:
HR.info()

Before applying classification to the data, we will explore and analyze the data to determine the features that influence the decision of the employee to remain or leave the company.

In [0]:
left = HR[HR.left==1] ## employees who left the company 
No_left= left.shape[0]
remain = HR[HR.left==0] ## employees who remain at the company 
No_remain = remain.shape[0]
Per_left = No_left / (No_left + No_remain)

print('No_left = {}, No_remain = {} , Percentage of left = {} %'.format(No_left,No_remain,Per_left*100))


About $23\%$ employees left the company. Now, let us check which features are mostly affecting the decision of employees to leave or remain in the company. To do this, we will measure the average of each numeric feature for employees to remain or leave the company.  

In [0]:
HR.groupby('left').mean()

We may conclude the following from the table above: \\
1- Employees who remain in the company has higher satisfaction_level and thus it is a good indicator for our regression/classifier (good feature) \\
2- The last_evaluation, number of projects, and time_spend_company scores are almost independent of the employees remain or leave the company \\
3- The average_montly_hours for employees who left the company are higher than those who remained which could be an indicator (good feature) \\
4- The promotion_last_5years feature of employees remaining in the company is much higher than those left the company (good feature) \\
5- Work_accident is also an indicator so it is a good feature.




Let us also check the quality of the categories' features.

In [0]:
pd.crosstab(HR.salary,HR.left)

The salary table shows that emloyees with high salaries are more likely to stay in the company. So it is a good feature. To visualize this we make a bar plot as follows:

In [0]:
pd.crosstab(HR.salary,HR.left).plot(kind='bar')

We need also to investigate the department feature as follows

In [0]:
pd.crosstab(HR.Department,HR.left).plot(kind='bar')

The department type has a minor effect on the decision of employees to stay or leave the company. It doesn't look a major factor and thus we will ignore this feature. 

Based on the above analysis, we will create the following table which includes only the good (important, major) features affecting employees decisions to stay or leave the company

In [0]:
HR_GF = HR[['left','satisfaction_level','average_montly_hours','Work_accident','promotion_last_5years', 'salary']]
HR_GF.head()

Let us plot this data for better visualization

In [0]:
import matplotlib.pyplot as plt
HR_GF_0 = HR_GF[HR_GF['left'] == 0]
HR_GF_1 = HR_GF[HR_GF['left'] == 1]

fig, axes = plt.subplots(2, 4,figsize = (20,10))
axes[0,0].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['average_montly_hours'], color = 'blue', marker ='+')
axes[0,0].set_xlabel('satisfaction_level')
axes[0,0].set_ylabel('average_montly_hours')

axes[0,1].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['Work_accident'], color = 'blue', marker ='+')
axes[0,1].set_xlabel('satisfaction_level')
axes[0,1].set_ylabel('Work_accident')

axes[0,2].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['promotion_last_5years'], color = 'blue', marker ='+')
axes[0,2].set_xlabel('satisfaction_level')
axes[0,2].set_ylabel('promotion_last_5years')

axes[0,3].scatter(HR_GF_0['satisfaction_level'], HR_GF_0['salary'], color = 'blue', marker ='+')
axes[0,3].set_xlabel('satisfaction_level')
axes[0,3].set_ylabel('salary')

axes[1,0].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['average_montly_hours'], color = 'orange', marker ='s')
axes[1,0].set_xlabel('satisfaction_level')
axes[1,0].set_ylabel('average_montly_hours')

axes[1,1].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['Work_accident'], color = 'orange', marker ='s')
axes[1,1].set_xlabel('satisfaction_level')
axes[1,1].set_ylabel('Work_accident')

axes[1,2].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['promotion_last_5years'], color = 'orange', marker ='s')
axes[1,2].set_xlabel('satisfaction_level')
axes[1,2].set_ylabel('promotion_last_5years')

axes[1,3].scatter(HR_GF_1['satisfaction_level'], HR_GF_1['salary'], color = 'orange', marker ='s')
axes[1,3].set_xlabel('satisfaction_level')
axes[1,3].set_ylabel('salary')

By comparing the top and bottom figures, the dataset is separable with respect to the left feature.

**For the ML algorithms**, we need to convert categories features into numbers. We will use label encoder from sklearn library to encode the category feature (salary) as follows

In [0]:
from sklearn.preprocessing import LabelEncoder
le_salary = LabelEncoder()
HR_GF_LE = pd.DataFrame.copy(HR_GF)
HR_GF_LE['salary'] = le_salary.fit_transform(HR_GF_LE['salary'])
HR_GF_LE.head()

Let us define input (x) and output (y) of the model

In [0]:
x = HR_GF_LE.drop('left',axis=1)
y = HR_GF_LE.left

Use one-hot-encoding for salary to be used by logitic regression algorithm

In [0]:
## add column for logitic regression (training)
dm = pd.get_dummies(x.salary)
x_lr = pd.concat([x,dm],axis=1)
x_lr = x_lr.drop(['salary',2],axis=1)

Now, we are ready to apply **KCV** to compare the performance of the different ML techniques

In [0]:
import numpy as np
from sklearn.model_selection import cross_validate

##LR
from sklearn.linear_model import LogisticRegression
Score_lr = cross_validate(LogisticRegression(),x_lr,y,return_train_score=True) ##x is set to x_lr using one-hot-coding instead of label encoding
ACC_test_lr = np.mean(Score_lr['test_score'])
ACC_train_lr = np.mean(Score_lr['train_score'])
fit_time_lr = np.mean(Score_lr['fit_time'])
score_time_lr = np.mean(Score_lr['score_time'])

##DT
from sklearn.tree import DecisionTreeClassifier
Score_dt = cross_validate(DecisionTreeClassifier(),x,y,return_train_score=True)
ACC_test_dt = np.mean(Score_dt['test_score'])
ACC_train_dt = np.mean(Score_dt['train_score'])
fit_time_dt = np.mean(Score_dt['fit_time'])
score_time_dt = np.mean(Score_dt['score_time'])

##SVM
from sklearn.svm import SVC
Score_svm = cross_validate(SVC(),x,y,return_train_score=True)
ACC_test_svm = np.mean(Score_svm['test_score'])
ACC_train_svm = np.mean(Score_svm['train_score'])
fit_time_svm = np.mean(Score_svm['fit_time'])
score_time_svm = np.mean(Score_svm['score_time'])

##RF
from sklearn.ensemble import RandomForestClassifier
Score_rf = cross_validate(RandomForestClassifier(),x,y,return_train_score=True)
ACC_test_rf = np.mean(Score_rf['test_score'])
ACC_train_rf = np.mean(Score_rf['train_score'])
fit_time_rf = np.mean(Score_rf['fit_time'])
score_time_rf = np.mean(Score_rf['score_time'])

##NB
from sklearn.naive_bayes import GaussianNB
Score_nb = cross_validate(GaussianNB(),x,y,return_train_score=True)
ACC_test_nb = np.mean(Score_nb['test_score'])
ACC_train_nb = np.mean(Score_nb['train_score'])
fit_time_nb = np.mean(Score_nb['fit_time'])
score_time_nb = np.mean(Score_nb['score_time'])

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic' , 'DT' , 'SVM' , 'RF', 'NB'])
t.add_row(['Training (%)', ACC_train_lr*100, ACC_train_dt*100, ACC_train_svm*100, ACC_train_rf*100, ACC_train_nb*100])
t.add_row(['Testing (%)', ACC_test_lr*100, ACC_test_dt*100, ACC_test_svm*100, ACC_test_rf*100, ACC_test_nb*100])
t.add_row(['fit_time per fold', fit_time_lr, fit_time_dt, fit_time_svm, fit_time_rf, fit_time_nb])
t.add_row(['score_time per fold', score_time_lr, score_time_dt, score_time_svm, score_time_rf, score_time_nb])
print(t)

**Comment on the training and testing accuracies in the table above?**

Let us try to use the **leave-one-out cross-validation** approach

In [0]:
## compute maximum posssible cv values
y_0 =  HR_GF[HR_GF['left'] == 0]
y_1 =  HR_GF[HR_GF['left'] == 1]
cv_value = int(min(y_0.shape[0], y_1.shape[0]))
cv_value

This is a large number and will take alot of time cosnidering all ML techniques. And thus we will use cv value of 10

In [0]:
cv_value = 10

import numpy as np
from sklearn.model_selection import cross_validate

##LR
from sklearn.linear_model import LogisticRegression
Score_lr = cross_validate(LogisticRegression(),x_lr,y,return_train_score=True, cv = cv_value) ##x is set to x_lr using one-hot-coding instead of label encoding
ACC_test_lr = np.mean(Score_lr['test_score'])
ACC_train_lr = np.mean(Score_lr['train_score'])
fit_time_lr = np.sum(Score_lr['fit_time'])
score_time_lr = np.sum(Score_lr['score_time'])

##DT
from sklearn.tree import DecisionTreeClassifier
Score_dt = cross_validate(DecisionTreeClassifier(),x,y,return_train_score=True, cv = cv_value)
ACC_test_dt = np.mean(Score_dt['test_score'])
ACC_train_dt = np.mean(Score_dt['train_score'])
fit_time_dt = np.sum(Score_dt['fit_time'])
score_time_dt = np.sum(Score_dt['score_time'])

##SVM
from sklearn.svm import SVC
Score_svm = cross_validate(SVC(),x,y,return_train_score=True, cv = cv_value)
ACC_test_svm = np.mean(Score_svm['test_score'])
ACC_train_svm = np.mean(Score_svm['train_score'])
fit_time_svm = np.sum(Score_svm['fit_time'])
score_time_svm = np.sum(Score_svm['score_time'])

##RF
from sklearn.ensemble import RandomForestClassifier
Score_rf = cross_validate(RandomForestClassifier(),x,y,return_train_score=True, cv = cv_value)
ACC_test_rf = np.mean(Score_rf['test_score'])
ACC_train_rf = np.mean(Score_rf['train_score'])
fit_time_rf = np.sum(Score_rf['fit_time'])
score_time_rf = np.sum(Score_rf['score_time'])

##NB
from sklearn.naive_bayes import GaussianNB
Score_nb = cross_validate(GaussianNB(),x,y,return_train_score=True, cv = cv_value)
ACC_test_nb = np.mean(Score_nb['test_score'])
ACC_train_nb = np.mean(Score_nb['train_score'])
fit_time_nb = np.sum(Score_nb['fit_time'])
score_time_nb = np.sum(Score_nb['score_time'])

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logistic' , 'DT' , 'SVM' , 'RF', 'NB'])
t.add_row(['Training (%)', ACC_train_lr*100, ACC_train_dt*100, ACC_train_svm*100, ACC_train_rf*100, ACC_train_nb*100])
t.add_row(['Testing (%)', ACC_test_lr*100, ACC_test_dt*100, ACC_test_svm*100, ACC_test_rf*100, ACC_test_nb*100])
t.add_row(['fit_time per fold', fit_time_lr, fit_time_dt, fit_time_svm, fit_time_rf, fit_time_nb])
t.add_row(['score_time per fold', score_time_lr, score_time_dt, score_time_svm, score_time_rf, score_time_nb])
print(t)

**1- Comment on the training and testing accuracies in the table above?**

**2- Use KCV to tune the number of tress parameter in RF**

# Case #3: Recognition of Handwritten Digits

In this section, we will fine tune the parameters of **RandomForest** (RF) using the **K-Fold Cross Validation** to recognize handwritten digits using . We will be using a standard dataset available through the sklearn library called "load_digits".$^{[1][2]}$

[1] https://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction

[2] https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html

**implementation** (you may run the first few cells quickly if you have done this probem in the the previous tutorials)

In the beginning, we will load the dataset as follows

In [0]:
from sklearn.datasets import load_digits
digits = load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. Let us explore the content of the digits dataset

In [0]:
dir(digits)

The digits.data contains the features that will be used to classify the digits samples

In [0]:
print(digits.data)

The digits.images contains the images of the digits samples. They can be viewed using the following code

In [0]:
import matplotlib.pyplot as plt
plt.gray()
plt.matshow(digits.images[0])

The ground truth of the datset is stored in the digits.taget

In [0]:
print(digits.target)

Let us use Principle Component Analysis to view the digits dataset. We will lot a projection on the 2 first principal axis

In [0]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
proj = pca.fit_transform(digits.data)
plt.scatter(proj[:, 0], proj[:, 1], c=digits.target, cmap="Paired")
plt.colorbar()

After exploring the content of the digits dataset, we will design a classified using **RF**. First, we decide the input feature vector (x) and the ground truth (y) 

In [0]:
x = digits.data
y = digits.target

Here we will train the **RF** model and compute the accuracies

In [0]:
cv_value = 5 ## default value

import numpy as np
from sklearn.model_selection import cross_validate

##RF
from sklearn.ensemble import RandomForestClassifier
Score_rf = cross_validate(RandomForestClassifier(),x,y,return_train_score=True, cv = cv_value)
ACC_test_rf = np.mean(Score_rf['test_score'])
ACC_train_rf = np.mean(Score_rf['train_score'])
fit_time_rf = np.sum(Score_rf['fit_time'])
score_time_rf = np.sum(Score_rf['score_time'])

from prettytable import PrettyTable
t = PrettyTable(['Accuracy','RF '])
t.add_row(['Training (%)', ACC_train_rf*100])
t.add_row(['Testing (%)',  ACC_test_rf*100])
t.add_row(['fit_time (total)', fit_time_rf])
t.add_row(['score_time (total)', score_time_rf])
print(t)


We will then tune the number of trees in **RF**


In [0]:
x = digits.data
y = digits.target

from sklearn.model_selection import cross_validate
from prettytable import PrettyTable
t = PrettyTable(['n_estimators','Accuracy of RF'])

##RF (cv = 10)
from sklearn.ensemble import RandomForestClassifier
for n_trees in range(10,200,10):
  Score_rf = cross_validate(RandomForestClassifier(n_estimators = n_trees),x,y,return_train_score=True, cv = 10)
  ACC_test_rf = np.mean(Score_rf['test_score'])
  t.add_row([n_trees, ACC_test_rf*100])

print(t)

**1- Comment on the training and testing accuracies in the table above**

**2- Check how does the number of folds (k) affect the accuracy of RF**

# Exercises

**Exercise #1**

**Exercise #2**