## Cross Validation
**Cross validation** is a technique used to evaluate the performance of a predictive model and assess its accuracy. It works by dividing the dataset into training and testing sets, then using the training set to build the model and the testing set to evaluate the model’s performance. This process is repeated multiple times with different combinations of training and testing sets, and the results are averaged to get an overall accuracy score. Cross validation helps ensure that the model is not overfitting or underfitting the data.

In [3]:
# Required libraries...
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()

In [4]:
# To split the dataset into train and test samples:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size = 0.2)

In [5]:
# Applying Logisctic Regression Classifire:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9722222222222222

In [6]:
# Applying Support Vector Machine Classifier: 
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.9861111111111112

In [8]:
# Applying Random Forest Classifier:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9722222222222222

* So this is the quick result of three algorithms. As we see we split the data: 80% is using for training and 20% is using for testing. But this splitting is not uniform, if we execute it again and again it will change the samples and finally will brings changes in the models performance. So this is the key problem with train_test_split method that you can't run the model once and judge whether my model is performing better then other models. You must run it multiple times.
* Now let's in coming part check K-fold cross validation.

In [9]:
# Let's call K-Fold:
# n_splits specify how many fold do you want?
from sklearn.model_selection import KFold
kf = KFold(n_splits = 3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [11]:
# Now the folds are created. The way we use this K-fold on the datasets is:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


* So as we see the sample dataset[1-9] is splited into three folds. in First iteration it used first fold [0 1 2] for testing and remaining [3 4 5 6 7 8 ] for training. in the second iteration it used [3 4 5] for testing and [0 1 2 6 7 8] for training. and similar in 3rd iteration. 
* So now we use K-fold for our Digits example.

In [13]:
# To simplify the things, we create a generic method which takes a model, X_train, X_test, y_train and y_test as input.
# And then train the corresponding model.
# Once the model is trained, the function will return the model score.
def get_score (model, X_train, X_test, y_train, t_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [15]:
# We can also use this method to check the performance of upper three algorithms. We can just pass different model and the 
# function will return the relevant model scores:
score = get_score(RandomForestClassifier(), X_train, X_test, y_train, y_test)
score

0.9694444444444444

In [16]:
# So once we have this method, we want to apply K-Fold on our Digits dataset. Here we used StratifiedKFold. It's similar to 
# KFold but it's a little better in a way that when you're seperating your folds it will divide each of the classification 
# category in a uniform way. And this property could be very helpful.
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits = 3)

In [29]:
# So once we have the folds ready, we prepare the scores array to prepare the scores of different models:
scores_lg = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(digits.data):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]

        # Now to measure the performance of the three models in each iteration, so as we have three folds, this loop will
        # iterate three times. Every time it will takes the X_train, X_test, y_train and y_test and measure the performance
        # of the models and then we'll append the scores in these arrays.
        
    scores_lg.append(get_score(LogisticRegression(), X_train, X_test, y_train, y_test))
    scores_svm.append(get_score(SVC(), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score( RandomForestClassifier(), X_train, X_test, y_train, y_test))
        

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [30]:
# Logistic Regression Scores:
scores_lg

[0.9232053422370617, 0.9415692821368948, 0.9148580968280468]

In [31]:
# Support Vector Machine Scores:
scores_svm

[0.9666110183639399, 0.9816360601001669, 0.9549248747913188]

In [32]:
# Random Forest Scores:
scores_rf

[0.9298831385642737, 0.9649415692821369, 0.9215358931552587]

* As result we can average the score and decide which model is performing better.

In [33]:
# Instead of writting the above codes, sklearn provide us a method called cross_val_score to do the exact things that we
# did in [29]. The upper code was just for the concept understanding purpose. When you doing ML, you don't need to write 
# that much code.
# Let's see here the method:
from sklearn.model_selection import cross_val_score

In [41]:
# The method takes three arguments: 1) Model 2) Your 'X' 3) You 'Y' or 'target':
cross_val_score(RandomForestClassifier(n_estimators = 100), digits.data, digits.target)

array([0.93333333, 0.91388889, 0.94428969, 0.97214485, 0.91922006])

* When we try different parameters for model better performance, it's called **parameter tuning.**

In [37]:
# For SVM:
cross_val_score(SVC(), digits.data, digits.target)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

* So, as we see Cross Validation technique is very useful. It don't only allow you to compare different algorithms but it also allow you to compare the same algorithm with different parameters how it would increase the performance.
* So ML is not like scientific question, where for a given problem you should use this model or that model. No it's try and error based, for a given problem and given dataset you need to try various models with various parameters and then figure out which one is the best for your usecase.

### Exercise
Use iris flower dataset from sklearn library and use cross_val_score against following models to measure the performance of each. In the end figure out the model with best performance,

    1. Logistic Regression
    2. SVM
    3. Decision Tree
    4. Random Forest

In [42]:
# Call for the dataset:
from sklearn.datasets import load_iris
iris = load_iris()

In [43]:
# Describes the dataset:
dir(iris)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [44]:
# Call to cross_val_score:
from sklearn.model_selection import cross_val_score

In [45]:
# Logistic Regression():
cross_val_score(LogisticRegression(), iris.data, iris.target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([0.96666667, 1.        , 0.93333333, 0.96666667, 1.        ])

In [46]:
# Support Vector Machine():
cross_val_score(SVC(), iris.data, iris.target)

array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])

In [47]:
# Decision Tree():
from sklearn.tree import DecisionTreeClassifier
cross_val_score(DecisionTreeClassifier(), iris.data, iris.target)

array([0.96666667, 0.96666667, 0.9       , 0.96666667, 1.        ])

In [48]:
# Random Forest Classifier():
cross_val_score(RandomForestClassifier(), iris.data, iris.target)

array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])

* **Decision Tree Algorithm out perform the other three algorithms in case of given dataset.**

Thats were all about Cross Validation...