# Cross-Validation and Grid Search

- A validation dataset is a sample of data held back from training your model that is used to give an estimate of model skill while tuning model’s hyperparameters.
- give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate, StratifiedGroupKFold, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from scipy.stats import uniform, randint 

In [3]:
df = pd.read_csv("wine.csv")
df.head()

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


In [4]:
x = df.drop("class", axis=1)
y = df["class"]

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=14)

In [6]:
x_sub, x_val, y_sub, y_val = train_test_split(x_train, y_train, test_size=0.2, stratify = y_train, random_state=14)

In [7]:
print(x_sub.shape, x_val.shape)

(4157, 3) (1040, 3)


In [8]:
print(x_train.shape, x_test.shape)

(5197, 3) (1300, 3)


Model Training

In [9]:
dt = DecisionTreeClassifier(random_state=14)
dt.fit(x_sub, y_sub)
print(dt.score(x_sub, y_sub))
print(dt.score(x_val, y_val))

0.9980755352417608
0.8480769230769231


- training set score is higher than val -> overfitting

## Cross Validation

- problem -> training set decreases if we have lots of validation data, vice versa 
- fix this problem with corss validation

In [10]:
scores = cross_validate(dt, x_train, y_train)
print(scores)

{'fit_time': array([0.00921321, 0.00720477, 0.00641298, 0.00565004, 0.00564933]), 'score_time': array([0.00109982, 0.00103807, 0.000736  , 0.00065899, 0.00064373]), 'test_score': array([0.85961538, 0.84326923, 0.86814244, 0.85948027, 0.84600577])}


- fit_time : time for training the model 
- score_time: time to score the model
- test_score: validation score
    - final cross-validation score is avg of test_score

In [11]:
print(np.mean(scores["test_score"]))

0.8553026208632561


Things to Note
- cross_validate doesn't mix the training set
    - In the current example, the entire data was shuffled with train_test_split, so there is no need to shuffle it separately.
    - If you need to mix the training set during cross-validation, you must specify a splitter.

- divider
    - Decide how to divide the folds in cross-validation
    - The cross_validate function uses KFold for regression models and uses StratifiedKFold to evenly divide the dependent variables for classification models.

In [12]:
from sklearn.model_selection import StratifiedKFold


splitter = StratifiedKFold(n_splits=10, shuffle=True, random_state=14)
scores = cross_validate(dt, x_train, y_train, cv=splitter)
print(np.mean(scores["test_score"]))


0.8585841855639543


## Grid Search

- parameters: machine learning model learns while trianing
- hyperparameters: model can't learn so human must choose 

Hyperparameter tuning order
1. Train the model with default values
2. Modify parameters through validation set scores or cross-validation.
    - Each model provides at least 1 to 2 parameters and at most 5 to 6 or more parameters.

- When tuning hyperparameters, the optimal values ​​of multiple parameters must be found simultaneously.
    - Example) Even if you find the optimal value of parameter A, if the value of parameter B changes, the optimal value of parameter A changes.
    - As the number of parameters increases, the process of finding the optimal value becomes more complicated.

- To solve the above problem, use Scikit-Learn’s Gridsearch.
    - Perform hyperparameter search and cross-validation at once

In [13]:
params = {"min_impurity_decrease": [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}

In [14]:
gs = GridSearchCV(DecisionTreeClassifier(random_state=14), params)

Grid search executes training by changing the parameter values ​​of the input model.
- In the current example, training is performed 5 times while changing min_impurity_decrease.
- Since the default value of the cv parameter of GridsearchcV is 5, 5-fold cross-validation is performed.
- Therefore, cross-validation 5 times per training time * 5 training times = training the model 25 times

In [15]:
gs.fit(x_train, y_train)

-After training, grid search automatically retrains the model on the entire training set using the parameter combination of the model with the highest validation score among the 25 models.
- The model is stored in the best_estimator_ attribute.

In [16]:
dt = gs.best_estimator_
print(dt.score(x_train, y_train))

0.9318837791033289


In [17]:
print(gs.best_params_)
# best parameters found by gridsearch

{'min_impurity_decrease': 0.0002}


The average score of the cross-validations performed on each parameter is stored in the "mean_test_score" key of the cv_results_ property.

In [18]:
print(gs.cv_results_["mean_test_score"])

[0.86703931 0.87435219 0.86607926 0.86357796 0.86376657]


The order in which grid search finds optimal parameters
1. Specify the parameters to search for
2. Perform a grid search on the training set to find the parameter combination that yields the best validation score.
3. Perform final model training using the entire training set with the optimal parameters found in step 2.
4. Save the model created as a result of step 3

## Random Search
- When the parameter value is a number, it may be difficult to determine the range or interval of the value, and because there are too many parameter conditions, grid search may take a long time.
- Randomsearch does not pass a list of parameter values, but rather a probability distribution object from which parameters can be sampled.

Random Search: Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.

In [19]:
randint(0, 10).rvs(10)

array([2, 2, 1, 5, 7, 4, 1, 1, 5, 2])

In [20]:
uniform(0, 1).rvs(10)

array([0.40268776, 0.17494014, 0.77175371, 0.39494119, 0.01339984,
       0.80023552, 0.22655798, 0.19757489, 0.10737044, 0.65655829])

In [21]:
params = {"min_impurity_decrease": uniform(0.0001, 0.001), 
          "max_depth": randint(20, 50), 
          "min_samples_split": randint(2, 25), 
          "min_samples_leaf": randint(1, 25)}

In [22]:
rs = RandomizedSearchCV(DecisionTreeClassifier(random_state = 14), params, n_iter = 100, n_jobs = -1, random_state=14)
rs.fit(x_train, y_train)

- Cross-validation is performed by sampling a total of 100 times in the parameter range defined in params above.
- Effectively searches a wide area while reducing the number of cross-validations compared to grid search

In [23]:
print(rs.best_params_)

{'max_depth': 46, 'min_impurity_decrease': 0.00023308405138174716, 'min_samples_leaf': 2, 'min_samples_split': 13}


In [24]:
print(np.max(rs.cv_results_["mean_test_score"]))

0.8680039979270008


In [25]:
print(gs.best_estimator_.score(x_test, y_test))
print(rs.best_estimator_.score(x_test, y_test))

0.8738461538461538
0.8723076923076923
