
# Grid Search CV

by Emil Vassev

October 3, 2022, October 5, 2022
<br><br>
Copyright (C) 2022 - All rights reserved, do not copy or distribute without permission of the author.
***

## Optimization Problem 
* A process about maximizing or minimizing a function by systematically choosing input values from a set, so to compute an optimal value of that function.
* The process of adjusting **hyperparameters** in order to minimize the cost (loss) function by using one of the optimization techniques

## Hyperparameters 
* describe the structure of the model
* need to be set before starting to train the model
* model hyperparameters and  algorithm hyperparameters
* examples: model (topology and size of a neural network), algorithm (learning rate) 

## Exhaustive Search (Brute-Force Search)
* the process of looking for the most optimal combination of hyperparameters by checking whether each candidate is a good match
* advantage: every single combination will be checked and the absolute optimum solution identified
* disadvantage: required time is proportional to the total number of all possible solutions

## Grid Search
* an Exhaustive Search through a manually specified a subset of the hyperparameter space of a learning algorithm defines the search space as a grid of hyperparameter values and evaluates every position in the grid
* *search space*: 
    * volume to be searched where each dimension represents a hyperparameter and each point represents one model configuration
    * a point in the search space is a vector with a specific value for each hyperparameter
* *objective*: find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum loss (error) 

## Example: Real-Life Scenario
The example data we’re going to use for Exhaustive Search Optimization is a real-life data about the profit of 50 startups running in the US. What we need to analyze is various expenses such as R&D Spend, Administration expenses, Marketing Spend and the State location of the startup (e.g., California). The Profit is the target.
<br><br>
Data taken from: https://raw.githubusercontent.com/arib168/data/main/50_Startups.csv
<br><br>
*Objective*: Train and test regression models without and with optimization to predict the Profit of a startup if it has calculated expenses such as R&D Spend, Administration expenses, and Marketing Spend. The regression algorithms to be tested are:
    <li> Multiple Linear Regression </li>
    <li> Random Forest Regression </li>

### Step 1. Load Data

In [1]:
import pandas as pd
df = pd.read_csv('data\\50_startups.csv')
df.head(10)

Unnamed: 0,R_D_Spend,Administration,Marketing_Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


### Step 2. Remove the 'State' Feature

In [2]:
df_numeric_ftrs = df.select_dtypes(include=['float64','int64'])

In [3]:
df_numeric_ftrs.head(10)

Unnamed: 0,R_D_Spend,Administration,Marketing_Spend,Profit
0,165349.2,136897.8,471784.1,192261.83
1,162597.7,151377.59,443898.53,191792.06
2,153441.51,101145.55,407934.54,191050.39
3,144372.41,118671.85,383199.62,182901.99
4,142107.34,91391.77,366168.42,166187.94
5,131876.9,99814.71,362861.36,156991.12
6,134615.46,147198.87,127716.82,156122.51
7,130298.13,145530.06,323876.68,155752.6
8,120542.52,148718.95,311613.29,152211.77
9,123334.88,108679.17,304981.62,149759.96


### Step 3. Prepare the Datasets - Train and Test

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
Y = df_numeric_ftrs.iloc[:, df_numeric_ftrs.columns.get_loc('Profit')]
X = df_numeric_ftrs.loc[:, df_numeric_ftrs.columns != 'Profit']

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=0)

### Step 4. Import the scikit-learn Libraries

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

### Step 5. Multiple Linear Regression without Optimization

In [7]:
lr = LinearRegression()
lr.fit(X_train, Y_train)

LinearRegression()

In [8]:
Y_pred = lr.predict(X_test)

In [9]:
mlr_score_without_optimization = metrics.r2_score(Y_test,Y_pred)
print(mlr_score_without_optimization)

0.9393955917820569


### Step 6. Linear Regression with Optimization

#### GridSearchCV
* Exhaustive search over specified hyperparameter values for an estimator.
* Important methods are fit, predict.
* Important arguments:
    * model - the model to be optimized 
    * space - the search space
    * cv (cross-validation): 
        * number of folds to be specified 
        * configured cross-validation object
        
Note: Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.        

#### ShuffleSplit
* random permutation cross-validator
* yields indices to split data into training and test sets

Note: In contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

In [10]:
cv = ShuffleSplit(n_splits=2, test_size=0.3, random_state=0)

#### LinearRegression Hyperparameters

* **fit_intercept** (default=True) - If True, the intercept for the model will be calculated. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

* **copy_X** (default=True) - If True, X will be copied; else, it may be overwritten.

* **n_jobs** (default=None) - Number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems. **None** means 1 job and **-1** means using all available processors.

* **positivebool** (default=False) - When is set to True, this hyperparameter forces the coefficients - *intercept* (𝛽<sub>0</sub>) and *slopes*  (𝛽<sub>0</sub>....𝛽<sub>n</sub>) to be positive.

In [11]:
lm_parameters = {'fit_intercept':[True,False], 'copy_X':[True,False]}

In [12]:
lm = GridSearchCV(LinearRegression(),
                  param_grid=lm_parameters,
                  cv=cv,            
                  return_train_score=True)

In [13]:
result = lm.fit(X_train, Y_train)

In [14]:
result.best_params_

{'copy_X': True, 'fit_intercept': True}

In [15]:
Y_pred = lm.predict(X_test)

In [16]:
mlr_score_with_optimization = metrics.r2_score(Y_test,Y_pred)
print(mlr_score_with_optimization)

0.9393955917820569


In [17]:
print(f"Prediction score without opimization: {mlr_score_without_optimization}")
print(f"Prediction score with opimization: {mlr_score_with_optimization}")

Prediction score without opimization: 0.9393955917820569
Prediction score with opimization: 0.9393955917820569


### Step 7. Random Forest Regression without Optimization

In [18]:
rf_regressor = RandomForestRegressor (n_estimators=100)
rf_regressor.fit(X_train, Y_train)

RandomForestRegressor()

In [19]:
Y_pred = rf_regressor.predict(X_test)

In [20]:
rfr_score_without_optimization = metrics.r2_score(Y_test,Y_pred)
print(rfr_score_without_optimization)

0.9595224380380424


### Step 8. Random Forest Regression with Optimization

In [21]:
cv = ShuffleSplit(n_splits=2, test_size=0.3, random_state=0)

#### RandomForestRegressor Hyperparameters

* **n_estimators** (default=100) - the number of trees in the forest

* **criterion** {“squared_error”, “absolute_error”, “poisson”} (default=”squared_error) - the function to measure the quality of a split

* **max_depth** (default=None) - the maximum depth of the tree; if None, then nodes are expanded until all leaves are reached

* **min_samples_split** int or float (default=2) - the minimum number of samples required to split an internal node:
    * if int, then consider min_samples_split as the minimum number.
    * if float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split

* **min_samples_leaf** int or float (default=1) - the minimum number of samples required to be at a leaf node. 

* **min_weight_fraction_leaf** (default=0.0) - the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node

* **max_features** {“sqrt”, “log2”, None} int or float (default=1.0) - rte number of features to consider when looking for the best split:

* **max_leaf_nodes** (default=None) - tells to grow trees with max_leaf_nodes in best-first fashion

* **min_impurity_decrease** (default=0.0) - a node will be split if this split induces a decrease of the impurity greater than or equal to this value

* **bootstrap** (default=True) - tells whether bootstrap samples are used when building trees; if False, the whole dataset is used to build each tree

* **oob_score** (default=False) - tells whether to use out-of-bag samples to estimate the generalization score; only available if bootstrap=True

* **n_jobs** (default=None) - the number of jobs to run in parallel; fit, predict, decision_path and apply are all parallelized over the trees; None means 1 and -1 means using all processors

* **random_state**  - RandomState instance or None (default=None) - controls both the randomness of the bootstrapping of the samples used when building trees

* **verbose** (default=0) - controls the verbosity when fitting and predicting

* **warm_startbool** (default=False) - when set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest

* **ccp_alphanon-negative** (default=0.0) - a complexity hiperparameter used for Minimal Cost-Complexity Pruning

* **max_samples** int or float (default=None) - if bootstrap is True, the number of samples to draw from X to train each base estimator

In [22]:
lm_parameters = {} #an empty dict signifies default parameters

lm = GridSearchCV(RandomForestRegressor(),
                  param_grid=lm_parameters,
                  cv=cv,            
                  return_train_score=True)

In [23]:
result = lm.fit(X_train, Y_train)

In [24]:
result.best_params_

{}

In [25]:
Y_pred = lm.predict(X_test)

In [26]:
rfr_score_with_optimization_empty_param_grid = metrics.r2_score(Y_test,Y_pred)
print(rfr_score_with_optimization_empty_param_grid)

0.9607149523722232


In [27]:
print(rfr_score_without_optimization)
print(rfr_score_with_optimization_empty_param_grid)

0.9595224380380424
0.9607149523722232


In [28]:
'''
lm_parameters = {
    'n_estimators': [2, 3, 4, 5],
    'criterion': ['squared_error', 'absolute_error', 'poisson'],
    'random_state': [0, 1, 2],
    'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
'''

lm_parameters = {
    'n_estimators': [2, 3, 4, 5],
    'criterion': ['squared_error', 'absolute_error', 'poisson'],
    'random_state': [0, 1, 2]
}

'''
lm_parameters = {
    'max_features':[0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95],
    'n_estimators': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'criterion': ['squared_error', 'absolute_error', 'poisson'],
    'random_state': [0, 1, 2]
}
'''

lm = GridSearchCV(RandomForestRegressor(),
                  param_grid=lm_parameters,
                  cv=cv,            
                  return_train_score=True)

In [29]:
result = lm.fit(X_train, Y_train)

In [30]:
result.best_params_

{'criterion': 'absolute_error', 'n_estimators': 2, 'random_state': 1}

In [31]:
Y_pred = lm.predict(X_test)

In [32]:
rfr_score_with_optimization = metrics.r2_score(Y_test,Y_pred)
print(rfr_score_with_optimization)

0.9626849262899466


In [33]:
print(rfr_score_without_optimization)
print(rfr_score_with_optimization)
print(rfr_score_with_optimization_empty_param_grid)

0.9595224380380424
0.9626849262899466
0.9607149523722232
