<a href="https://colab.research.google.com/github/hucarlos08/GEO-ML/blob/main/CrossValidation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GridSearchCV
GridSearchCV is a technique for tuning hyperparameters, seeking to find the optimal hyperparameters for a model in a systematic way. The idea is to define a grid of hyperparameters, where each point in the grid is a specific configuration of hyperparameters. The model is then trained and evaluated for each of these configurations, and the configuration that yields the best performance according to some metric (e.g., accuracy for classification problems, mean squared error for regression problems) is chosen as the optimal one.

Let's break it down:

- **Grid**: This refers to the combination of different hyperparameters that you'd like to test. For example, if you're tuning a support vector machine, you might want to test different values for the cost parameter C and the kernel gamma. You could define C to take on values from the set {1, 10, 100} and gamma from the set {0.1, 0.01}. Your grid would then consist of 9 combinations: (C=1, gamma=0.1), (C=1, gamma=0.01), (C=10, gamma=0.1), etc.

- **Search**: This is the process of training and evaluating a model for each point in the grid. Typically, this is done using some form of cross-validation to get a reliable estimate of the model's performance.

- **CV**: Stands for Cross-Validation. In cross-validation, the training set is split into k smaller sets or "folds". The model is then trained on k-1 of these folds and evaluated on the remaining fold. This process is repeated k times, with a different fold used for evaluation each time. The performance of the model is then the average performance over these k folds. This helps to get a more robust estimate of the model's performance and to mitigate overfitting.

By systematically exploring the grid of hyperparameters with cross-validation, **GridSearchCV** can find a good combination of hyperparameters, balancing model complexity and model performance in a principled way. The main drawback of this method is that it can be computationally expensive, especially if the grid is large and the model takes a long time to train.

In [29]:
# 2. Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

# 3. Load and explore the data
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# A simple pipeline, with only two steps
pipeline = Pipeline([
    ('normalizer', StandardScaler()), #Step1 - normalize data
    ('clf', LogisticRegression(max_iter=10000, solver='saga')) #Step2 - classifier
])

# 7. Train and evaluate the model using the Pipeline
# Train the model
pipeline.fit(X_train, y_train)

# Predict the classes
y_pred = pipeline.predict(X_test)

# Evaluate the model performance
print('Accuracy: {:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))

# 8. Hyperparameter tuning in a Pipeline
# Create a parameter grid
param_grid = {
    'clf__penalty': ['l1', 'l2', None, 'elasticnet'],
    'clf__C': np.logspace(-2, 0, 20),
    'clf__l1_ratio': np.logspace(-2, 0, 20)
}

# Create a GridSearch object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, verbose=0, n_jobs=4)

# Fit the GridSearch object
best_model = grid_search.fit(X_train, y_train)

# Get the results into a dataframe
results = pd.DataFrame(best_model.cv_results_)

# Print the best parameters
print('Best Parameters: ', best_model.best_params_)

# Predict the classes using the best model
y_pred_best = best_model.predict(X_test)

# Evaluate the best model performance
print('Accuracy of Best Model: {:.2f}'.format(metrics.accuracy_score(y_test, y_pred_best)))


Accuracy: 1.00
Best Parameters:  {'clf__C': 0.615848211066026, 'clf__l1_ratio': 0.01, 'clf__penalty': 'l2'}
Accuracy of Best Model: 1.00



l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)



In [30]:
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__C,param_clf__l1_ratio,param_clf__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.008903,0.00282,0.0012,0.00013,0.01,0.01,l1,"{'clf__C': 0.01, 'clf__l1_ratio': 0.01, 'clf__...",0.375,0.333333,0.333333,0.333333,0.333333,0.341667,0.016667,1511
1,0.011304,0.005162,0.001381,0.000663,0.01,0.01,l2,"{'clf__C': 0.01, 'clf__l1_ratio': 0.01, 'clf__...",0.75,0.916667,0.833333,0.875,0.875,0.85,0.056519,1428
2,0.20025,0.085036,0.00263,0.00278,0.01,0.01,,"{'clf__C': 0.01, 'clf__l1_ratio': 0.01, 'clf__...",0.958333,0.958333,0.833333,1.0,1.0,0.95,0.061237,112
3,0.009315,0.005135,0.001451,0.000997,0.01,0.01,elasticnet,"{'clf__C': 0.01, 'clf__l1_ratio': 0.01, 'clf__...",0.75,0.916667,0.833333,0.875,0.875,0.85,0.056519,1428
4,0.006531,0.002201,0.002119,0.002006,0.01,0.012743,l1,"{'clf__C': 0.01, 'clf__l1_ratio': 0.0127427498...",0.291667,0.333333,0.333333,0.333333,0.333333,0.325,0.016667,1573


In [33]:
import plotly.express as px

# Select the columns we are interested in
subset_results = results[['param_clf__C', 'param_clf__penalty', 'param_clf__l1_ratio', 'mean_test_score']]

# Rename columns for a more meaningful plot
subset_results = subset_results.rename(columns={
    'param_clf__C': 'C',
    'param_clf__penalty': 'Penalty',
    'param_clf__l1_ratio': 'Ratios',
    'mean_test_score': 'Mean Test Score'
})

# Convert columns to appropriate dtypes for plotting
subset_results['C'] = subset_results['C'].astype(float)
subset_results['Mean Test Score'] = subset_results['Mean Test Score'].astype(float)
subset_results['Penalty'] = subset_results['Penalty'].map({'l1': 1, 'l2': 2, None:3, "elasticnet":4})
subset_results['Ratios'] = subset_results['Ratios'].astype(float)


# Make the parallel coordinates plot
fig = px.parallel_coordinates(
    subset_results,
    color="Penalty",
    labels={"C": "C", "Penalty": "Penalty", "Solver": "Solver", "Mean Test Score": "Mean Test Score"},
    color_continuous_scale=px.colors.diverging.Tealrose,
    color_continuous_midpoint=subset_results['Mean Test Score'].mean()
)
fig.show()


In [34]:
subset_results.head(100)

Unnamed: 0,C,Penalty,Ratios,Mean Test Score
0,0.010000,1,0.010000,0.341667
1,0.010000,2,0.010000,0.850000
2,0.010000,3,0.010000,0.950000
3,0.010000,4,0.010000,0.850000
4,0.010000,1,0.012743,0.325000
...,...,...,...,...
95,0.012743,4,0.020691,0.858333
96,0.012743,1,0.026367,0.333333
97,0.012743,2,0.026367,0.858333
98,0.012743,3,0.026367,0.950000
