## Hyperparameter Tuning  
**Definition:** The process of optimizing the hyperparameters of a machine learning model to improve its performance.  

### Types of Hyperparameter Tuning:  
1. **Grid Search:** Exhaustively searches predefined hyperparameter values.  
2. **Random Search:** Randomly selects hyperparameter combinations.  
3. **Bayesian Optimization:** Uses probability models to find the best parameters.  
4. **Gradient-Based Optimization:** Adjusts hyperparameters using gradient methods.  
5. **Evolutionary Algorithms:** Uses genetic algorithms to optimize hyperparameters.  

---

## Cross-Validation  
**Definition:** A technique used to evaluate a machine learning model's performance by splitting the dataset into multiple subsets for training and testing.  

### Types of Cross-Validation:  
1. **K-Fold Cross-Validation:** Divides data into *k* subsets and iteratively trains on *k-1* while testing on the remaining fold.  
2. **Stratified K-Fold:** Ensures each fold has a proportional class distribution, useful for imbalanced datasets.  
3. **Leave-One-Out Cross-Validation (LOO-CV):** Uses a single data point for testing and the rest for training, repeated for all points.  
4. **Time Series Cross-Validation:** Uses rolling or expanding windows for sequential data.  
5. **Holdout Method:** Splits the dataset into separate training and testing sets without multiple iterations.  


## Hyperparameter Tuning  

In [1]:
# import libraries 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split , GridSearchCV , RandomizedSearchCV
from sklearn.metrics import confusion_matrix , classification_report

In [2]:
# load the data of iris from sklearn.datasets
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

In [5]:
# define the model 
model = RandomForestClassifier()

# Create the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200, 300, 400, 500],
    'max_depth' : [4,5,6,7,8,9,10] ,
    'criterion' :['gini' , 'entropy'],
    'bootstrap' : [True , False]
}

# set up the grid search
grid_search = GridSearchCV(model,
                            param_grid,
                            cv=5,
                            scoring='accuracy',
                            verbose=1,
                            n_jobs=-1
                        )
grid_search.fit(X, y)

# print the best parameters in f string
print(f"Best parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 168 candidates, totalling 840 fits
Best parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': 4, 'n_estimators': 50}


In [8]:
# lets create model
model = RandomForestClassifier(**grid_search.best_params_)
model.fit(X, y)

# predict the model
y_pred = model.predict(X)

# evaluate the model
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

[[50  0  0]
 [ 0 50  0]
 [ 0  0 50]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        50
           2       1.00      1.00      1.00        50

    accuracy                           1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150



In [10]:
# define the model 
model = RandomForestClassifier()

# Create the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200, 300, 400, 500],
    'max_depth' : [4,5,6,7,8,9,10] ,
    'criterion' :['gini' , 'entropy'],
    'bootstrap' : [True , False]
}

# set up the grid search
grid_search = RandomizedSearchCV(model,
                            param_grid,
                            cv=5,
                            scoring='accuracy',
                            verbose=1,
                            n_jobs=-1
                        )
grid_search.fit(X, y)

# print the best parameters in f string
print(f"Best parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'n_estimators': 200, 'max_depth': 4, 'criterion': 'entropy', 'bootstrap': True}


In [14]:
# lets create model
model = RandomForestClassifier(**grid_search.best_params_)
model.fit(X, y)

# predict the model
y_pred = model.predict(X)

# evaluate the model
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

[[50  0  0]
 [ 0 48  2]
 [ 0  0 50]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      0.96      0.98        50
           2       0.96      1.00      0.98        50

    accuracy                           0.99       150
   macro avg       0.99      0.99      0.99       150
weighted avg       0.99      0.99      0.99       150



# Cross-Validation  

## Definition  
Cross-validation is a technique used to evaluate the performance of a machine learning model by splitting the dataset into multiple subsets for training and testing. It helps assess how well the model generalizes to unseen data and reduces overfitting.  

## Why Use Cross-Validation?  
- Ensures the model performs well on different subsets of data.  
- Helps detect overfitting or underfitting.  
- Provides a more reliable estimate of model performance compared to a single train-test split.  

## Types of Cross-Validation  

### 1. **K-Fold Cross-Validation**  
- Divides the data into *k* equal parts (folds).  
- Trains the model *k* times, using *k-1* folds for training and the remaining fold for testing in each iteration.  
- The final performance is the average of all iterations.  

### 2. **Stratified K-Fold Cross-Validation**  
- Similar to K-Fold but ensures each fold has a proportional representation of class labels.  
- Useful for imbalanced datasets.  

### 3. **Leave-One-Out Cross-Validation (LOO-CV)**  
- Each data point is used as a test set once, while the rest are used for training.  
- Computationally expensive but gives a precise evaluation.  

### 4. **Time Series Cross-Validation**  
- Used for sequential data where the order matters.  
- Training is done on past data, and testing is performed on future data using rolling or expanding windows.  

### 5. **Holdout Method**  
- The dataset is split into a training set and a testing set.  
- No multiple iterations, making it computationally efficient but less reliable for small datasets.  


In [1]:
# cross validation
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

# load the Iris data set
from sklearn.datasets import load_iris
iris = load_iris()  

# define the model
nb = GaussianNB()

# perform k-fold cross validation with k=5
scores = cross_val_score(nb, iris.data, iris.target, cv=5 , scoring='accuracy')


# print the scores for each fold and the mean score
print("Cross-validation scores:", scores)
print("Mean cross-validation score:", scores.mean())
print(f'Standard Deviation : {scores.std()}')

Cross-validation scores: [0.93333333 0.96666667 0.93333333 0.93333333 1.        ]
Mean cross-validation score: 0.9533333333333334
Standard Deviation : 0.02666666666666666
