# Tutorial 8: Tuning Hyperparameters for Optimal Model Performance


In this tutorial, we will discuss **hyperparameter tuning**, a process in machine learning and deep learning that can  enhance model performance. Hyperparameters are the settings of a model that you configure before training, such as the number of trees in a random forest. Choosing the right hyperparameter values is often the difference between a poor model and an effective one.

This session provides an approach to understanding the process of hyperparameter tuning, the methods available, and practice exampled to apply them. By the end, you will be equipped with techniques like **Grid Search** and **Random Search**, and you'll also gain insights into advanced approaches for optimization.

---

**Importance of Hyperparameter Tuning**
Hyperparameter tuning is essential because:
- **Improved Model Accuracy:** Properly tuned hyperparameters can drastically improve model accuracy and generalization.
- **Reduced Overfitting/Underfitting:** Fine-tuning helps balance model complexity, avoiding both overfitting (too complex) and underfitting (too simple).
- **Better Generalization:** Well-tuned hyperparameters ensure the model performs well on unseen data, which is the ultimate goal of machine learning.

Hyperparameter tuning is not just about achieving the best possible accuracy but about making informed decisions that result in robust, reliable models.

---

**Session Outline**
This tutorial is divided into two sections:


**Section 1: Cross-Validation and Its Role in Tuning**
In this section, we’ll discuss:
- The **importance of cross-validation** in hyperparameter tuning.
- Why **test sets should not be used** during the tuning process.
- **Data leakage prevention**:
  - When tuning hyperparameters using the test set, we introduce bias because the chosen hyperparameters are influenced by test performance.
  - Cross-validation allows us to evaluate models on multiple splits of the data, ensuring that the test set remains unseen until the final evaluation.
- Types of cross-validation methods:
  - Hold-out validation.
  - K-Fold cross-validation.
  - Stratified K-Fold cross-validation.
  - Leave-One-Out Cross-Validation.

This section will emphasize the need for unbiased evaluation of models and highlight how cross-validation ensures that we select the best model while preserving the integrity of the test set.

---

**Section 2: Hyperparameter Tuning Algorithms**
In this section, we’ll explore:
- **Basic Tuning Algorithms:**
  - **Grid Search:** An exhaustive search over a predefined set of hyperparameters.
  - **Random Search:** A more efficient method that randomly samples hyperparameter values.
  - **Nested Grid and Random Search:** A hybrid approach where critical hyperparameters are tuned with Grid Search and less critical ones with Random Search.

- **Regular Machine Learning Models:**
  - Applying Grid and Random Search to common models like Logistic Regression, Random Forest, Support Vector Machine (SVM), and XGBoost.

- **Neural Networks:**
  - Tuning hyperparameters of neural networks using Grid Search and Random Search.
  - Defining architecture and training parameters such as the number of layers, neurons, activation functions, learning rate, and batch size.

---

**Advanced Techniques (Introduction)**
While this tutorial focuses on basic techniques, there advanced methods such as:
- Bayesian Optimization
- Tree-structured Parzen Estimators (TPE)
- Hyperband
- Evolutionary Algorithms
- Automated Machine Learning (AutoML)

---

**What You Will Learn**
1. The importance of cross-validation in model evaluation and hyperparameter tuning.
2. Practical implementation of Grid Search and Random Search for regular models.
3. How to tune hyperparameters of neural networks.

---

Let’s get started!

We will use the Titanic dataset that was preprocessed in the first session.

## Data Preprocessing

### Importing Some Basic Libraries

In [1]:
import numpy as np  # for doing numerical operations
import pandas as pd  # for data analysis
import matplotlib.pyplot as plt  # for data visualization
import seaborn as sns  # for data visualization

### Importing the Dataset, reading CSV file to a Pandas Data frame

In [2]:
dataset = pd.read_csv('Titanic_Data.csv')

# Set the float format to 2 decimal
pd.options.display.float_format = '{:.2f}'.format

# Display the first few rows as sample
dataset.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,Number of Siblings/Spouses Aboard,Number of Parents/Children Aboard,Fare,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.28,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.92,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,0


### Dropping Irrelevant Input Data

In [3]:
dataset.drop(['PassengerId', 'Name'], axis=1, inplace=True)

### Taking Care of Missing Embarked Data

In [4]:
# Drop rows where the 'Embarked' column has missing values
dataset = dataset.dropna(subset=['Embarked'])

### Encoding Categorical Data

In [5]:
# One-hot encoding for 'Sex' and 'Embarked' columns
dataset = pd.get_dummies(dataset, columns=['Sex', 'Embarked'])

In [6]:
# Label encoding for 'Survived' column

from sklearn.preprocessing import LabelEncoder

dataset['Survived'] = LabelEncoder().fit_transform(dataset['Survived'])

### Seperate the input and output

In [7]:
y = dataset.iloc[:, 5]
X = dataset.drop(['Survived'], axis=1, inplace=False)

### Splitting the Dataset into the Training Set and Test Set

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Taking Care of Missing Age Data

In [9]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

### Scaling the Features

In [10]:
# Scale the age and fare
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# `Fit` and `transform` are only applied to the training data
X_train[:, [1,4]] = sc.fit_transform(X_train[:, [1,4]])

# Just `Transform` is used on the test data
X_test[:, [1,4]] = sc.transform(X_test[:, [1,4]])

In [11]:
# Ensure both X_train and y_train are NumPy arrays

X_train = np.array(X_train)  
y_train = np.array(y_train)  

## Cross-Validation

Cross-validation is crucial in machine learning because it ensures a reliable and unbiased evaluation of model performance during the training and hyperparameter tuning processes. 


<img src="grid_search_cross_validation.png" alt="Cross-validation vs. Test" width="600">

<img src="1_lOZqYqwmuW1lg6fitwqXxA.png" alt="Cross-validation performance calculation" width="600">


---
Here's why cross-validation is important:

1. **Prevents Data Leakage**
- The **test set** represents unseen data and should only be used once—at the very end of the modeling process—to evaluate the final model.
- If you use the test set during hyperparameter tuning, the model selection process becomes biased toward the test set. This leads to **data leakage**, where the model inadvertently "learns" about the test data, making the evaluation unreliable.

2. **Reliable Performance Estimates**
- Cross-validation splits the data into multiple training and validation sets, ensuring that every data point is used for both training and validation at some point.
- This reduces the variability associated with a single train-validation split, leading to a more robust estimate of model performance.

3. **Ensures Generalization**
- Cross-validation evaluates the model on multiple validation sets, simulating its behavior on unseen data. This helps ensure the model generalizes well to real-world scenarios.
- It avoids overfitting to a single validation set, where the model might perform well just because it happens to align with the validation split.

4. **Maximizes Data Utilization**
- By splitting the data into training and validation sets multiple times (e.g., in K-Fold Cross-Validation), cross-validation makes efficient use of the dataset, especially when it's small.
- Every data point is eventually used for training and validation, leading to a more comprehensive evaluation.

5. **Model Comparison and Selection**
- When tuning hyperparameters or comparing multiple models, cross-validation ensures that the selection process is not biased toward a particular data split.
- It helps you confidently choose the model that performs best across different validation sets, rather than one that performs well on a specific split.

---

**Why Not Use the Test Set for Tuning?**

1. **The Role of the Test Set**
- The **test set** is a proxy for real-world data, used to measure how well the final model generalizes to unseen scenarios.
- Using it during training or hyperparameter tuning contaminates this "unseen" nature, making the evaluation meaningless.

2. **Risk of Overfitting to the Test Set**
- When you tune hyperparameters based on test set performance, you’re inadvertently optimizing the model for that specific test set.
- This results in overfitting to the test set and gives a false sense of confidence in the model's ability to handle truly unseen data.

3. **Unrealistic Expectations**
- In real-world scenarios, you won’t know the characteristics of the test data in advance. Using the test set during tuning creates an unrealistic scenario where you "peek" at future data.

---

**Summary**

| **Cross-Validation**                | **Using Test Set**                     |
|-------------------------------------|----------------------------------------|
| Used for tuning and evaluating models during the training process. | Should only be used for final model evaluation. |
| Provides unbiased performance estimates. | Leads to biased results if used repeatedly. |
| Prevents data leakage and ensures generalization. | Causes data leakage if used during tuning. |
| Simulates performance on unseen data. | Cannot provide a fair evaluation after being used for tuning. |




**Cross-Validation Techniques**

**1. K-Fold Cross-Validation**
- K-Fold Cross-Validation is a method to evaluate a model's performance by splitting the **training** dataset into **K equal-sized subsets** (folds).
- The model is trained on \(K-1\) folds and tested on the remaining fold. This process is repeated \(K\) times, with each fold serving as the test set exactly once.
- The final performance metric is the **average of the scores** across all \(K\) iterations.

**How does it work?**
1. Divide the dataset into \(K\) parts (folds).
2. For each fold:
   - Use \(K-1\) folds as the training set.
   - Use the remaining fold as the test set.
3. Calculate a performance metric (e.g., accuracy) for each fold.
4. Average the metrics to get the final score.

**Advantages
- Reduces bias: Each data point is used for both training and testing.
- Helps identify overfitting or underfitting by testing on multiple splits.

**Disadvantages**
- Computational cost: Training the model \(K\) times can be expensive, especially for large datasets or complex models.

**When to use?**
- Use K-Fold Cross-Validation when you want a reliable performance estimate, and computational resources are not a major constraint.

---

**2. Stratified K-Fold Cross-Validation**
- A variation of K-Fold Cross-Validation that ensures **class proportions in each fold** are similar to the original dataset.
- Particularly useful for **imbalanced datasets**, where one class may dominate.

**How does it work?**
1. Divide the dataset into \(K\) folds.
2. Ensure that each fold has approximately the same proportion of each class as the entire dataset.
3. Follow the same training and testing process as K-Fold Cross-Validation.

**Advantages**
- Maintains the class distribution, which leads to more reliable evaluation metrics, especially for classification tasks with imbalanced data.

**Disadvantages**
- Slightly more complex to implement (though most modern libraries like Scikit-learn handle it automatically).

**When to use?**
- Use Stratified K-Fold Cross-Validation for classification problems, especially when the dataset has an **imbalanced class distribution**.

---

**3. Leave-One-Out Cross-Validation (LOOCV)**
- A special case of K-Fold Cross-Validation where \(K = n\) (the number of data points in the dataset).
- Each iteration uses \(n-1\) data points for training and the **single remaining data point** for testing.

**How does it work?**
1. For each data point:
   - Use all other points as the training set.
   - Use the current point as the test set.
2. Train and evaluate the model \(n\) times.
3. Average the results of all \(n\) iterations to get the final performance metric.

**Advantages**
- No randomness: Every data point is used for testing exactly once.
- Ideal for very small datasets, as it maximizes the amount of data used for training.

**Disadvantages**
- Extremely computationally expensive for large datasets, as the model must be trained \(n\) times.
- May lead to high variance in the evaluation metric because each test set contains only one data point.

**When to use?**
- Use LOOCV for very **small datasets** where you want the most accurate estimate of model performance.

---

**Comparison of Techniques**
| **Technique**           | **Best For**                 | **Computational Cost**  | **Key Feature**                                   |
|--------------------------|------------------------------|--------------------------|--------------------------------------------------|
| K-Fold Cross-Validation  | General use cases            | Medium                   | Divides data into \(K\) folds.                  |
| Stratified K-Fold        | Imbalanced classification    | Medium                   | Preserves class proportions in each fold.       |
| LOOCV                   | Small datasets               | High                     | Uses every point for testing exactly once.      |


**4. Hold-Out Validation?**
- You take a portion of your data (e.g., 80%) for training the model and reserve the remaining portion (e.g., 20%) as a validation set.
- The model is trained on the training data and evaluated on the validation set to estimate its performance.

**How Does It Work?**
1. Split the Data:
   - Randomly divide the dataset into two subsets: training and validation.
   - For example:
     - Training set: 80% of the data.
     - Validation set: 20% of the data.

2. Train the Model:
   - Use the training set to fit the model.

3. Evaluate on Validation Set:
   - Use the validation set to compute performance metrics (e.g., accuracy, precision, recall, etc.).

**Advantages**
- Simple and fast: Only one split, so it’s computationally efficient.
- No need for complex logic: Easy to implement with minimal coding.

**Disadvantages**
- Risk of bias: The performance metric might depend on how the data is split. A poorly chosen split might not represent the true performance of the model.
- Inefficient use of data: Not all data points contribute to both training and validation, which can be problematic for small datasets.
- High variance: Results can vary significantly based on the random split.


**When to Use It?**
- When you have a large dataset and can afford to reserve a portion for validation without worrying about losing valuable training data.
- When you need quick, approximate results and don’t want the overhead of K-Fold or other advanced methods.


**Comparison with K-Fold Cross-Validation**
| **Hold-Out Validation**               | **K-Fold Cross-Validation**           |
|---------------------------------------|---------------------------------------|
| Simple and quick to implement         | More complex but robust               |
| May introduce bias or high variance   | Reduces bias and variance             |
| Efficient for large datasets          | Better for small or medium datasets   |
| Less computational overhead           | More computationally expensive        |


### Hold-Out Validation

In [12]:
# Dictionary to store test accuracies for each method
results = {}

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split training data into train and validation sets
X_train_cv, X_val_cv, y_train_cv, y_val_cv = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(max_iter=500)
model.fit(X_train_cv, y_train_cv)

# Validate model
y_val_pred = model.predict(X_val_cv)
val_accuracy = accuracy_score(y_val_cv, y_val_pred)
print("Hold-Out Validation Accuracy:", val_accuracy)

y_test_pred = model.predict(X_test)
results["Hold-Out"] = accuracy_score(y_test, y_test_pred)

Hold-Out Validation Accuracy: 0.7832167832167832


### K-Fold Cross-Validation

In [13]:
from sklearn.model_selection import KFold, cross_val_score

# Initialize K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
best_kf_model = None
best_kf_val_acc = 0

for train_index, val_index in kf.split(X_train):
    X_train_kf, X_val_kf = X_train[train_index], X_train[val_index]
    y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
    
    model.fit(X_train_kf, y_train_kf)
    val_acc = accuracy_score(y_val_kf, model.predict(X_val_kf))
    
    if val_acc > best_kf_val_acc:
        best_kf_val_acc = val_acc
        best_kf_model = model  # Save the best model

# Predict on test set using the best model
kf_test_pred = best_kf_model.predict(X_test)
results["K-Fold"] = accuracy_score(y_test, kf_test_pred)

### Stratified K-Fold Cross-Validation

In [14]:
from sklearn.model_selection import StratifiedKFold

# Initialize Stratified K-Fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
best_skf_model = None
best_skf_val_acc = 0

for train_index, val_index in skf.split(X_train, y_train):
    X_train_skf, X_val_skf = X_train[train_index], X_train[val_index]
    y_train_skf, y_val_skf = y_train[train_index], y_train[val_index]
    
    model.fit(X_train_skf, y_train_skf)
    val_acc = accuracy_score(y_val_skf, model.predict(X_val_skf))
    
    if val_acc > best_skf_val_acc:
        best_skf_val_acc = val_acc
        best_skf_model = model  # Save the best model

# Predict on test set using the best model
skf_test_pred = best_skf_model.predict(X_test)
results["Stratified K-Fold"] = accuracy_score(y_test, skf_test_pred)


### Leave-One-Out Cross-Validation (LOOCV)

In [15]:
from sklearn.model_selection import LeaveOneOut

# Initialize LOOCV
loo = LeaveOneOut()
best_loo_model = None
best_loo_val_acc = 0

for train_index, val_index in loo.split(X_train):
    X_train_loo, X_val_loo = X_train[train_index], X_train[val_index]
    y_train_loo, y_val_loo = y_train[train_index], y_train[val_index]
    
    model.fit(X_train_loo, y_train_loo)
    val_acc = accuracy_score(y_val_loo, model.predict(X_val_loo))
    
    if val_acc > best_loo_val_acc:
        best_loo_val_acc = val_acc
        best_loo_model = model  # Save the best model

# Predict on test set using the best model
loo_test_pred = best_loo_model.predict(X_test)
results["LOOCV"] = accuracy_score(y_test, loo_test_pred)

### Comparison

In [16]:
# Create a DataFrame to display the results
results_df = pd.DataFrame(list(results.items()), columns=["Cross-Validation Method", "Test Accuracy"])

results_df

Unnamed: 0,Cross-Validation Method,Test Accuracy
0,Hold-Out,0.81
1,K-Fold,0.83
2,Stratified K-Fold,0.83
3,LOOCV,0.84


**Summary of Methods**

| **Method**               | **Description**                                                                 | **Use Case**                                                                 |
|---------------------------|---------------------------------------------------------------------------------|------------------------------------------------------------------------------|
| **Hold-Out Validation**   | Simple train-validation split.                                                  | Fast and suitable for large datasets.                                       |
| **K-Fold Cross-Validation** | Splits data into \(k\) folds, each used as validation set once.                 | General-purpose, reliable for most datasets.                                |
| **Stratified K-Fold**     | Similar to K-Fold but preserves class distribution.                             | Best for imbalanced datasets.                                               |
| **Leave-One-Out (LOOCV)** | Each sample is used as a validation set once.                                   | Best for small datasets but computationally expensive for large datasets.   |


## Hyperparameter Tuning

**Grid Search and Randomized Search**

**1. Grid Search**
Grid Search is a hyperparameter tuning method that systematically searches through a predefined grid of hyperparameter values. For each combination of parameters, the model is trained, validated, and its performance is recorded.

**Advantages**
- **Exhaustive Search:** Grid Search explores all possible combinations of hyperparameter values, ensuring that the best combination within the predefined grid is identified.
- **Structured:** Easy to understand and implement, especially when the search space is small and clearly defined.
- **Deterministic:** Provides consistent results, as it evaluates every possible combination in the grid.

**Disadvantages**
- **Computationally Expensive:** As the number of hyperparameters and their possible values increases, the search space grows exponentially, leading to a significant computational burden.
- **Inefficient for Large Spaces:** Grid Search evaluates combinations that may not contribute significantly to performance improvement, wasting computational resources.
- **Limited Exploration:** Restricted to the predefined grid, it might miss optimal values outside the grid.



**2. Randomized Search**
Randomized Search randomly samples combinations of hyperparameter values from a defined distribution. It evaluates a fixed number of random combinations rather than exploring all possible options.

**Advantages**
- **Computationally Efficient:** By sampling a fixed number of random combinations, Randomized Search saves time and computational resources.
- **Better Coverage in Large Spaces:** Randomized Search can explore a wider range of hyperparameters, potentially identifying better configurations in large or continuous spaces.
- **Scalable:** Suitable for problems with many hyperparameters or a large range of values.

**Disadvantages**
- **Not Exhaustive:** Randomized Search may miss the optimal combination of parameters because it only evaluates a subset of possible values.
- **Non-Deterministic:** Results may vary depending on the random seed and number of samples, requiring careful control for reproducibility.

---

**Why Use Both Grid Search and Randomized Search in a Nested Approach?**

Combining Grid Search and Randomized Search leverages the strengths of both methods, mitigating their individual limitations:

1. **Focused and Efficient Exploration:**
   - Use **Grid Search** to exhaustively tune the most critical hyperparameters, which are expected to have the greatest impact on model performance (e.g., learning rate, regularization strength).
   - Use **Randomized Search** for less impactful or secondary hyperparameters, allowing a broader but less exhaustive exploration.

2. **Balanced Computational Cost:**
   - By limiting Grid Search to a small number of critical parameters, the computational overhead is reduced.
   - Randomized Search can then explore larger spaces efficiently without sacrificing too much precision.

3. **Improved Coverage:**
   - Grid Search ensures that critical parameters are thoroughly explored within the specified range.
   - Randomized Search introduces stochasticity, helping to identify configurations that might have been missed by the predefined grid.

4. **Adaptability:**
   - Nested Grid and Randomized Search is adaptable to both small and large datasets or search spaces, providing flexibility in hyperparameter tuning.

---

**Example Use Case**
For a Random Forest model:
- **Grid Search:** Tune `n_estimators` (number of trees) and `max_depth` (tree depth) using a small, predefined grid.
- **Randomized Search:** Tune `min_samples_split` and `min_samples_leaf` with broader, random sampling.

For a Neural Network:
- **Grid Search:** Focus on architectural parameters like the number of layers and neurons.
- **Randomized Search:** Explore training-related parameters like learning rate, batch size, and dropout rates.

---

**Summary Table: Grid Search vs. Randomized Search**

| **Feature**               | **Grid Search**                          | **Randomized Search**                     |
|---------------------------|------------------------------------------|-------------------------------------------|
| **Exploration**            | Exhaustive within the grid              | Random sampling in a defined space        |
| **Computational Cost**     | High for large grids                    | Lower and more scalable                   |
| **Efficiency**             | Inefficient for large or continuous spaces | Efficient for high-dimensional spaces     |
| **Coverage**               | Limited to the predefined grid          | Covers a broader range of values          |
| **Deterministic**          | Yes                                     | No (depends on random sampling)           |
| **Best Use Case**          | Small, critical hyperparameter spaces   | Large, less critical or continuous spaces |



### Grid Search

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Define models and their parameter grids
models = {
    "Logistic Regression": LogisticRegression(max_iter=500),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Support Vector Machine": SVC(),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

param_grids = {
    "Logistic Regression": {
        'C': [0.01, 0.1, 1, 10],
        'solver': ['liblinear', 'lbfgs']
    },
    "Random Forest": {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    },
    "Support Vector Machine": {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf']
    },
    "K-Nearest Neighbors": {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance']
    }
}

# Dictionary to store results
results = []

# Perform Grid Search with 5-Fold Cross-Validation for each model
for model_name, model in models.items():
    print(f"Running Grid Search for {model_name}...")
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grids[model_name],
        cv=5,  # K-Fold Cross-Validation with k=5
        scoring='accuracy',
        n_jobs=-1
    )
    grid_search.fit(X_train, y_train)

    # Store results
    for params, mean_score in zip(grid_search.cv_results_['params'], grid_search.cv_results_['mean_test_score']):
        results.append({
            'Model': model_name,
            'Parameters': params,
            'Validation Accuracy': mean_score
        })

#     print(f"Best Parameters for {model_name}: {grid_search.best_params_}")
#     print(f"Best Validation Accuracy for {model_name}: {grid_search.best_score_}")

    # Update the best model and parameters
    if 'Best_Model' not in locals() or grid_search.best_score_ > Best_Model['Validation Accuracy']:
        Best_Model = {
            'Model': model_name,
            'Parameters': grid_search.best_params_,
            'Validation Accuracy': grid_search.best_score_
        }

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Print overall best model and parameters
print("\nBest Overall Model and Parameters:")
print(f"Model: {Best_Model['Model']}")
print(f"Parameters: {Best_Model['Parameters']}")
print(f"Validation Accuracy: {Best_Model['Validation Accuracy']}")

Running Grid Search for Logistic Regression...
Running Grid Search for Random Forest...
Running Grid Search for Support Vector Machine...
Running Grid Search for K-Nearest Neighbors...

Best Overall Model and Parameters:
Model: Random Forest
Parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 50}
Validation Accuracy: 0.8241307987786861


In [18]:
results_df

Unnamed: 0,Model,Parameters,Validation Accuracy
0,Logistic Regression,"{'C': 0.01, 'solver': 'liblinear'}",0.74
1,Logistic Regression,"{'C': 0.01, 'solver': 'lbfgs'}",0.75
2,Logistic Regression,"{'C': 0.1, 'solver': 'liblinear'}",0.78
3,Logistic Regression,"{'C': 0.1, 'solver': 'lbfgs'}",0.79
4,Logistic Regression,"{'C': 1, 'solver': 'liblinear'}",0.79
5,Logistic Regression,"{'C': 1, 'solver': 'lbfgs'}",0.79
6,Logistic Regression,"{'C': 10, 'solver': 'liblinear'}",0.79
7,Logistic Regression,"{'C': 10, 'solver': 'lbfgs'}",0.79
8,Random Forest,"{'max_depth': None, 'min_samples_split': 2, 'n...",0.8
9,Random Forest,"{'max_depth': None, 'min_samples_split': 2, 'n...",0.81


In [19]:
from sklearn.metrics import accuracy_score

# Train the best model with the best parameters on the full training data
final_model = models[Best_Model['Model']]
final_model.set_params(**Best_Model['Parameters'])  # Apply the best parameters
final_model.fit(X_train, y_train)  # Train on the full training data

# Predict on the test set
y_test_pred = final_model.predict(X_test)

# Evaluate performance on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)

# Evaluate performance on the train set
y_train_pred = final_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)

# Print the results
print("\nBest Model Performance on Test Set:")
print(f"Model: {Best_Model['Model']}")
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")



Best Model Performance on Test Set:
Model: Random Forest
Train Accuracy: 0.8917
Test Accuracy: 0.8315


### Randomized Search

In [20]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint


# Define models and their hyperparameter distributions
models = {
    "Logistic Regression": LogisticRegression(max_iter=500),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Support Vector Machine": SVC(),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

param_distributions = {
    "Logistic Regression": {
        'C': uniform(0.01, 10),  # Continuous uniform distribution
        'solver': ['liblinear', 'lbfgs']
    },
    "Random Forest": {
        'n_estimators': randint(50, 200),
        'max_depth': [None, 10, 20],
        'min_samples_split': randint(2, 20)
    },
    "Support Vector Machine": {
        'C': uniform(0.1, 10),  # Continuous uniform distribution
        'kernel': ['linear', 'rbf']
    },
    "K-Nearest Neighbors": {
        'n_neighbors': randint(3, 20),
        'weights': ['uniform', 'distance']
    }
}

# Dictionary to store results
results = []

# Perform Randomized Search with 5-Fold Cross-Validation for each model
for model_name, model in models.items():
    print(f"Running Randomized Search for {model_name}...")
    
    random_search = RandomizedSearchCV(
        estimator=model,
        param_distributions=param_distributions[model_name],
        n_iter=20,  # Number of parameter combinations to try
        cv=5,  # K-Fold Cross-Validation with k=5
        scoring='accuracy',
        n_jobs=-1,
        random_state=42
    )
    random_search.fit(X_train, y_train)

    # Store results
    for params, mean_score in zip(random_search.cv_results_['params'], random_search.cv_results_['mean_test_score']):
        results.append({
            'Model': model_name,
            'Parameters': params,
            'Validation Accuracy': mean_score
        })

    # Update the best model and parameters
    if 'Best_Model' not in locals() or random_search.best_score_ > Best_Model['Validation Accuracy']:
        Best_Model = {
            'Model': model_name,
            'Parameters': random_search.best_params_,
            'Validation Accuracy': random_search.best_score_
        }

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Print overall best model and parameters
print("\nBest Overall Model and Parameters:")
print(f"Model: {Best_Model['Model']}")
print(f"Parameters: {Best_Model['Parameters']}")
print(f"Validation Accuracy: {Best_Model['Validation Accuracy']}")


# Train the best model with the best parameters on the full training data
final_model = models[Best_Model['Model']]
final_model.set_params(**Best_Model['Parameters'])  # Apply the best parameters
final_model.fit(X_train, y_train)  # Train on the full training data

# Predict on the test set
y_test_pred = final_model.predict(X_test)

# Evaluate performance on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)

# Evaluate performance on the train set
y_train_pred = final_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)

# Print the results
print("\nBest Model Performance on Test Set:")
print(f"Model: {Best_Model['Model']}")
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")


Running Randomized Search for Logistic Regression...
Running Randomized Search for Random Forest...
Running Randomized Search for Support Vector Machine...
Running Randomized Search for K-Nearest Neighbors...

Best Overall Model and Parameters:
Model: Random Forest
Parameters: {'max_depth': 10, 'min_samples_split': 7, 'n_estimators': 179}
Validation Accuracy: 0.8241504973899341

Best Model Performance on Test Set:
Model: Random Forest
Train Accuracy: 0.9072
Test Accuracy: 0.8371


### Nested Grid and Random Search

In [21]:
# Define models and their hyperparameters for Grid and Random Search
models = {
    "Logistic Regression": LogisticRegression(max_iter=500),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Support Vector Machine": SVC(),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

grid_params = {
    "Logistic Regression": {
        'solver': ['liblinear', 'lbfgs'],  # Critical hyperparameter for Grid Search
    },
    "Random Forest": {
        'n_estimators': [50, 100, 200],  # Critical hyperparameter for Grid Search
    },
    "Support Vector Machine": {
        'kernel': ['linear', 'rbf'],  # Critical hyperparameter for Grid Search
    },
    "K-Nearest Neighbors": {
        'n_neighbors': [3, 5, 7],  # Critical hyperparameter for Grid Search
    }
}

random_params = {
    "Logistic Regression": {
        'C': uniform(0.01, 10),  # Less critical, tuned with Random Search
    },
    "Random Forest": {
        'max_depth': [None, 10, 20],  # Less critical
        'min_samples_split': randint(2, 20),  # Less critical
    },
    "Support Vector Machine": {
        'C': uniform(0.1, 10),  # Less critical
    },
    "K-Nearest Neighbors": {
        'weights': ['uniform', 'distance'],  # Less critical
    }
}

random_params = {
    "Logistic Regression": {
        'C': uniform(0.01, 10),  # Less critical, tuned with Random Search
    },
    "Random Forest": {
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': randint(2, 50),  # Less critical, tuned with Random Search
    },
    "Support Vector Machine": {
        'C': uniform(0.1, 50),  # Less critical, tuned with Random Search
    },
    "K-Nearest Neighbors": {
        'weights': ['uniform', 'distance'],  # Less critical, tuned with Random Search
    }
}

# Dictionary to store results
results = []



# Perform Nested Grid and Random Search for each model
for model_name, model in models.items():
    print(f"Running Nested Search for {model_name}...")
    
    # Step 1: Perform Grid Search for critical hyperparameters
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=grid_params[model_name],
        cv=5,  # K-Fold Cross-Validation with k=5
        scoring='accuracy',
        n_jobs=-1
    )
    
    grid_search.fit(X_train, y_train)
    best_grid_params = grid_search.best_params_
    
    # Step 2: Perform Random Search for less critical hyperparameters
    
    # Update model with best parameters from Grid Search
    model.set_params(**best_grid_params)
    random_search = RandomizedSearchCV(
        estimator=model,
        param_distributions=random_params[model_name],
        n_iter=20,  # Number of parameter combinations to try
        cv=5,  # K-Fold Cross-Validation with k=5
        scoring='accuracy',
        n_jobs=-1,
        random_state=42
    )
    
    random_search.fit(X_train, y_train)
    best_random_params = random_search.best_params_
    
    # Combine best parameters from Grid and Random Search
    combined_best_params = {**best_grid_params, **best_random_params}

    # Store results
    results.append({
        'Model': model_name,
        'Grid Search Parameters': best_grid_params,
        'Random Search Parameters': best_random_params,
        'Combined Parameters': combined_best_params,
        'Validation Accuracy': random_search.best_score_
    })

    # Update the best model and parameters
    if 'Best_Model' not in locals() or random_search.best_score_ > Best_Model['Validation Accuracy']:
        Best_Model = {
            'Model': model_name,
            'Parameters': combined_best_params,
            'Validation Accuracy': random_search.best_score_
        }

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Print overall best model and parameters
print("\nBest Overall Model and Parameters:")
print(f"Model: {Best_Model['Model']}")
print(f"Parameters: {Best_Model['Parameters']}")
print(f"Validation Accuracy: {Best_Model['Validation Accuracy']}")

# Train the best model with the best parameters on the full training data
final_model = models[Best_Model['Model']]
final_model.set_params(**Best_Model['Parameters'])  # Apply the combined best parameters
final_model.fit(X_train, y_train)  # Train on the full training data

# Predict on the test set
y_test_pred = final_model.predict(X_test)

# Evaluate performance on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)

# Evaluate performance on the train set
y_train_pred = final_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)

# Print the results
print("\nBest Model Performance on Test Set:")
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")


Running Nested Search for Logistic Regression...
Running Nested Search for Random Forest...
Running Nested Search for Support Vector Machine...
Running Nested Search for K-Nearest Neighbors...

Best Overall Model and Parameters:
Model: Random Forest
Parameters: {'max_depth': 10, 'min_samples_split': 7, 'n_estimators': 179}
Validation Accuracy: 0.8241504973899341





Best Model Performance on Test Set:
Train Accuracy: 0.9072
Test Accuracy: 0.8371


### Nested Grid and Random Search on ANN

In [23]:
from scikeras.wrappers import KerasClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Define a function to create the ANN model
def create_model(n_layers=1, n_neurons=32, dropout_rate=0.0, learning_rate=0.01):
    model = Sequential()
    model.add(Dense(n_neurons, activation='relu', input_shape=(X_train.shape[1],)))  # Input layer
    for _ in range(n_layers - 1):
        model.add(Dense(n_neurons, activation='relu'))
        model.add(Dropout(dropout_rate))
    model.add(Dense(len(np.unique(y_train)), activation='softmax'))  # Output layer
    model.compile(
        optimizer=Adam(learning_rate=learning_rate),
        loss='sparse_categorical_crossentropy',  # Use sparse loss for label-based targets
        metrics=['accuracy']
    )
    return model

# Wrap the Keras model for compatibility with scikit-learn
model_wrapper = KerasClassifier(
    model=create_model,
    verbose=0  # Suppress training logs
)

# Step 1: Grid Search for critical hyperparameters
grid_params = {
    'model__n_layers': [1, 2, 3],  # Number of hidden layers
    'model__n_neurons': [16, 32],  # Number of neurons per layer
    'batch_size': [8, 16]  # Batch size
}

grid_search = GridSearchCV(
    estimator=model_wrapper,
    param_grid=grid_params,
    cv=3,
    scoring='accuracy',
    n_jobs=2
)

print("Running Grid Search...")
grid_search.fit(X_train, y_train)

# Get the best parameters from Grid Search
best_grid_params = grid_search.best_params_
print(f"Grid Search Best Parameters: {best_grid_params}")
print(f"Grid Search Best Validation Accuracy: {grid_search.best_score_}")

# Step 2: Random Search for less critical hyperparameters
# Update model with best parameters from Grid Search
model_wrapper.set_params(**best_grid_params)

random_params = {
    'model__learning_rate': uniform(0.0001, 0.01),  # Learning rate
    'model__dropout_rate': uniform(0, 0.5),  # Dropout rates
    'epochs': randint(10, 100)  # Integer range for number of epochs
}

random_search = RandomizedSearchCV(
    estimator=model_wrapper,
    param_distributions=random_params,
    n_iter=10,  # Number of random combinations to try
    cv=3,
    scoring='accuracy',
    n_jobs=2,
    random_state=42
)

print("Running Random Search...")
random_search.fit(X_train, y_train)

# Get the best parameters from Random Search
best_random_params = random_search.best_params_
print(f"Random Search Best Parameters: {best_random_params}")
print(f"Random Search Best Validation Accuracy: {random_search.best_score_}")

# Combine best parameters from both searches
combined_best_params = {**best_grid_params, **best_random_params}
print(f"Combined Best Parameters: {combined_best_params}")

# Train the final model with the best parameters
final_model = create_model(**{
    'n_layers': combined_best_params['model__n_layers'],
    'n_neurons': combined_best_params['model__n_neurons'],
    'dropout_rate': combined_best_params['model__dropout_rate'],
    'learning_rate': combined_best_params['model__learning_rate']
})

final_model.fit(
    X_train, y_train,
    epochs=combined_best_params['epochs'],
    batch_size=combined_best_params['batch_size'],
    verbose=0
)

# Predict on both train and test sets
y_train_pred = final_model.predict(X_train)
y_test_pred = final_model.predict(X_test)

# Convert predictions to class labels
y_train_pred = np.argmax(y_train_pred, axis=1)
y_test_pred = np.argmax(y_test_pred, axis=1)

# Evaluate performance on train and test sets
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("\nBest ANN Model Performance:")
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")


Running Grid Search...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Grid Search Best Parameters: {'batch_size': 8, 'model__n_layers': 3, 'model__n_neurons': 32}
Grid Search Best Validation Accuracy: 0.7988748241912799
Running Random Search...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Random Search Best Parameters: {'epochs': 61, 'model__dropout_rate': 0.4753571532049581, 'model__learning_rate': 0.007419939418114052}
Random Search Best Validation Accuracy: 0.80028129395218
Combined Best Parameters: {'batch_size': 8, 'model__n_layers': 3, 'model__n_neurons': 32, 'epochs': 61, 'model__dropout_rate': 0.4753571532049581, 'model__learning_rate': 0.007419939418114052}


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 

Best ANN Model Performance:
Train Accuracy: 0.8495
Test Accuracy: 0.8146
