# Model Selection & Tuning: Grid Search

## Context
Machine Learning algorithms have settings called **Hyperparameters** that you must set before training. For a Decision Tree, it's `max_depth`. For a Support Vector Machine, it's `C` and the `kernel`.

Choosing the perfect configuration manually is basically guessing. Instead, we use **Grid Search** to automatically train and evaluate the model using a grid of thousands of possible hyperparameter combinations to find the absolute best setup for predicting our infrastructure events.

## Objectives
- Generate a dataset predicting "API Timeout Alerts" based on concurrency and DB queries.
- Use `GridSearchCV` to exhaustively test hyperparameters for a Random Forest classifier.
- Safely pick the best model for continuous integration/deployment.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

### 1. Generating API Telemetry Data

In [None]:
np.random.seed(42)
n_samples = 600

X = pd.DataFrame({
    'Concurrent_Users': np.random.normal(500, 200, n_samples),
    'DB_Queries_Per_Sec': np.random.normal(2000, 500, n_samples)
})

# Timeout alert triggers heavily when Users > 700 AND DB queries > 2200
y = ((X['Concurrent_Users'] > 700) & (X['DB_Queries_Per_Sec'] > 2200)).astype(int)
noise = np.random.choice(n_samples, size=50, replace=False)
y[noise] = 1 - y[noise]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### 2. The Baseline Model (Guessing Hyperparameters)
Let's see how our model performs when we just use the default settings.

In [None]:
baseline_rf = RandomForestClassifier(random_state=42)
baseline_rf.fit(X_train, y_train)

baseline_pred = baseline_rf.predict(X_test)
print("Baseline Accuracy: {:.2f}%".format(accuracy_score(y_test, baseline_pred) * 100))

### 3. Defining the Hyperparameter Grid
We will tell Scikit-Learn to test every single combination of the following rules:
- `n_estimators` (Number of trees): 50, 100, or 200.
- `max_depth` (Deepest a tree can grow): None (infinite), 5, or 10.
- `min_samples_split`: 2, 5, or 10.

This means it will train $3 \times 3 \times 3 = 27$ different models. Since we also use 5-Fold Cross Validation for each, it will train a total of **135 models**.

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

print("Grid search will evaluate combination of parameters:", param_grid)

### 4. Executing `GridSearchCV`

In [None]:
# Initialize GridSearchCV
# n_jobs=-1 tells Scikit-Learn to use ALL your CPU cores to train models in parallel
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), 
                           param_grid=param_grid, 
                           cv=5, 
                           n_jobs=-1, 
                           verbose=1)  # Verbose=1 prints training progress

# Fit it (This loops through all 135 models)
grid_search.fit(X_train, y_train)

### 5. Reviewing Results
Let's see what the "winning" combination was and evaluate it.

In [None]:
print("Best Hyperparameters Found: ", grid_search.best_params_)
print("Best CV Accuracy: {:.2f}%\n".format(grid_search.best_score_ * 100))

# The grid_search object automatically retains the BEST model so you can use it immediately
best_model = grid_search.best_estimator_
optimized_pred = best_model.predict(X_test)

print("Optimized Test Accuracy: {:.2f}%".format(accuracy_score(y_test, optimized_pred) * 100))
print("\nClassification Report:")
print(classification_report(y_test, optimized_pred))

### Summary
In automated ML pipelines (like retraining a model weekly on fresh Logstash data), `GridSearchCV` lets the script continuously discover the best parameters without a human Data Scientist needing to manually tinker with variables.