### üî∞ Notation Used in This Notebook

In Linear Models, the hypothesis (prediction function) is traditionally written as:

$$
y(x) = mx + c
$$

However, in Machine Learning ‚Äî especially for **Logistic Regression and Linear Regression** ‚Äî we use **vectorized parameter notation**:

$$
h_\theta(x) = \theta_0 + \theta_1 x_1
$$

where  
- $ \theta_0 $ ‚Üí intercept / bias term  
- $ \theta_1 $ ‚Üí weight associated with feature $ x_1 $

This notation makes the model easier to **extend to multiple features (dimensions)** and **express in matrix form** later.


---



# üîπ Logistic Regression ‚Äî Binary Classification

Logistic Regression is a supervised machine learning algorithm used when the **target variable has only two possible classes** (0 or 1).  
It is one of the most popular and foundational algorithms for **classification problems**.

Examples of binary classification:
- üé¨ Spam (1) vs Not Spam (0)
- üè• Disease (1) vs No Disease (0)
- üí≥ Fraud (1) vs Legit (0)
- üîê Login Success (1) vs Failure (0)

---

## üß© Why Logistic Regression?
Linear Regression is not suitable for classification because:
- It predicts continuous values (beyond 0 and 1)
- It does not output probabilities
- It is **very sensitive to outliers**
- Most importantly, **the decision boundary becomes unreliable for binary outcomes**

So instead of predicting any numeric value, logistic regression predicts the **probability that a sample belongs to class 1**.

---

## üéö The Need for an Activation Function
If we apply linear regression directly:
$$
z = \theta^T x
$$

or 

$$
h\theta(x) = \theta_0 + \theta_1 x_1
$$
the output can be any value from **‚àí‚àû to +‚àû**, but for classification we need a probability between **0 and 1**.

To convert linear output into a probability, we apply the **Sigmoid Activation Function**.

---

## üìå Sigmoid Function
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

- $ z  = \theta_0 + \theta_1 x_1 $
- Compresses \( z \) into a range of **0 to 1**
- Helps interpret the output as a **probability**
- Creates a **smooth S-shaped curve**

---

## üí∞ Cost Function

$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{1=0}^m (h\theta(x)^i - y^i)^2
$$

We know that $ h\theta(x)^i = \frac{1}{1 + e^{-z}} \quad \bigg|_{z=\theta_0 + \theta_1 x}$

Hence $ (h\theta(x)^i i y^i)^2 $ can be denoted as follows

$$ 
\text{Cost}(h\theta(x)^i, y^i)
$$

Therefore,



To convert the loss into a **convex function** (so that Gradient Descent can reach a **global minima**), we use the **Log Loss (Binary Cross-Entropy Loss)**.

$$
\text{Cost}(h_\theta(x)^i, y^i)
= -\, y \, \log(h_\theta(x)) \;-\; (1 - y) \, \log(1 - h_\theta(x))
$$

$$
\text{Cost}(h_\theta(x)^i, y^i) =
\begin{cases}
-\log(h_\theta(x)) & \text{if } y = 1 \\
-\log(1 - h_\theta(x)) & \text{if } y = 0
\end{cases} \quad \quad \Bigg |_{\text{we basically call this as Log Loss}}
$$

Therefore, the cost function can be denoted as 
$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{1=0}^m (y_i - \log(h\theta(x)^i)) - (1 - y^i) \log(1- h\theta(x)^i)

$$

---

## üß† Logistic Regression Hypothesis
$$
h_\theta(x) = \sigma(\theta^T x)
$$

or

$$
h_\theta(x) = \sigma(\theta_0 + \theta_1 x_1)
$$

---

## ‚ö†Ô∏è Why the Loss Function Cannot Be Mean Squared Error
If we use **MSE (Mean Squared Error)** for logistic regression:
- The optimization surface becomes **non-convex**
- Gradient Descent may get stuck in **local minima**
- The model becomes **unstable and inaccurate**

This happens because sigmoid is a **non-linear transformation**, and MSE does not handle that well.

---

## ‚úîÔ∏è Solution ‚Äî Log Loss
To ensure a **convex curve** (smooth bowl-shape) that Gradient Descent can minimize efficiently, logistic regression uses **Log Loss (Binary Cross-Entropy Loss)**.

$$
\text{Cost}(h_\theta(x)^i, y^i)
= -\, y \, \log(h_\theta(x)) \;-\; (1 - y) \, \log(1 - h_\theta(x))
$$

$$
\text{Cost}(h_\theta(x)^i, y^i) =
\begin{cases}
-\log(h_\theta(x)) & \text{if } y = 1 \\
-\log(1 - h_\theta(x)) & \text{if } y = 0
\end{cases} \quad \quad \Bigg |_{\text{we basically call this as Log Loss}}
$$

Therefore, the cost function can be denoted as 
$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{1=0}^m (y_i - \log(h\theta(x)^i)) - (1 - y^i) \log(1- h\theta(x)^i)

$$

---

## üéØ Summary
| Concept | Meaning |
|--------|---------|
| Goal | Classify input into 0 or 1 |
| Output | Probability of being class 1 |
| Activation | Sigmoid to squash values into 0‚Äì1 |
| Best Loss Function | Binary Cross-Entropy (Convex) |
| Optimization | Gradient Descent |

---

üöÄ Final takeaway:  
Even though the name says **"Regression"**, Logistic Regression is actually used for **classification**, and it works by predicting **probabilities**, not continuous values.


In [181]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

In [182]:
# make_classification is used to generate a dummy dataset with input features and class labels
# We can control number of samples, features, informative features, redundant features, and number of classes
from sklearn.datasets import make_classification

In [183]:
# Generate synthetic data for classification:
# n_samples = total rows (1000)
# n_features = total input features per row (10)
# n_classes = output classes (binary: 0 or 1)
# random_state = for reproducibility of results
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

In [184]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets:
# X  = input features
# y  = target labels
# test_size = 0.30 ‚Üí 30% of the data will be used for testing and 70% for training
# random_state = 42 ‚Üí ensures the split is always the same (reproducible results)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=15)


In [185]:
## Model training

from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
# LogisticRegression() creates a classifier that predicts the probability of a sample
# belonging to class 1 using the sigmoid function
logistic_model = LogisticRegression()

# Train (fit) the model on the training data
# X_train = input features for training
# y_train = actual labels for those samples
# During training, the model learns the optimal weights (Œ∏) that minimize the log-loss cost function
logistic_model.fit(X_train, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [186]:
# Make predictions on the test dataset using the trained model
# X_test = input features that the model has never seen before
# y_pred = predicted class labels (0 or 1) for each sample in X_test
y_pred = logistic_model.predict(X_test)

# Print the prediction outputs in a clean and readable format
print("-" * 75)
print("Prediction:")
print("-" * 75)
print(y_pred)    # Displays the array of predicted labels
print("-" * 75)


---------------------------------------------------------------------------
Prediction:
---------------------------------------------------------------------------
[0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0
 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1
 1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0
 0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 0
 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 1
 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 1
 1 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 1
 0 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1
 0 0 1 1]
---------------------------------------------------------------------------


In [187]:
# Get the predicted probabilities for each class
# logistic_model.predict_proba(X_test) returns a 2-column array:
#   - Column 0 ‚Üí probability that the sample belongs to class 0
#   - Column 1 ‚Üí probability that the sample belongs to class 1
# These probabilities come from the sigmoid function (for binary logistic regression)
# and are useful when we need confidence scores instead of just class predictions.
logistic_model.predict_proba(X_test)


array([[9.89982061e-01, 1.00179394e-02],
       [9.81564732e-01, 1.84352676e-02],
       [1.74722738e-03, 9.98252773e-01],
       [9.48052078e-01, 5.19479217e-02],
       [6.79745877e-01, 3.20254123e-01],
       [8.57637307e-01, 1.42362693e-01],
       [9.83706289e-01, 1.62937109e-02],
       [9.90902788e-01, 9.09721193e-03],
       [8.75802329e-01, 1.24197671e-01],
       [8.65452066e-01, 1.34547934e-01],
       [9.83544149e-01, 1.64558514e-02],
       [1.48175976e-01, 8.51824024e-01],
       [4.66110623e-01, 5.33889377e-01],
       [9.08202083e-03, 9.90917979e-01],
       [1.34569854e-01, 8.65430146e-01],
       [7.97765698e-01, 2.02234302e-01],
       [9.50394674e-01, 4.96053262e-02],
       [9.87121968e-01, 1.28780318e-02],
       [3.78895329e-01, 6.21104671e-01],
       [1.35492989e-01, 8.64507011e-01],
       [3.31601568e-01, 6.68398432e-01],
       [9.38937899e-01, 6.10621013e-02],
       [8.46108695e-02, 9.15389130e-01],
       [8.86959048e-01, 1.13040952e-01],
       [5.496784

In [188]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Evaluate the model performance on test data

# 1Ô∏è‚É£ Accuracy Score
# Measures the percentage of correctly predicted samples:
# (TP + TN) / (TP + TN + FP + FN)
score = accuracy_score(y_test, y_pred)

# 2Ô∏è‚É£ Confusion Matrix
# Summarizes the prediction results in a 2√ó2 matrix:
# [[TP, FN],
#  [FP, TN]]
confusion_matrix = confusion_matrix(y_test, y_pred)

# 3Ô∏è‚É£ Classification Report
# Provides precision, recall, F1-score, and support for each class
classification_report = classification_report(y_test, y_pred)

# Print evaluation results in a readable format
print(f"Accuracy score: {score}")
print(f"Confusion matrix: \n{confusion_matrix}")
print(f"Classification Report: \n{classification_report}")


Accuracy score: 0.8333333333333334
Confusion matrix: 
[[128  30]
 [ 20 122]]
Classification Report: 
              precision    recall  f1-score   support

           0       0.86      0.81      0.84       158
           1       0.80      0.86      0.83       142

    accuracy                           0.83       300
   macro avg       0.83      0.83      0.83       300
weighted avg       0.84      0.83      0.83       300




---
## üòµ‚Äçüí´ Confusion Matrix

The confusion matrix is basically derived from the dataset. A confusion matrix is a table used to evaluate the performance of a classification model (mainly supervised learning).

$$
\begin{array}{c|c}
\text{} & \text{Actual} \\
\hline
\text{Predicted} &
\begin{array}{|c|c|}
\hline
TP & FN \\
\hline
FP & TN \\
\hline
\end{array}
\end{array}
$$


---

## üîë Meaning of each term

| Term                    | Meaning                                                                        |
| ----------------------- | ------------------------------------------------------------------------------ |
| **TP ‚Äì True Positive**  | Model predicted **Positive** and it **was actually Positive**                  |
| **TN ‚Äì True Negative**  | Model predicted **Negative** and it **was actually Negative**                  |
| **FP ‚Äì False Positive** | Model predicted **Positive**, but it was **actually Negative** (Type-I error)  |
| **FN ‚Äì False Negative** | Model predicted **Negative**, but it was **actually Positive** (Type-II error) |

---


---
## üéØ Accuracy Score 

Accuracy score can be represented as 

$$
Accuracy = \frac{TP + TN}{TP + FP + FN + TN}
$$

Accuracy score cannot be used on a imbalanced dataset, therefore it cannot be used directly for model accuracy. Inorder to prevent this we can use Precision and Recall.

---

## ‚úÖ Precison 

Precision can be represented as

$$
Precision = \frac{TP}{TP + FP} \quad \bigg | \text{ Out of all the acutal values, how many are correctly predicted}
$$

---

## üîÅ Recall

Recall can be represented as 

$$
Recall = \frac{TP}{TP + FN} \quad \bigg| \text{ Out of all the predicted values, how many are correctly predicted}
$$

---

## F - $\beta$ score

F-$\beta$ score can be represented as 
$$
F-\beta Score = (1 + \beta^2) \frac{Precision x Recall}{Precision + Recall}
$$

Usecase,

- If $ FP $ and $ FN $ are both important, then $\beta = 1$

Therefore,
$$
    F-1 Score = (1 + 1^2) \frac{Precision x Recall}{Precision + Recall}
$$

This is reffered as Harmonic Mean

- If $ FP $ is more important than $ FN $, then $\beta = 0.5$

Therefore,

$$
F-0.5 Score = (1 + 0.5^2) \frac{Precision x Recall}{Precision + Recall}
$$

- If $ FN $ is more important than $ FP $, then $\beta = 2$

Therefore,

$$
F-2 Score = (1 + 2^2) \frac{Precision x Recall}{Precision + Recall}
$$

# üîç Grid Search ‚Äî Hyperparameter Tuning

Grid Search is a technique used to **find the best hyperparameters** for a machine learning model.  
Instead of training once with default parameters, Grid Search **trains the model multiple times** with different combinations of hyperparameter values and selects the best one based on performance metrics.

---

## üéØ Why do we need Grid Search?

Machine learning models have **hyperparameters** (settings chosen before training), such as:
- `C` (regularization strength)
- `penalty` (L1 / L2)
- `max_iter` (maximum iterations)

Choosing the wrong hyperparameters can:
- Reduce model performance ‚ùå
- Cause underfitting / overfitting ‚ö†Ô∏è

Grid Search helps by **systematically trying multiple possible values** and finding the **optimal combination**.

---

## ‚öôÔ∏è How Grid Search Works

1. Define a dictionary of hyperparameters with possible values
2. Train the model for every combination (exhaustive search)
3. Evaluate performance using cross-validation
4. Return the **best model** and **best hyperparameters**

---

## üß† Example workflow
```python
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the model
log_reg = LogisticRegression()

# Hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

# Grid Search with 5-fold cross validation
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')

# Train the grid search
grid.fit(X_train, y_train)

```

## ‚≠ê Extracting the Best Model

```python
grid.best_params_
best_model = grid.best_estimator_
```

In [189]:
# Initialize the Logistic Regression model (base estimator for Grid Search)
log_reg = LogisticRegression()

# Define hyperparameter options to search over during Grid Search

# Type of regularization (penalty terms)
#  l1         ‚Üí Lasso (feature selection)
#  l2         ‚Üí Ridge (most commonly used)
#  elasticnet ‚Üí Combination of L1 + L2
penalty = ['l1', 'l2', 'elasticnet']

# Regularization strength values (inverse of regularization)
# Higher C  ‚Üí weaker regularization (model fits more to training data)
# Lower C   ‚Üí stronger regularization (prevents overfitting)
c_values = [100, 10, 1.0, 0.1, 0.01]

# Optimization solvers for logistic regression
# Different solvers support different penalties:
#  - liblinear  ‚Üí supports l1 & l2
#  - saga       ‚Üí supports l1, l2, elasticnet
#  - newton-cg, lbfgs, sag ‚Üí support l2 only
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Combine all parameters into a dictionary format required by GridSearchCV
# Grid Search will try every possible combination of:
# (penalty √ó C √ó solver)
param_grid = dict(penalty=penalty, C=c_values, solver=solvers)

In [190]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# StratifiedKFold ensures that each fold has the same proportion of class labels
# (important for classification problems with imbalanced classes)
# By default it uses 5 folds ‚Üí splits data into 5 parts for cross-validation
cv = StratifiedKFold()

# Perform Grid Search to find the best hyperparameters for Logistic Regression
grid = GridSearchCV(
    estimator=log_reg,          # base model to tune
    param_grid=param_grid,      # dictionary of hyperparameters to search
    scoring='accuracy',         # evaluation metric for selecting best model
    cv=cv,                      # cross-validation strategy (Stratified K-Fold)
    n_jobs=-1                   # run computations in parallel (-1 ‚Üí use all CPU cores)
)

In [191]:
print(grid)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [100, 10, 1.0, 0.1, 0.01],
                         'penalty': ['l1', 'l2', 'elasticnet'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']},
             scoring='accuracy')


In [192]:
grid.fit(X_train, y_train)

200 fits failed out of a total of 375.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/psundara/learn/python/python-series/.conda/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/psundara/learn/python/python-series/.conda/lib/python3.12/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/psundara/learn/python/python-series/.conda/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1218, in fit
    solver

0,1,2
,estimator,LogisticRegression()
,param_grid,"{'C': [100, 10, ...], 'penalty': ['l1', 'l2', ...], 'solver': ['newton-cg', 'lbfgs', ...]}"
,scoring,'accuracy'
,n_jobs,-1
,refit,True
,cv,StratifiedKFo...shuffle=False)
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,penalty,'l1'
,dual,False
,tol,0.0001
,C,0.1
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'saga'
,max_iter,100


In [193]:
# Display the best hyperparameter combination found by Grid Search
# grid.best_params_ returns the exact set of hyperparameters that produced the highest accuracy
print("Best params:", grid.best_params_)

# Display the cross-validated score corresponding to those best hyperparameters
# grid.best_score_ represents the average accuracy across all cross-validation folds
print("Best score:", grid.best_score_)
# Print the complete model configured with the best hyperparameters
# grid.best_estimator_ returns the LogisticRegression instance trained using the
# best combination of parameters found during Grid Search
print("Best estimator:", grid.best_estimator_)



Best params: {'C': 0.1, 'penalty': 'l1', 'solver': 'saga'}
Best score: 0.8742857142857143
Best estimator: LogisticRegression(C=0.1, penalty='l1', solver='saga')


In [194]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
y_pred = grid.predict(X_test)

# Evaluate the model performance on test data

# 1Ô∏è‚É£ Accuracy Score
# Measures the percentage of correctly predicted samples:
# (TP + TN) / (TP + TN + FP + FN)
score = accuracy_score(y_test, y_pred)

# 2Ô∏è‚É£ Confusion Matrix
# Summarizes the prediction results in a 2√ó2 matrix:
# [[TP, FN],
#  [FP, TN]]
confusion_matrix = confusion_matrix(y_test, y_pred)

# 3Ô∏è‚É£ Classification Report
# Provides precision, recall, F1-score, and support for each class
classification_report = classification_report(y_test, y_pred)

# Print evaluation results in a readable format
print(f"Accuracy score: {score}")
print(f"Confusion matrix: \n{confusion_matrix}")
print(f"Classification Report: \n{classification_report}")

Accuracy score: 0.85
Confusion matrix: 
[[130  28]
 [ 17 125]]
Classification Report: 
              precision    recall  f1-score   support

           0       0.88      0.82      0.85       158
           1       0.82      0.88      0.85       142

    accuracy                           0.85       300
   macro avg       0.85      0.85      0.85       300
weighted avg       0.85      0.85      0.85       300



In [195]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Retrieve the best-performing model discovered during Grid Search
# This model already contains the optimal combination of hyperparameters
best_model = grid.best_estimator_

# Use the best model to make predictions on the unseen test set
# y_pred_best contains the final class labels (0 or 1) predicted for X_test
y_pred_best = best_model.predict(X_test)


# --------------------------------------------------------------
# üìå Evaluate the model performance on the test dataset
# --------------------------------------------------------------

# 1Ô∏è‚É£ Accuracy Score
# Represents overall correctness of the model
# Formula: (TP + TN) / (TP + TN + FP + FN)
# Higher accuracy indicates better prediction performance across both classes
score = accuracy_score(y_test, y_pred_best)

# 2Ô∏è‚É£ Confusion Matrix
# Shows the detailed breakdown of classification results:
#  TP ‚Üí Correct predictions of class 1 (positive)
#  TN ‚Üí Correct predictions of class 0 (negative)
#  FP ‚Üí Incorrectly classified negative samples as positive
#  FN ‚Üí Incorrectly classified positive samples as negative
# This matrix gives insights into the kinds of errors the model is making
confusion_matrix = confusion_matrix(y_test, y_pred_best)

# 3Ô∏è‚É£ Classification Report
# Displays multiple evaluation metrics per class:
#  ‚Ä¢ Precision ‚Üí Out of all predicted positives, how many were actually positive?
#  ‚Ä¢ Recall    ‚Üí Out of all actual positives, how many were correctly detected?
#  ‚Ä¢ F1-score  ‚Üí Harmonic mean of precision and recall (balances both)
#  ‚Ä¢ Support   ‚Üí Number of occurrences of each class in the dataset
# This report helps assess model behavior beyond simple accuracy
classification_report = classification_report(y_test, y_pred_best)


# --------------------------------------------------------------
# üñ®Ô∏è Print results in an organized format
# --------------------------------------------------------------
print(f"Accuracy score: {score}")
print(f"Confusion matrix: \n{confusion_matrix}")
print(f"Classification Report: \n{classification_report}")


Accuracy score: 0.85
Confusion matrix: 
[[130  28]
 [ 17 125]]
Classification Report: 
              precision    recall  f1-score   support

           0       0.88      0.82      0.85       158
           1       0.82      0.88      0.85       142

    accuracy                           0.85       300
   macro avg       0.85      0.85      0.85       300
weighted avg       0.85      0.85      0.85       300



# üéØ Randomized Search CV ‚Äî Hyperparameter Tuning

Randomized Search CV is a hyperparameter optimization technique that **searches randomly across the hyperparameter space** instead of exhaustively checking every possible combination (as done in Grid Search).

Instead of testing *all* hyperparameter combinations, Randomized Search **selects a fixed number of random combinations** and evaluates them using cross-validation.  
This makes it **much faster and often just as effective as Grid Search**, especially when the search space is large.

---

## üîß Why Randomized Search?

Grid Search is good for small parameter spaces but becomes **slow and expensive** when:
- There are many hyperparameters
- Each hyperparameter has many values
- The dataset is large

Randomized Search solves this by:
- Sampling only a **subset** of the parameter combinations
- Allowing us to set **how many searches** to run (via `n_iter`)

---

## üß† How Randomized Search Works
1. Define a parameter distribution (range of values)
2. Specify how many random combinations to test (`n_iter`)
3. Train and evaluate using cross-validation
4. Select the best hyperparameters based on scoring metric

---

## ‚öôÔ∏è Example workflow
```python
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from scipy.stats import uniform

log_reg = LogisticRegression(max_iter=1000)

# Parameter distributions (random ranges, not fixed lists)
param_dist = {
    'C': uniform(0.001, 100),                   # continuous probability distribution
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['liblinear', 'saga']
}

# Randomized Search with 50 random combinations
random_search = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)
```
---

## ‚≠ê Extracting the Best Model

```python
random_search.best_params_
best_model = random_search.best_estimator_
```
---

## üî• Randomized Search vs Grid Search
| Feature                     | Grid Search CV                 | Randomized Search CV                                |
| --------------------------- | ------------------------------ | --------------------------------------------------- |
| Search type                 | Exhaustive (all combinations)  | Random sampling of combinations                     |
| Speed                       | ‚è≥ Slow for large search spaces | ‚ö° Fast                                              |
| Best for                    | Small hyperparameter grids     | Large / continuous search spaces                    |
| Guarantees best combination | ‚úî Yes (given grid)             | ‚ùå Not guaranteed, but likely with enough iterations |
| Supports distributions      | ‚ùå                              | ‚úî                                                   |
| Control over compute cost   | ‚ùå                              | ‚úî via `n_iter`                                      |

---
## üí° Takeaway
> Randomized Search CV is usually preferred as the first tuning technique
because it is much faster than Grid Search and often finds equally good or better hyperparameters ‚Äî especially with large parameter spaces.


In [196]:
from sklearn.model_selection import RandomizedSearchCV

log_reg = LogisticRegression()

# Define hyperparameter options to search over during Grid Search

# Type of regularization (penalty terms)
#  l1         ‚Üí Lasso (feature selection)
#  l2         ‚Üí Ridge (most commonly used)
#  elasticnet ‚Üí Combination of L1 + L2
penalty = ['l1', 'l2', 'elasticnet']

# Regularization strength values (inverse of regularization)
# Higher C  ‚Üí weaker regularization (model fits more to training data)
# Lower C   ‚Üí stronger regularization (prevents overfitting)
c_values = [100, 10, 1.0, 0.1, 0.01]

# Optimization solvers for logistic regression
# Different solvers support different penalties:
#  - liblinear  ‚Üí supports l1 & l2
#  - saga       ‚Üí supports l1, l2, elasticnet
#  - newton-cg, lbfgs, sag ‚Üí support l2 only
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Combine all parameters into a dictionary format required by GridSearchCV
# Grid Search will try every possible combination of:
# (penalty √ó C √ó solver)
param_dist = dict(penalty=penalty, C=c_values, solver=solvers)


# StratifiedKFold ensures that each fold has the same proportion of class labels
# (important for classification problems with imbalanced classes)
# By default it uses 5 folds ‚Üí splits data into 5 parts for cross-validation
cv = StratifiedKFold()

random_search = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=cv,
    n_jobs=-1,
    random_state=42
)

In [197]:
random_search.fit(X_test, y_test)

130 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/psundara/learn/python/python-series/.conda/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/psundara/learn/python/python-series/.conda/lib/python3.12/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/psundara/learn/python/python-series/.conda/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1218, in fit
    solver

0,1,2
,estimator,LogisticRegression()
,param_distributions,"{'C': [100, 10, ...], 'penalty': ['l1', 'l2', ...], 'solver': ['newton-cg', 'lbfgs', ...]}"
,n_iter,50
,scoring,'accuracy'
,n_jobs,-1
,refit,True
,cv,StratifiedKFo...shuffle=False)
,verbose,0
,pre_dispatch,'2*n_jobs'
,random_state,42

0,1,2
,penalty,'l1'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'saga'
,max_iter,100


In [198]:
# Display the best hyperparameter combination found by RandomizedSearchCV
# random_search.best_params_ returns the exact set of hyperparameters that produced the highest accuracy
print("Best params:", random_search.best_params_)

# Display the cross-validated score corresponding to those best hyperparameters
# random_search.best_score_ represents the average accuracy across all cross-validation folds
print("Best score:", random_search.best_score_)
# Print the complete model configured with the best hyperparameters
# random_search.best_estimator_ returns the LogisticRegression instance trained using the
# best combination of parameters found during RandomizedSearchCV
print("Best estimator:", random_search.best_estimator_)

Best params: {'solver': 'saga', 'penalty': 'l1', 'C': 1.0}
Best score: 0.8566666666666667
Best estimator: LogisticRegression(penalty='l1', solver='saga')


In [199]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Retrieve the best-performing model discovered during RandomizedSearchCV
# This model already contains the optimal combination of hyperparameters
best_model = random_search.best_estimator_

# Use the best model to make predictions on the unseen test set
# y_pred_best contains the final class labels (0 or 1) predicted for X_test
y_pred_best = best_model.predict(X_test)


# --------------------------------------------------------------
# üìå Evaluate the model performance on the test dataset
# --------------------------------------------------------------

# 1Ô∏è‚É£ Accuracy Score
# Represents overall correctness of the model
# Formula: (TP + TN) / (TP + TN + FP + FN)
# Higher accuracy indicates better prediction performance across both classes
score = accuracy_score(y_test, y_pred_best)

# 2Ô∏è‚É£ Confusion Matrix
# Shows the detailed breakdown of classification results:
#  TP ‚Üí Correct predictions of class 1 (positive)
#  TN ‚Üí Correct predictions of class 0 (negative)
#  FP ‚Üí Incorrectly classified negative samples as positive
#  FN ‚Üí Incorrectly classified positive samples as negative
# This matrix gives insights into the kinds of errors the model is making
confusion_matrix = confusion_matrix(y_test, y_pred_best)

# 3Ô∏è‚É£ Classification Report
# Displays multiple evaluation metrics per class:
#  ‚Ä¢ Precision ‚Üí Out of all predicted positives, how many were actually positive?
#  ‚Ä¢ Recall    ‚Üí Out of all actual positives, how many were correctly detected?
#  ‚Ä¢ F1-score  ‚Üí Harmonic mean of precision and recall (balances both)
#  ‚Ä¢ Support   ‚Üí Number of occurrences of each class in the dataset
# This report helps assess model behavior beyond simple accuracy
classification_report = classification_report(y_test, y_pred_best)


# --------------------------------------------------------------
# üñ®Ô∏è Print results in an organized format
# --------------------------------------------------------------
print(f"Accuracy score: {score}")
print(f"Confusion matrix: \n{confusion_matrix}")
print(f"Classification Report: \n{classification_report}")


Accuracy score: 0.8733333333333333
Confusion matrix: 
[[141  17]
 [ 21 121]]
Classification Report: 
              precision    recall  f1-score   support

           0       0.87      0.89      0.88       158
           1       0.88      0.85      0.86       142

    accuracy                           0.87       300
   macro avg       0.87      0.87      0.87       300
weighted avg       0.87      0.87      0.87       300

