# What are we doing?

## Objectives

+ Construct a cross-validation pipeline.
+ Use cross-validation to evaluate different hyperparameter performance.
+ Perform grid search for systemic evaluation.
+ Store and manage results.

## Procedure

The diagram below, taken from Scikit Learn's documentation, shows the procedure that we will follow:

![](./images/05_grid_search_workflow.png)


+ System requriements:
    
    - Automation: the system should operate automatically with the least amount of supervision. 
    - Replicability: changes to code and (arguably) data should be logged and controled. Randomness should also be controlled (random seeds, etc.)
    - Persistence: persist results for later analysis.


## What is a Hyperparameter?

+ Generally speaking, hyperparameters are parameters that control the learning process: regularization weights, learning rate, entropy/gini metrics, etc. 
+ Hyperparameters will drive the behaviour and performance of a model. Model selection is intimately related with hyperparameter tuning. 
+ Selection critieria are based on performance evaluation and, to get better performance estimates, we use cross-validation.

## Searching the Hyperparameter Grid

+ To address the automation requirement, we could use `GridSearchCV()`, which is a self-contained function for performing a Grid Search over a hyperparameter space.
+ To "Search the Hyperparameter Grid" exhaustively means that we will consider all possible combination of hyperparameter values in the search space and evaluate the model using those hyperparams. For example, if we have two parameters that we are exploring, kernel (takes values "rbf" and "poly") and C (takes values 1.0 and 0.5), then this grid would be the combinations:

    + (rbf, 1.0)
    + (rbf, 0.5)
    + (poly, 1.0)
    + (poly, 0.5)

+ Under each combination, we perform CV and evaluate the model's performance.

# Setup

We start with [Give me some credit](https://www.kaggle.com/c/GiveMeSomeCredit) data that we used in the previous session.

In [1]:
# Load environment variables
from dotenv import load_dotenv
import os

# Load .env file
load_dotenv(dotenv_path='.env')

# System & OS utilities
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Add project source directory
sys.path.append(os.getenv('SRC_DIR'))
ft_path = os.getenv("CREDIT_DATA")
df_raw = pd.read_csv(ft_path)

# Scikit-learn: Preprocessing, Pipelines, and Model Selection
from sklearn import set_config
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import (
    train_test_split, 
    GridSearchCV, 
    ParameterGrid
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import (
    classification_report, 
    roc_auc_score, 
    roc_curve, 
    confusion_matrix
)

# Parallel processing & Progress Bar
from tqdm import tqdm
from joblib import Parallel, delayed

# Debugging & Logging
import logging

# Load environment variables
log_dir = os.getenv('LOG_DIR')
db_url = os.getenv('DB_URL')
tickers_file = os.getenv('TICKERS')

print(f"Log Directory: {log_dir}")
print(f"Tickers File: {tickers_file}")
print(f"Database URL: {db_url}")
print("Current working directory:", os.getcwd())
print("DEBUG: DB_URL =", os.getenv("DB_URL"))

set_config(transform_output="pandas")  # Forces transformers to return DataFrames


Log Directory: ../../07_logs/
Tickers File: ../../05_src/data/tickers/sp500_wiki.csv
Database URL: postgresql://postgres:HumanAfterAll@localhost:5432/model_db
Current working directory: c:\Users\AC\Documents\GitHub\Programs\UofT-DSI\production\01_materials\labs
DEBUG: DB_URL = postgresql://postgres:HumanAfterAll@localhost:5432/model_db


In [2]:
df = df_raw.drop(columns = ["Unnamed: 0"]).rename(
    columns = {
        'SeriousDlqin2yrs': 'delinquency',
        'RevolvingUtilizationOfUnsecuredLines': 'revolving_unsecured_line_utilization', 
        'age': 'age',
        'NumberOfTime30-59DaysPastDueNotWorse': 'num_30_59_days_late', 
        'DebtRatio': 'debt_ratio', 
        'MonthlyIncome': 'monthly_income',
        'NumberOfOpenCreditLinesAndLoans': 'num_open_credit_loans', 
        'NumberOfTimes90DaysLate':  'num_90_days_late',
        'NumberRealEstateLoansOrLines': 'num_real_estate_loans', 
        'NumberOfTime60-89DaysPastDueNotWorse': 'num_60_89_days_late',
        'NumberOfDependents': 'num_dependents'
    }
).assign(
    high_debt_ratio = lambda x: (x['debt_ratio'] > 1)*1,
    missing_monthly_income = lambda x: x['monthly_income'].isna()*1,
    missing_num_dependents = lambda x: x['num_dependents'].isna()*1, 
)


Use a simple pipeline composed of:

+ Preprocessing steps.
+ Logistic Regression classifier.

We will explore the hyperparameter sapce by evaluating different regularization strategies and parameters.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, ParameterGrid
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, confusion_matrix
from tqdm import tqdm
from joblib import Parallel, delayed
import matplotlib.pyplot as plt

import pandas as pd


In [4]:
class DataFrameWrapper(BaseEstimator, TransformerMixin):
    """Ensures transformed NumPy arrays are converted back to Pandas DataFrames while keeping expected column names"""

    def __init__(self):
        self.feature_names = None  

    def fit(self, X, y=None):
        """Store feature names if input is a DataFrame"""
        if hasattr(X, "columns"):
            self.feature_names = list(X.columns)
        return self

    def transform(self, X):
        """Convert array back to DataFrame with stored feature names"""
        if self.feature_names is not None:
            return pd.DataFrame(X, columns=self.feature_names)
        else:
            return pd.DataFrame(X)


In [5]:
# Define feature matrix X and target variable Y
X = df.drop(columns='delinquency')
Y = df['delinquency']

# Ensure X_train and X_test are DataFrames before training
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)


In [6]:
num_cols = ['revolving_unsecured_line_utilization', 'age',
       'num_30_59_days_late', 'debt_ratio', 'monthly_income',
       'num_open_credit_loans', 'num_90_days_late', 'num_real_estate_loans',
       'num_60_89_days_late', 'num_dependents', 
       # Although expressed as numbers, these columns are boolean:
       # 'high_debt_ratio',
       # 'missing_monthly_income', 
       # 'missing_num_dependents' 
       ]

pipe_num_simple = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler())
])

ctransform_simple= ColumnTransformer(
    [('numeric_simple', pipe_num_simple, num_cols),],
    remainder='passthrough',force_int_remainder_cols=False)

pipe_lr = Pipeline([
    ("scaler", StandardScaler()),  # Standardizes numerical features
    ("preprocess", ctransform_simple),  # Applies transformations
    ("clf", LogisticRegression())  # Model
])


pipe_lr


Obtain the parameters of the pipeline with `.get_params()`.

In [7]:
pipe_lr.get_params()


{'memory': None,
 'steps': [('scaler', StandardScaler()),
  ('preprocess',
   ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                     transformers=[('numeric_simple',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('standardizer',
                                                     StandardScaler())]),
                                    ['revolving_unsecured_line_utilization', 'age',
                                     'num_30_59_days_late', 'debt_ratio',
                                     'monthly_income', 'num_open_credit_loans',
                                     'num_90_days_late', 'num_real_estate_loans',
                                     'num_60_89_days_late', 'num_dependents'])])),
  ('clf', LogisticRegression())],
 'transform_input': None,
 'verbose': False,
 'scaler': Stand

## Setup the Splitting Strategy

In [8]:
# X = df.drop(columns = 'delinquency')
# Y = df['delinquency']

# scoring = ['neg_log_loss', 'roc_auc', 'f1', 'accuracy', 'precision', 'recall']

# X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)



To perform the Grid Search we need to define a parameter grid:

- A parameter grid defines all of the combinations of parameters that we need to explore.
- The function `GridSearchCV()` performs an exhaustive search of parameter combinations.
- The parameter grid is defined as a dictionary of lists:

    * Each entry's key is the name of the parameter.
    * Each entry's value is the list of values that we would like to explore.

In [9]:
solver_penalty_map = {
    "lbfgs": {"penalty": ["l2"]},  # lbfgs only supports L2
    "liblinear": {"penalty": ["l1", "l2"]},  # Supports L1 and L2
    "saga": {"penalty": ["l1", "l2", "elasticnet"], "l1_ratio": [0.1, 0.5, 0.9]},  # Needs l1_ratio for elasticnet
    "newton-cg": {"penalty": ["l2"]},  # Only supports L2
    "sag": {"penalty": ["l2"]},  # Only supports L2
}

param_grid = []
for solver, params in solver_penalty_map.items():
    grid = {
        "clf__solver": [solver], 
        "clf__penalty": params["penalty"],
        "clf__C": [0.001, 0.01, 0.1, 0.5, 1.0]
    }
    if "l1_ratio" in params:
        grid["clf__l1_ratio"] = params["l1_ratio"]
    param_grid.append(grid)


In [10]:
# param_grid = {
#     "clf__C": [0.001, 0.01, 0.1, 0.5, 1.0],  # Regularization strength
#     "clf__penalty": ["l1", "l2"]#,  # Penalty type
#     #"clf__solver": ["liblinear", "saga"],  # Only solvers that support both l1 and l2
# }


Some key inputs to [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) are:

+ `estimator`: the pipeline or classifier that we are tuning.
+ `param_grid`: the parameter grid defined as a dictionary of lists described above.
+ `n_jobs`: settings for parallel computation.
+ `refit`: options for refitting the model using the best-performing configuration.

In [11]:
scoring = ['neg_log_loss', 'roc_auc', 'f1', 'accuracy', 'precision', 'recall']


In [12]:
grid_cv = GridSearchCV(
    estimator=pipe_lr, 
    param_grid=param_grid, 
    scoring=scoring, 
    cv=5,
    refit="neg_log_loss",
    n_jobs=-1,
    verbose=3  # Enables built-in progress updates
)

grid_cv.fit(X_train, Y_train)


Fitting 5 folds for each of 70 candidates, totalling 350 fits


Access the cross-validation results using the property `.cv_results_`:

In [13]:
res = grid_cv.cv_results_
res = pd.DataFrame(res)
res.columns

res[['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_clf__C', 'param_clf__penalty', 'param_clf__solver', 'params',
       'mean_test_neg_log_loss',
       'std_test_neg_log_loss', 'rank_test_neg_log_loss']].sort_values('rank_test_neg_log_loss')


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__C,param_clf__penalty,param_clf__solver,params,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss
4,1.050852,0.199232,0.210751,0.073393,1.000,l2,lbfgs,"{'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__so...",-0.225366,0.000481,1
13,18.019614,0.882370,0.074761,0.013371,1.000,l1,liblinear,"{'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__so...",-0.225373,0.000493,2
64,0.934518,0.240172,0.119844,0.060976,1.000,l2,newton-cg,"{'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__so...",-0.225373,0.000489,3
11,18.931579,1.147549,0.125953,0.034203,0.500,l1,liblinear,"{'clf__C': 0.5, 'clf__penalty': 'l1', 'clf__so...",-0.225373,0.000489,4
14,3.368281,0.400845,0.527001,0.239820,1.000,l2,liblinear,"{'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__so...",-0.225374,0.000488,5
...,...,...,...,...,...,...,...,...,...,...,...
21,2.412699,0.692579,0.059038,0.005239,0.001,l1,saga,"{'clf__C': 0.001, 'clf__l1_ratio': 0.9, 'clf__...",-0.238117,0.000401,66
15,9.774046,1.143187,0.107029,0.051437,0.001,l1,saga,"{'clf__C': 0.001, 'clf__l1_ratio': 0.1, 'clf__...",-0.238117,0.000401,67
18,2.486464,0.383048,0.057827,0.003051,0.001,l1,saga,"{'clf__C': 0.001, 'clf__l1_ratio': 0.5, 'clf__...",-0.238117,0.000401,68
5,0.740302,0.152482,0.160786,0.036136,0.001,l1,liblinear,"{'clf__C': 0.001, 'clf__penalty': 'l1', 'clf__...",-0.239394,0.000359,69


In [14]:
res = pd.DataFrame(grid_cv.cv_results_)
print(res.columns)  # Verify available columns


Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_clf__C', 'param_clf__penalty', 'param_clf__solver',
       'param_clf__l1_ratio', 'params', 'split0_test_neg_log_loss',
       'split1_test_neg_log_loss', 'split2_test_neg_log_loss',
       'split3_test_neg_log_loss', 'split4_test_neg_log_loss',
       'mean_test_neg_log_loss', 'std_test_neg_log_loss',
       'rank_test_neg_log_loss', 'split0_test_roc_auc', 'split1_test_roc_auc',
       'split2_test_roc_auc', 'split3_test_roc_auc', 'split4_test_roc_auc',
       'mean_test_roc_auc', 'std_test_roc_auc', 'rank_test_roc_auc',
       'split0_test_f1', 'split1_test_f1', 'split2_test_f1', 'split3_test_f1',
       'split4_test_f1', 'mean_test_f1', 'std_test_f1', 'rank_test_f1',
       'split0_test_accuracy', 'split1_test_accuracy', 'split2_test_accuracy',
       'split3_test_accuracy', 'split4_test_accuracy', 'mean_test_accuracy',
       'std_test_accuracy', 'rank_test_accuracy', 'split0_test_precision',

In [15]:
res


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__C,param_clf__penalty,param_clf__solver,param_clf__l1_ratio,params,split0_test_neg_log_loss,...,std_test_precision,rank_test_precision,split0_test_recall,split1_test_recall,split2_test_recall,split3_test_recall,split4_test_recall,mean_test_recall,std_test_recall,rank_test_recall
0,0.380031,0.099674,0.160748,0.105342,0.001,l2,lbfgs,,"{'clf__C': 0.001, 'clf__penalty': 'l2', 'clf__...",-0.235929,...,0.055570,23,0.009913,0.021685,0.014250,0.014870,0.013011,0.014746,0.003867,30
1,0.969333,0.261290,0.208822,0.080677,0.010,l2,lbfgs,,"{'clf__C': 0.01, 'clf__penalty': 'l2', 'clf__s...",-0.229737,...,0.046403,59,0.010533,0.022305,0.014250,0.016109,0.013631,0.015366,0.003907,19
2,1.010386,0.278845,0.291628,0.049857,0.100,l2,lbfgs,,"{'clf__C': 0.1, 'clf__penalty': 'l2', 'clf__so...",-0.226058,...,0.033793,2,0.034077,0.044610,0.037794,0.031599,0.037794,0.037175,0.004399,10
3,1.089450,0.190647,0.192163,0.062830,0.500,l2,lbfgs,,"{'clf__C': 0.5, 'clf__penalty': 'l2', 'clf__so...",-0.225792,...,0.020925,5,0.039653,0.049566,0.047088,0.040892,0.045229,0.044486,0.003726,6
4,1.050852,0.199232,0.210751,0.073393,1.000,l2,lbfgs,,"{'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__so...",-0.225750,...,0.016801,10,0.040273,0.050186,0.047088,0.042751,0.048947,0.045849,0.003759,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,3.292957,0.510526,0.055569,0.004005,0.001,l2,sag,,"{'clf__C': 0.001, 'clf__penalty': 'l2', 'clf__...",-0.235919,...,0.055570,23,0.009913,0.021685,0.014250,0.014870,0.013011,0.014746,0.003867,30
66,4.717235,0.340193,0.050032,0.002074,0.010,l2,sag,,"{'clf__C': 0.01, 'clf__penalty': 'l2', 'clf__s...",-0.230232,...,0.052272,55,0.009913,0.021685,0.013631,0.014870,0.013631,0.014746,0.003847,30
67,4.427697,0.382469,0.046854,0.003234,0.100,l2,sag,,"{'clf__C': 0.1, 'clf__penalty': 'l2', 'clf__so...",-0.227775,...,0.032369,65,0.014250,0.022305,0.015489,0.017348,0.015489,0.016976,0.002842,16
68,4.032418,0.145087,0.044684,0.001879,0.500,l2,sag,,"{'clf__C': 0.5, 'clf__penalty': 'l2', 'clf__so...",-0.227536,...,0.028059,56,0.016729,0.021685,0.015489,0.017348,0.016109,0.017472,0.002196,14


In [16]:
res = pd.DataFrame(grid_cv.cv_results_)

# Drop invalid results (failed fits)
res_filtered = res.dropna(subset=["mean_test_neg_log_loss"])

# Display best results
print("\n🔹 Best Results:")
print(res_filtered.nsmallest(5, "rank_test_neg_log_loss"))



🔹 Best Results:
    mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
4        1.050852      0.199232         0.210751        0.073393   
13      18.019614      0.882370         0.074761        0.013371   
64       0.934518      0.240172         0.119844        0.060976   
11      18.931579      1.147549         0.125953        0.034203   
14       3.368281      0.400845         0.527001        0.239820   

    param_clf__C param_clf__penalty param_clf__solver  param_clf__l1_ratio  \
4            1.0                 l2             lbfgs                  NaN   
13           1.0                 l1         liblinear                  NaN   
64           1.0                 l2         newton-cg                  NaN   
11           0.5                 l1         liblinear                  NaN   
14           1.0                 l2         liblinear                  NaN   

                                               params  \
4   {'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__

Access the best-performing configuration:

In [17]:
grid_cv.best_params_


{'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__solver': 'lbfgs'}

In [18]:
grid_cv.best_estimator_


The best-performing classifier (pipeline) trained on the complete training set is:

# Tracking GridSearchCV Experiments

+ We can expand our infrastructure for hyperparameter tuning across various models.
+ The plan:

    - Create a model ingredient to obtain the classifier object.
    - Create experiment param grids in json files to organize our parameter grids.
    - Schedule the experiments.


## The Design

<div>
<img src=./images/05_experiment_setup.png width="75%">
</div>

Explore the code in `./05_src/credit_experiment.py` and `./05_src/credit_model_ingredient.py`:

+ `credit_model_ingredient.py` implements a function that returns a model given a string. This way, we can parametrize models in the experiment.
+ `credit_experiment.py` is modularized version of our previous file, `credit_experiment_nb.py` which only worked with Naive Bayes classifier.
+ The experiment is now further *modularized*: there are ingredients for most components and it can be broken down even more depending on the evolution of the model.

## Running Experiments from the Command Line

Access the experiment through the [Command Line Interface](https://sacred.readthedocs.io/en/stable/command_line.html).

```
cd src  # if required
python credit_experiment.py
```

We can also change the parameters of the experiment. For instance, using the same code, we can run an experiment with a logistic regression classifier using a basic (not power) preprocessing pipeline:

```
python .\credit_experiment.py with 'preprocessing="basic"' 'model="LogisticRegression"'
```

# A Few Notes About Sacred

+ Sacred is a powerful tool, but it is only the beginning.  
+ Sacred is useful in keeping track of experiments within a limited scope: it is not a project management tool.
+ It works well in SQL environments, but handling hyperparameters can be painful.
+ The natural backend is MongoDB, however not all workplaces have running instances.


## Experiment Schema

The database schema implemented by sacred is shown below. The schema is a useful representation of the code and setup of an experiment. The package offers a [metrics API](https://sacred.readthedocs.io/en/stable/collected_information.html#metrics-api), but we have decided to extend the framework with a few ad-hoc tables with performance metrics. 

The database backend is a database like any other: you can query it with Python, R, or PowerBI.

+ Server is located in localhost port 5432.
+ User and password are in the .env file in `./05_src/db/`.

<div>
<img src=./images/05_sacred_sql_schema.png width="40%">
</div>