# Logistic Regression Model Training

In this section, we will:
1. Load the training data (`x_train_woe.csv` and `y_train_proxy.csv`).
2. Standardize the features.
3. Train a Logistic Regression model.
4. Tune hyperparameters using Grid Search.
5. Evaluate the model's performance.

Let's begin by loading the necessary libraries.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

## 1. Load Data

We load the WoE-transformed features (`x_train_woe.csv`) and the target labels (`y_train_proxy.csv`).
The target variable (`y_train_proxy`) is assumed to be in the `Label` column.

In [2]:
# Load the WoE-transformed data and labels
x_train = pd.read_csv('../data/processed/x_train_woe.csv')
y_train = pd.read_csv('../data/processed/y_train_proxy.csv')['Label']

# Preview the data
print(x_train.head())

   Transaction_Hour_woe  Transaction_Day_woe  Recency_woe  Stability_woe  \
0              0.543092             0.467015     1.751994            0.0   
1              1.123482             0.467015     1.751994            0.0   
2             -0.700449            -0.222864     1.751994            0.0   
3             -0.700449             0.147976     1.751994            0.0   
4              0.543092             0.467015     1.751994            0.0   

   Monetary_woe  Frequency_woe  Transaction_Month_woe  Transaction_Year_woe  \
0      1.771599       1.816266                    0.0                   0.0   
1      1.771599       1.816266                    0.0                   0.0   
2      1.771599       1.816266                    0.0                   0.0   
3      1.771599       1.816266                    0.0                   0.0   
4      1.771599       1.816266                    0.0                   0.0   

   AvgTransactionInterval_woe  
0                    0.599176  
1   

## 2. Data Preprocessing

We will standardize the features using `StandardScaler` to make sure the features have a mean of 0 and standard deviation of 1. This is particularly important for Logistic Regression as it is sensitive to the scale of the features.

In [3]:
# Standardize the features
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)

## 3. Train Logistic Regression Model

We will initialize and train the Logistic Regression model on the standardized training data.

In [7]:
# Initialize Logistic Regression with balanced class weights
log_reg = LogisticRegression(random_state=42, class_weight='balanced')

# Train the model
log_reg.fit(x_train_scaled, y_train)

# Evaluate performance
y_train_pred = log_reg.predict(x_train_scaled)
print("Logistic Regression with Balanced Class Weights:")
print(classification_report(y_train, y_train_pred))

Logistic Regression with Balanced Class Weights:
              precision    recall  f1-score   support

           0       1.00      0.87      0.93      3574
           1       0.02      1.00      0.03         8

    accuracy                           0.87      3582
   macro avg       0.51      0.94      0.48      3582
weighted avg       1.00      0.87      0.93      3582



Below is the step-by-step notebook code for training Decision Trees

In [31]:
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

# Initialize Logistic Regression with balanced class weights
log_reg = LogisticRegression(random_state=42, class_weight='balanced')

# Train the model
log_reg.fit(x_train_scaled, y_train)

# Evaluate performance
y_train_pred = log_reg.predict(x_train_scaled)
print("Logistic Regression with Balanced Class Weights:")
print(classification_report(y_train, y_train_pred))

# Save the model to a .pkl file
with open('logistic_regression_model.pkl', 'wb') as f:
    pickle.dump(log_reg, f)

Logistic Regression with Balanced Class Weights:
              precision    recall  f1-score   support

           0       1.00      0.87      0.93      3574
           1       0.02      1.00      0.03         8

    accuracy                           0.87      3582
   macro avg       0.51      0.94      0.48      3582
weighted avg       1.00      0.87      0.93      3582



In [13]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, roc_curve
import matplotlib.pyplot as plt

In [14]:
x_train_woe = pd.read_csv("../data/processed/x_train_woe.csv")
y_train_proxy = pd.read_csv("../data/processed/y_train_proxy.csv")

In [15]:
y_train = y_train_proxy.values.ravel()

In [17]:
print(f"x_train_woe shape: {x_train_woe.shape}")
print(f"y_train_proxy shape: {y_train_proxy.shape}")

x_train_woe shape: (3582, 9)
y_train_proxy shape: (3582, 2)


In [23]:
x_train_woe = x_train_woe.loc[x_train_woe.index.isin(y_train_proxy.index)]

In [24]:
print(f"Aligned shapes: x_train_woe = {x_train_woe.shape}, y_train_proxy = {y_train_proxy.shape}")

Aligned shapes: x_train_woe = (3582, 9), y_train_proxy = (3582, 2)


In [26]:
print(y_train_proxy.columns)

Index(['CustomerId', 'Label'], dtype='object')


In [27]:
# Extract the target column 'Label' from y_train_proxy
y_train = y_train_proxy['Label']

# Split Data into Training and Validation Sets
x_train, x_val, y_train_split, y_val = train_test_split(
    x_train_woe, y_train, test_size=0.2, random_state=42, stratify=y_train
)

# Train Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(x_train, y_train_split)

# Evaluate the Model
y_val_pred = dt_model.predict(x_val)
print("Classification Report for Decision Tree:")
print(classification_report(y_val, y_val_pred))

Classification Report for Decision Tree:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       715
           1       0.00      0.00      0.00         2

    accuracy                           1.00       717
   macro avg       0.50      0.50      0.50       717
weighted avg       0.99      1.00      1.00       717



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [28]:
# Train Decision Tree Classifier with class weights
dt_model_weighted = DecisionTreeClassifier(class_weight='balanced', random_state=42)
dt_model_weighted.fit(x_train, y_train_split)

# Evaluate the Model
y_val_pred_weighted = dt_model_weighted.predict(x_val)
print("Classification Report for Decision Tree (with Class Weights):")
print(classification_report(y_val, y_val_pred_weighted))

Classification Report for Decision Tree (with Class Weights):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       715
           1       0.00      0.00      0.00         2

    accuracy                           1.00       717
   macro avg       0.50      0.50      0.50       717
weighted avg       0.99      1.00      1.00       717



In [29]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Decision Tree
param_grid = {
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ['gini', 'entropy']
}

# Initialize DecisionTreeClassifier
dt_model = DecisionTreeClassifier(class_weight='balanced', random_state=42)

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train_split)

# Best parameters and the best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Evaluate the best model from GridSearch
best_dt_model = grid_search.best_estimator_
y_val_pred_best = best_dt_model.predict(x_val)
print("Classification Report for Best Decision Tree (with GridSearchCV):")
print(classification_report(y_val, y_val_pred_best))

Best Parameters: {'criterion': 'gini', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Score: 0.9947643979057592
Classification Report for Best Decision Tree (with GridSearchCV):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       715
           1       0.00      0.00      0.00         2

    accuracy                           1.00       717
   macro avg       0.50      0.50      0.50       717
weighted avg       0.99      1.00      1.00       717



In [30]:
from sklearn.metrics import classification_report, roc_auc_score, matthews_corrcoef, recall_score

In [33]:
import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define the parameter grid for Decision Tree
param_grid = {
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ['gini', 'entropy']
}

# Initialize DecisionTreeClassifier
dt_model = DecisionTreeClassifier(class_weight='balanced', random_state=42)

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train_split)

# Best parameters and the best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Evaluate the best model from GridSearch
best_dt_model = grid_search.best_estimator_
y_val_pred_best = best_dt_model.predict(x_val)
print("Classification Report for Best Decision Tree (with GridSearchCV):")
print(classification_report(y_val, y_val_pred_best))

# Save the best model from GridSearchCV
with open('best_decision_tree_model.pkl', 'wb') as f:
    pickle.dump(best_dt_model, f)

Best Parameters: {'criterion': 'gini', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Score: 0.9947643979057592
Classification Report for Best Decision Tree (with GridSearchCV):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       715
           1       0.00      0.00      0.00         2

    accuracy                           1.00       717
   macro avg       0.50      0.50      0.50       717
weighted avg       0.99      1.00      1.00       717

