# Logistic Regression Model Training

This notebook demonstrates a structured workflow for **Logistic Regression model training, tuning, and evaluation**.

### Overview of Iterations
| Iteration | Description |
|------------|--------------|
| **1** | Evaluate model performance on different data fractions |
| **2** | Hyperparameter tuning on 10% of training data |
| **3** | Train on full data (max_iter = 1000) |
| **4** | Train on full data (max_iter = 2500) for convergence |

All trained models are saved as `.pkl` files in the `../models/` directory for future evaluation.

In [1]:
import joblib
import pandas as pd
import time
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split

X_train = joblib.load('../data/processed/X_train_processed.pkl')
y_train = pd.read_csv('../data/processed/y_train.csv').squeeze()

print(f"Training shape: {X_train.shape}")

Training shape: (761791, 5108)


## **Iteration 1: Test Data Fractions**

**Goal:**  
Evaluate how model performance (AUC) scales with different fractions of the training data using 3-fold cross-validation.

In [2]:
fractions = [0.1, 0.25, 0.5, 0.75, 1.0]
print("=== Iteration 1: Test Data Fractions ===")

for f in fractions:
    start = time.time()
    if f == 1.0:
        X_sub, y_sub = X_train, y_train
    else:
        X_sub, _, y_sub, _ = train_test_split(
            X_train, y_train, train_size=f, stratify=y_train, random_state=42
        )
    model = LogisticRegression(max_iter=500, solver='saga', random_state=42)
    auc = cross_val_score(model, X_sub, y_sub, cv=3, scoring='roc_auc', n_jobs=-1).mean()
    print(f"Fraction {f:.2f} | AUC: {auc:.4f} | Time: {time.time()-start:.2f}s")

=== Iteration 1: Test Data Fractions ===
Fraction 0.10 | AUC: 0.7141 | Time: 16.86s
Fraction 0.25 | AUC: 0.7315 | Time: 44.20s
Fraction 0.50 | AUC: 0.7441 | Time: 97.52s
Fraction 0.75 | AUC: 0.7491 | Time: 156.14s
Fraction 1.00 | AUC: 0.7520 | Time: 233.60s


**Insights:**
- Model performance improves with more data, but gains plateau after ~50–75%.  
- Using **10% of data** provides a good estimate of performance while saving significant computation time.  
- Training time increases roughly linearly with data fraction.

## **Iteration 2: Hyperparameter Tuning on 10% Data**

**Goal:**  
Tune regularization type (`l1`, `l2`) and strength (`C`) using 5-fold cross-validation on 10% of the training data.

In [3]:
X_sub, _, y_sub, _ = train_test_split(
    X_train, y_train, train_size=0.1, stratify=y_train, random_state=42
)
params = [('l1', 0.01), ('l1', 0.1), ('l1', 1), ('l1', 10),
          ('l2', 0.01), ('l2', 0.1), ('l2', 1), ('l2', 10)]
results = []

print("=== Iteration 2: Hyperparameter Tuning (10%) ===")
for p, c in params:
    start = time.time()
    model = LogisticRegression(penalty=p, C=c, solver='saga', max_iter=500, random_state=42)
    auc = cross_val_score(model, X_sub, y_sub, cv=5, scoring='roc_auc', n_jobs=-1).mean()
    model.fit(X_sub, y_sub)
    filename = f"../models/lr_iter2_{p}_C{c}.pkl"
    joblib.dump(model, filename)
    results.append({'penalty': p, 'C': c, 'AUC': auc, 'time': time.time()-start, 'file': filename})
    print(f"{p.upper()} C={c:<4} | AUC: {auc:.4f} | Time: {results[-1]['time']:.2f}s | Saved: {filename}")

results_df = pd.DataFrame(results)
best = results_df.loc[results_df['AUC'].idxmax()]
results_df.to_csv('../models/hyperparam_tuning_summary_iter2.csv', index=False)

=== Iteration 2: Hyperparameter Tuning (10%) ===




L1 C=0.01 | AUC: 0.6085 | Time: 53.65s | Saved: ../models/lr_iter2_l1_C0.01.pkl




L1 C=0.1  | AUC: 0.6910 | Time: 146.32s | Saved: ../models/lr_iter2_l1_C0.1.pkl




L1 C=1    | AUC: 0.7160 | Time: 876.32s | Saved: ../models/lr_iter2_l1_C1.pkl




L1 C=10   | AUC: 0.7194 | Time: 1461.37s | Saved: ../models/lr_iter2_l1_C10.pkl
L2 C=0.01 | AUC: 0.6760 | Time: 31.03s | Saved: ../models/lr_iter2_l2_C0.01.pkl




L2 C=0.1  | AUC: 0.7155 | Time: 41.50s | Saved: ../models/lr_iter2_l2_C0.1.pkl




L2 C=1    | AUC: 0.7193 | Time: 40.21s | Saved: ../models/lr_iter2_l2_C1.pkl
L2 C=10   | AUC: 0.7197 | Time: 42.73s | Saved: ../models/lr_iter2_l2_C10.pkl




**Summary:**
| Penalty | C | Mean AUC | Observation |
|----------|---|-----------|-------------|
| L1 | 0.01–10 | 0.60–0.72 | Slower convergence, lower AUC |
| L2 | 0.01–10 | 0.68–0.72 | Stable convergence, better AUC |
| **Best** | **L2** | **10** | **AUC = 0.7197** |

**Insights:**
- **Best model:** L2 penalty with `C = 10`.  
- L1 penalty took longer to converge and occasionally failed within `max_iter = 500`.  
- L2 penalty provided more stable performance and faster convergence.  
- Larger `C` (weaker regularization) slightly improves validation AUC, indicating mild regularization is beneficial.

## **Iteration 3: Train Model on Full Data (max_iter = 1000)**

**Goal:**  
Train the best model (L2 penalty, C = 10) using 5-fold cross-validation on the full training dataset.

In [4]:
print("=== Iteration 3: Train Model on Full Data (max_iter=1000) ===")
start = time.time()
model_iter3 = LogisticRegression(penalty=best['penalty'], C=best['C'], solver='saga', max_iter=1000, random_state=42)
model_iter3.fit(X_train, y_train)
train_auc = cross_val_score(model_iter3, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1).mean()
joblib.dump(model_iter3, '../models/lr_iter3.pkl')
print(f"Model Iter3 | Penalty={best['penalty']} | C={best['C']} | Train AUC: {train_auc:.4f} | Time: {time.time()-start:.2f}s")

=== Iteration 3: Train Model on Full Data (max_iter=1000) ===




Model Iter3 | Penalty=l2 | C=10.0 | Train AUC: 0.7582 | Time: 1166.80s


**Insights:**
- Training on full data improved performance compared to subset tuning.  
- The model likely requires more iterations to converge fully.

## **Iteration 4: Train Model on Full Data (max_iter = 2500)**

**Goal:**  
Increase `max_iter` to ensure full convergence and observe any improvement in AUC.

In [5]:
print("=== Iteration 4: Train Model on Full Data (max_iter=2500) ===")
start = time.time()
model_iter4 = LogisticRegression(penalty=best['penalty'], C=best['C'], solver='saga', max_iter=2500, random_state=42)
model_iter4.fit(X_train, y_train)
train_auc = cross_val_score(model_iter4, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1).mean()
joblib.dump(model_iter4, '../models/lr_iter4.pkl')
print(f"Model Iter4 | Penalty={best['penalty']} | C={best['C']} | Train AUC: {train_auc:.4f} | Time: {time.time()-start:.2f}s")

=== Iteration 4: Train Model on Full Data (max_iter=2500) ===




Model Iter4 | Penalty=l2 | C=10.0 | Train AUC: 0.7604 | Time: 2996.58s


**Insights:**
- Although convergence was not perfect, AUC improved only slightly (+0.002).  
- Indicates the model performance has **stabilized**, and further iterations would yield **diminishing returns**.

## **Final Insights**

The best-performing model is a **Logistic Regression** with **L2 regularization (C = 10)** and `max_iter = 2500`.  
It achieved a **5-fold validation AUC of approximately 0.7604**, indicating strong and stable performance.  
The **L2 regularization** improves generalization by penalizing large coefficients, helping prevent overfitting.  

To load and use the trained model:
```python
import joblib
model = joblib.load('../models/lr_iter4.pkl')