# Hyperparameter Optimization

### Load Dataset

We use the preprocessed dataset that already has:
*    Encoded categorical variables (one-hot encoding)
*    Scaled numerical features
*    Feature engineering applied

In [None]:
import optuna
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, cross_val_score
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('../data/processed/main_data_processed2.csv')

### Splitting Dataset

In [None]:
X = df.drop('Exited', axis=1)
y = df['Exited']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,       
    random_state=42,     
    stratify=y          
)

**Key Insights:**
- *stratify=y*: Maintains class distribution in train/test splits
- Important for imbalanced data (20% churn rate)
- Ensures both sets have similar proportions

### Setting Hyperparameter Options

In [3]:
def objective(trial):

    param_grid = {
        'objective': 'binary',
        'metric': 'auc', 
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'random_state': 42,
        'is_unbalance': True, 
        
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.1, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 1.0),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0)
    }

    model = lgb.LGBMClassifier(**param_grid)
    
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1')
    
    return scores.mean()

**Key parameters explained:**
- *n_estimators*: Number of boosting rounds (more = better but slower)
- *learning_rate*: Step size (lower = more careful learning)
- *num_leaves*: Tree complexity (more leaves = more flexible)
- *max_depth*: Maximum tree depth (controls overfitting)
- *min_child_samples*: Minimum samples per leaf (prevents overfitting)
- *reg_alpha*/lambda: L1/L2 regularization (reduces overfitting)
- *subsample*: Row sampling ratio (prevents overfitting)
- *colsample_bytree*: Feature sampling ratio (adds randomness)


**Interpreting Results:**

- *Best F1 Score*: The highest average F1 across 5 CV folds
- *Best Parameters*: Combination that achieved this score
- *Trial Process*: Optuna tries different combinations and learns which work best

In [None]:
study = optuna.create_study(direction='maximize')

print("Optimizasyon başlıyor...")
study.optimize(objective, n_trials=50)

print("-" * 50)
print(f"Best F1 Score: {study.best_value:.4f}")
print("Best Parameters:")
print(study.best_params)

[I 2025-12-08 17:34:38,830] A new study created in memory with name: no-name-e6ef6aa4-3180-4ecb-99dc-32209b0b3b1e


Optimizasyon başlıyor...


[I 2025-12-08 17:34:42,618] Trial 0 finished with value: 0.6042470012859436 and parameters: {'n_estimators': 295, 'learning_rate': 0.015066364369955908, 'num_leaves': 108, 'max_depth': 5, 'min_child_samples': 52, 'reg_alpha': 0.9560358705748271, 'reg_lambda': 0.29720717702943933, 'subsample': 0.5418915113899205, 'colsample_bytree': 0.9497656434546202}. Best is trial 0 with value: 0.6042470012859436.
[I 2025-12-08 17:35:04,449] Trial 1 finished with value: 0.6091198781735351 and parameters: {'n_estimators': 613, 'learning_rate': 0.02149573275382093, 'num_leaves': 72, 'max_depth': 10, 'min_child_samples': 63, 'reg_alpha': 0.5942321870749928, 'reg_lambda': 0.09234229836571217, 'subsample': 0.9820647659529835, 'colsample_bytree': 0.5224323331696423}. Best is trial 1 with value: 0.6091198781735351.
[I 2025-12-08 17:35:12,188] Trial 2 finished with value: 0.6058303586049208 and parameters: {'n_estimators': 620, 'learning_rate': 0.06543517965590022, 'num_leaves': 55, 'max_depth': 5, 'min_chil

--------------------------------------------------
En İyi F1 Skoru: 0.6187
En İyi Parametreler:
{'n_estimators': 109, 'learning_rate': 0.02675719739765303, 'num_leaves': 103, 'max_depth': 11, 'min_child_samples': 64, 'reg_alpha': 0.6955488155903736, 'reg_lambda': 0.4871752440349787, 'subsample': 0.7102664732940085, 'colsample_bytree': 0.5088658536687106}


### Training Model With Best Parameters

In [10]:
best_params = study.best_params

best_params['objective'] = 'binary'
best_params['is_unbalance'] = True
best_params['random_state'] = 42
best_params['verbosity'] = -1

final_lgbm = lgb.LGBMClassifier(**best_params)

final_lgbm.fit(X_train, y_train)

print("Final model sucessfully trained.")

Final model sucessfully trained.


In [9]:
from sklearn.metrics import f1_score

y_pred_proba = final_lgbm.predict_proba(X_train)[:, 1]

thresholds = np.arange(0.1, 0.9, 0.05)
best_f1 = 0
best_threshold = 0.5

for thr in thresholds:
    y_pred_binary = (y_pred_proba > thr).astype(int)
    current_f1 = f1_score(y_train, y_pred_binary)
    
    if current_f1 > best_f1:
        best_f1 = current_f1
        best_threshold = thr

print(f"F1 Score After Optuna: {study.best_value:.4f}")
print(f"Threshold Adjusted F1: {best_f1:.4f}")
print(f"Optimal Threshold: {best_threshold}")

F1 Score After Optuna: 0.6187
Threshold Adjusted F1: 0.7453
Optimal Threshold: 0.5500000000000002


**Why Threshold Tuning?**

*    Default threshold (0.5) may not be optimal for imbalanced data. 
*    For this reason we need to test different thresholds (0.1 to 0.9)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

y_test_proba = final_lgbm.predict_proba(X_test)[:, 1]
best_threshold = 0.60
y_test_pred = (y_test_proba > best_threshold).astype(int)

print(f"TEST SET RESULT (Threshold: {best_threshold})")
print("-" * 40)
print(classification_report(y_test, y_test_pred))
print("-" * 40)

cm = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:\n", cm)

TEST SET RESULT (Threshold: 0.6)
----------------------------------------
              precision    recall  f1-score   support

           0       0.90      0.91      0.91      1593
           1       0.64      0.61      0.63       407

    accuracy                           0.85      2000
   macro avg       0.77      0.76      0.77      2000
weighted avg       0.85      0.85      0.85      2000

----------------------------------------
Confusion Matrix:
 [[1455  138]
 [ 157  250]]


**Test Set - Final Results**

**Metrics to focus on:**
- *Precision (Class 1)*: Of predicted churns, how many actually churned?
- *Recall (Class 1)*: Of actual churns, how many did we catch?
- *F1-Score (Class 1)*: Harmonic mean of precision and recall
- *Confusion Matrix*: Shows true/false positives and negatives

### Steps for Pipeline

In [14]:
import joblib

joblib.dump(final_lgbm, '../models/lgbm_final_model.pkl')

model_columns = list(X_train.columns)
joblib.dump(model_columns, '../models/model_columns.pkl')

print("Model and column list saved to 'models/' directory.")

Model and column list saved to 'models/' directory.
