04- Model Optimization - Cancer Risk Prediction
============================================
Farklƒ± modeller ve hyperparameter tuning ile en iyi performansƒ± elde etmek amacƒ±yla test edilecektir.

Test edilecek modeller:
1. Logistic Regression (Baseline)
2. Random Forest
3. XGBoost
4. LightGBM
5. Gradient Boosting

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

In [6]:
try:
    df = pd.read_csv('../data/processed/cancer_data_feature_engineered.csv')
    print("Feature engineered data loaded!")
except:
    df = pd.read_csv('..data/raw/cancer-patient-data-sets.csv')
    print("Using original data (feature engineered data not found!)")

print(f"Data Shape: {df.shape}")

# Bu yapƒ±, √∂zellikle b√ºy√ºk ML projelerinde "eƒüer bir √∂nceki adƒ±mƒ±n √ßƒ±ktƒ±sƒ± yoksa, en azƒ±ndan ham veriyle 
# dene" mantƒ±ƒüƒ±nƒ± uygulayarak kodun direncini ve yeniden √ºretilebilirliƒüini artƒ±rƒ±r.
# Verilerin bu ≈üekilde y√ºklenmesi model performansƒ±nƒ± optimize etmekten ziyade, ML pipeline'nƒ±n saƒülƒ±ƒüƒ±nƒ± ve profesyonelliƒüini optimize eder.

Feature engineered data loaded!
Data Shape: (1000, 44)


In [None]:
feature_cols = [col for col in df.columns if col not in ['index', 'Patient Id', 'Level']]
X = df[feature_cols]
y = df['Level']

print(f"Feature Count: {len(feature_cols)}")
print(f"Class Distribution:")
print(y.value_counts())

#  Veriler √∂zellik sayƒ±sƒ± ve target deƒüi≈üken belirlenecek ≈üekilde ayarlandƒ± ve bakƒ±lmayacak s√ºtunlar silindi.

Feature Count: 41
Class Distribution:
Level
High      365
Medium    332
Low       303
Name: count, dtype: int64


In [9]:
# Train-test verileri olarak ayrƒ±ldƒ±.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [12]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nTrain: {X_train.shape[0]}, Test: {X_test.shape[0]}")

# Veriler scale(√∂l√ßeklendirme) edildi.


Train: 800, Test: 200


In [13]:

#  MODEL 1: LOGISTIC REGRESSION

lr_params = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['lbfgs'],
    'max_iter': [1000]
}

print("üîß Hyperparameter Tuning with GridSearchCV...")
lr_grid = GridSearchCV(
    LogisticRegression(random_state=42),
    lr_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
lr_grid.fit(X_train_scaled, y_train)

lr_best = lr_grid.best_estimator_
lr_train_score = lr_best.score(X_train_scaled, y_train)
lr_test_score = lr_best.score(X_test_scaled, y_test)
lr_cv_score = cross_val_score(lr_best, X_train_scaled, y_train, cv=5).mean()

print(f"Best Parameters: {lr_grid.best_params_}")
print(f"Train Accuracy: {lr_train_score:.4f}")
print(f"Test Accuracy:  {lr_test_score:.4f}")
print(f"CV Score:       {lr_cv_score:.4f}")

üîß Hyperparameter Tuning with GridSearchCV...
Best Parameters: {'C': 0.1, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'lbfgs'}
Train Accuracy: 1.0000
Test Accuracy:  1.0000
CV Score:       1.0000


In [None]:
# MODEL 2: RANDOM FOREST

In [14]:
rf_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

print("üîß Hyperparameter Tuning with GridSearchCV...")
rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    rf_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
rf_grid.fit(X_train, y_train)  # RF doesn't require scaling

rf_best = rf_grid.best_estimator_
rf_train_score = rf_best.score(X_train, y_train)
rf_test_score = rf_best.score(X_test, y_test)
rf_cv_score = cross_val_score(rf_best, X_train, y_train, cv=5).mean()

print(f"Best Parameters: {rf_grid.best_params_}")
print(f"Train Accuracy: {rf_train_score:.4f}")
print(f"Test Accuracy:  {rf_test_score:.4f}")
print(f"CV Score:       {rf_cv_score:.4f}")

üîß Hyperparameter Tuning with GridSearchCV...
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Train Accuracy: 1.0000
Test Accuracy:  1.0000
CV Score:       1.0000


In [16]:
# MODEL 3: XGBOOST

# Encode target for XGBoost
y_train_encoded = y_train.map({'Low': 0, 'Medium': 1, 'High': 2})
y_test_encoded = y_test.map({'Low': 0, 'Medium': 1, 'High': 2})

xgb_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0]
}

print("üîß Hyperparameter Tuning with GridSearchCV...")
xgb_grid = GridSearchCV(
    xgb.XGBClassifier(random_state=42, eval_metric='mlogloss'),
    xgb_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
xgb_grid.fit(X_train_scaled, y_train_encoded)

xgb_best = xgb_grid.best_estimator_
xgb_train_score = xgb_best.score(X_train_scaled, y_train_encoded)
xgb_test_score = xgb_best.score(X_test_scaled, y_test_encoded)
xgb_cv_score = cross_val_score(xgb_best, X_train_scaled, y_train_encoded, cv=5).mean()

print(f" Best Parameters: {xgb_grid.best_params_}")
print(f" Train Accuracy: {xgb_train_score:.4f}")
print(f" Test Accuracy:  {xgb_test_score:.4f}")
print(f" CV Score:       {xgb_cv_score:.4f}")

üîß Hyperparameter Tuning with GridSearchCV...
Fitting 5 folds for each of 54 candidates, totalling 270 fits
 Best Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
 Train Accuracy: 1.0000
 Test Accuracy:  1.0000
 CV Score:       1.0000


In [17]:
# MODEL 4: LIGHTGBM

lgb_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, -1],
    'learning_rate': [0.01, 0.1, 0.3],
    'num_leaves': [31, 50, 70]
}

print("üîß Hyperparameter Tuning with GridSearchCV...")
lgb_grid = GridSearchCV(
    lgb.LGBMClassifier(random_state=42, verbose=-1),
    lgb_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
lgb_grid.fit(X_train_scaled, y_train_encoded)

lgb_best = lgb_grid.best_estimator_
lgb_train_score = lgb_best.score(X_train_scaled, y_train_encoded)
lgb_test_score = lgb_best.score(X_test_scaled, y_test_encoded)
lgb_cv_score = cross_val_score(lgb_best, X_train_scaled, y_train_encoded, cv=5).mean()

print(f"Best Parameters: {lgb_grid.best_params_}")
print(f"Train Accuracy: {lgb_train_score:.4f}")
print(f"Test Accuracy:  {lgb_test_score:.4f}")
print(f"CV Score:       {lgb_cv_score:.4f}")

üîß Hyperparameter Tuning with GridSearchCV...
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200, 'num_leaves': 31}
Train Accuracy: 1.0000
Test Accuracy:  1.0000
CV Score:       1.0000


In [18]:
# MODEL 5: GRADIENT BOOSTING

gb_params = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0]
}

print("üîß Hyperparameter Tuning with GridSearchCV...")
gb_grid = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    gb_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
gb_grid.fit(X_train_scaled, y_train)

gb_best = gb_grid.best_estimator_
gb_train_score = gb_best.score(X_train_scaled, y_train)
gb_test_score = gb_best.score(X_test_scaled, y_test)
gb_cv_score = cross_val_score(gb_best, X_train_scaled, y_train, cv=5).mean()

print(f" Best Parameters: {gb_grid.best_params_}")
print(f" Train Accuracy: {gb_train_score:.4f}")
print(f" Test Accuracy:  {gb_test_score:.4f}")
print(f" CV Score:       {gb_cv_score:.4f}")


üîß Hyperparameter Tuning with GridSearchCV...
Fitting 5 folds for each of 24 candidates, totalling 120 fits
 Best Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
 Train Accuracy: 1.0000
 Test Accuracy:  1.0000
 CV Score:       1.0000


In [None]:
# 7. MODEL COMPARISON
#  Modeller arasƒ±nda kƒ±yaslama yapƒ±ldƒ±.

results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost', 'LightGBM', 'Gradient Boosting'],
    'Train_Accuracy': [lr_train_score, rf_train_score, xgb_train_score, lgb_train_score, gb_train_score],
    'Test_Accuracy': [lr_test_score, rf_test_score, xgb_test_score, lgb_test_score, gb_test_score],
    'CV_Score': [lr_cv_score, rf_cv_score, xgb_cv_score, lgb_cv_score, gb_cv_score]
})

results['Overfitting'] = results['Train_Accuracy'] - results['Test_Accuracy']
results = results.sort_values('Test_Accuracy', ascending=False)

print("\n" + results.to_string(index=False))

# Best model
best_model_name = results.iloc[0]['Model']
best_model_test_acc = results.iloc[0]['Test_Accuracy']
best_model_cv = results.iloc[0]['CV_Score']

print(f"\n BEST MODEL: {best_model_name}")
print(f"   Test Accuracy: {best_model_test_acc:.4f}")
print(f"   CV Score: {best_model_cv:.4f}")


              Model  Train_Accuracy  Test_Accuracy  CV_Score  Overfitting
Logistic Regression             1.0            1.0       1.0          0.0
      Random Forest             1.0            1.0       1.0          0.0
            XGBoost             1.0            1.0       1.0          0.0
           LightGBM             1.0            1.0       1.0          0.0
  Gradient Boosting             1.0            1.0       1.0          0.0

üèÜ BEST MODEL: Logistic Regression
   Test Accuracy: 1.0000
   CV Score: 1.0000


In [None]:
# DETAILED REPORT FOR BEST MODEL
# En iyi model i√ßin detaylƒ± bir rapor yazƒ±ldƒ±.

# Get best model predictions
if best_model_name == 'Logistic Regression':
    best_model = lr_best
    y_pred = best_model.predict(X_test_scaled)
elif best_model_name == 'Random Forest':
    best_model = rf_best
    y_pred = best_model.predict(X_test)
elif best_model_name == 'XGBoost':
    best_model = xgb_best
    y_pred_encoded = best_model.predict(X_test_scaled)
    y_pred = pd.Series(y_pred_encoded).map({0: 'Low', 1: 'Medium', 2: 'High'})
elif best_model_name == 'LightGBM':
    best_model = lgb_best
    y_pred_encoded = best_model.predict(X_test_scaled)
    y_pred = pd.Series(y_pred_encoded).map({0: 'Low', 1: 'Medium', 2: 'High'})
else:  # Gradient Boosting
    best_model = gb_best
    y_pred = best_model.predict(X_test_scaled)

print("\n Classification Report:")
print(classification_report(y_test, y_pred))

print("\n Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred, labels=['Low', 'Medium', 'High'])
cm_df = pd.DataFrame(
    cm,
    index=['True: Low', 'True: Medium', 'True: High'],
    columns=['Pred: Low', 'Pred: Medium', 'Pred: High']
)
print(cm_df)


 Classification Report:
              precision    recall  f1-score   support

        High       1.00      1.00      1.00        73
         Low       1.00      1.00      1.00        61
      Medium       1.00      1.00      1.00        66

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200


 Confusion Matrix:
              Pred: Low  Pred: Medium  Pred: High
True: Low            61             0           0
True: Medium          0            66           0
True: High            0             0          73


In [24]:
# Save comparison results
results.to_csv('model_comparison_results.csv', index=False)
print(" Model comparison saved: model_comparison_results.csv")

# Save best model (using pickle or joblib)
import pickle

with open('best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)
print(f" Best model saved: best_model.pkl ({best_model_name})")

with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)
print(" Scaler saved: scaler.pkl")

print("\n" + "="*80)
print("MODEL OPTIMIZATION COMPLETED! ")
print("="*80)
print(f"\n Winner: {best_model_name} with {best_model_test_acc:.4f} test accuracy")

 Model comparison saved: model_comparison_results.csv
 Best model saved: best_model.pkl (Logistic Regression)
 Scaler saved: scaler.pkl

MODEL OPTIMIZATION COMPLETED! 

 Winner: Logistic Regression with 1.0000 test accuracy
