# Model Optimization

## Amaç

Seçtiğiniz clustering modellerine hiperparametre optimizasyonu uygulayın ve en iyi parametreleri bulun.

### Optimizasyon Stratejisi
- **K-Means**: n_clusters, init, n_init, max_iter
- **DBSCAN**: eps, min_samples
- **Hierarchical**: n_clusters, linkage
- **Grid Search / Random Search** ile optimal parametreleri bulma


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.model_selection import ParameterGrid
import warnings
import sys
from pathlib import Path

sys.path.append(str(Path('..').resolve()))
from src.config import *
from src.data_loader import load_gaming_dataset, create_sample_gaming_dataset

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported!")


## Veri Yükleme ve Hazırlık


In [None]:
# Load dataset
try:
    df = load_gaming_dataset(RAW_DATA_DIR)
    if df is None or len(df) == 0:
        raise FileNotFoundError("Dataset not found")
except FileNotFoundError:
    df = create_sample_gaming_dataset(n_samples=20000, save_path=TRAIN_FILE)

# Feature engineering (baseline'dan sonraki feature set)
from src.pipeline import UserSegmentationPipeline
pipeline_temp = UserSegmentationPipeline()
df_processed = pipeline_temp.preprocess(df, is_training=True)

# Select numerical features
numerical_cols = df_processed.select_dtypes(include=[np.number]).columns.tolist()
X = df_processed[numerical_cols].fillna(df_processed[numerical_cols].median())

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Data prepared: {X_scaled.shape}")
print(f"Features: {len(numerical_cols)}")


## K-Means Hiperparametre Optimizasyonu


In [None]:
# K-Means parametre optimizasyonu
param_grid = {
    'n_clusters': [3, 4, 5, 6, 7],
    'init': ['k-means++', 'random'],
    'n_init': [10, 20, 30],
    'max_iter': [300, 500]
}

results = []
print("K-Means optimizasyonu yapılıyor...")

for params in ParameterGrid(param_grid):
    kmeans = KMeans(random_state=MODEL_CONFIG['random_state'], **params)
    labels = kmeans.fit_predict(X_scaled)
    
    sil_score = silhouette_score(X_scaled, labels)
    db_score = davies_bouldin_score(X_scaled, labels)
    ch_score = calinski_harabasz_score(X_scaled, labels)
    
    results.append({
        **params,
        'silhouette_score': sil_score,
        'davies_bouldin': db_score,
        'calinski_harabasz': ch_score,
        'inertia': kmeans.inertia_
    })
    print(f"n_clusters={params['n_clusters']}, init={params['init']}, n_init={params['n_init']}: Silhouette={sil_score:.4f}")

results_df = pd.DataFrame(results)
print("\nEn iyi parametreler:")
print(results_df.nlargest(5, 'silhouette_score')[['n_clusters', 'init', 'n_init', 'max_iter', 'silhouette_score', 'davies_bouldin']])


## DBSCAN Optimizasyonu


In [None]:
# DBSCAN parametre optimizasyonu
dbscan_params = {
    'eps': [0.3, 0.5, 0.7, 1.0, 1.5],
    'min_samples': [3, 5, 7, 10]
}

dbscan_results = []
print("DBSCAN optimizasyonu yapılıyor...")

for eps in dbscan_params['eps']:
    for min_samples in dbscan_params['min_samples']:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(X_scaled)
        
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_noise = list(labels).count(-1)
        
        if n_clusters > 1:  # En az 2 cluster olmalı
            sil_score = silhouette_score(X_scaled, labels)
            db_score = davies_bouldin_score(X_scaled, labels)
            
            dbscan_results.append({
                'eps': eps,
                'min_samples': min_samples,
                'n_clusters': n_clusters,
                'n_noise': n_noise,
                'silhouette_score': sil_score,
                'davies_bouldin': db_score
            })
            print(f"eps={eps}, min_samples={min_samples}: n_clusters={n_clusters}, Silhouette={sil_score:.4f}")

if dbscan_results:
    dbscan_df = pd.DataFrame(dbscan_results)
    print("\nEn iyi DBSCAN parametreleri:")
    print(dbscan_df.nlargest(5, 'silhouette_score'))
else:
    print("DBSCAN ile uygun parametreler bulunamadı.")


## Model Optimization Docs

### Optimize Edilen Modeller

1. **K-Means**: Grid search ile n_clusters, init, n_init, max_iter optimize edildi
2. **DBSCAN**: eps ve min_samples parametreleri test edildi

### Sonuçlar

- **En iyi K-Means parametreleri**: [Sonuçlara göre doldurulacak]
- **En iyi DBSCAN parametreleri**: [Sonuçlara göre doldurulacak]
- **Seçilen model**: [K-Means/DBSCAN/Hierarchical]

### Notlar

- Silhouette score en önemli metrik olarak kullanıldı
- Business rules'a göre 4 cluster tercih edildi
- Final model seçimi için Model Evaluation notebook'una bakın
