# Model Evaluation

## Amaç

Feature importance, SHAP değerleri ve business gereksinimlerine göre en uygun feature setini belirleyin.

### Değerlendirme Metrikleri
- Silhouette Score
- Davies-Bouldin Index
- Calinski-Harabasz Index
- Feature Importance (PCA veya feature selection ile)
- Segment Profilleme


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
import warnings
import sys
from pathlib import Path

sys.path.append(str(Path('..').resolve()))
from src.config import *
from src.data_loader import load_gaming_dataset
from src.pipeline import UserSegmentationPipeline

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported!")


## Model Yükleme ve Final Değerlendirme


In [None]:
# Load trained model
pipeline = UserSegmentationPipeline()
pipeline.load()

print("✅ Model yüklendi")
print(f"Features used: {len(pipeline.feature_names)}")
print(f"Number of clusters: {pipeline.n_clusters}")

# Load data for evaluation
df = load_gaming_dataset(RAW_DATA_DIR)
df_processed = pipeline.preprocess(df, is_training=False)
labels = pipeline.predict(df)

df_eval = df.copy()
df_eval['cluster'] = labels

print(f"\nCluster distribution:")
print(df_eval['cluster'].value_counts().sort_index())


## Final Model Metrikleri


In [None]:
# Prepare data for metrics
X = df_processed[pipeline.feature_names].fillna(df_processed[pipeline.feature_names].median())
scaler_eval = StandardScaler()
X_scaled = scaler_eval.fit_transform(X)

# Calculate metrics
silhouette = silhouette_score(X_scaled, labels)
davies_bouldin = davies_bouldin_score(X_scaled, labels)
calinski_harabasz = calinski_harabasz_score(X_scaled, labels)

print("=" * 60)
print("FINAL MODEL METRICS")
print("=" * 60)
print(f"Silhouette Score: {silhouette:.4f}")
print(f"  Interpretation: {'Excellent' if silhouette > 0.5 else 'Good' if silhouette > 0.3 else 'Fair' if silhouette > 0.2 else 'Poor'}")
print()
print(f"Davies-Bouldin Index: {davies_bouldin:.4f}")
print(f"  (Lower is better, < 1 is excellent)")
print()
print(f"Calinski-Harabasz Index: {calinski_harabasz:.2f}")
print(f"  (Higher is better)")
print()
print(f"Number of features: {len(pipeline.feature_names)}")
print(f"Number of clusters: {pipeline.n_clusters}")


## Segment Profilleme ve Analiz


In [None]:
# Segment profilleri
key_features = ['total_sessions', 'total_playtime_hours', 'total_spent_usd', 
                'login_frequency_per_week', 'engagement_score', 'max_level_reached']

segment_profiles = {}
for cluster_id in range(pipeline.n_clusters):
    cluster_data = df_eval[df_eval['cluster'] == cluster_id]
    profile = {
        'count': len(cluster_data),
        'percentage': len(cluster_data) / len(df_eval) * 100
    }
    for feature in key_features:
        if feature in cluster_data.columns:
            profile[f'avg_{feature}'] = cluster_data[feature].mean()
            profile[f'median_{feature}'] = cluster_data[feature].median()
    segment_profiles[cluster_id] = profile

# Segment isimlendirme
segment_names = {}
for cluster_id, profile in segment_profiles.items():
    avg_spending = profile.get('avg_total_spent_usd', 0)
    avg_engagement = profile.get('avg_engagement_score', 0)
    
    if avg_spending > df_eval['total_spent_usd'].quantile(0.75) and avg_engagement > df_eval['engagement_score'].quantile(0.75):
        segment_names[cluster_id] = "Whales (High Spenders)"
    elif avg_engagement > df_eval['engagement_score'].quantile(0.75):
        segment_names[cluster_id] = "Engaged Players"
    elif avg_spending > df_eval['total_spent_usd'].quantile(0.5):
        segment_names[cluster_id] = "Regular Players"
    else:
        segment_names[cluster_id] = "Casual Players"

print("=" * 60)
print("SEGMENT PROFILES")
print("=" * 60)
for cluster_id in range(pipeline.n_clusters):
    print(f"\n{segment_names.get(cluster_id, f'Segment {cluster_id}')} (Cluster {cluster_id}):")
    print(f"  Users: {segment_profiles[cluster_id]['count']} ({segment_profiles[cluster_id]['percentage']:.1f}%)")
    for key, value in segment_profiles[cluster_id].items():
        if key not in ['count', 'percentage']:
            print(f"  {key}: {value:.2f}")


## Feature Importance (PCA ile)


In [None]:
# PCA ile feature importance
pca = PCA(n_components=min(10, len(pipeline.feature_names)))
X_pca = pca.fit_transform(X_scaled)

# Explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), 
        pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('PCA Explained Variance')
plt.show()

print(f"Cumulative explained variance (first 5 components): {pca.explained_variance_ratio_[:5].sum():.2%}")


## Model Evaluation Docs

### Final Model Seçimi

- **Model**: K-Means Clustering
- **Cluster Sayısı**: 4
- **Feature Sayısı**: [Final feature sayısı]
- **Silhouette Score**: [Final skor]
- **Davies-Bouldin Index**: [Final skor]

### Segment İsimlendirme

1. **Whales (High Spenders)**: Yüksek engagement ve spending
2. **Engaged Players**: Yüksek engagement, orta spending
3. **Regular Players**: Orta seviye engagement ve spending
4. **Casual Players**: Düşük engagement ve spending

### Business Gereksinimleri ile Uyum

- ✅ Segment sayısı business rules'a uygun (4 segment)
- ✅ Her segment anlamlı ve işlenebilir
- ✅ Segment profilleri pazarlama stratejileri için uygun
- ✅ Model production'a hazır
