# Baseline Model: K-Means Clustering

## Amaç

En basit pipeline ve feature set ile temel bir kullanıcı segmentasyon modeli eğitmek. Bu, sonraki adımlarda iyileştirmelerimizi karşılaştıracağımız referans noktası olacak.

### Baseline Stratejisi
- **Model**: K-Means Clustering
- **Feature Set**: En temel 4-5 özellik (engagement ve spending odaklı)
- **Preprocessing**: Sadece StandardScaler
- **Cluster Sayısı**: 4 (business rules'a göre)
- **Metrikler**: Silhouette Score, Davies-Bouldin Index, Inertia


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
import warnings
import sys
from pathlib import Path

sys.path.append(str(Path('..').resolve()))
from src.config import *
from src.data_loader import load_gaming_dataset, create_sample_gaming_dataset

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")


## Veri Yükleme


In [None]:
# Load dataset
try:
    df = load_gaming_dataset(RAW_DATA_DIR)
    if df is None or len(df) == 0:
        raise FileNotFoundError("Dataset not found")
except FileNotFoundError:
    print("Creating sample dataset...")
    df = create_sample_gaming_dataset(n_samples=20000, save_path=TRAIN_FILE)

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()[:10]}...")
df.head()


## Baseline Feature Set

Baseline için en temel ve önemli özellikleri seçiyoruz:
1. **SessionsPerWeek**: Toplam oturum sayısı (engagement)
2. **PlayTimeHours**: Toplam oyun süresi (engagement)
3. **InGamePurchases**: Toplam harcama (monetization)
4. **login_frequency_per_week**: Haftalık giriş sıklığı (activity)


In [None]:
# Baseline feature set - minimal features for segmentation
baseline_features = [
    'SessionsPerWeek',
    'PlayTimeHours',
    'InGamePurchases',
    'SessionsPerWeek'
]

# Prepare data
X_baseline = df[baseline_features].copy()

# Handle missing values (if any)
X_baseline = X_baseline.fillna(X_baseline.median())

print("=" * 60)
print("BASELINE FEATURE SET")
print("=" * 60)
print(f"Features: {baseline_features}")
print(f"\nFeature Statistics:")
print(X_baseline.describe())
print(f"\nMissing values: {X_baseline.isnull().sum().sum()}")


## Veri Ön İşleme

Clustering için veriyi standardize ediyoruz.


In [None]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_baseline)
X_scaled_df = pd.DataFrame(X_scaled, columns=baseline_features)

print("Data standardized successfully!")
print(f"Scaled data shape: {X_scaled_df.shape}")
print(f"\nScaled data statistics:")
print(X_scaled_df.describe())


## Optimum Cluster Sayısını Belirleme (Elbow Method)


In [None]:
# Elbow method to find optimal number of clusters
inertias = []
silhouette_scores = []
k_range = range(2, 11)

print("Testing different numbers of clusters...")
for k in k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, 
                    max_iter=300, random_state=MODEL_CONFIG['random_state'])
    kmeans.fit(X_scaled)
    
    inertias.append(kmeans.inertia_)
    sil_score = silhouette_score(X_scaled, kmeans.labels_)
    silhouette_scores.append(sil_score)
    print(f"k={k}: Inertia={kmeans.inertia_:.2f}, Silhouette={sil_score:.3f}")

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow curve
axes[0].plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (k)', fontsize=12)
axes[0].set_ylabel('Inertia', fontsize=12)
axes[0].set_title('Elbow Method', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Silhouette scores
axes[1].plot(k_range, silhouette_scores, 'ro-', linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Clusters (k)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Score vs Number of Clusters', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find optimal k (highest silhouette score)
optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"\nOptimal number of clusters (based on silhouette): {optimal_k}")
print(f"Best silhouette score: {max(silhouette_scores):.3f}")


## Baseline Model: K-Means Clustering

Business rules'a göre 4 cluster kullanıyoruz.


In [None]:
# Baseline model with 4 clusters (from business rules)
n_clusters = BUSINESS_RULES['segment_count']

print("=" * 60)
print("TRAINING BASELINE MODEL")
print("=" * 60)
print(f"Number of clusters: {n_clusters}")
print(f"Features: {baseline_features}")

# Train K-Means
kmeans_baseline = KMeans(
    n_clusters=n_clusters,
    init='k-means++',
    n_init=10,
    max_iter=300,
    random_state=MODEL_CONFIG['random_state']
)

kmeans_baseline.fit(X_scaled)
labels = kmeans_baseline.labels_

# Add cluster labels to dataframe
df_baseline = df.copy()
df_baseline['cluster'] = labels

print(f"\n✅ Model trained successfully!")
print(f"Cluster distribution:")
print(df_baseline['cluster'].value_counts().sort_index())


## Model Değerlendirme Metrikleri


In [None]:
# Calculate clustering metrics
silhouette = silhouette_score(X_scaled, labels)
davies_bouldin = davies_bouldin_score(X_scaled, labels)
calinski_harabasz = calinski_harabasz_score(X_scaled, labels)
inertia = kmeans_baseline.inertia_

print("=" * 60)
print("BASELINE MODEL METRICS")
print("=" * 60)
print(f"Silhouette Score: {silhouette:.4f}")
print(f"  (Range: -1 to 1, higher is better)")
print(f"  Interpretation: {'Good' if silhouette > 0.3 else 'Fair' if silhouette > 0.2 else 'Poor'}")
print()
print(f"Davies-Bouldin Index: {davies_bouldin:.4f}")
print(f"  (Lower is better)")
print()
print(f"Calinski-Harabasz Index: {calinski_harabasz:.2f}")
print(f"  (Higher is better)")
print()
print(f"Inertia (Within-cluster sum of squares): {inertia:.2f}")
print(f"  (Lower is better)")

# Store baseline metrics
baseline_metrics = {
    'silhouette_score': silhouette,
    'davies_bouldin_index': davies_bouldin,
    'calinski_harabasz_index': calinski_harabasz,
    'inertia': inertia,
    'n_clusters': n_clusters,
    'n_features': len(baseline_features)
}

print("\n" + "=" * 60)
print("BASELINE METRICS SUMMARY")
print("=" * 60)
for key, value in baseline_metrics.items():
    print(f"{key}: {value}")


## Cluster Profilleme ve Analiz


In [None]:
# Cluster profiles
print("=" * 60)
print("CLUSTER PROFILES")
print("=" * 60)

cluster_profile = df_baseline.groupby('cluster')[baseline_features].mean()
cluster_profile['count'] = df_baseline.groupby('cluster').size()
cluster_profile['percentage'] = (cluster_profile['count'] / len(df_baseline) * 100).round(2)

print("\nCluster Statistics (Mean values):")
print(cluster_profile)

# Additional insights
print("\n" + "=" * 60)
print("CLUSTER INSIGHTS")
print("=" * 60)
for cluster_id in range(n_clusters):
    cluster_data = df_baseline[df_baseline['cluster'] == cluster_id]
    print(f"\nCluster {cluster_id} ({len(cluster_data)} users, {len(cluster_data)/len(df_baseline)*100:.1f}%):")
    print(f"  Avg Sessions: {cluster_data['SessionsPerWeek'].mean():.1f}")
    print(f"  Avg Playtime: {cluster_data['PlayTimeHours'].mean():.1f} hours")
    print(f"  Avg Spending: ${cluster_data['InGamePurchases'].mean():.2f}")
    print(f"  Avg Login Frequency: {cluster_data['SessionsPerWeek'].mean():.2f} per week")


## Görselleştirme: Cluster Dağılımları


In [None]:
# Visualize clusters
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Playtime vs Spending
scatter1 = axes[0, 0].scatter(df_baseline['PlayTimeHours'], 
                               df_baseline['InGamePurchases'],
                               c=df_baseline['cluster'], 
                               cmap='viridis', 
                               alpha=0.6, 
                               s=30)
axes[0, 0].set_xlabel('Total Playtime (Hours)', fontsize=11)
axes[0, 0].set_ylabel('Total Spent (USD)', fontsize=11)
axes[0, 0].set_title('Clusters: Playtime vs Spending', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0, 0], label='Cluster')

# Plot cluster centers
centers = scaler.inverse_transform(kmeans_baseline.cluster_centers_)
centers_df = pd.DataFrame(centers, columns=baseline_features)
axes[0, 0].scatter(centers_df['PlayTimeHours'], 
                   centers_df['InGamePurchases'],
                   c='red', marker='x', s=200, linewidths=3, label='Centroids')
axes[0, 0].legend()

# 2. Sessions vs Login Frequency
scatter2 = axes[0, 1].scatter(df_baseline['SessionsPerWeek'], 
                               df_baseline['SessionsPerWeek'],
                               c=df_baseline['cluster'], 
                               cmap='plasma', 
                               alpha=0.6, 
                               s=30)
axes[0, 1].set_xlabel('Total Sessions', fontsize=11)
axes[0, 1].set_ylabel('Login Frequency per Week', fontsize=11)
axes[0, 1].set_title('Clusters: Sessions vs Login Frequency', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=axes[0, 1], label='Cluster')
axes[0, 1].scatter(centers_df['SessionsPerWeek'], 
                   centers_df['SessionsPerWeek'],
                   c='red', marker='x', s=200, linewidths=3, label='Centroids')
axes[0, 1].legend()

# 3. Cluster size distribution
cluster_counts = df_baseline['cluster'].value_counts().sort_index()
axes[1, 0].bar(cluster_counts.index.astype(str), cluster_counts.values, 
               color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'])
axes[1, 0].set_xlabel('Cluster', fontsize=11)
axes[1, 0].set_ylabel('Number of Users', fontsize=11)
axes[1, 0].set_title('Cluster Size Distribution', fontsize=12, fontweight='bold')
for i, v in enumerate(cluster_counts.values):
    axes[1, 0].text(i, v, str(v), ha='center', va='bottom', fontweight='bold')

# 4. Feature means by cluster
cluster_means = df_baseline.groupby('cluster')[baseline_features].mean()
x = np.arange(len(baseline_features))
width = 0.2
for i, cluster_id in enumerate(range(n_clusters)):
    offset = (i - n_clusters/2 + 0.5) * width
    axes[1, 1].bar(x + offset, cluster_means.loc[cluster_id], 
                   width, label=f'Cluster {cluster_id}', alpha=0.8)
axes[1, 1].set_xlabel('Features', fontsize=11)
axes[1, 1].set_ylabel('Mean Value', fontsize=11)
axes[1, 1].set_title('Feature Means by Cluster', fontsize=12, fontweight='bold')
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels(baseline_features, rotation=45, ha='right')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()


## Baseline Sonuçları ve Özet

### Baseline Model Özeti

**Model**: K-Means Clustering
- **Cluster Sayısı**: 4
- **Feature Sayısı**: 4 (minimal set)
- **Features**: SessionsPerWeek, PlayTimeHours, InGamePurchases, login_frequency_per_week
- **Preprocessing**: StandardScaler

### Baseline Metrikleri

- **Silhouette Score**: {baseline_metrics['silhouette_score']:.4f}
- **Davies-Bouldin Index**: {baseline_metrics['davies_bouldin_index']:.4f}
- **Calinski-Harabasz Index**: {baseline_metrics['calinski_harabasz_index']:.2f}
- **Inertia**: {baseline_metrics['inertia']:.2f}

### Sonraki Adımlar

1. **Feature Engineering**: Daha fazla özellik ekleyerek model performansını artırma
2. **Feature Selection**: En önemli özellikleri seçme
3. **Model Optimization**: Farklı clustering algoritmaları deneme
4. **Hyperparameter Tuning**: Cluster sayısı ve diğer parametreleri optimize etme

### Notlar

- Baseline model basit bir başlangıç noktasıdır
- Silhouette score'un 0.3'ün üzerinde olması iyi bir başlangıç
- Feature engineering ile daha iyi segmentasyon bekleniyor
