# Final Pipeline

## Amaç

Final ön işleme stratejisi ve final modelin train olduğu notebook. Bu notebook, production'a hazır final pipeline'ı gösterir.

### Final Pipeline Özeti
- **Feature Set**: [Final feature listesi]
- **Preprocessing**: StandardScaler, LabelEncoder, Feature Engineering
- **Model**: K-Means Clustering
- **Cluster Sayısı**: 4
- **Validation**: Silhouette Score, Davies-Bouldin Index


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
import warnings
import sys
from pathlib import Path

sys.path.append(str(Path('..').resolve()))
from src.config import *
from src.data_loader import load_gaming_dataset
from src.pipeline import UserSegmentationPipeline

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported!")


## Veri Yükleme


In [None]:
# Load dataset
df = load_gaming_dataset(RAW_DATA_DIR)
print("Dataset loaded successfully!")

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")


## Final Pipeline Eğitimi


In [None]:
# Create and train final pipeline
pipeline = UserSegmentationPipeline()
metrics = pipeline.fit(df)

print("=" * 60)
print("FINAL PIPELINE METRICS")
print("=" * 60)
for key, value in metrics.items():
    print(f"{key}: {value}")

# Save pipeline
pipeline.save()
print(f"\nPipeline saved to: {MODELS_DIR / 'user_segmentation_pipeline.pkl'}")


## Final Feature Set


In [None]:
print("=" * 60)
print("FINAL FEATURE SET")
print("=" * 60)
print(f"Total features: {len(pipeline.feature_names)}")
print("\nFeatures:")
for i, feature in enumerate(pipeline.feature_names, 1):
    print(f"{i:2d}. {feature}")


## Segment Dağılımı ve Profilleri


In [None]:
# Predict segments
labels = pipeline.predict(df)
df_final = df.copy()
df_final['cluster'] = labels

# Segment distribution
print("=" * 60)
print("SEGMENT DISTRIBUTION")
print("=" * 60)
segment_dist = df_final['cluster'].value_counts().sort_index()
for cluster_id, count in segment_dist.items():
    print(f"Cluster {cluster_id}: {count} users ({count/len(df_final)*100:.1f}%)")

# Visualize
plt.figure(figsize=(10, 6))
segment_dist.plot(kind='bar', color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'])
plt.xlabel('Cluster')
plt.ylabel('Number of Users')
plt.title('Final Segment Distribution')
plt.xticks(rotation=0)
plt.grid(True, alpha=0.3, axis='y')
plt.show()


## Final Pipeline Docs

### Final Pipeline Yapısı

**Preprocessing:**
1. Categorical encoding (LabelEncoder)
2. Missing value imputation (median)
3. Feature engineering (ratio features, aggregate features)
4. Variance threshold filtering
5. StandardScaler

**Model:**
- Algorithm: K-Means Clustering
- n_clusters: 4
- init: k-means++
- n_init: 10
- max_iter: 300

**Feature Set:**
- Total features: [Final sayı]
- Feature list: [Liste]

### Neden Bu Model ve Feature Set Seçildi?

1. **Model Seçimi**: K-Means, interpretability ve business uyumu için seçildi
2. **Feature Set**: Evaluation notebook'unda en iyi performans gösteren feature set
3. **Cluster Sayısı**: Business rules'a göre 4 segment optimal

### Baseline ile Karşılaştırma

- **Baseline Silhouette**: [Baseline skor]
- **Final Silhouette**: [Final skor]
- **İyileşme**: [Fark] puan

### Production'a Hazirlik

- Model kaydedildi
- Pipeline kaydedildi
- Inference script hazir
- API hazir
- Frontend hazir
