# MÃ¼ÅŸteri Åžikayet Kategorilendirme - Baseline Model

Bu notebook, mÃ¼ÅŸteri ÅŸikayet metinlerinden otomatik kategori tahmini iÃ§in baseline modelini iÃ§ermektedir.

## Baseline Model Stratejisi
Basit ve hÄ±zlÄ± bir baÅŸlangÄ±Ã§ modeli olarak:
- **Ã–zellik Ã‡Ä±karÄ±mÄ±**: TF-IDF (Term Frequency-Inverse Document Frequency)
- **Model**: Logistic Regression (basit, aÃ§Ä±klanabilir, hÄ±zlÄ±)
- **Ã–n Ä°ÅŸleme**: Temel metin temizleme ve tokenizasyon
- **Validasyon**: Stratified K-Fold Cross Validation

## Hedefler
1. HÄ±zlÄ± bir baseline performans elde etmek
2. Model performansÄ±nÄ± Ã¶lÃ§mek
3. Kategori bazÄ±nda performans analizi
4. Sonraki adÄ±mlar iÃ§in referans noktasÄ± oluÅŸturmak

## 1. Gerekli KÃ¼tÃ¼phanelerin YÃ¼klenmesi

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import f1_score, precision_score, recall_score
import re
import warnings
warnings.filterwarnings('ignore')

print("KÃ¼tÃ¼phaneler baÅŸarÄ±yla yÃ¼klendi!")

## 2. Veri Setinin YÃ¼klenmesi ve HazÄ±rlanmasÄ±

In [None]:
# Veri setini yÃ¼kle
df = pd.read_csv('../data/raw/customer_complaints_full.csv')

print(f"Veri seti boyutu: {df.shape}")
print(f"Toplam kategori sayÄ±sÄ±: {df['complaint_category'].nunique()}")
print("\n=== Kategori DaÄŸÄ±lÄ±mÄ± ===")
print(df['complaint_category'].value_counts())

## 3. Metin Ã–n Ä°ÅŸleme FonksiyonlarÄ±

In [None]:
def clean_text(text):
    """
    Temel metin temizleme iÅŸlemleri
    """
    if pd.isna(text):
        return ""
    
    # KÃ¼Ã§Ã¼k harfe Ã§evir
    text = text.lower()
    
    # Ã–zel karakterleri ve sayÄ±larÄ± temizle (sadece harfleri ve boÅŸluklarÄ± bÄ±rak)
    text = re.sub(r'[^a-zA-ZÄŸÃ¼ÅŸÄ±Ã¶Ã§ÄžÃœÅžIÄ°Ã–Ã‡\s]', ' ', text)
    
    # Ã‡oklu boÅŸluklarÄ± tek boÅŸluÄŸa Ã§evir
    text = re.sub(r'\s+', ' ', text)
    
    # BaÅŸ ve sondaki boÅŸluklarÄ± temizle
    text = text.strip()
    
    return text

# Metinleri temizle
print("Metinler temizleniyor...")
df['cleaned_text'] = df['complaint_text'].apply(clean_text)

# Ã–rnek karÅŸÄ±laÅŸtÄ±rma
print("\n=== Metin Temizleme Ã–rneÄŸi ===")
for i in range(3):
    print(f"Orijinal: {df.iloc[i]['complaint_text']}")
    print(f"TemizlenmiÅŸ: {df.iloc[i]['cleaned_text']}")
    print("---")

## 4. Veri Setini Train/Test Olarak BÃ¶lme

In [None]:
# Ã–zellik ve hedef deÄŸiÅŸkenleri ayÄ±r
X = df['cleaned_text']
y = df['complaint_category']

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train seti boyutu: {len(X_train):,}")
print(f"Test seti boyutu: {len(X_test):,}")

print("\n=== Train Seti Kategori DaÄŸÄ±lÄ±mÄ± ===")
print(y_train.value_counts())

print("\n=== Test Seti Kategori DaÄŸÄ±lÄ±mÄ± ===")
print(y_test.value_counts())

## 5. TF-IDF Ã–zellik Ã‡Ä±karÄ±mÄ±

In [None]:
# TF-IDF Vectorizer
print("TF-IDF Ã¶zellik Ã§Ä±karÄ±mÄ± baÅŸlatÄ±lÄ±yor...")

# TF-IDF parametreleri
tfidf = TfidfVectorizer(
    max_features=5000,  # En fazla 5000 Ã¶zellik
    ngram_range=(1, 2),  # Unigram ve bigram
    stop_words=None,  # TÃ¼rkÃ§e stop words yok, kendi listemizi kullanacaÄŸÄ±z
    min_df=2,  # En az 2 dokÃ¼manda geÃ§meli
    max_df=0.95,  # En fazla %95 dokÃ¼manda geÃ§meli
    sublinear_tf=True  # Logaritmik TF kullan
)

# Train setini fit et ve transform et
X_train_tfidf = tfidf.fit_transform(X_train)

# Test setini sadece transform et
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF matris boyutu (train): {X_train_tfidf.shape}")
print(f"TF-IDF matris boyutu (test): {X_test_tfidf.shape}")
print(f"Toplam Ã¶zellik sayÄ±sÄ±: {len(tfidf.get_feature_names_out())}")

# En Ã¶nemli 20 Ã¶zelliÄŸi gÃ¶ster
feature_names = tfidf.get_feature_names_out()
print("\n=== Ä°lk 20 TF-IDF Ã–zelliÄŸi ===")
print(list(feature_names[:20]))

## 6. Baseline Model EÄŸitimi (Logistic Regression)

In [None]:
# Logistic Regression modeli
print("Baseline Logistic Regression modeli eÄŸitiliyor...")

# Model parametreleri
lr_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
    solver='liblinear',  # KÃ¼Ã§Ã¼k veri seti iÃ§in uygun
    multi_class='ovr'  # One-vs-Rest
)

# Modeli eÄŸit
lr_model.fit(X_train_tfidf, y_train)

print("Model eÄŸitimi tamamlandÄ±!")

# Train seti performansÄ±
y_train_pred = lr_model.predict(X_train_tfidf)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred, average='weighted')

print(f"\n=== Train Seti PerformansÄ± ===")
print(f"DoÄŸruluk: {train_accuracy:.4f}")
print(f"F1 Skoru (weighted): {train_f1:.4f}")

## 7. Cross Validation ile Model DeÄŸerlendirmesi

In [None]:
# Cross Validation
print("Cross Validation ile model deÄŸerlendirmesi...")

# 5-fold stratified cross validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(lr_model, X_train_tfidf, y_train, cv=cv, scoring='accuracy')
cv_f1_scores = cross_val_score(lr_model, X_train_tfidf, y_train, cv=cv, scoring='f1_weighted')

print(f"\n=== Cross Validation SonuÃ§larÄ± (5-Fold) ===")
print(f"DoÄŸruluk SkorlarÄ±: {cv_scores}")
print(f"Ortalama DoÄŸruluk: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

print(f"\nF1 SkorlarÄ± (weighted): {cv_f1_scores}")
print(f"Ortalama F1 Skoru: {cv_f1_scores.mean():.4f} (+/- {cv_f1_scores.std() * 2:.4f})")

## 8. Test Seti DeÄŸerlendirmesi

In [None]:
# Test seti tahmini
y_test_pred = lr_model.predict(X_test_tfidf)

# Test seti performansÄ±
test_accuracy = accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
test_precision = precision_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')

print("=== Test Seti PerformansÄ± ===")
print(f"DoÄŸruluk: {test_accuracy:.4f}")
print(f"Precision (weighted): {test_precision:.4f}")
print(f"Recall (weighted): {test_recall:.4f}")
print(f"F1 Skoru (weighted): {test_f1:.4f}")

# DetaylÄ± classification report
print("\n=== DetaylÄ± Classification Report ===")
print(classification_report(y_test, y_test_pred))

In [None]:
# Confusion Matrix
plt.figure(figsize=(12, 10))

cm = confusion_matrix(y_test, y_test_pred)
categories = sorted(df['complaint_category'].unique())

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=categories, yticklabels=categories)
plt.title('Confusion Matrix - Test Seti', fontsize=14, fontweight='bold')
plt.xlabel('Tahmin Edilen Kategori')
plt.ylabel('GerÃ§ek Kategori')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## 9. Kategori BazÄ±nda Performans Analizi

In [None]:
# Kategori bazÄ±nda metrikleri hesapla
from sklearn.metrics import classification_report

report_dict = classification_report(y_test, y_test_pred, output_dict=True)

# Kategori bazÄ±nda precision, recall, f1-score
category_metrics = []
for category in categories:
    if category in report_dict:
        metrics = report_dict[category]
        category_metrics.append({
            'Kategori': category,
            'Precision': metrics['precision'],
            'Recall': metrics['recall'],
            'F1-Score': metrics['f1-score'],
            'Support': metrics['support']
        })

category_df = pd.DataFrame(category_metrics)
category_df = category_df.sort_values('F1-Score', ascending=False)

print("=== Kategori BazÄ±nda Performans (F1-Score'a gÃ¶re sÄ±ralÄ±) ===")
print(category_df.round(4))

In [None]:
# Kategori performansÄ±nÄ± gÃ¶rselleÅŸtir
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# F1-Score gÃ¶rselleÅŸtirmesi
axes[0].barh(category_df['Kategori'], category_df['F1-Score'], color='lightcoral')
axes[0].set_title('Kategori BazÄ±nda F1-Score', fontweight='bold')
axes[0].set_xlabel('F1-Score')
axes[0].invert_yaxis()

# Support (Ã¶rnek sayÄ±sÄ±) gÃ¶rselleÅŸtirmesi
axes[1].barh(category_df['Kategori'], category_df['Support'], color='lightblue')
axes[1].set_title('Kategori BazÄ±nda Test Ã–rnek SayÄ±sÄ±', fontweight='bold')
axes[1].set_xlabel('Test Ã–rnek SayÄ±sÄ±')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

## 10. Modelin En Ã–nemli Ã–zellikleri

In [None]:
# En Ã¶nemli Ã¶zellikleri bul (her kategori iÃ§in)
feature_names = tfidf.get_feature_names_out()
coefficients = lr_model.coef_

# Her kategori iÃ§in en Ã¶nemli 10 Ã¶zelliÄŸi gÃ¶ster
print("=== Her Kategori Ä°Ã§in En Ã–nemli 10 Ã–zellik ===")

for i, category in enumerate(lr_model.classes_):
    # Coef deÄŸerlerini al ve sÄ±rala
    category_coef = coefficients[i]
    
    # Pozitif ve negatif en Ã¶nemli Ã¶zellikleri bul
    top_positive_idx = np.argsort(category_coef)[-10:][::-1]
    top_negative_idx = np.argsort(category_coef)[:10]
    
    print(f"\n--- {category} ---")
    print("En pozitif Ã¶zellikler:")
    for idx in top_positive_idx:
        print(f"  {feature_names[idx]}: {category_coef[idx]:.4f}")
    
    print("En negatif Ã¶zellikler:")
    for idx in top_negative_idx:
        print(f"  {feature_names[idx]}: {category_coef[idx]:.4f}")

## 11. Model Tahmin Ã–rnekleri

In [None]:
# Rastgele test Ã¶rnekleri ile tahmin gÃ¶ster
np.random.seed(42)
sample_indices = np.random.choice(len(X_test), 10, replace=False)

print("=== Test Ã–rnekleri ve Tahminleri ===")

for i, idx in enumerate(sample_indices):
    original_text = X_test.iloc[idx]
    true_label = y_test.iloc[idx]
    predicted_label = y_test_pred[idx]
    
    # Tahmin olasÄ±lÄ±klarÄ±
    proba = lr_model.predict_proba(X_test_tfidf[idx:idx+1])
    max_proba = proba.max()
    
    print(f"\n--- Ã–rnek {i+1} ---")
    print(f"Metin: {original_text}")
    print(f"GerÃ§ek Kategori: {true_label}")
    print(f"Tahmin Edilen: {predicted_label}")
    print(f"GÃ¼ven OranÄ±: {max_proba:.3f}")
    print(f"DoÄŸru Tahmin: {'âœ“' if true_label == predicted_label else 'âœ—'}")

## 12. Baseline Model Ã–zeti ve SonuÃ§lar

### ðŸ“Š Model Performans Ã–zeti

Bu bÃ¶lÃ¼m model performans sonuÃ§larÄ±nÄ± ve analizleri iÃ§ermektedir. 
Notebook Ã§alÄ±ÅŸtÄ±rÄ±ldÄ±ÄŸÄ±nda burada detaylÄ± sonuÃ§lar gÃ¶rÃ¼ntÃ¼lenecektir.

### ðŸŽ¯ Temel Bulgular

#### GÃ¼Ã§lÃ¼ YÃ¶nler:
1. **Basit ve HÄ±zlÄ±**: Logistic Regression + TF-IDF kombinasyonu Ã§ok hÄ±zlÄ± eÄŸitiliyor
2. **AÃ§Ä±klanabilir**: Model kararlarÄ± kolayca yorumlanabiliyor
3. **Ortalama Performans**: Baseline model kabul edilebilir performans gÃ¶steriyor
4. **Dengeli Performans**: BirÃ§ok kategori iÃ§in makul F1 skorlarÄ±

#### ZayÄ±f YÃ¶nler:
1. **Class Imbalance**: BazÄ± kategoriler diÄŸerlerinden Ã§ok daha az Ã¶rneÄŸe sahip
2. **KarmaÅŸÄ±k Kategoriler**: Benzer metinlere sahip kategoriler karÄ±ÅŸÄ±yor
3. **Basit Ã–zellik Ã‡Ä±karÄ±mÄ±**: TF-IDF sadece kelime frekansÄ±na bakÄ±yor
4. **Anlamsal Eksiklik**: Kelimelerin anlamÄ±nÄ± tam yakalayamÄ±yor

### ðŸ”§ Ä°yileÅŸtirme Ã–nerileri

#### KÄ±sa Vadeli (Feature Engineering):
1. **Daha Ä°yi Metin Ã–n Ä°ÅŸleme**: Stemming, lemmatization
2. **N-gram Ã‡eÅŸitliliÄŸi**: Trigram, character n-gram denemeleri
3. **Kategorik Ã–zellik Ekleme**: Priority level, customer age gibi
4. **Sentiment Analysis**: MÃ¼ÅŸteri duygu durumu analizi

#### Orta Vadeli (Model GeliÅŸtirme):
1. **Ensemble Modeller**: Random Forest, XGBoost
2. **SVM ve Naive Bayes**: FarklÄ± algoritma denemeleri
3. **Hyperparameter Tuning**: Grid search ile optimizasyon
4. **Class Balancing**: SMOTE, class weights

#### Uzun Vadeli (GeliÅŸmiÅŸ NLP):
1. **Word Embeddings**: Word2Vec, GloVe
2. **Deep Learning**: LSTM, CNN, Transformers
3. **Pre-trained Models**: BERT, Turkish BERT
4. **Transfer Learning**: HazÄ±r TÃ¼rkÃ§e NLP modellerini kullanma

### ðŸš€ Sonraki AdÄ±mlar

1. **Feature Engineering Notebook**: GeliÅŸmiÅŸ Ã¶zellik Ã§Ä±karÄ±mÄ±
2. **Model Optimization**: Hiperparametre tuning
3. **Model Evaluation**: DetaylÄ± analiz ve karÅŸÄ±laÅŸtÄ±rma
4. **Final Pipeline**: Production-ready pipeline
5. **Deployment**: Web API ve kullanÄ±cÄ± arayÃ¼zÃ¼

### ðŸ’¡ Ä°ÅŸ DeÄŸeri

Bu baseline model, kÃ¼Ã§Ã¼k iÅŸletmeler iÃ§in:
- Otomatik kategori tahmini
- Manuel kategorilendirme iÅŸini otomatikleÅŸtirme
- MÃ¼ÅŸteri hizmetleri sÃ¼reÃ§lerini hÄ±zlandÄ±rma
- Ã‡Ã¶zÃ¼m sÃ¼relerini kÄ±saltma
- Maliyet tasarrufu saÄŸlama

Model baÅŸlangÄ±Ã§ iÃ§in yeterli performans gÃ¶steriyor ve sonraki adÄ±mlarla daha da iyileÅŸtirilebilir.