# MACHINE LEARNING CHALLENGE: ANALISIS RISIKO CUSTOMER E-COMMERCE INDONESIA

## 1. Latar Belakang
Sebuah perusahaan e-commerce lokal Indonesia sedang mengalami peningkatan jumlah customer baru. Namun, tim operasional menemukan masalah:
- Ada customer yang sering membatalkan order, mengakibatkan kerugian ongkir.
- Ada customer yang pembayaran COD-nya sering gagal, sehingga barang harus dikembalikan.
- Beberapa customer melakukan return berulang kali dengan alasan yang tidak jelas.

## 2. Goals
Tujuan utama adalah membangun model machine learning untuk memprediksi perilaku risiko customer berdasarkan histori transaksi dan interaksi mereka. Model ini akan digunakan untuk:
- Menentukan kelayakan COD.
- Menentukan order yang perlu verifikasi manual.
- Menyusun strategi retensi.

---

In [29]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

# Preprocessing & Model Selection
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

# Imbalance Handling
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# Metrics
from sklearn.metrics import (f1_score, recall_score, precision_score, roc_auc_score, 
                             confusion_matrix, classification_report, accuracy_score, make_scorer)

# Display Options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
sns.set_style("whitegrid")

## 1. Data Understanding
Insight awal dari dataset: distribusi variabel, missing value, outlier, dan korelasi.

In [30]:
df = pd.read_csv('Customer Risk Dataset.csv')

print("Dataset Shape:", df.shape)
print("\nInfo Dataset:")
df.info()
print("\nMissing Values:")
print(df.isnull().sum())
print("\nStatistik Deskriptif:")
display(df.describe())

print("\nSample cancel_rate values (raw):")
print(df['cancel_rate'].value_counts().head(10))

Dataset Shape: (10214, 13)

Info Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10214 entries, 0 to 10213
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        10214 non-null  object 
 1   age                7627 non-null   object 
 2   registration_date  10214 non-null  object 
 3   city               10214 non-null  object 
 4   total_orders       5134 non-null   float64
 5   cancel_rate        7619 non-null   object 
 6   return_count       10214 non-null  object 
 7   cod_failed         8567 non-null   object 
 8   avg_order_value    10214 non-null  float64
 9   last_purchase      10214 non-null  object 
 10  device             10214 non-null  object 
 11  complaints         8123 non-null   object 
 12  risk_flag          4110 non-null   float64
dtypes: float64(3), object(10)
memory usage: 1.0+ MB

Missing Values:
customer_id             0
age                  2587
registr

Unnamed: 0,total_orders,avg_order_value,risk_flag
count,5134.0,10214.0,4110.0
mean,23.321582,51304310.0,0.503163
std,15.267024,48766010.0,0.500051
min,-3.0,1937.06,0.0
25%,10.0,2494366.0,0.0
50%,23.0,100000000.0,1.0
75%,37.0,100000000.0,1.0
max,49.0,100000000.0,1.0



Sample cancel_rate values (raw):
cancel_rate
0,2                   2584
20%                   2535
8.745310152178726        1
125.7689757796603        1
63.75177707438059        1
82.68268528447099        1
1.0350375942248358       1
73.39262538295903        1
8.129820480415173        1
60.33476790016527        1
Name: count, dtype: int64


## 2. Menentukan Analisis (Regresi/Klasifikasi)

**Problem Statement Final:**
Memprediksi apakah seorang customer berpotensi "Berisiko" (Gagal COD/Sering Cancel/Return) atau "Tidak Berisiko".

**Jenis Analisis: Klasifikasi**
**Alasan:**
1. Output yang diinginkan adalah label kategori biner: Berisiko (1) atau Tidak Berisiko (0).
2. Bisnis perlu keputusan tegas (Ya/Tidak) untuk menyetujui fitur COD atau melakukan verifikasi manual.
3. Meskipun kita bisa memprediksi probabilitas (skor risiko), pada akhirnya operasional butuh threshold untuk mengambil tindakan.

**Penanganan Label:**
Kolom `risk_flag` sudah tersedia sebagai label. Jika label ini kotor (missing), kita akan melakukan imputasi (mengisi dengan 0/Safe) atau membuang baris tersebut. Dalam kasus ini, kita asumsikan missing label berarti belum ada insiden risiko, jadi kita isi dengan 0.

## 3. Menentukan Metrics + Alasan Memilihnya

**Metric Utama: Recall (Sensitivitas)**

**Alasan:**
- **Kasus Bisnis**: Kita ingin mendeteksi customer yang berpotensi merugikan (COD gagal bayar).
- **False Negative (FN)**: Model memprediksi Aman (0), padahal aslinya Berisiko (1). 
  - *Dampak*: Perusahaan mengirim barang COD -> Customer tidak bayar -> Perusahaan rugi Ongkir + Biaya Retur + Potensi Barang Rusak.
- **False Positive (FP)**: Model memprediksi Berisiko (1), padahal aslinya Aman (0).
  - *Dampak*: Customer baik ditolak COD-nya -> Potensi kehilangan penjualan (Opportunity Cost).

Dalam konteks manajemen risiko, **meminimalkan kerugian nyata (FN)** biasanya lebih prioritas daripada kehilangan potensi untung (FP). Oleh karena itu, kita ingin **Recall setinggi mungkin** untuk kelas positif (Risk).

Namun, kita juga akan memantau **F1-Score** dan **ROC-AUC** untuk memastikan model tidak sekadar memprediksi semua orang sebagai risiko (yang akan membuat Recall 100% tapi Precision hancur).

## 4. Data Cleaning
Meliputi perbaikan typo, standardisasi kategori, penanganan missing value, dan perbaikan format.

In [31]:
df_clean = df.copy()

# --- 1. Age Cleaning ---
num_words = {
    "nol": 0, "satu": 1, "dua": 2, "tiga": 3, "empat": 4,
    "lima": 5, "enam": 6, "tujuh": 7, "delapan": 8,
    "sembilan": 9, "sepuluh": 10, "sebelas": 11
}

def text_to_number(text):
    if pd.isna(text):
        return np.nan
    text = str(text).lower().strip()
    digits = re.findall(r'\d+', text)
    if digits:
        return int(digits[0])
    parts = text.split()
    if "puluh" in parts:
        idx = parts.index("puluh")
        tens = num_words.get(parts[idx - 1], 0) * 10
        if idx + 1 < len(parts):
            return tens + num_words.get(parts[idx + 1], 0)
        return tens
    if "belas" in text:
        return num_words.get(parts[0], 0) + 10
    if text in num_words:
        return num_words[text]
    return np.nan

df_clean["age"] = df_clean["age"].apply(text_to_number)
df_clean["age"] = df_clean["age"].fillna(df_clean["age"].median()).astype(int)

# --- 2. Date Parsing ---
def parse_date(date_str):
    if pd.isna(date_str) or date_str == '':
        return pd.NaT
    formats = ['%B %d %Y', '%d-%m-%y', '%Y/%m/%d', '%m/%d/%Y', '%b %d %Y']
    date_str = str(date_str).strip()
    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            pass
    try:
        return pd.to_datetime(date_str)
    except:
        return pd.NaT

df_clean["registration_date"] = df_clean["registration_date"].map(parse_date)
df_clean["last_purchase"] = df_clean["last_purchase"].map(parse_date)

# --- 3. City Standardization ---
city_mapping = {
    'jkt': 'Jakarta', 'jakarta': 'Jakarta', 'jakrta': 'Jakarta',
    'jakrta sel.': 'Jakarta', 'jakarta selatan': 'Jakarta',
    'bandung': 'Bandung', 'surabaya': 'Surabaya'
}
df_clean['city'] = df_clean['city'].astype(str).str.lower().str.strip().map(
    lambda x: city_mapping.get(x, x.title() if x != "nan" else "Unknown")
)

# --- 4. Numeric Cleaning (Total Orders & Avg Order Value) ---
def clean_numeric(x):
    if pd.isna(x) or x == "": 
        return 0
    try:
        return float(x)
    except:
        return 0

df_clean["total_orders"] = df_clean["total_orders"].apply(clean_numeric).abs().astype(int)

df_clean["avg_order_value"] = (
    df_clean["avg_order_value"]
    .astype(str)
    .str.replace(",", "")
    .apply(clean_numeric)
)
df_clean.loc[df_clean["avg_order_value"] > 1e8, "avg_order_value"] = 0

# --- 5. CANCEL RATE (Comprehensive Cleaning) ---
def clean_cancel_rate(val):
    if pd.isna(val): 
        return 0.0
    
    val_str = str(val).strip().replace(" ", "").replace("%", "")
    
    has_comma = ',' in val_str
    has_dot = '.' in val_str
    
    if has_comma and has_dot:
        comma_pos = val_str.rfind(',')
        dot_pos = val_str.rfind('.')
        
        if comma_pos > dot_pos:
            val_str = val_str.replace('.', '').replace(',', '.')
        else:
            val_str = val_str.replace(',', '')
    
    elif has_comma and not has_dot:
        val_str = val_str.replace(',', '.')
    
    try:
        num = float(val_str)
    except:
        return 0.0
    
    num = abs(num)
    
    if num > 10000:
        str_num = str(int(num))
        length = len(str_num)
        num = num / (10 ** (length - 2))
    
    elif num > 100:
        while num > 100:
            num /= 10
    
    elif 0 < num <= 1:
        num = num * 100
    
    if num > 100:
        num = 100.0
    
    return round(num, 2)

df_clean["cancel_rate"] = df_clean["cancel_rate"].apply(clean_cancel_rate)

# --- 6. Categorical & Binary Cleaning ---
return_map = {'dua': 2, 'tiga': 3, 'tiga puluh': 30, 'satu': 1}
df_clean['return_count'] = (
    df_clean['return_count']
    .astype(str)
    .str.lower()
    .map(lambda x: return_map.get(x, x))
    .apply(clean_numeric)
    .astype(int)
)

def clean_cod(x):
    s = str(x).lower().strip()
    if s in ['1', 'true', 'ya', 'yes']: 
        return 1
    if s in ['0', 'false', 'tidak', 'no']: 
        return 0
    return np.nan

df_clean['cod_failed'] = df_clean['cod_failed'].apply(clean_cod)

# COD Failed Imputation (after cancel_rate cleaning)
mask_nan = df_clean['cod_failed'].isna()
df_clean.loc[mask_nan & (df_clean['cancel_rate'] > 25), 'cod_failed'] = 1
df_clean.loc[mask_nan, 'cod_failed'] = 0
df_clean['cod_failed'] = df_clean['cod_failed'].astype(int)

# --- 7. Device ---
device_mapping = {
    'andriod': 'Android', 'android': 'Android',
    'i-phone': 'iOS', 'iphone': 'iOS', 'ios': 'iOS',
    'desktop': 'Desktop'
}

df_clean['device'] = (
    df_clean['device']
    .astype(str)
    .str.lower()
    .str.strip()
    .map(lambda x: device_mapping.get(x, x) if x != "nan" else np.nan)
)

df_clean['complaints'] = df_clean['complaints'].apply(
    lambda x: 1 if str(x).strip() not in ['0', 'nan', ''] else 0
)

# --- 8. RISK FLAG (Impute from cod_failed) ---
df_clean['risk_flag'] = df_clean['risk_flag'].fillna(df_clean['cod_failed']).astype(int)

print(f"Cancel rate range: {df_clean['cancel_rate'].min():.2f} - {df_clean['cancel_rate'].max():.2f}")
print(df_clean['cancel_rate'].describe())
print(f"\nRisk flag distribution after imputation:")
print(df_clean['risk_flag'].value_counts())
print(f"\nData Cleaning Selesai.")

Cancel rate range: 0.00 - 99.99
count    10214.000000
mean        19.500922
std         20.004446
min          0.000000
25%          0.000000
50%         20.000000
75%         20.000000
max         99.990000
Name: cancel_rate, dtype: float64

Risk flag distribution after imputation:
risk_flag
0    6048
1    4166
Name: count, dtype: int64

Data Cleaning Selesai.


## 5. Data Preprocessing
Encoding, Scaling, Feature Engineering, dan Train-Test Split.

In [32]:
# Feature Engineering: Durasi Hari
ref_date = df_clean['last_purchase'].max()
df_clean['days_registered'] = (ref_date - df_clean['registration_date']).dt.days.fillna(0)
df_clean['recency'] = (ref_date - df_clean['last_purchase']).dt.days.fillna(0)

# Drop unused columns
X = df_clean.drop(['customer_id', 'risk_flag', 'registration_date', 'last_purchase'], axis=1)
y = df_clean['risk_flag']

# Define Features
cat_cols = ['city', 'device']
num_cols = [c for c in X.columns if c not in cat_cols]

# Preprocessing Pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ]
)

# Train-Test Split (80:20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Train Shape:", X_train.shape)
print("Test Shape:", X_test.shape)

Train Shape: (8171, 11)
Test Shape: (2043, 11)


## 6. Model Benchmarking dengan Cross-Validation
Kita akan menggunakan 5-Fold Cross Validation untuk menguji beberapa model.
Metric yang dilihat: **Recall** (untuk meminimalkan False Negative).

In [33]:
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss')
}

results_cv = []
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Benchmarking Models (Metric: Recall)...")
for name, model in models.items():
    # Pipeline with SMOTE to handle imbalance during CV
    pipeline = ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('classifier', model)
    ])
    
    scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='recall')
    results_cv.append({
        'Model': name,
        'Recall Mean': scores.mean(),
        'Recall Std': scores.std()
    })

df_cv = pd.DataFrame(results_cv).sort_values('Recall Mean', ascending=False)
display(df_cv)

Benchmarking Models (Metric: Recall)...


Unnamed: 0,Model,Recall Mean,Recall Std
1,KNN,0.697579,0.016875
0,Logistic Regression,0.67538,0.023769
3,Random Forest,0.67328,0.025885
4,XGBoost,0.66818,0.030516
2,Decision Tree,0.603969,0.021042


## 7. Pilih Top 3 Model
Berdasarkan skor Recall rata-rata dari Cross-Validation di atas, kita akan memilih 3 model terbaik untuk tahap selanjutnya.

In [34]:
top_3_models = df_cv.head(3)['Model'].tolist()
print("Top 3 Models:", top_3_models)

Top 3 Models: ['KNN', 'Logistic Regression', 'Random Forest']


## 8. Penanganan Imbalance Learning
Kita menggunakan **SMOTE (Synthetic Minority Over-sampling Technique)**.

**Alasan:**
Dataset risiko biasanya tidak seimbang (jumlah customer berisiko jauh lebih sedikit daripada customer aman). Jika tidak ditangani, model akan cenderung bias ke kelas mayoritas (memprediksi semua aman), yang akan menghasilkan banyak False Negative. SMOTE membantu menyeimbangkan kelas dengan membuat data sintetis untuk kelas minoritas saat training.

## 9. Hyperparameter Tuning
Kita akan melakukan tuning pada Top 3 model menggunakan `RandomizedSearchCV` atau `GridSearchCV` untuk mencari parameter optimal.

In [35]:
# Parameter Grids untuk Top 3
param_grids = {
    'Random Forest': {
        'classifier__n_estimators': [100, 200],
        'classifier__max_depth': [10, 20, None],
        'classifier__min_samples_split': [2, 5]
    },
    'XGBoost': {
        'classifier__n_estimators': [100, 200],
        'classifier__learning_rate': [0.01, 0.1],
        'classifier__max_depth': [3, 6]
    },
    'Decision Tree': {
        'classifier__max_depth': [5, 10, None],
        'classifier__min_samples_leaf': [1, 2, 4]
    },
    'Logistic Regression': {
        'classifier__C': [0.1, 1, 10]
    },
    'KNN': {
        'classifier__n_neighbors': [3, 5, 7]
    }
}

best_estimators = {}
tuned_results = []

print("Tuning Hyperparameters...")
for name in top_3_models:
    model = models[name]
    pipeline = ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('classifier', model)
    ])
    
    search = RandomizedSearchCV(
        pipeline, 
        param_distributions=param_grids.get(name, {}), 
        n_iter=5, 
        cv=3, 
        scoring='recall', 
        random_state=42, 
        n_jobs=-1
    )
    
    search.fit(X_train, y_train)
    best_estimators[name] = search.best_estimator_
    tuned_results.append({
        'Model': name,
        'Best Recall CV': search.best_score_
    })
    
    # Print parameters dengan jelas
    print(f"\n{name} tuned:")
    print(f"  Best Recall CV: {search.best_score_:.4f}")
    print(f"  Best Parameters:")
    for param, value in search.best_params_.items():
        print(f"    - {param}: {value}")

print("\n" + "="*60)
print("SUMMARY - Best Recall CV Scores:")
print("="*60)
df_tuned = pd.DataFrame(tuned_results).sort_values('Best Recall CV', ascending=False)
display(df_tuned)

Tuning Hyperparameters...

KNN tuned:
  Best Recall CV: 0.7015
  Best Parameters:
    - classifier__n_neighbors: 7

Logistic Regression tuned:
  Best Recall CV: 0.6754
  Best Parameters:
    - classifier__C: 0.1

Random Forest tuned:
  Best Recall CV: 0.6763
  Best Parameters:
    - classifier__n_estimators: 100
    - classifier__min_samples_split: 2
    - classifier__max_depth: None

SUMMARY - Best Recall CV Scores:


Unnamed: 0,Model,Best Recall CV
0,KNN,0.70147
2,Random Forest,0.676268
1,Logistic Regression,0.675368


## 10. Testing ke Data Test & 11. Komparasi Before vs After
Kita akan mengevaluasi model sebelum dan sesudah tuning pada data test (yang tidak dilihat saat training/tuning).

In [36]:
final_comparison = []

for name in top_3_models:
    # Before Tuning (Default)
    default_pipe = ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('classifier', models[name])
    ])
    default_pipe.fit(X_train, y_train)
    y_pred_def = default_pipe.predict(X_test)
    recall_def = recall_score(y_test, y_pred_def)
    
    # After Tuning
    tuned_model = best_estimators[name]
    y_pred_tuned = tuned_model.predict(X_test)
    recall_tuned = recall_score(y_test, y_pred_tuned)
    
    final_comparison.append({
        'Model': name,
        'Recall Default': recall_def,
        'Recall Tuned': recall_tuned,
        'Improvement': recall_tuned - recall_def
    })

df_comparison = pd.DataFrame(final_comparison)
display(df_comparison)

Unnamed: 0,Model,Recall Default,Recall Tuned,Improvement
0,KNN,0.711885,0.702281,-0.009604
1,Logistic Regression,0.661465,0.661465,0.0
2,Random Forest,0.656663,0.656663,0.0


## 12. Final Best Model Decision

Berdasarkan hasil komparasi di atas, kita memilih model terbaik.

**Kriteria Pemilihan:**
1. **Recall Tertinggi**: Prioritas utama untuk menangkap sebanyak mungkin customer berisiko.
2. **Stabilitas**: Perbedaan antara skor Train/CV dan Test tidak terlalu jauh (tidak overfitting).
3. **Kompleksitas**: Jika performa mirip, pilih model yang lebih sederhana (misal: Logistic Regression/Decision Tree vs Random Forest).

*(Isi bagian ini secara manual setelah melihat output tabel di atas saat running)*

## 13. Limitasi Model
1. **Data Noise**: Masih ada kemungkinan kesalahan input manual pada data historis yang mempengaruhi kualitas prediksi.
2. **Imbalance Ekstrem**: Meskipun sudah di-SMOTE, jika rasio kelas sangat timpang, model mungkin masih menghasilkan False Positive yang cukup tinggi demi mengejar Recall.
3. **Perubahan Perilaku**: Model dilatih pada data masa lalu. Jika perilaku penipuan berubah (pola baru), model perlu di-retrain secara berkala.
4. **Fitur Terbatas**: Kita hanya menggunakan fitur transaksi dasar. Fitur perilaku (klik, durasi sesi) mungkin bisa meningkatkan akurasi.

## 14. Impact Model ke Bisnis

**Simulasi Dampak:**
Misalkan tanpa model, kita meloloskan semua COD. 
- Total Kerugian = (Jumlah Bad Customer) x (Rata-rata Ongkir + Handling Cost).

Dengan model (asumsi Recall 80%):
- Kita berhasil mencegah 80% dari potensi kerugian tersebut.
- **Efisiensi Operasional**: Tim CS tidak perlu memverifikasi semua order, cukup fokus pada yang diprediksi "Berisiko" oleh model.
- **Keputusan Bisnis**: 
  - Jika Prediksi = Risiko Tinggi -> Matikan opsi COD, wajibkan Transfer Bank.
  - Jika Prediksi = Aman -> Lanjutkan proses otomatis.