# Customer Churn Prediction - Nusantara Retail

## ðŸ›’ Latar Belakang
Perusahaan e-commerce **Nusantara Retail** lagi panik! Churn rate bulan lalu naik jadi **15%** (rata-rata industri cuma 10%). Ini angka tertinggi sepanjang sejarah perusahaan.

### Masalah Finansial:
- **Biaya Akuisisi (CAC)**: Rp 250.000 per pelanggan baru
- **Biaya Retensi**: Rp 50.000 per pelanggan lama (lebih murah 5x!)
- **Kerugian**: Rp 10 Miliar diproyeksikan hilang tahun ini karena pelanggan kabur

### Tujuan Proyek:
Bikin model ML yang bisa **memprediksi pelanggan mana yang bakal churn** dalam 30 hari ke depan. Jadi tim marketing bisa kasih voucher/promo sebelum mereka kabur!

### Dataset:
- **Tenure**: Lama jadi pelanggan (bulan)
- **Total_Belanja_3bln**: Total belanja 3 bulan terakhir (juta rupiah)
- **Hari_Terakhir_Login**: Berapa hari lalu terakhir login
- **Sesi_Per_Bulan**: Rata-rata kunjungan per bulan
- **Jumlah_Tiket_Komplain**: Berapa kali komplain ke CS
- **Skor_Kepuasan**: Skor survei kepuasan (1-10)
- **Item_Wishlist**: Jumlah barang di wishlist
- **Churn**: 0 = Setia, 1 = Berisiko Churn

In [38]:
# Library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [39]:
# Load Data

df = pd.read_csv('customer_churn.csv')
df.head(10)

Unnamed: 0,Tenure,Total_Belanja_3bln,Hari_Terakhir_Login,Sesi_Per_Bulan,Jumlah_Tiket_Komplain,Skor_Kepuasan,Item_Wishlist,Churn
0,34.1,18.57,31.45,67.04,5.89,4.98,18.64,1
1,53.14,33.17,47.51,59.71,8.17,3.57,7.24,0
2,42.23,30.98,45.8,46.99,6.49,5.5,18.05,0
3,30.78,27.27,47.73,60.87,7.11,5.28,14.68,0
4,33.43,14.5,42.93,51.55,7.43,5.55,24.38,1
5,15.14,18.62,58.72,47.55,2.15,6.17,30.64,0
6,32.38,25.14,38.31,58.0,6.32,4.67,29.28,1
7,4.0,30.93,55.8,78.1,7.09,1.83,49.9,1
8,35.92,20.96,26.12,72.94,5.83,2.62,35.5,1
9,15.7,23.42,63.43,38.97,5.15,6.57,33.75,0


In [40]:
# Info Dataset

print('Ukuran Data:', df.shape)
print('\nInfo Data:')
df.info()

Ukuran Data: (1500, 8)

Info Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Tenure                 1500 non-null   float64
 1   Total_Belanja_3bln     1500 non-null   float64
 2   Hari_Terakhir_Login    1500 non-null   float64
 3   Sesi_Per_Bulan         1500 non-null   float64
 4   Jumlah_Tiket_Komplain  1500 non-null   float64
 5   Skor_Kepuasan          1500 non-null   float64
 6   Item_Wishlist          1500 non-null   float64
 7   Churn                  1500 non-null   int64  
dtypes: float64(7), int64(1)
memory usage: 93.9 KB


In [41]:
# Cek distribusi Churn

print('Distribusi Churn:')
print(df['Churn'].value_counts())
print('\nPersentase:')
print(df['Churn'].value_counts(normalize=True)*100)

Distribusi Churn:
Churn
0    751
1    749
Name: count, dtype: int64

Persentase:
Churn
0    50.066667
1    49.933333
Name: proportion, dtype: float64


In [42]:
# Statistik Deskriptif

df.describe()

Unnamed: 0,Tenure,Total_Belanja_3bln,Hari_Terakhir_Login,Sesi_Per_Bulan,Jumlah_Tiket_Komplain,Skor_Kepuasan,Item_Wishlist,Churn
count,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0
mean,28.112147,22.608947,42.888927,53.765707,5.59036,5.66926,26.16928,0.499333
std,10.535947,6.867743,14.350893,13.205835,1.525598,1.431702,8.385225,0.500166
min,1.0,0.5,0.0,1.0,0.0,1.0,0.0,0.0
25%,19.8575,18.1175,32.475,45.015,4.73,4.65,21.0575,0.0
50%,29.435,22.43,40.95,54.16,5.79,5.8,26.95,0.0
75%,36.0425,27.0675,52.075,62.9825,6.65,6.7,32.02,1.0
max,60.0,50.0,90.0,100.0,10.0,10.0,50.0,1.0


# Train Test Split
Kita split data jadi 80% training dan 20% testing

In [43]:
# Split X dan Y

x = df.drop('Churn', axis=1)
y = df['Churn']

print('X shape:', x.shape)
print('Y shape:', y.shape)

X shape: (1500, 7)
Y shape: (1500,)


In [44]:
# Train Test Split

xtrain, xtest, ytrain, ytest = train_test_split(
    x,
    y,
    test_size=0.2,
    random_state=123,
    stratify=y
)

print('Training set:', xtrain.shape)
print('Testing set:', xtest.shape)

Training set: (1200, 7)
Testing set: (300, 7)


# Model 1: Logistic Regression
Coba model paling simple dulu

In [45]:
# Training Logistic Regression

logreg = LogisticRegression()
logreg.fit(xtrain, ytrain)

In [46]:
# Prediksi dan Akurasi LogReg

pred_logreg = logreg.predict(xtest)
acc_logreg = accuracy_score(ytest, pred_logreg) * 100

print(f'Akurasi Logistic Regression: {round(acc_logreg, 2)}%')

Akurasi Logistic Regression: 87.33%


# Model 2: K-Nearest Neighbors (KNN)
Sekarang coba pakai KNN, kita cari K terbaik dulu

In [47]:
# Cari K terbaik untuk KNN

print('Mencari K terbaik untuk KNN...')
k_values = range(3, 31, 2)
acc_list = []
best_k = 0
best_score = 0

for k in k_values:
    # Bikin model KNN dengan K = k
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # Latih model
    knn.fit(xtrain, ytrain)
    
    # Prediksi
    pred_knn = knn.predict(xtest)
    
    # Hitung akurasi
    acc = accuracy_score(ytest, pred_knn)
    acc_list.append(acc)
    
    print(f'K={k}, Akurasi = {round(acc*100, 2)}%')
    
    # Cek apakah ini akurasi terbaik
    if acc > best_score:
        best_score = acc
        best_k = k

print(f'\nâœ… K terbaik adalah K={best_k} dengan akurasi {round(best_score*100, 2)}%')

Mencari K terbaik untuk KNN...
K=3, Akurasi = 84.67%
K=5, Akurasi = 85.33%
K=7, Akurasi = 85.67%
K=9, Akurasi = 86.67%
K=11, Akurasi = 88.0%
K=13, Akurasi = 88.33%
K=15, Akurasi = 87.33%
K=17, Akurasi = 87.33%
K=19, Akurasi = 87.67%
K=21, Akurasi = 87.67%
K=23, Akurasi = 88.33%
K=25, Akurasi = 89.0%
K=27, Akurasi = 89.33%
K=29, Akurasi = 89.0%

âœ… K terbaik adalah K=27 dengan akurasi 89.33%


In [48]:
# Final Model KNN dengan K terbaik

knn_final = KNeighborsClassifier(n_neighbors=best_k)
knn_final.fit(xtrain, ytrain)
pred_knn_final = knn_final.predict(xtest)
acc_knn = accuracy_score(ytest, pred_knn_final) * 100

print(f'Akurasi KNN (K={best_k}): {round(acc_knn, 2)}%')

Akurasi KNN (K=27): 89.33%


# Model 3: Support Vector Machine (SVM)
Terakhir coba SVC

In [49]:
# Training SVC

svc = SVC()
svc.fit(xtrain, ytrain)

In [50]:
# Prediksi dan Akurasi SVC

pred_svc = svc.predict(xtest)
acc_svc = accuracy_score(ytest, pred_svc) * 100

print(f'Akurasi SVC: {round(acc_svc, 2)}%')

Akurasi SVC: 89.0%


# Perbandingan Ketiga Model
Sekarang kita bandingkan ketiga model berdasarkan akurasi

In [51]:
# Tabel Perbandingan Akurasi

hasil = pd.DataFrame({
    'Model': ['Logistic Regression', 'KNN (K=' + str(best_k) + ')', 'SVC'],
    'Akurasi (%)': [round(acc_logreg, 2), round(acc_knn, 2), round(acc_svc, 2)]
})

hasil = hasil.sort_values('Akurasi (%)', ascending=False).reset_index(drop=True)
hasil

Unnamed: 0,Model,Akurasi (%)
0,KNN (K=27),89.33
1,SVC,89.0
2,Logistic Regression,87.33


In [52]:
# Bikin tabel pelanggan yang berisiko churn dari ketiga model

# Buat dataframe dari data testing
hasil_prediksi = xtest.copy()
hasil_prediksi['Actual_Churn'] = ytest.values
hasil_prediksi['Pred_LogReg'] = pred_logreg
hasil_prediksi['Pred_KNN'] = pred_knn_final
hasil_prediksi['Pred_SVC'] = pred_svc

# Tambah kolom voting (berapa model yang prediksi churn)
hasil_prediksi['Total_Vote_Churn'] = (
    hasil_prediksi['Pred_LogReg'] + 
    hasil_prediksi['Pred_KNN'] + 
    hasil_prediksi['Pred_SVC']
)

print('Sample Hasil Prediksi:')
hasil_prediksi.head(10)

Sample Hasil Prediksi:


Unnamed: 0,Tenure,Total_Belanja_3bln,Hari_Terakhir_Login,Sesi_Per_Bulan,Jumlah_Tiket_Komplain,Skor_Kepuasan,Item_Wishlist,Actual_Churn,Pred_LogReg,Pred_KNN,Pred_SVC,Total_Vote_Churn
794,20.73,21.54,46.11,42.36,6.42,7.14,32.32,0,1,0,0,1
606,11.92,30.41,68.08,48.71,4.14,5.48,32.54,0,0,0,0,0
198,18.49,34.66,40.04,68.6,7.29,4.19,35.37,1,1,1,1,3
1296,18.85,24.92,82.35,40.19,4.78,5.29,25.89,0,0,0,0,0
408,8.16,23.93,72.28,58.79,0.27,5.08,24.07,0,0,0,0,0
798,37.6,21.65,52.36,30.5,7.62,5.78,35.86,1,1,1,1,3
1396,32.2,20.11,37.96,42.49,5.09,6.76,28.87,1,1,1,1,3
1161,44.59,9.64,29.83,45.13,5.37,6.7,18.56,0,0,0,0,0
1327,35.2,23.88,47.62,35.37,7.24,5.93,34.9,1,1,1,1,3
1196,33.85,30.6,44.22,61.34,6.98,4.49,20.72,1,1,1,1,3
