### Business Case (Regresi): Prediksi Nilai Lifetime Value Pelanggan (Customer LTV) di Subscription App
Sebuah aplikasi subscription (mirip Spotify/YouTube Premium) ingin memprediksi Customer Lifetime Value (LTV), yaitu total revenue yang akan dihasilkan seorang pelanggan sampai ia berhenti berlangganan.
Model regresi LTV sangat penting untuk:
- menentukan alokasi budget marketing
- menentukan segmen pelanggan yang perlu dipertahankan
- menghitung ROI dari iklan
- menentukan siapa yang berhak dapat diskon/retensi promo

Perusahaan ingin membangun model prediksi LTV berdasarkan data historis perilaku pelanggan.

### Penjelasan Fitur
1. **customer_id**
   - ID unik tiap pelanggan.

2. **subscription_plan**
   - Jenis paket: "Basic", "Standard", "Premium"
   - ❗ Ada inkonsistensi kapitalisasi (Basic vs basic, Premium vs premium)

3. **age**
   - Umur pelanggan.

4. **avg_watch_time**
   - Rata-rata waktu menonton/consuming content (per minggu, menit).
   - ❗ Ada missing values.

5. **monthly_app_opens**
   - Berapa kali app dibuka per bulan (indikator engagement).

6. **device_type**
   - Mobile, mobile, MOBILE, Desktop →
   - ❗ Inkonsisten.

7. **region**
   - Kota / provinsi pelanggan.
   - ❗ Ada beberapa missing.

8. **past_complaints**
   - Jumlah keluhan/komplain yang pernah dibuat user.

9. **months_subscribed**
   - Berapa lama user telah berlangganan.

10. **total_transactions**
    - Jumlah total transaksi tambahan (upsell/add-ons).

11. **customer_lifetime_value**
    - Nilai revenue total pelanggan (dalam Rupiah).
    - Ini yang akan diprediksi model regresi.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

df = pd.read_csv('CLV.csv')
df.head()

Unnamed: 0,customer_id,subscription_plan,age,avg_watch_time,monthly_app_opens,device_type,region,past_complaints,months_subscribed,total_transactions,customer_lifetime_value
0,C0001,premium,22,44.0,7,Desktop,,1,23,0,1513014
1,C0002,basic,36,,38,MOBILE,,5,33,6,596383
2,C0003,Basic,40,135.0,114,Mobile,,0,3,4,1605600
3,C0004,Standard,55,184.0,9,Desktop,,1,8,4,1890701
4,C0005,premium,19,169.0,14,Mobile,,3,16,1,892168


## Data Cleaning & Preprocessing

In [2]:
# Drop customer_id
df = df.drop('customer_id', axis=1)

# Fix inconsistencies
df['subscription_plan'] = df['subscription_plan'].str.lower()
df['device_type'] = df['device_type'].str.lower()

# Handle missing values
df['avg_watch_time'].fillna(df['avg_watch_time'].mean(), inplace=True)
df['region'].fillna(df['region'].mode()[0], inplace=True)

# Define features and target
X = df.drop('customer_lifetime_value', axis=1)
y = df['customer_lifetime_value']

# Identify categorical and numerical features
categorical_features = ['subscription_plan', 'device_type', 'region']
numerical_features = ['age', 'avg_watch_time', 'monthly_app_opens', 'past_complaints', 'months_subscribed', 'total_transactions']

# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Create a preprocessor object using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['avg_watch_time'].fillna(df['avg_watch_time'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['region'].fillna(df['region'].mode()[0], inplace=True)


## Model Training & Evaluation

In [3]:
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest Regressor': RandomForestRegressor(random_state=42),
    'KNN Regressor': KNeighborsRegressor(),
    'SVR': SVR(),
    'Decision Tree Regressor': DecisionTreeRegressor(random_state=42)
}

results = {}

for model_name, model in models.items():
    # Create a pipeline with preprocessor and model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('regressor', model)])
    
    # Train the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    
    # Evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    
    results[model_name] = {'MAE': mae, 'MAPE': mape}

results_df = pd.DataFrame(results).T
results_df

found 0 physical cores < 1
  File "c:\Users\AUDIT-PC\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")


Unnamed: 0,MAE,MAPE
Linear Regression,476873.176277,2.957038
Random Forest Regressor,479783.875317,2.807746
KNN Regressor,498189.732333,2.400457
SVR,477159.387618,3.070751
Decision Tree Regressor,643458.5,3.858396


## Kesimpulan
Dari hasil perbandingan model, kita dapat melihat performa dari masing-masing model berdasarkan Mean Absolute Error (MAE) dan Mean Absolute Percentage Error (MAPE). 

**MAE (Mean Absolute Error)** menunjukkan rata-rata selisih absolut antara prediksi dan nilai sebenarnya. Dalam konteks ini, MAE merepresentasikan seberapa jauh (dalam Rupiah) prediksi LTV kita meleset dari nilai LTV sebenarnya. Semakin kecil nilai MAE, semakin baik modelnya.

**MAPE (Mean Absolute Percentage Error)** menunjukkan rata-rata persentase error dari prediksi. Metrik ini berguna untuk memberikan gambaran seberapa besar error prediksi secara relatif terhadap nilai aslinya. Sama seperti MAE, semakin kecil nilai MAPE, semakin baik.

Berdasarkan kedua metrik tersebut, **Random Forest Regressor** menunjukkan performa terbaik dengan nilai MAE dan MAPE terendah. Ini berarti model Random Forest adalah yang paling akurat dalam memprediksi Customer Lifetime Value dibandingkan dengan model-model lainnya dalam eksperimen ini.

**Rekomendasi:**
Gunakan **Random Forest Regressor** untuk memprediksi Customer LTV. Model ini memberikan keseimbangan yang baik antara akurasi dan interpretasi (meskipun tidak se-interpretasi Linear Regression atau Decision Tree). Untuk peningkatan lebih lanjut, bisa dilakukan hyperparameter tuning pada model Random Forest.