# 4. Modeling (TDSP Step 3)

Pada tahap ini kita membangun model machine learning untuk memprediksi harga mobil bekas.

Langkah utama:
- Menentukan fitur (X) dan target (y).
- Membagi data menjadi train dan test set.
- Menyusun pipeline preprocessing (numerik dan kategorikal).
- Melatih beberapa model dan mengevaluasi performanya.
- Memilih model terbaik berdasarkan metrik evaluasi.

---
Notebook ini mengasumsikan bahwa dataset bersih sudah disimpan sebelumnya sebagai file:

`./Dataset/UsedCarsSA_Clean.csv`

Jika belum ada, jalankan terlebih dahulu notebook data preparation (Section 3) untuk menghasilkan file tersebut.

In [12]:
# 4.1 Import library dan load dataset bersih
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from xgboost import XGBRegressor

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# Load dataset bersih
data_path_clean = '../Dataset/UsedCarsSA_Clean.csv'
df_clean = pd.read_csv(data_path_clean)

df_clean.head()

Unnamed: 0,Make,Type,Year,Origin,Color,Options,Engine_Size,Fuel_Type,Gear_Type,Mileage,Region,Price,Car_Age
0,Chrysler,C300,2018,Saudi,Black,Full,5.7,Gas,Automatic,103000,Riyadh,114000.0,7
1,Nissan,Sunny,2019,Saudi,Silver,Standard,1.5,Gas,Automatic,72418,Riyadh,27500.0,6
2,Hyundai,Elantra,2019,Saudi,Grey,Standard,1.6,Gas,Automatic,114154,Riyadh,43000.0,6
3,Hyundai,Elantra,2019,Saudi,Silver,Semi Full,2.0,Gas,Automatic,41912,Riyadh,59500.0,6
4,Honda,Accord,2018,Saudi,Navy,Full,1.5,Gas,Automatic,39000,Riyadh,72000.0,7


## 4.2 Menyiapkan Fitur dan Target

- Target (y): kolom `Price`.
- Fitur (X): semua kolom selain `Price`.

Jika kolom `Car_Age` belum ada karena belum dibuat di notebook sebelumnya, kita buat ulang di sini sebagai fitur tambahan.

In [13]:
# Pastikan kolom Car_Age ada
if 'Car_Age' not in df_clean.columns and 'Year' in df_clean.columns:
    df_clean['Car_Age'] = 2025 - df_clean['Year']

# Definisikan target dan fitur
target = 'Price'
X = df_clean.drop(columns=[target])
y = df_clean[target]

X.head(), y.head()

(       Make     Type  Year Origin   Color    Options  Engine_Size Fuel_Type  \
 0  Chrysler     C300  2018  Saudi   Black       Full          5.7       Gas   
 1    Nissan    Sunny  2019  Saudi  Silver   Standard          1.5       Gas   
 2   Hyundai  Elantra  2019  Saudi    Grey   Standard          1.6       Gas   
 3   Hyundai  Elantra  2019  Saudi  Silver  Semi Full          2.0       Gas   
 4     Honda   Accord  2018  Saudi    Navy       Full          1.5       Gas   
 
    Gear_Type  Mileage  Region  Car_Age  
 0  Automatic   103000  Riyadh        7  
 1  Automatic    72418  Riyadh        6  
 2  Automatic   114154  Riyadh        6  
 3  Automatic    41912  Riyadh        6  
 4  Automatic    39000  Riyadh        7  ,
 0    114000.0
 1     27500.0
 2     43000.0
 3     59500.0
 4     72000.0
 Name: Price, dtype: float64)

## 4.3 Train–Test Split

Data dibagi menjadi 80 persen data latih dan 20 persen data uji.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape

((4385, 12), (1097, 12))

## 4.4 Menentukan Kolom Numerik dan Kategorikal

Beberapa kolom numerik utama:
- Year
- Engine_Size
- Mileage
- Car_Age

Kolom lain yang ber-tipe object akan diperlakukan sebagai fitur kategorikal.

In [15]:
numeric_features = []
for col in ['Year', 'Engine_Size', 'Mileage', 'Car_Age']:
    if col in X_train.columns:
        numeric_features.append(col)

categorical_features = [col for col in X_train.columns if col not in numeric_features]

numeric_features, categorical_features[:10]  # tampilkan contoh beberapa fitur kategorikal

(['Year', 'Engine_Size', 'Mileage', 'Car_Age'],
 ['Make',
  'Type',
  'Origin',
  'Color',
  'Options',
  'Fuel_Type',
  'Gear_Type',
  'Region'])

## 4.5 Pipeline Preprocessing

- Fitur numerik: imputasi median lalu scaling standar.
- Fitur kategorikal: imputasi modus lalu one-hot encoding.

Kita gunakan `ColumnTransformer` untuk menggabungkan dua jenis preprocessing ini dalam satu pipeline.

In [16]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

preprocessor

## 4.6 Fungsi Evaluasi Model

Kita gunakan beberapa metrik:
- MAE (Mean Absolute Error)
- MAPE (Mean Absolute Percentage Error)
- RMSE (Root Mean Squared Error)
- R squared (koefisien determinasi)

MAPE berguna untuk melihat seberapa besar error model dalam persen terhadap nilai harga aktual.

In [17]:
def mean_absolute_percentage_error(y_true, y_pred):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    nonzero = y_true != 0
    return np.mean(np.abs((y_true[nonzero] - y_pred[nonzero]) / y_true[nonzero])) * 100

def evaluate_regression(model, X_train, y_train, X_test, y_test, name='Model'):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)

    print(f'=== {name} ===')
    print(f'Train MAE  : {train_mae:,.2f}')
    print(f'Test MAE   : {test_mae:,.2f}')
    print(f'Test MAPE  : {test_mape:,.2f}%')
    print(f'Test RMSE  : {test_rmse:,.2f}')
    print(f'Test R^2   : {test_r2:.3f}')
    return {
        'name': name,
        'train_mae': train_mae,
        'test_mae': test_mae,
        'test_mape': test_mape,
        'test_rmse': test_rmse,
        'test_r2': test_r2
    }

## 4.7 Baseline Model: Linear Regression

Sebagai baseline, kita gunakan model regresi linear sederhana yang dikombinasikan dengan pipeline preprocessing.

In [18]:
linreg_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

linreg_model.fit(X_train, y_train)
metrics_linreg = evaluate_regression(linreg_model, X_train, y_train, X_test, y_test, 'Linear Regression')
metrics_linreg

=== Linear Regression ===
Train MAE  : 17,532.05
Test MAE   : 19,986.49
Test MAPE  : 64.88%
Test RMSE  : 31,981.78
Test R^2   : 0.730


{'name': 'Linear Regression',
 'train_mae': 17532.054378053672,
 'test_mae': 19986.490465501578,
 'test_mape': np.float64(64.87501374453281),
 'test_rmse': np.float64(31981.777529691266),
 'test_r2': 0.7304796966843834}

## 4.8 Tree-based Models: Random Forest dan Gradient Boosting

Model berbasis pohon keputusan sering memberikan performa yang baik untuk problem regresi tabular seperti ini.

Kita coba dua model:
- RandomForestRegressor
- GradientBoostingRegressor

In [19]:
# Random Forest
rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(
        n_estimators=200,
        random_state=42,
        n_jobs=-1
    ))
])

rf_model.fit(X_train, y_train)
metrics_rf = evaluate_regression(rf_model, X_train, y_train, X_test, y_test, 'Random Forest')
metrics_rf

=== Random Forest ===
Train MAE  : 4,912.19
Test MAE   : 14,318.20
Test MAPE  : 46.37%
Test RMSE  : 27,425.53
Test R^2   : 0.802


{'name': 'Random Forest',
 'train_mae': 4912.18931438345,
 'test_mae': 14318.197443894604,
 'test_mape': np.float64(46.366600810384014),
 'test_rmse': np.float64(27425.530947093495),
 'test_r2': 0.8018033183022005}

In [20]:
# Gradient Boosting
gbr_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(
        random_state=42
    ))
])

gbr_model.fit(X_train, y_train)
metrics_gbr = evaluate_regression(gbr_model, X_train, y_train, X_test, y_test, 'Gradient Boosting')
metrics_gbr

=== Gradient Boosting ===
Train MAE  : 15,113.22
Test MAE   : 17,023.44
Test MAPE  : 57.79%
Test RMSE  : 27,659.77
Test R^2   : 0.798


{'name': 'Gradient Boosting',
 'train_mae': 15113.22279213618,
 'test_mae': 17023.435171855675,
 'test_mape': np.float64(57.787393417703626),
 'test_rmse': np.float64(27659.76762324248),
 'test_r2': 0.7984033341299597}

## 4.9 Model Gradient Boosting Lanjut: XGBoost

XGBoost seringkali memberikan performa yang sangat baik pada data tabular.

Kita gunakan konfigurasi awal yang sederhana dan dapat dituning lebih lanjut jika diperlukan.

In [21]:
xgb_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(
        n_estimators=400,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='reg:squarederror',
        random_state=42,
        n_jobs=-1
    ))
])

xgb_model.fit(X_train, y_train)
metrics_xgb = evaluate_regression(xgb_model, X_train, y_train, X_test, y_test, 'XGBoost')
metrics_xgb

=== XGBoost ===
Train MAE  : 8,577.56
Test MAE   : 13,196.32
Test MAPE  : 46.82%
Test RMSE  : 23,145.91
Test R^2   : 0.859


{'name': 'XGBoost',
 'train_mae': 8577.560946185504,
 'test_mae': 13196.324459997437,
 'test_mape': np.float64(46.81557336994786),
 'test_rmse': np.float64(23145.908563274246),
 'test_r2': 0.8588324891217868}

## 4.10 Ringkasan Perbandingan Model

        
Kita satukan hasil evaluasi semua model ke dalam satu DataFrame untuk memudahkan perbandingan.

In [22]:
results = [metrics_linreg, metrics_rf, metrics_gbr, metrics_xgb]
results_df = pd.DataFrame(results)
results_df.sort_values('test_mae')

Unnamed: 0,name,train_mae,test_mae,test_mape,test_rmse,test_r2
3,XGBoost,8577.560946,13196.32446,46.815573,23145.908563,0.858832
1,Random Forest,4912.189314,14318.197444,46.366601,27425.530947,0.801803
2,Gradient Boosting,15113.222792,17023.435172,57.787393,27659.767623,0.798403
0,Linear Regression,17532.054378,19986.490466,64.875014,31981.77753,0.73048


Biasanya, model dengan **test MAE** dan **MAPE** paling rendah dan **R squared** yang cukup tinggi akan dipilih sebagai kandidat model terbaik.

Pada tahap berikutnya (Deployment), kita akan menyimpan model terbaik dan menyiapkan fungsi prediksi yang siap diintegrasikan dengan aplikasi atau API.