# Property Price Prediction - Model Training

This notebook takes the cleaned property data, prepares it for machine learning, trains a RandomForestRegressor model, evaluates its performance, and saves the final trained model for use in our web application.

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import joblib

print("Libraries imported successfully.")

Libraries imported successfully.


## Step 1: Load the Cleaned Data
We start by loading the `properti_data_cleaned.csv` file created by our preprocessing notebook.

In [18]:
try:
    df = pd.read_csv('properti_data_cleaned.csv')
    print("Successfully loaded 'properti_data_cleaned.csv'.")
    print(f"The dataset contains {df.shape[0]} rows.")
    display(df.head())
except FileNotFoundError:
    print("❌ Error: 'properti_data_cleaned.csv' not found. Please run the preprocessing notebook first.")

Successfully loaded 'properti_data_cleaned.csv'.
The dataset contains 39888 rows.


Unnamed: 0,City,Bedrooms,Building Area (m²),Land Area (m²),Price
0,tangerang,2,68,130,2300000000
1,tangerang,5,192,128,4200000000
2,tangerang,3,94,158,2800000000
3,tangerang,3,125,144,2800000000
4,tangerang,4,325,270,7500000000


## Step 2: Prepare Data for Modeling (Feature Engineering)
Machine learning models require all input features to be numerical. The `City` column is categorical (text), so we need to convert it into a numerical format using a technique called **One-Hot Encoding**.

This creates a new binary (0 or 1) column for each city.

In [19]:
print("Performing one-hot encoding on the 'City' column...")
dummies = pd.get_dummies(df['City'], prefix='City')
df_model = pd.concat([df, dummies], axis=1)

# Define our target (what we want to predict) and our features (what the model uses to predict)
target = 'Price'
features = ['Bedrooms', 'Building Area (m²)', 'Land Area (m²)'] + list(dummies.columns)

X = df_model[features]
y = df_model[target]

print("Features and target variable created.")
print(f"Number of features: {len(features)}")
display(X.head())

Performing one-hot encoding on the 'City' column...
Features and target variable created.
Number of features: 9


Unnamed: 0,Bedrooms,Building Area (m²),Land Area (m²),City_bekasi,City_bogor,City_depok,City_jakarta,City_other,City_tangerang
0,2,68,130,False,False,False,False,False,True
1,5,192,128,False,False,False,False,False,True
2,3,94,158,False,False,False,False,False,True
3,3,125,144,False,False,False,False,False,True
4,4,325,270,False,False,False,False,False,True


## Step 3: Split Data into Training and Testing Sets
We split our data into two parts:
- **Training Set (80%):** The model learns the patterns from this data.
- **Testing Set (20%):** We use this unseen data to evaluate how well the model performs.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Data successfully split:")
print(f"- Training samples: {len(X_train)}")
print(f"- Testing samples:  {len(X_test)}")

Data successfully split:
- Training samples: 31910
- Testing samples:  7978


## Step 4: Train the RandomForestRegressor Model
Now we initialize our model and train it using the training data.

In [21]:
print("Training the RandomForestRegressor model...")

# n_estimators=100 means the model is an ensemble of 100 decision trees.
# n_jobs=-1 uses all available CPU cores to speed up training.
model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

print("Model training complete.")

Training the RandomForestRegressor model...
Model training complete.


## Step 5: Evaluate Model Performance
We use the trained model to make predictions on the unseen test data and measure its accuracy using the **Mean Absolute Error (MAE)**. The MAE tells us the average amount (in Rupiah) that the model's predictions are off by.

In [22]:
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)

print(f"Mean Absolute Error (MAE) on Test Set: Rp {mae:,.0f}")

Mean Absolute Error (MAE) on Test Set: Rp 860,421,280


## Step 6: Save the Final Model
Finally, we save the trained model and the list of feature columns it expects into a single `.pkl` file. This file will be loaded by our Streamlit web application.

In [23]:
model_data = {
    'model': model,
    'columns': features
}

joblib.dump(model_data, 'property_price_predictor.pkl')

print("Model and feature columns successfully saved to 'property_price_predictor.pkl'.")

Model and feature columns successfully saved to 'property_price_predictor.pkl'.
