# AI Property Price Prediction

<ul>
 <li>There are 4 attributes in the datasets that we use, which are Longitude, Latitude, Room Type, and Minimum Nights.</li>
 <li>We are manually Data Augmented by combine many datasets to get better datasets for training the model.</li>
 <li>Searching the best model by evaluate 5 models that suitable for Price Prediction</li>
 <li>After that we are going to Hyperparameter Tuning the Best model by using GridSearch and Cross Validation</li>
 <li>For the result we got 3 .pkl file which are the "best_encoder.pkl", "best_scaler.pkl", and "best_price_model.pkl"</li>
</ul>

## Determine the Model that will be used

In [None]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb
import joblib

# 🔹 Step 1: Gabungkan 6 Dataset dengan Tipe Data yang Jelas
file_paths = [
    "Datasets/data.csv", "Datasets/data1.csv", "Datasets/data2.csv",
    "Datasets/data3.csv", "Datasets/data4.csv", "Datasets/data5.csv"
]

df_list = [pd.read_csv(file, dtype={"room_type": str}, low_memory=False) for file in file_paths]
data = pd.concat(df_list, ignore_index=True)

# 🔹 Step 2: Pilih Kolom yang Dibutuhkan
selected_features = ["latitude", "longitude", "minimum_nights", "room_type", "price"]
data = data[selected_features]

# 🔹 Step 3: Bersihkan `price` dari simbol "$" dan koma, lalu konversi ke float
data["price"] = data["price"].astype(str).str.replace(r'[\$,]', '', regex=True).astype(float)

# 🔹 Step 4: Handle Missing Values
data.fillna({
    "minimum_nights": data["minimum_nights"].median(),
    "room_type": "Unknown"
}, inplace=True)

# 🔹 Step 5: Encode Kategorikal (`room_type` saja)
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoded_features = encoder.fit_transform(data[["room_type"]])

# 🔹 Step 6: Standarisasi Fitur Numerik (latitude, longitude, minimum_nights)
numerical_features = ["latitude", "longitude", "minimum_nights"]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[numerical_features])

# 🔹 Step 7: Gabungkan Semua Fitur
X = np.hstack((encoded_features, scaled_features))
y = data["price"]

# 🔹 Step 8: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 🔹 Step 9: Daftar Model untuk Dibandingkan
models = {
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
    "XGBoost": xgb.XGBRegressor(objective="reg:squarederror", n_estimators=100, learning_rate=0.1),
    "Linear Regression": LinearRegression(),
    "K-Nearest Neighbors": KNeighborsRegressor(n_neighbors=5)
}

# 🔹 Step 10: Evaluasi Setiap Model
results = []

for model_name, model in models.items():
    print(f"Training {model_name}...")
    
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    results.append({
        "Model": model_name,
        "MAE": round(mae, 2),
        "MSE": round(mse, 2),
        "RMSE": round(rmse, 2),
        "R² Score": round(r2, 4),
        "Train Time (s)": round(train_time, 2)
    })

    print(f"{model_name} - MAE: {mae:.2f}, MSE: {mse:.2f}, RMSE: {rmse:.2f}, R²: {r2:.4f}, Time: {train_time:.2f}s")

# 🔹 Step 11: Konversi Hasil ke DataFrame dan Simpan
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by="MAE")  # Sorting berdasarkan akurasi terbaik (MAE terkecil)
print("\nModel Comparison Results:\n", results_df)

# 🔹 Step 12: Simpan Model Terbaik
best_model_name = results_df.iloc[0]["Model"]
best_model = models[best_model_name]

joblib.dump(best_model, "best_price_model.pkl")
joblib.dump(encoder, "best_encoder.pkl")
joblib.dump(scaler, "best_scaler.pkl")

print(f"\n✅ Best model saved: {best_model_name}")


Training Random Forest...
Random Forest - MAE: 124.04, MSE: 534884.37, RMSE: 731.36, R²: -0.0414, Time: 470.78s
Training Gradient Boosting...
Gradient Boosting - MAE: 151.09, MSE: 485255.29, RMSE: 696.60, R²: 0.0552, Time: 63.75s
Training XGBoost...
XGBoost - MAE: 144.64, MSE: 467685.75, RMSE: 683.88, R²: 0.0894, Time: 2.38s
Training Linear Regression...
Linear Regression - MAE: 158.87, MSE: 508261.84, RMSE: 712.92, R²: 0.0104, Time: 0.11s
Training K-Nearest Neighbors...
K-Nearest Neighbors - MAE: 135.81, MSE: 492108.54, RMSE: 701.50, R²: 0.0419, Time: 0.90s

Model Comparison Results:
                  Model     MAE        MSE    RMSE  R² Score  Train Time (s)
0        Random Forest  124.04  534884.37  731.36   -0.0414          470.78
4  K-Nearest Neighbors  135.81  492108.54  701.50    0.0419            0.90
2              XGBoost  144.64  467685.75  683.88    0.0894            2.38
1    Gradient Boosting  151.09  485255.29  696.60    0.0552           63.75
3    Linear Regression  158.87  508261.84  712.92    0.0104            0.11

✅ Best model saved: Random Forest

## Insights (1)

Based on the Model Evaluation,
<br><br>
✅ MAE (Mean Absolute Error) → The least, the best model

The average of the absolute differences between the actual values ​​(y_test) and the predicted values

✅ MSE (Mean Squared Error) → The least, the best model

The average of the squared differences between the actual and predicted values.

✅ RMSE (Root Mean Squared Error) → The least, the best model

Akar dari MSE.

✅ R² Score → Approach 1 is the best model

Measures how well a model explains the variability of the data.

The best model based on the evaluation => <b> Random Forest </b>

In [None]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# 🔹 Step 1: Gabungkan 6 Dataset dengan Tipe Data yang Jelas
file_paths = [
    "Datasets/data.csv", "Datasets/data1.csv", "Datasets/data2.csv",
    "Datasets/data3.csv", "Datasets/data4.csv", "Datasets/data5.csv"
]

df_list = [pd.read_csv(file, dtype={"room_type": str}, low_memory=False) for file in file_paths]
data = pd.concat(df_list, ignore_index=True)

# 🔹 Step 2: Pilih Kolom yang Dibutuhkan (Tanpa `neighbourhood`)
selected_features = ["latitude", "longitude", "minimum_nights", "room_type", "price"]
data = data[selected_features]

# 🔹 Step 3: Bersihkan `price` dari simbol "$" dan koma, lalu konversi ke float
data["price"] = data["price"].astype(str).str.replace(r'[\$,]', '', regex=True).astype(float)

# 🔹 Step 4: Handle Missing Values
data.fillna({
    "minimum_nights": data["minimum_nights"].median(),
    "room_type": "Unknown"
}, inplace=True)

# 🔹 Step 5: Encode Kategorikal (`room_type` saja)
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoded_features = encoder.fit_transform(data[["room_type"]])

# 🔹 Step 6: Standarisasi Fitur Numerik (`latitude`, `longitude`, `minimum_nights`)
numerical_features = ["latitude", "longitude", "minimum_nights"]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[numerical_features])

# 🔹 Step 7: Gabungkan Semua Fitur
X = np.hstack((encoded_features, scaled_features))
y = data["price"]

# 🔹 Step 8: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 🔹 Step 9: Hyperparameter Tuning dengan GridSearchCV
param_grid = {
    "n_estimators": [50, 100, 200],  # Jumlah pohon dalam hutan
    "max_depth": [10, 20, None],  # Kedalaman maksimal pohon
    "min_samples_split": [2, 5, 10],  # Minimum sampel untuk split node
    "min_samples_leaf": [1, 2, 4],  # Minimum sampel di tiap leaf node
}

print("🔍 Performing GridSearchCV for Hyperparameter Tuning...")

grid_search = GridSearchCV(RandomForestRegressor(random_state=42),
                           param_grid, cv=5, scoring="neg_mean_absolute_error",
                           verbose=1, n_jobs=-1)

start_time = time.time()
grid_search.fit(X_train, y_train)
train_time = time.time() - start_time

best_params = grid_search.best_params_
print(f"✅ Best Parameters Found: {best_params}")

# 🔹 Step 10: Train Model Terbaik dengan Parameter Optimal
best_model = RandomForestRegressor(**best_params, random_state=42)
best_model.fit(X_train, y_train)

# 🔹 Step 11: Evaluasi Model
y_pred = best_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("\n🔹 Model Evaluation Metrics:")
print(f"📌 Mean Absolute Error (MAE): {mae:.2f}")
print(f"📌 Mean Squared Error (MSE): {mse:.2f}")
print(f"📌 Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"📌 R² Score: {r2:.4f}")
print(f"📌 Training Time: {train_time:.2f} seconds")

# 🔹 Step 12: Cross Validation untuk Validasi Model
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring="r2")

print(f"\n✅ Cross Validation R² Scores: {cv_scores}")
print(f"✅ Mean R² Score (Cross Validation): {np.mean(cv_scores):.4f}")

# 🔹 Step 13: Simpan Model Terbaik
joblib.dump(best_model, "best_price_model.pkl")
joblib.dump(encoder, "best_encoder.pkl")
joblib.dump(scaler, "best_scaler.pkl")

print("\n✅ Model training complete. Best model saved as 'best_price_model.pkl'.")


## Insights (2)

Because there is issue with CPU/GPU and RAM, I tune the model in Google Colab

And finally got the result 3 .pkl file which are the "best_encoder.pkl", "best_scaler.pkl", and "best_price_model.pkl"

In [None]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# 🔹 Step 1: Gabungkan 6 Dataset dengan Tipe Data yang Jelas
file_paths = [
    "Datasets/data.csv", "Datasets/data1.csv", "Datasets/data2.csv",
    "Datasets/data3.csv", "Datasets/data4.csv", "Datasets/data5.csv"
]

df_list = [pd.read_csv(file, dtype={"room_type": str}, low_memory=False) for file in file_paths]
data = pd.concat(df_list, ignore_index=True)

# 🔹 Step 2: Pilih Kolom yang Dibutuhkan
selected_features = ["latitude", "longitude", "minimum_nights", "room_type", "price"]
data = data[selected_features]

# 🔹 Step 3: Bersihkan `price` dari simbol "$" dan koma
data["price"] = data["price"].astype(str).str.replace(r'[\$,]', '', regex=True).astype(float)

# 🔹 Step 4: Handle Missing Values
data.fillna({
    "minimum_nights": data["minimum_nights"].median(),
    "room_type": "Unknown"
}, inplace=True)

# 🔹 Step 5: Encode Kategorikal
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoded_features = encoder.fit_transform(data[["room_type"]])

# 🔹 Step 6: Standarisasi Fitur Numerik
numerical_features = ["latitude", "longitude", "minimum_nights"]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[numerical_features])

# 🔹 Step 7: Gabungkan Semua Fitur
X = np.hstack((encoded_features, scaled_features))
y = data["price"]

# 🔹 Step 8: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 🔹 Step 9: Hyperparameter Tuning dengan RandomizedSearchCV
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [10, None],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2]
}

print("🔍 Performing RandomizedSearchCV for Hyperparameter Tuning...")

random_search = RandomizedSearchCV(RandomForestRegressor(random_state=42),
                                   param_distributions=param_grid, 
                                   n_iter=10, cv=3, scoring="neg_mean_absolute_error",
                                   verbose=1, n_jobs=-1, random_state=42)

start_time = time.time()
random_search.fit(X_train, y_train)
train_time = time.time() - start_time

best_params = random_search.best_params_
print(f"✅ Best Parameters Found: {best_params}")

# 🔹 Step 10: Train Model Terbaik dengan Warm Start
best_model = RandomForestRegressor(**best_params, warm_start=True, random_state=42)
best_model.fit(X_train, y_train)

# 🔹 Step 11: Evaluasi Model
y_pred = best_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f"📌 Mean Absolute Error (MAE): {mae:.2f}")
print(f"📌 Training Time: {train_time:.2f} seconds")

# 🔹 Step 12: Simpan Model Terbaik
joblib.dump(best_model, "best_price_model.pkl")
joblib.dump(encoder, "best_encoder.pkl")
joblib.dump(scaler, "best_scaler.pkl")

print("\n✅ Model training complete. Best model saved as 'best_price_model.pkl'.")


## Insights (3)

Easier and Faster way to tune the Random Forest model and got the best AI Model Predictions

## Conclusion

<ul>
    <li>We are using Random Forest to got the AI Model Price Predictions for Staychaintion Properties</li>
    <li>We already hyperparameter tuning the Random Forest Model to get the best .pkl file</li>
    <li>For the dataset we also combine many datasets and choose 4 attributes that very relevant and greatly influences Price Prediction</li>
</ul>

## Future Improvement

<ul>
    <li>Using more accurate and reliable datasets</li>
    <li>Using more attributes that are influences the Price Prediction for Property</li>
    <li>Find better model and tuning it again using the new datasets</li>
</ul>

## Datasets Reference

- https://www.kaggle.com/datasets/marcosgarcia75/air-bnb-dataset
- https://www.kaggle.com/datasets/vrindakallu/new-york-dataset
- https://www.kaggle.com/datasets/kritikseth/us-airbnb-open-data?select=AB_US_2023.csv
- https://www.kaggle.com/datasets/thedevastator/airbnbs-nyc-overview