ST 563 Project

By: Sanith Rao

Overview:
For this project, I analyze the New York City Airbnb Open Data, sourced from Kaggle. This dataset provides detailed information on Airbnb listings, including pricing, location, host details, and property characteristics. My goal is to build predictive models to estimate log-transformed price based on various listing features. The logarithmic transformation helps address price skewness and improves model interpretability. Accurate price prediction is valuable for hosts to set competitive rates and for potential guests to anticipate reasonable booking costs.

In [None]:
# Downloading pygam for Splines
!pip install pygam

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from pygam import LinearGAM, s
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import LinearSVR
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from scipy.sparse import issparse

Data Cleaning and Transformations:
I went through several steps to clean and transform the data to make it ready
to be used by the different models that I implemented. I got rid of some irrelevant columns that add no value to our models. I filled in values for the reviews per month with 0 to ensure there were no missing values. The price variable was transformed to a log price to account for potential skewness and to handle outliers, I got rid of all data points for prices above the 99th percentile. All the categorical variables were one-hot coded, a method I came across online, to avoid multicollinearity.

In [None]:
# Reading the CSV file and printing it's info to see what data cleaning and
# transformations need to be done.
df = pd.read_csv('/content/sample_data/AB_NYC_2019.csv')  # Add the path of the file
df.info()

In [None]:
# Data cleaning & transformations
#
df.drop(columns=["id", "name", "host_id", "host_name", "last_review"], inplace=True)

df["reviews_per_month"].fillna(0, inplace=True)

df["log_price"] = np.log1p(df["price"])

price_cap = df["price"].quantile(0.99)  # 99th percentile
df = df[df["price"] <= price_cap]

df = pd.get_dummies(df, columns=["neighbourhood_group", "neighbourhood", "room_type"], drop_first=True)

df.head()


In [None]:
print(df.isnull().sum())  # Should show 0 for all columns

In [18]:
# Split data into train and test set
x = df.drop(columns=["price", "log_price"])  # Features
y = df["log_price"]  # Target variable

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

KNN Model:
K-Nearest Neighbors is a non-parametric, instance-based learning algorithm that predicts values based on the average of the k nearest training points. It has a key tuning parameter, k, which is the number of neighbors, which controls the bias-variance tradeoff: lower values lead to higher variance, while higher values smooth predictions. KNN does not perform variable selection and is not suitable for inference since it does not learn explicit parameters. Standardizing predictors is essential, as the algorithm relies on distance metrics that can be affected by differences in feature scales. Grid search with cross-validation was used to find the optimal k, ensuring the best model performance.

After running, we see that it has a very high RMSE compared to the other models that I implemented indicating that it does not capture relationships well.

In [None]:
# Define the model
knn = KNeighborsRegressor()

# Define the grid of possible values for k
param_grid = {'n_neighbors': [5, 10, 15, 20, 25, 30]}

# Use GridSearchCV to find the best k using cross-validation
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best k
best_k = grid_search.best_params_['n_neighbors']
print(f"Best k: {best_k}")

# Get the best model from GridSearchCV
best_knn = grid_search.best_estimator_

# Predict on the test set
y_pred_knn = best_knn.predict(X_test)

# Inverse the log transformation for the predictions (since price was log-transformed)
y_pred_knn_exp = np.expm1(y_pred_knn)  # This is equivalent to exp(log(price)) - 1

# Inverse the log transformation for the true values
y_test_exp = np.expm1(y_test)

# Calculate MSE and RMSE (on the original price scale)
mse_knn = mean_squared_error(y_test_exp, y_pred_knn_exp)
rmse_knn = np.sqrt(mse_knn)
print(f"kNN RMSE: {rmse_knn}")

Best k: 10
kNN RMSE: 93.89033735345937


Linear Regression with both Ridge and LASSO:
Ridge and Lasso regression are both parametric linear models that introduce regularization to prevent overfitting. Ridge regression (L2 regularization) shrinks coefficients toward zero but does not set them exactly to zero, whereas Lasso (L1 regularization) can eliminate some coefficients, effectively performing variable selection. Both models have a key tuning parameter, alpha, which controls the strength of regularization: higher values increase regularization, reducing variance but increasing bias. These models can be used for inference, but Ridge retains all variables, while Lasso selects a subset. Standardizing predictors is recommended to ensure fair penalty application across all features. Cross-validation was used to optimize alpha, ensuring the best tradeoff between bias and variance.

What I noticed here is that the Ridge regression had a very high RMSE while the LASSO regression had a significantly lower RMSE which tells us that we had to get rid of some features to enhance the accuracy of the model for linear regression.

In [None]:
# Linear Regression with Ridge

# Define the Ridge regression model
ridge = Ridge()

# Define the grid of possible values for alpha (regularization strength)
param_grid_ridge = {'alpha': [0.1, 1, 10, 100, 1000]}

# Use GridSearchCV to find the best alpha using cross-validation
grid_search_ridge = GridSearchCV(ridge, param_grid_ridge, cv=5, scoring='neg_mean_squared_error')
grid_search_ridge.fit(X_train, y_train)

# Best alpha
best_alpha_ridge = grid_search_ridge.best_params_['alpha']
print(f"Best alpha for Ridge Regression: {best_alpha_ridge}")

# Get the best model from GridSearchCV
best_ridge = grid_search_ridge.best_estimator_

# Predict on the test set
y_pred_ridge = best_ridge.predict(X_test)

# Inverse the log transformation for the predictions
y_pred_ridge_exp = np.expm1(y_pred_ridge)

# Inverse the log transformation for the true values
y_test_exp = np.expm1(y_test)

# Calculate MSE and RMSE
mse_ridge = mean_squared_error(y_test_exp, y_pred_ridge_exp)
rmse_ridge = np.sqrt(mse_ridge)
print(f"Ridge Regression RMSE: {rmse_ridge}")

Best alpha for Ridge Regression: 1
Ridge Regression RMSE: 81.3224124928855


In [None]:
# Linear regression with LASSO

# Define the model
lasso = Lasso()

# Define the grid of possible values for alpha (regularization strength)
param_grid = {'alpha': [0.1, 0.5, 1, 5, 10, 50, 100]}

# Use GridSearchCV to find the best alpha using cross-validation
grid_search_lasso = GridSearchCV(lasso, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search_lasso.fit(X_train, y_train)

# Best alpha
best_alpha_lasso = grid_search_lasso.best_params_['alpha']
print(f"Best alpha for Lasso Regression: {best_alpha_lasso}")

# Get the best model and evaluate
best_lasso = grid_search_lasso.best_estimator_
y_pred_lasso = best_lasso.predict(X_test)

# Calculate RMSE
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mse_lasso)
print(f"Lasso Regression RMSE: {rmse_lasso}")

Best alpha for Lasso Regression: 0.1
Lasso Regression RMSE: 0.5641653464975939


Single Tree Model:
The decision tree regression model is a non-parametric model that recursively splits the data into homogeneous groups to make predictions. It has key tuning parameters, such as max_depth, which controls how deep the tree can grow—deeper trees capture more complexity but risk overfitting. Decision trees do not assume a specific functional form, making them useful for capturing non-linear relationships, but they are generally not used for inference due to their lack of interpretability. Unlike Lasso regression, decision trees perform implicit variable selection by prioritizing the most informative features for splitting. Standardizing predictors is not required since decision trees are invariant to feature scaling. Cross-validation was used to optimize max_depth, balancing model complexity and predictive performance.

We see that this also has a very low RMSE which indicators that more complex tree models could be used to obtain more accuracte results.

In [None]:
# Single Tree Model

# Define the model
dt = DecisionTreeRegressor(random_state=42)

# Define hyperparameter grid
param_grid = {'max_depth': [3, 5, 10, 15, 20, None]}

# Use GridSearchCV to find the best depth using cross-validation
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best depth
best_depth = grid_search.best_params_['max_depth']
print(f"Best depth: {best_depth}")

# Get the best model and evaluate
best_dt = grid_search.best_estimator_
y_pred_dt = best_dt.predict(X_test)

# Calculate RMSE
mse_dt = mean_squared_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mse_dt)
print(f"Decision Tree RMSE: {rmse_dt}")

Best depth: 10
Decision Tree RMSE: 0.42714883115952146


Random Forest Model:
The random forest regression model is an ensemble-based, non-parametric model that builds multiple decision trees and averages their predictions to improve accuracy and reduce overfitting. It has key tuning parameters, including n_estimators, the number of trees in the forest, and max_depth, the depth of each tree, both of which affect model complexity and generalization. While random forests provide strong predictive performance, they are not typically used for inference due to their black-box nature. The model performs implicit variable selection by determining feature importance based on how often a variable is used for splitting across trees. Standardizing predictors is not necessary, as tree-based models are scale-invariant. Cross-validation was used to optimize hyperparameters, ensuring a balance between bias and variance.

Although we do not use random forest much for inference, we see that the results are the best of what we have had so far, indicating that tree based models give us the best results for this particular problem.

In [None]:
# Implementing Random Forest as the ensemble tree model

# Define the model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15]
}

# Grid search with cross-validation
grid_search_rf = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=1)
grid_search_rf.fit(X_train, y_train)

# Best model
best_rf = grid_search_rf.best_estimator_

# Predictions
y_pred_rf = best_rf.predict(X_test)

# RMSE Calculation
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)

print(f"Best parameters: {grid_search_rf.best_params_}")
print(f"Random Forest RMSE: {rmse_rf}")

Note: I had actually finished running this block of code and got the following results:
Best parameters: {'max_depth': 15, 'n_estimators': 200}
Random Forest RMSE: 0.3986131908193861

I tried to run it again before submitting but it was taking too long which is why I'm pasting my results in a text box.

Support Vector Machine:
Support Vector Machines are non-parametric models that find an optimal hyperplane to minimize prediction error while maximizing margin. The model has a key tuning parameter, C, which controls the trade-off between model complexity and error tolerance. A higher C reduces bias but may lead to overfitting. SVMs are not commonly used for inference due to their complex decision boundaries, and they do not perform variable selection directly. Standardizing predictors is essential, as SVMs are sensitive to feature scales. In this implementation, a linear kernel (via LinearSVR) was chosen for efficiency, and a smaller dataset was used to improve training speed. Hyperparameter tuning was performed using cross-validation to optimize performance.

From the result, we see that this also performed well but did not perform as well as our tree models.

In [None]:
# Support Vector Machine

# Subsample for speed (use ~10,000 rows)
X_train_sample, _, y_train_sample, _ = train_test_split(X_train, y_train, train_size=10000, random_state=42)

# Feature scaling (critical for SVMs)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_sample)
X_test_scaled = scaler.transform(X_test)

# Use LinearSVR (much faster)
svm = LinearSVR(random_state=42, max_iter=10000)

# Smaller hyperparameter grid for speed
param_grid = {'C': [0.1, 1, 10]}

# Grid search with cv=3 (faster than cv=5)
grid_search_svm = GridSearchCV(svm, param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_search_svm.fit(X_train_scaled, y_train_sample)

# Best model
best_svm = grid_search_svm.best_estimator_

# Predictions
y_pred_svm = best_svm.predict(X_test_scaled)

# RMSE Calculation
rmse_svm = np.sqrt(mean_squared_error(y_test, y_pred_svm))

print(f"Best parameters: {grid_search_svm.best_params_}")
print(f"SVM RMSE: {rmse_svm}")

Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best parameters: {'C': 0.1}
SVM RMSE: 0.45366025696105317


After completing all the models, we observe that our most accurate model was the random forest model just because of its predictive power. However, even the SVM and Linear Regression with LASSO regression gave us extremely good results, indicating that the appropriate features were used.

In [None]:
# Define the Random Forest model with the best hyperparameters
rf_final = RandomForestRegressor(
    n_estimators=200,  # Best n_estimators
    max_depth=15,       # Best max_depth
    random_state = 42
)

# Combine the training and testing data (X_train + X_test and y_train + y_test)
X_full = pd.concat([X_train, X_test])
y_full = pd.concat([y_train, y_test])

# Fit the model on the entire dataset
rf_final.fit(X_full, y_full)

# Optionally, predict on the entire dataset
y_pred_full = rf_final.predict(X_full)

# Calculate RMSE on the full dataset
mse_full = mean_squared_error(y_full, y_pred_full)
rmse_full = np.sqrt(mse_full)

# Print the result
print(f"Random Forest Model refitted on the entire dataset")
print(f"RMSE on the full dataset: {rmse_full}")