# Airbnb NYC Price Prediction Model

## Introduction
This notebook presents the process of creating a predictive model for Airbnb listing prices in New York City. After the initial data exploration and pre-processing steps performed in previous notebooks, we now proceed to the modeling phase.

## Objectives
The goal is to create a model that accurately predicts the price of an Airbnb listing based on features we deem significant through feature selection methods. This model could be beneficial for both hosts and guests, enabling them to understand pricing trends and make informed decisions.

## Methodology
We start by further pre-processing our data for modeling. This includes encoding categorical variables, scaling numerical features, and performing feature selection to choose the most relevant features for our model.

We then build several candidate predictive models, including Random Forest Regression, Support Vector Regression, and Linear Regression. Each of these models is trained on our dataset, and we evaluate their performance using cross-validation techniques.

The model performance is determined by their root mean square error(RMSE) score. The model with lowest RMSE is selected as our final model. This model is then saved for future use.

We also include a section that demonstrates how to use our saved model to make predictions on unseen data.

In [128]:
import pandas as pd

In [129]:
# Reading the data
listing_dataframe = pd.read_csv("cleaned_listing_data.csv")

In [130]:
listing_dataframe.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
number_of_reviews_ltm               int64
reviewed                            int64
dtype: object

The columns such as host_name, name , neighhbourhood doesn't provide much predictive power, and also has high cardinality. So dropping them for data modelling. Similarly, host_id and id don't have much predictive

In [131]:
X = listing_dataframe.drop(['id', 'host_id', 'name', 'host_name', 'neighbourhood', 'price'],axis=1)
y = listing_dataframe['price']

In [132]:
X.head()

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,reviewed
0,Manhattan,40.75356,-73.98559,Entire home/apt,30,49,0.3,3,314,1,1
1,Brooklyn,40.68535,-73.95512,Private room,30,50,0.3,2,365,0,1
2,Manhattan,40.8038,-73.96751,Private room,2,118,0.72,1,0,0,1
3,Manhattan,40.76457,-73.98317,Private room,2,575,3.41,1,106,52,1
4,Brooklyn,40.66265,-73.99454,Entire home/apt,60,3,0.03,1,181,1,1


### Converting the categorical values to one-hot encoding that can be understood by the model

In [133]:
# Converting neighbourhood_group and room_type to one-hot encoding
encoded_neighbourhood_group = pd.get_dummies(X["neighbourhood_group"], prefix="neighbourhood_group")
encoded_room_type = pd.get_dummies(X["room_type"], prefix="room_type")

In [134]:
# adding the one-hot values to the dataframe
X = pd.concat([X, encoded_neighbourhood_group, encoded_room_type], axis=1 )

In [135]:
# dropping the original categorical columns
X.drop(["neighbourhood_group","room_type"], axis=1, inplace=True)

In [136]:
X.columns

Index(['latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'number_of_reviews_ltm', 'reviewed',
       'neighbourhood_group_Bronx', 'neighbourhood_group_Brooklyn',
       'neighbourhood_group_Manhattan', 'neighbourhood_group_Queens',
       'neighbourhood_group_Staten Island', 'room_type_Entire home/apt',
       'room_type_Hotel room', 'room_type_Private room',
       'room_type_Shared room'],
      dtype='object')

In [137]:
X.head()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,reviewed,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,40.75356,-73.98559,30,49,0.3,3,314,1,1,False,False,True,False,False,True,False,False,False
1,40.68535,-73.95512,30,50,0.3,2,365,0,1,False,True,False,False,False,False,False,True,False
2,40.8038,-73.96751,2,118,0.72,1,0,0,1,False,False,True,False,False,False,False,True,False
3,40.76457,-73.98317,2,575,3.41,1,106,52,1,False,False,True,False,False,False,False,True,False
4,40.66265,-73.99454,60,3,0.03,1,181,1,1,False,True,False,False,False,True,False,False,False


In [138]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train,y_test = train_test_split(X,y, test_size=0.3,random_state=0)

## Feature Selection
We will use feature importance from random forest and correlation to select features for our model

In [139]:
from sklearn.ensemble import RandomForestRegressor

In [140]:
# craeting simple model for calculating feature importances
rf = RandomForestRegressor(n_estimators=100, random_state=0)

In [141]:
rf.fit(X_train,y_train)

In [142]:
# Get importances of ffeatures
importances = rf.feature_importances_

In [143]:
# Convert the importances into one-dimensional 1darray with corresponding df column names as axis labels
f_importances = pd.Series(importances, X.columns)

In [144]:
f_importances

latitude                             0.147414
longitude                            0.188003
minimum_nights                       0.073829
number_of_reviews                    0.043890
reviews_per_month                    0.062503
calculated_host_listings_count       0.084440
availability_365                     0.069490
number_of_reviews_ltm                0.026987
reviewed                             0.002960
neighbourhood_group_Bronx            0.000448
neighbourhood_group_Brooklyn         0.001982
neighbourhood_group_Manhattan        0.001760
neighbourhood_group_Queens           0.001760
neighbourhood_group_Staten Island    0.000047
room_type_Entire home/apt            0.290458
room_type_Hotel room                 0.001309
room_type_Private room               0.001524
room_type_Shared room                0.001196
dtype: float64

In [145]:
# Categorical features that have been one_hot_encoded
categorical_features = ['neighbourhood_group', 'room_type']

In [146]:
# showing the total importance of categorical features
for feature in categorical_features:
    one_hot_columns = [col for col in f_importances.index if col.startswith(feature)]
    total_importance = f_importances[one_hot_columns].sum()
    print(f'Total importance for {feature}: {total_importance}')

Total importance for neighbourhood_group: 0.005997663767217159
Total importance for room_type: 0.2944879149803341


In [147]:
# selecting features with importance greater than 0.01
selected_features_from_rf = f_importances[f_importances >=0.01]

In [148]:
selected_features_from_rf

latitude                          0.147414
longitude                         0.188003
minimum_nights                    0.073829
number_of_reviews                 0.043890
reviews_per_month                 0.062503
calculated_host_listings_count    0.084440
availability_365                  0.069490
number_of_reviews_ltm             0.026987
room_type_Entire home/apt         0.290458
dtype: float64

In [149]:
import numpy as np

In [150]:

# Calculate the correlation matrix
correlation_matrix = X_train.corr()

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Apply the mask to the correlation matrix
correlation_matrix_upper = correlation_matrix.where(mask)

# Find pairs of features that have a correlation higher than 0.8
highly_correlated_pairs = {}

for col in correlation_matrix_upper.columns:
    above_threshold_vars = correlation_matrix_upper.loc[(correlation_matrix_upper[col] > 0.8) & (correlation_matrix_upper[col] < 1), col].index.tolist()
    if above_threshold_vars:
        highly_correlated_pairs[col] = above_threshold_vars

# Print pairs of highly correlated features
for key, value in highly_correlated_pairs.items():
    print(f"{key} is highly correlated with {value}")

number_of_reviews_ltm is highly correlated with ['reviews_per_month']


Based on the random forrest regressor, we select reviews_per_month, number_of_reviews_ltm, minimum_nights, longitude, latitude, calculated_host_listings_count, availability_365 as the selected features. Since number_of_reviews_ltm and reviews_per_month are highly correalted, we only use review_per_month as it has greater importance in the random forrest regressor. Other than that after converting room_type and neighbourhood_group to one-hot encoding, we got very less imtance for neighbourhood_group, but there might be complex relationship which we couldnt perceive, so we will use both of those categorical variables.

Thus, the variables we select are reviews_per_month, minimum_nights, longitude, latitude, calculated_host_listings_count, availability_365, neighbourhood_group and room_type


In [151]:
# selecting the variables to use
variables_to_use = ['latitude', 'longitude', 'minimum_nights',
                     'reviews_per_month', 'calculated_host_listings_count',
                     'availability_365', 'neighbourhood_group_Bronx', 'neighbourhood_group_Brooklyn',
                     'neighbourhood_group_Manhattan', 'neighbourhood_group_Queens',
                     'neighbourhood_group_Staten Island', 'room_type_Entire home/apt',
                     'room_type_Hotel room', 'room_type_Private room',
                     'room_type_Shared room']

In [152]:
X_train = X_train[variables_to_use]
X_test = X_test[variables_to_use]

## Creating pipelines for training various models

In [153]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR

In [154]:
# Describing th numerical features

In [155]:
# mumerical features selectec
numeric_features = ['latitude', 'longitude', 'minimum_nights',
                    'reviews_per_month', 'calculated_host_listings_count',
                    'availability_365']
# one hot encoding for categorical values
categorical_features_onehot = ['neighbourhood_group_Bronx', 'neighbourhood_group_Brooklyn',
                               'neighbourhood_group_Manhattan', 'neighbourhood_group_Queens',
                               'neighbourhood_group_Staten Island', 'room_type_Entire home/apt',
                               'room_type_Hotel room', 'room_type_Private room',
                               'room_type_Shared room']

In [156]:
# transformer pipeline for adding scaling to numerical columns
numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

In [157]:
# preprocessor which will apply scaling to numerical values and leave the one hot encoding as it is
preprocessor = ColumnTransformer(
    transformers=[
        ('num',numeric_transformer, numeric_features),
        ('cat_onehot', 'passthrough', categorical_features_onehot)
    ]
)

In [158]:
# Random Forest Pipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor())
])

In [159]:
# svm pipeline
svr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', SVR())
])

In [160]:
# Linear Regression
from sklearn.linear_model import LinearRegression

lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', LinearRegression())])

## Model Selection and Evaluation

In [161]:
from sklearn.model_selection import cross_val_score


In [162]:
pipelines = [ rf_pipeline, svr_pipeline, lr_pipeline]
pipeline_names = ["Random Forest", "Support Vector Regression", "Linear Regression"]

In [163]:
# training on each pipeline and generating rmse value for each pipeline
for i, pipe in enumerate(pipelines):
    cv_score = cross_val_score(pipe, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    print(f'{pipeline_names[i]} RMSE value: {np.mean(np.sqrt(np.abs(cv_score)))}')

Random Forest RMSE value: 54.07053058324196
Support Vector Regression RMSE value: 64.51134253580825
Linear Regression RMSE value: 62.8188610610812


## Conclusion

The goal of this project was to understand the factors that affect the price of Airbnb lstings in NYC and to build a model that could predict the price of a listing based on these factors.

The variables that we found to be most significant in influencing the price were 'latitude', 'longitude', 'minimum_nights', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', as well as one-hot encoded variables representing 'neighbouring_group' and 'room_type'.

Various machine learning models were evaluated, including Linear Regression, Support Vector Regression, and Random Forest. The performance of the models was assessed using Root Mean Square Error(RMSE), a common metric for regression problems.

Random Forest performeed the best among the models with an RMSE value of 54.10, indicating it had the smallest average prediction error. Thus, this mode was selected for predicting Airbnb listing prices.

To finalize and save the trained model, we can use joblib's dump function


In [164]:
from joblib import dump

In [165]:
rf_pipeline.fit(X_train, y_train)
# saving the model
dump(rf_pipeline, 'airbnb_price_predictor_pipeline.joblib')

['airbnb_price_predictor_pipeline.joblib']

In this script:
1. We used the rf_pipeline that applies preprocessor to the data, and then applies the RandomForestRegressor.
2. We trained the entire pipeline using fit method
3. Then, we saved the pipeline to a file named 'airbnb_price_predictor_pipeline.joblib'

When it's time to make predictions with new data, we can load this pipeline, and it will ensure that the input data is preprocessed in the same way as the training data before making a prediction.

## Example of loading the running the model of an unseen data

In [166]:
from joblib import load

In [167]:
# load the pipeline
loaded_pipeline = load('airbnb_price_predictor_pipeline.joblib')

In [168]:
# example of unseen data
unseen_data = pd.DataFrame({
    'latitude': [40.64749],
    'longitude': [-73.97237],
    'minimum_nights': [1],
    'reviews_per_month': [0.21],
    'calculated_host_listings_count': [6],
    'availability_365': [365],
    'neighbourhood_group_Bronx': [0],
    'neighbourhood_group_Brooklyn': [1],
    'neighbourhood_group_Manhattan': [0],
    'neighbourhood_group_Queens': [0],
    'neighbourhood_group_Staten Island': [0],
    'room_type_Entire home/apt': [1],
    'room_type_Hotel room': [0],
    'room_type_Private room': [0],
    'room_type_Shared room': [0]
})

In [169]:
# predict on the unseen data
predicted_price = loaded_pipeline.predict(unseen_data)

In [170]:
print(f'The predicted price for the Airbnb listing is: {predicted_price[0]}')

The predicted price for the Airbnb listing is: 213.29
