# Accommodation Price Prediction - Modelling and Predictions

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('../cleaning/accommodation.csv')
df.head()

Unnamed: 0,Source,Location,Number of Beds,Type,Price (HKD)
0,Hotel,Kita,,Apartment,589.0
1,Hotel,Taito,2.0,Hotel Room,621.0
2,Hotel,Shinagawa,,Apartment,1807.0
3,Hotel,Sumida,,Apartment,811.0
4,Hotel,Taito,1.0,Hotel Room,378.0


In [4]:
df.dtypes

Source             object
Location           object
Number of Beds    float64
Type               object
Price (HKD)       float64
dtype: object

In [5]:
df.isna().sum()

Source             0
Location           0
Number of Beds    75
Type               0
Price (HKD)        0
dtype: int64

Separating input from features

In [6]:
X_train = df.drop(columns='Price (HKD)')
y_train = df['Price (HKD)']

Creating pipeline for the transformation of different features.

- A one-hot encoding scheme is applied to transform categorical variables.
- Missing numerical data is imputer using the mdian of the variable, and all numerical data are scaled using a standard scaler.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

cat_var = ['Source', 'Location', 'Type']

var_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

full_pipeline = ColumnTransformer([
    ('oh_encoder', OneHotEncoder(), cat_var),
    ('imputer', var_pipeline, ['Number of Beds'])
])

In [8]:
X_train_prepared = full_pipeline.fit_transform(X_train)

In [9]:
X_train_prepared.shape

(1300, 42)

### Model building and model comparison

Four candidate models are chosen to be the regressors that predict the accommodation prices.

1. Linear Regression
- Linear regression is a regression model that assumes linear relationship between feaures and the target. A potential problem of linear regression is that it might overfit the data since it assumes equal importance of all the features. Therefore, the linear regression model is included for baseline comparison.

2. Elastic Net Regression
- Elastic Net Regression is a regression model of which the loss function includes both L1 (Lasso) and L2 (Ridge) regularizers. Therefore it has the characteristics of both regularization methods and it generally outperforms the two. The lambda hyperparameter controls the model complexity while alpha controls the ratio between the L1 and L2 regularizers.

3. Support Vector Regression
- Support vector regression model uses the same principle as a support vector machine. Instead of maximizing the margins, it minimizes the margin by finding a best-fit hyperplane.

4. Gradient Boosting Regression
- Gradient Boosting regression is an ensemble method that sequentially trains decision trees which are then aggregated for prediction. The trees are trained on the pervious residual value, which is the difference between the previous predicted value and the target value.


Dummy regression is included for baseline comparison.

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyRegressor

lin_reg = LinearRegression()
elastic_reg = ElasticNet(alpha=1, l1_ratio=0.9)
svm_reg = SVR(C=10)
dummy_reg = DummyRegressor(strategy='mean')


regressors = {'Linear Regressor': lin_reg, 'Elastic Net Regressor': elastic_reg, 'SVM Regressor': svm_reg, 'Dummy Regressor': dummy_reg}

for name, regressor in regressors.items():
    scores = cross_val_score(regressor, X_train_prepared, y_train, scoring='neg_mean_absolute_error', cv=5)
    print(name)
    print(f'mean score: {-np.mean(scores)}')
    print(f'standard deviation: {np.std(scores)}')

Linear Regressor
mean score: 549.7682529742972
standard deviation: 35.072846866536025
Elastic Net Regressor
mean score: 536.1307626410694
standard deviation: 45.414587574161445
SVM Regressor
mean score: 485.87212028758296
standard deviation: 66.30240344482918
Dummy Regressor
mean score: 523.9761538461538
standard deviation: 74.24953877489088


In [11]:
gbr_reg = GradientBoostingRegressor()

gbr_grid = {
    'learning_rate': [0.1, 0.2, 0.5, 1],
    'n_estimators': [50, 80, 100, 150],
    'loss': ['squared_error', 'huber'],
    'alpha': [0.1, 0.5, 0.9]
}

In [12]:
gbr_cv = GridSearchCV(gbr_reg, gbr_grid, cv=5, scoring='neg_mean_absolute_error', verbose=4, n_jobs=-1)

gbr_cv.fit(X_train_prepared, y_train)

print(gbr_cv.best_score_)
print(gbr_cv.best_params_)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


KeyboardInterrupt: 