# Accommodation Price Prediction - Modelling and Predictions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('../cleaning/accommodation.csv')
df.head()

Unnamed: 0,Source,Location,Number of Beds,Type,Price (HKD)
0,Hotel,Kita,,Apartment,589.0
1,Hotel,Taito,2.0,Hotel Room,621.0
2,Hotel,Shinagawa,,Apartment,1807.0
3,Hotel,Sumida,,Apartment,811.0
4,Hotel,Taito,1.0,Hotel Room,378.0


In [3]:
df.dtypes

Source             object
Location           object
Number of Beds    float64
Type               object
Price (HKD)       float64
dtype: object

In [4]:
df.isna().sum()

Source             0
Location           0
Number of Beds    75
Type               0
Price (HKD)        0
dtype: int64

Separating input from features

In [5]:
X_train = df.drop(columns='Price (HKD)')
y_train = df['Price (HKD)']

Creating pipeline for the transformation of different features.

- A one-hot encoding scheme is applied to transform categorical variables.
- Missing numerical data is imputer using the mdian of the variable, and all numerical data are scaled using a standard scaler.

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

cat_var = ['Source', 'Location', 'Type']

var_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

full_pipeline = ColumnTransformer([
    ('oh_encoder', OneHotEncoder(), cat_var),
    ('imputer', var_pipeline, ['Number of Beds'])
])

In [7]:
X_train_prepared = full_pipeline.fit_transform(X_train)

In [8]:
X_train_prepared.shape

(1300, 42)

### Model building and model comparison

Four candidate models are chosen to be the regressors that predict the accommodation prices.

1. Linear Regression
- Linear regression is a regression model that assumes linear relationship between feaures and the target. A potential problem of linear regression is that it might overfit the data since it assumes equal importance of all the features. Therefore, the linear regression model is included for baseline comparison.

2. Elastic Net Regression
- Elastic Net Regression is a regression model of which the loss function includes both L1 (Lasso) and L2 (Ridge) regularizers. Therefore it has the characteristics of both regularization methods and it generally outperforms the two. The lambda hyperparameter controls the model complexity while alpha controls the ratio between the L1 and L2 regularizers.

3. Support Vector Regression
- Support vector regression model uses the same principle as a support vector machine. Instead of maximizing the margins, it minimizes the margin by finding a best-fit hyperplane.

4. Gradient Boosting Regression
- Gradient Boosting regression is an ensemble method that sequentially trains decision trees which are then aggregated for prediction. The trees are trained on the pervious residual value, which is the difference between the previous predicted value and the target value.


Dummy regression is included for baseline comparison.

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyRegressor


# Create a dictionary of regressors
regressors = {
    'Linear Regression': LinearRegression(),
    'Elastic Net Regression': ElasticNet(),
    'Support Vector Regression': SVR(),
    'Gradient Boosting Regression': GradientBoostingRegressor(),
    'Dummy Regression': DummyRegressor()
}

# Create hyperparameter grids
grids = {
    'Linear Regression': {

    },
    'Elastic Net Regression':{
        'alpha': [0.0001, 0.001, 0.01, 0.1, 1],
        'l1_ratio': [0, 0.2, 0.5, 0.8, 1]
    },
    'Support Vector Regression':{
        'kernel': ['linear', 'rbf'],
        'gamma': [0.001, 0.1, 1, 5, 10],
        'C': [0.1, 1, 5, 10, 50]
    },
    'Gradient Boosting Regression':{
        'learning_rate': [0.1, 0.2, 0.5, 1],
        'n_estimators': [50, 80, 100, 150],
        'loss': ['squared_error', 'huber'],
        'alpha': [0.1, 0.5, 0.9]
    },
    'Dummy Regression':{

    }
}

for name, regressor in regressors.items():
    reg_cv = GridSearchCV(regressor, grids[name], scoring='neg_mean_absolute_error', cv=3, n_jobs=-1)
    reg_cv.fit(X_train_prepared, y_train)
    print(name)
    print(f'mean score: {-reg_cv.best_score_}')
    std = reg_cv.cv_results_['std_test_score'][reg_cv.best_index_]
    print(f'standard deviation: {std}')
    print('\n')

Linear Regression
mean score: 693.4471280026223
standard deviation: 239.7945611524849




  model = cd_fast.sparse_enet_coordinate_descent(


Elastic Net Regression
mean score: 581.3024021492453
standard deviation: 63.00992918736599


Support Vector Regression
mean score: 486.1645620481542
standard deviation: 52.718640723201204


Gradient Boosting Regression
mean score: 514.4460624479647
standard deviation: 124.52159241982432


Dummy Regression
mean score: 583.0653510255576
standard deviation: 66.51153879895924




From above we can see that the Support Vector Regressor peforms the best.

### Fine Tuning the model

In [10]:
svm_reg = SVR()

svm_grid = {
    'kernel': ['linear', 'rbf'],
    'gamma': [0.0001, 0.0005, 0.001, 0.005, 10],
    'C': [5, 8, 10, 12, 16, 20]
}

In [11]:
from sklearn.model_selection import RandomizedSearchCV
svm_cv = RandomizedSearchCV(svm_reg, svm_grid, cv=5, scoring='neg_mean_absolute_error', verbose=4, n_jobs=-1, n_iter=60)

svm_cv.fit(X_train_prepared, y_train)

print(-svm_cv.best_score_)
print(svm_cv.best_params_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
452.9095184142313
{'kernel': 'linear', 'gamma': 0.0001, 'C': 16}


**Conclusion**

The best predictor is Support Vector Regressor (linear kernel, C=16) with a MAE of 452.9.