<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Startup" data-toc-modified-id="Startup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Startup</a></span><ul class="toc-item"><li><span><a href="#Add-libraries-to-path" data-toc-modified-id="Add-libraries-to-path-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Add libraries to path</a></span></li><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Start-database-connection" data-toc-modified-id="Start-database-connection-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Start database connection</a></span></li></ul></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model</a></span><ul class="toc-item"><li><span><a href="#OLS" data-toc-modified-id="OLS-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>OLS</a></span></li><li><span><a href="#Random-forest" data-toc-modified-id="Random-forest-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Random forest</a></span></li><li><span><a href="#XgBoost" data-toc-modified-id="XgBoost-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>XgBoost</a></span></li></ul></li></ul></div>

## Startup

### Add libraries to path

In [1]:
import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir)

### Import libraries

In [16]:
from decouple import config
import pandas as pd
import numpy as np
from backend.utils import *
from backend.data_clean import *
from sklearn import linear_model
import statsmodels.api as sm
from backend.plots import *
import geopandas as gpd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', None)

### Start database connection

In [3]:
from scrapper.database.mysql import Database
db = Database(config('db_host'), config('db_port'), config('db_database'), config('db_user'), config('db_password'))
db.test_connection()

Test: Database connection.
Success: Connection to database.


True

## Preprocessing

In [10]:
table = 'properties'
df = get_sqlalchemy_table_to_pandas(table, db)

In [11]:
# Get only interesting columns
df = preprocess_pisos(df, 'buy')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [13]:
numerical_features = df._get_numeric_data().columns
categorical_features = [col for col in df.columns if col not in numerical_features]

In [14]:
df = get_dummies_from_categorical(categorical_features, df)

## Model

### OLS

In [23]:
# OLS
X = df.drop(columns=['price'])
Y = df[['price']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
X_ols = sm.add_constant(X_train)
model = sm.OLS(Y_train, X_ols).fit()
Y_pred = model.predict(sm.add_constant(X_test))
print("RMSE:", np.round(mean_squared_error(Y_test, Y_pred),0))
plot_true_vs_pred(Y_test, Y_pred)
# display(model.summary())

RMSE: 5133447810.0


### Random forest

In [24]:
# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

model = RandomForestRegressor()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print("RMSE:", np.round(mean_squared_error(Y_test, Y_pred),0))
plot_true_vs_pred(Y_test, Y_pred)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



RMSE: 5285116454.0


### XgBoost

In [25]:
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
model = xgb.XGBRegressor(colsample_bytree=0.4,
                 gamma=0,                 
                 learning_rate=0.07,
                 max_depth=3,
                 min_child_weight=1.5,
                 n_estimators=10000,                                                                    
                 reg_alpha=0.75,
                 reg_lambda=0.45,
                 subsample=0.6)
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X), columns = X.columns)
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.33)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print("RMSE:", mean_squared_error(Y_test, Y_pred))
plot_true_vs_pred(Y_test, Y_pred)

RMSE: 5143183107.653032


In [32]:
from sklearn.model_selection import GridSearchCV
parameters_for_testing = {
   'colsample_bytree':[0.4,0.6,0.8],
   'gamma':[0,0.03,0.1,0.3],
   'min_child_weight':[1.5,6,10],
   'learning_rate':[0.1,0.07],
   'max_depth':[3,5],
   'n_estimators':[10000],
   'reg_alpha':[1e-5, 1e-2,  0.75],
   'reg_lambda':[1e-5, 1e-2, 0.45],
   'subsample':[0.6,0.95]  
}
xgb_model = xgb.XGBRegressor(learning_rate =0.1, n_estimators=1000, max_depth=5,
    min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, nthread=6, scale_pos_weight=1, seed=27)

gsearch1 = GridSearchCV(estimator = xgb_model, param_grid = parameters_for_testing, n_jobs=6, verbose=10,scoring='neg_mean_squared_error')
gsearch1.fit(X_train,Y_train)
print (gsearch1.grid_scores_)
print('best params')
print (gsearch1.best_params_)
print('best score')
print (gsearch1.best_score_)

Fitting 5 folds for each of 2592 candidates, totalling 12960 fits


KeyboardInterrupt: 