In [56]:
import pandas as pd

In [57]:
melbourne_data = pd.read_csv("melb_data.csv")

In [58]:
predictor_columns = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']

In [59]:
x = melbourne_data[predictor_columns]
y = melbourne_data.Price # our target - what we try to predict

In [60]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y)

In [61]:
from sklearn.impute import SimpleImputer
column_imputer = SimpleImputer()
imputed_x_train = pd.DataFrame(column_imputer.fit_transform(x_train))
imputed_x_test = pd.DataFrame(column_imputer.transform(x_test))
# now put these column names back into these imputed data as imputation leaves column names removed
imputed_x_train.columns = x_train.columns
imputed_x_test.columns = x_test.columns
x_train = imputed_x_train
x_test = imputed_x_test

In [62]:
categorical_variables_names = (x_train.dtypes == 'object') # selects the other data type
categorical_variables_names = list(categorical_variables_names[categorical_variables_names].index)

In [63]:
from sklearn.preprocessing import OrdinalEncoder

OE_x_train = x_train.copy()
OE_x_test = x_test.copy()
OE = OrdinalEncoder()

OE_x_train[categorical_variables_names] = OE.fit_transform(OE_x_train[categorical_variables_names])
OE_x_test[categorical_variables_names] = OE.transform(OE_x_test[categorical_variables_names])
x_train = OE_x_train
x_test = OE_x_test

XGBoost Modelling:
Naive Model -> Make Predictions -> Calculate Loss (MAE) -> Train New Model -> Add new model to the ensemble -> repeat from step 2

In [64]:
from xgboost import XGBRegressor

XGB_Regressor = XGBRegressor()
XGB_Regressor.fit(x_train, y_train)

In [65]:
from sklearn.metrics import mean_absolute_error

predictions = XGB_Regressor.predict(x_test)
MAE_result = mean_absolute_error(y_test, predictions)
print("The MAE when using XBG Boost: ", MAE_result)

The MAE when using XBG Boost:  230225.23981958762


To improve the accuracy further, we can define the parameters within the XGB Regressor.
n_estimators = Number of Models to ensemble or the no. iterations of the cycle
Too low may result in underfitting and too much may result in overfitting, therefore find an balance.


In [66]:
XGB_Regressor = XGBRegressor(n_estimators=500)
XGB_Regressor.fit(x_train, y_train)
predictions = XGB_Regressor.predict(x_test)
MAE_result = mean_absolute_error(y_test, predictions)
print("The MAE when using XBG Boost: ", MAE_result)

The MAE when using XBG Boost:  241455.19070726712


We can see here that we overfitted, and experimenting these values can be time-consuming. We can however apply early_stopping_rounds which finds the ideal value of n_estimators, where it no longer improves. As a result, we can specify a high n_estimator to allow full coverage. We must also specify a value for early_stopping_rounds, it's representing for how many iterations deterioration must occur before stopping. To calculate the MAE we need to specify  some data, which can be done by eval_set parameter. 

In [67]:
XGB_Regressor = XGBRegressor(n_estimators=500)
XGB_Regressor.fit(x_train, y_train, early_stopping_rounds=5, eval_set = [(x_test, y_test)], verbose = False)
predictions = XGB_Regressor.predict(x_test)
MAE_result = mean_absolute_error(y_test, predictions)
print("The MAE when using XBG Boost: ", MAE_result)



The MAE when using XBG Boost:  230292.52416006997


Learning rate signifies the importance of each tree in the ensemble. A large number of estimators with a low learning rate will result in a good model.

In [68]:
XGB_Regressor = XGBRegressor(n_estimators=1000, learning_rate=0.05)
XGB_Regressor.fit(x_train, y_train, early_stopping_rounds=5, eval_set = [(x_test, y_test)], verbose = False)
predictions = XGB_Regressor.predict(x_test)
MAE_result = mean_absolute_error(y_test, predictions)
print("The MAE when using XBG Boost: ", MAE_result)



The MAE when using XBG Boost:  229159.64030283506


When we lower the learning rate, or for some large data sets, you may have noticed the training time is substantially large. To make it quicker, we can user the n_jobs parameter, which can be set to no. cores of your processor to be used.

In [69]:
XGB_Regressor = XGBRegressor(n_estimators=10000, learning_rate=0.05, n_jobs=2)
XGB_Regressor.fit(x_train, y_train, early_stopping_rounds=5, eval_set = [(x_test, y_test)], verbose = False)
predictions = XGB_Regressor.predict(x_test)
MAE_result = mean_absolute_error(y_test, predictions)
print("The MAE when using XBG Boost: ", MAE_result)



The MAE when using XBG Boost:  229159.64030283506
