## Enhancing Time Series |  Methods & How to 

<b> First let's start by importing the libs we need

In [1]:
from xgboost import XGBRegressor
from prophet import Prophet
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error

<b> For these tests we will be using House Property Sales dataset https://www.kaggle.com/datasets/htagholdings/property-sales?select=raw_sales.csv

In [2]:
df = pd.read_csv('raw_sales.csv')

<b> Let's first check our dataframe and see if we need to make some improvements to the df

In [3]:
df.shape

(29580, 5)

In [4]:
df.dtypes

datesold        object
postcode         int64
price            int64
propertyType    object
bedrooms         int64
dtype: object

<b> Let's correct our time column

In [5]:
df['datesold'] = pd.to_datetime(df['datesold'])

In [6]:
df.sample(2)

Unnamed: 0,datesold,postcode,price,propertyType,bedrooms
3281,2010-12-15,2615,497000,house,4
22709,2018-09-03,2914,705000,house,4


In [7]:
df['propertyType'].unique()

array(['house', 'unit'], dtype=object)

In [8]:
df['postcode'].nunique()

27

<b> Let's suppose we want a model that can predict price of houses and units.
    <br>     We will start with Prophet

<b> To use prophet correct we need to transform the data column name to ds and the output in y

In [9]:
df.rename(columns = {'datesold' : 'ds', 'price' : 'y'},inplace = True)

In [10]:
df.sample(2)

Unnamed: 0,ds,postcode,y,propertyType,bedrooms
10580,2014-12-08,2602,695000,house,4
398,2008-09-30,2913,367500,house,4


<b> Now, what we also need is to define our train, test and validation.

In [11]:
df['ds'].min()

Timestamp('2007-02-07 00:00:00')

In [12]:
df['ds'].max()

Timestamp('2019-07-27 00:00:00')

<b> Let's use the final 2 months as our validation, and let's do a test with 1 month of data.

In [15]:
train_df = df.loc[(df['ds'] < '2019-05-01') & (df['postcode'] == 2602)].reset_index(drop = True).copy()

In [16]:
test_df = df.loc[(df['ds'].between('2019-05-01','2019-05-31')) & (df['postcode'] == 2602)].reset_index(drop = True).copy()

In [17]:
val_df = df.loc[(df['ds'] >= '2019-05-31') & (df['postcode'] == 2602)].reset_index(drop = True).copy()

<b> Now prophet isn't too much of a category friend so if we wanted one model per type of property we would have two models, and if we wanted one model per postal code and type we could have 54 models which is the n of type's * n of postal codes
    <br> let's go simple and go with one model for houses.

In [18]:
ts_model = Prophet()

In [19]:
train_houses = train_df.loc[train_df['propertyType'] == 'house'].reset_index(drop = True).copy()
test_houses = test_df.loc[test_df['propertyType'] == 'house'].reset_index(drop = True).copy()

In [20]:
ts_model.fit(train_houses)

INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


<prophet.forecaster.Prophet at 0x1be3f873a90>

<b> From this point on our prophet has been trained, easy! Now we could be enhancing our model by moving the hyperparameters, and comparing against our test! Let's take a glimpse

In [21]:
mean_absolute_error(test_df['y'],ts_model.predict(test_df)['yhat'])

309436.94382434513

In [22]:
ts_model_hyper = Prophet(seasonality_prior_scale=15.0,
                         holidays_prior_scale=20.0,
                         changepoint_prior_scale=0.1)

In [23]:
ts_model_hyper.fit(train_houses)

INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


<prophet.forecaster.Prophet at 0x1be408db820>

In [24]:
mean_absolute_error(test_df['y'],ts_model_hyper.predict(test_df)['yhat'])

293778.05803002394

<b> From this point on to enhance with hyperparameters by themselves, the best option would by a Search Grid. So let's validate in our val_df

In [25]:
mean_absolute_error(val_df['y'],ts_model.predict(val_df)['yhat'])

363839.712125795

In [26]:
mean_absolute_error(val_df['y'],ts_model_hyper.predict(val_df)['yhat'])

343883.70855595445

<b> We get a model, that has an error pretty like our tests, it deprecated just a little bit which is normal but is useful.

In [27]:
############################################

## Let's change to XGBoost

<b> With XGBoost we will have to do a little bit of variable engineering, so let's start by transforming propertyType in bool

In [28]:
df['property_type'] = df['propertyType'].map({'house' : 1, 'unit' : 0})

In [29]:
df_xgb = df.copy()

<b> Now XGBoost won't understand data columns so we have to create the specific columns for each variable

In [30]:
df_xgb['day'] = df_xgb['ds'].dt.day
df_xgb['month'] = df_xgb['ds'].dt.month
df_xgb['year'] = df_xgb['ds'].dt.year

In [31]:
df_xgb.sample(2)

Unnamed: 0,ds,postcode,y,propertyType,bedrooms,property_type,day,month,year
19483,2017-09-20,2906,675000,house,4,1,20,9,2017
21867,2018-05-18,2617,900000,house,5,1,18,5,2018


<b> Now our other biggest step is to create the sliding window, for this test we will be using 28 days as our sliding window

In [32]:
df_xgb.sort_values(['propertyType','postcode','ds'],ignore_index = True,inplace = True)

In [33]:
df_xgb.sample(2)

Unnamed: 0,ds,postcode,y,propertyType,bedrooms,property_type,day,month,year
27007,2016-04-05,2607,185000,unit,1,0,5,4,2016
16075,2009-09-17,2905,450000,house,3,1,17,9,2009


<b> This is tricky, we have to group it by our other "category" variables and then we have to shift without the data so that the data remains on the grouped df

In [34]:
df_xgb['bedrooms_var'] = df_xgb.groupby(['propertyType','postcode'])['bedrooms'].shift(28)

<b> Final steps, let's do the separation by train,test and val

In [35]:
train_df = df_xgb.loc[df_xgb['ds'] < '2019-05-01'].reset_index(drop = True).copy()

In [36]:
train_df = train_df.loc[train_df['bedrooms_var'].notnull()].reset_index(drop = True).copy()

In [37]:
test_df = df_xgb.loc[df_xgb['ds'].between('2019-05-01','2019-05-31')].reset_index(drop = True).copy()

In [38]:
val_df = df_xgb.loc[df_xgb['ds'] >= '2019-05-31'].reset_index(drop = True).copy()

<b> Now as you can see, we don't have to create loads of models, just one XGBoost for all of them!

In [39]:
train_df.sample(2)

Unnamed: 0,ds,postcode,y,propertyType,bedrooms,property_type,day,month,year,bedrooms_var
6664,2019-03-14,2611,1228000,house,4,1,14,3,2019,4.0
2306,2018-10-22,2602,850000,house,3,1,22,10,2018,3.0


In [40]:
xgb_model = XGBRegressor(n_estimators = 800)

In [41]:
xgb_model.fit(train_df[['property_type','postcode','year','month','day','bedrooms_var']],train_df['y'])

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=800, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [42]:
mean_absolute_error(test_df.loc[(test_df['property_type'] == 1) & (df['postcode'] == 2602)]['y'],xgb_model.predict(test_df.loc[(test_df['property_type'] == 1) & (df['postcode'] == 2602) ][['property_type','postcode','year','month','day','bedrooms_var']]))

218721.28935185185

<b> With a little bit more of work we could get out xgb to predict 1 month ahead with 40K less price error than the prophet

<b> From this point on our prophet has been trained, easy! Now we could be enhancing our model by moving the hyperparameters, and comparing against our test! Let's take a glimpse

In [43]:
xgb_model_hyper = XGBRegressor(n_estimators = 1500,learning_rate = 0.003)

In [44]:
xgb_model_hyper.fit(train_df[['property_type','postcode','year','month','day','bedrooms_var']],train_df['y'])

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.003, max_delta_step=0,
             max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=1500, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [45]:
mean_absolute_error(test_df.loc[(test_df['property_type'] == 1) & (df['postcode'] == 2602)]['y'],xgb_model_hyper.predict(test_df.loc[(test_df['property_type'] == 1) & (df['postcode'] == 2602)][['property_type','postcode','year','month','day','bedrooms_var']]))

170804.36342592593

<b> Moving a little bit someparameters we could enhance it even more in test, now let's check in validation

In [46]:
mean_absolute_error(val_df.loc[(val_df['property_type'] == 1) & (df['postcode'] == 2602)]['y'],xgb_model.predict(val_df.loc[(val_df['property_type'] == 1) & (df['postcode'] == 2602)][['property_type','postcode','year','month','day','bedrooms_var']]))

146561.56173780488

In [48]:
mean_absolute_error(val_df.loc[(val_df['property_type'] == 1) & (df['postcode'] == 2602)]['y'],xgb_model_hyper.predict(val_df.loc[(val_df['property_type'] == 1) & (df['postcode'] == 2602)][['property_type','postcode','year','month','day','bedrooms_var']]))

132822.0030487805

<b> That's it! I hope you enjoyed the class and the material.
    <br> In case you need any help, i will be happy to help! You can contact me via LinkedIn https://www.linkedin.com/in/leandro-sartini-de-campos-64343915b/