# LSG Assignment

So far we have been studying time series that we generated using a pre-defined stochastic process, but now let's apply the models we have been working with on some real-world data. We will work with a data set which shows the consumption of chocolate, beer and electricity in Australia from 1958 to 1991.

In [15]:
pip install pmdarima

Note: you may need to restart the kernel to use updated packages.


The filename, directory name, or volume label syntax is incorrect.


In [2]:
from math import sin, pi
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy.random as nr

import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error

import statsmodels.graphics.tsaplots as splt
import statsmodels.api as statsmodels
import statsmodels.formula.api as sm
import statsmodels.tsa.seasonal as sts
import statsmodels.tsa.arima_process as arima_process
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA

import matplotlib
matplotlib.rcParams['figure.figsize'] = [15, 5]

import warnings
warnings.filterwarnings('ignore')

def decomp_ts(ts, period, model = 'additive'):
    res = sts.seasonal_decompose(ts, model = model, period = period)
    return(pd.DataFrame({'ts': ts, 'trend': res.trend, 'seasonal': res.seasonal, 'resid': res.resid}, 
                        index = ts.index))

def plot_acf_pacf(x, lags = 40):
    x = x[x.notna()] # remove NAs
    fig, axes = plt.subplots(1, 2, figsize = (15, 5))
    fig = splt.plot_acf(x, lags = lags, ax = axes[0])
    fig = splt.plot_pacf(x, lags = lags, ax = axes[1]);
    return None

def plot_ts_resid(x):
    x = x[x.notna()] # remove NAs
    fig, axes = plt.subplots(1, 2, figsize = (15, 5))
    fig = sns.lineplot(x.index, x, ax = axes[0])
    fig = sns.distplot(x, ax = axes[1]);
    return None

In [3]:
CBE = pd.read_csv('data/cbe.csv')
CBE.index = pd.date_range(start = '1-1-1958', end = '12-31-1990', freq = 'M')

CBE.head()

Unnamed: 0,choc,beer,elec
1958-01-31,1451,96.3,1497
1958-02-28,2037,84.4,1463
1958-03-31,2477,91.2,1648
1958-04-30,2785,81.9,1595
1958-05-31,2994,80.5,1777


We limit our example to looking at beer consumption.

In [4]:
plot_ts_resid(CBE['beer'])

Notice that for each of these time series the amplitude of the seasonal variation grows with time. This is a common situation with real world data. Seeing this situation indicates that we should use a **multiplicative decomposition model**.  

The multiplicative model can be easily transformed to an additive model by taking the logarithm of the values.

In [5]:
CBE['beer_log'] = np.log(CBE['beer'])
plot_ts_resid(CBE['beer_log'])

Notice the following properties about this time series.
- It has a significant trend.
- The time series have a noticeable seasonal component.
- The magnitude of the seasonal component increases with trend in the un-transformed time series. 
- The seasonal component of the log transformed series has a nearly constant magnitude, but decreases a bit with time. 

These results indicate that an STL decomposition is required. Further, a multiplicative (log transformed) STL model is preferred.


### Fitting linear regression

Before training any time series model, let's see how our old fried linear regression does. In cases where the data is relatively well behaved, we can train a model using linear regression, but we need to do some pre-processing to account for the time series nature of the data. This can be a manual and laborious process, but going through it can give us a sense of what trying to model time series "manually" looks like.

- Create a feature called `month_int`, which is equal to 1 when the month is January, 2 for February, and so on. Create another feature called `month_sqr` which is the square of `month_int`. <span style="color:red" float:right>[2 point]</span>
- One-hot-encode the `month_int` feature (creating one binary feature for each month), and normalize `month_int` and `month_sqr`. <span style="color:red" float:right>[2 point]</span>
- Create a feature called `beer_log_lag_1` which is the first lag of `beer_log` (as in the last known price of beer, when you look at the previous row). HINT: You can get lagged features using the `shift` method. <span style="color:red" float:right>[2 point]</span>

In [6]:
## your code goes here
# part 1
CBE['month_int']=CBE.index.month
CBE['month_sqr']=CBE['month_int'].pow(2)
# part 2
dum = pd.get_dummies(CBE.month_int, prefix='Month')
CBE['month_int_norm']=(CBE['month_int']-CBE['month_int'].mean())/CBE['month_int'].std()
CBE['month_sqr_norm']=(CBE['month_sqr']-CBE['month_sqr'].mean())/CBE['month_sqr'].std()
# part 3
CBE['beer_log_lag_1']=CBE['beer_log'].shift(1)

# combining everything back into the CBE data frame
CBE=pd.concat([CBE, dum], axis=1)

# displaying first 5 rows
CBE.head()

Unnamed: 0,choc,beer,elec,beer_log,month_int,month_sqr,month_int_norm,month_sqr_norm,beer_log_lag_1,Month_1,...,Month_3,Month_4,Month_5,Month_6,Month_7,Month_8,Month_9,Month_10,Month_11,Month_12
1958-01-31,1451,96.3,1497,4.567468,1,1,-1.591242,-1.151852,,1,...,0,0,0,0,0,0,0,0,0,0
1958-02-28,2037,84.4,1463,4.435567,2,4,-1.301925,-1.086857,4.567468,0,...,0,0,0,0,0,0,0,0,0,0
1958-03-31,2477,91.2,1648,4.513055,3,9,-1.012609,-0.978533,4.435567,0,...,1,0,0,0,0,0,0,0,0,0
1958-04-30,2785,81.9,1595,4.405499,4,16,-0.723292,-0.826878,4.513055,0,...,0,1,0,0,0,0,0,0,0,0
1958-05-31,2994,80.5,1777,4.388257,5,25,-0.433975,-0.631894,4.405499,0,...,0,0,1,0,0,0,0,0,0,0


With the feature engineering steps we took, we should be able to train a linear regression model now. With `month_int` and `month_sqr` the model should be able to find a trend over the course of the year, which is either linear or curvelinear with a single peak or trough. By one-hot-encoding `month_int` the model can also capture month to month effects. Finally, using a lagged feature, the model can anchor its beer price prediction on the last known price.

- Split the data into training and test sets, using the last 12 months of data for testing. <span style="color:red" float:right>[2 point]</span> 
- Train a linear regression model to predict beer price using onely the features we created earlier. <span style="color:red" float:right>[2 point]</span> 

In [7]:
from sklearn.linear_model import LinearRegression
# part 1 splitting the data
train_CBE=CBE[:-13].bfill().ffill() # back-filling the beginning and forward-filling the ending as the first row contains a NaN value
test_CBE=CBE[-13:-1] #use the last 12 months for testing
## part 2 train a lin reg model to predict beer_log price
Y = train_CBE['beer_log']
#
column=['month_int_norm','month_sqr_norm','beer_log_lag_1'] + list(dum.columns.values)
X=train_CBE[column]
#model initialization
regression_model = LinearRegression()
#fit the data(train)
regression_model.fit(X,Y)
#predict
train_CBE['predicted_beer_log'] = regression_model.predict(X)

- Plot a line plot of the original time series, and to the same plot add line plots to show the predictions on the training data and the test data. Use separate colors for each. <span style="color:red" float:right>[3 point]</span> 

In [8]:
## your code goes here
# create test prediction
test_CBE['predicted_beer_log'] = regression_model.predict(test_CBE[column]) #using same columns but test data frame
# plots

fig = plt.figure()
ax = plt.axes()
plt.plot(CBE.index, CBE['beer_log']) # plot actual
plt.plot(train_CBE.index, train_CBE['predicted_beer_log']) # plot predicted_train
plt.plot(test_CBE.index, test_CBE['predicted_beer_log']) # plot predicted_test
plt.title("Predicted Price of Beer v.s. Actual")
plt.xlabel("Year")
plt.ylabel("Price of Beer");
plt.legend(['Actual','Predicted_Train','Predicted_Test'])

<matplotlib.legend.Legend at 0x1ca71ac0580>

- Compute the **root mean square error (RMSE)** of the model on the test data and plot the line plot and the histogram of the residual (beer price minus forecast) using the `plot_ts_resid` helper function. What conclusion would you draw about the model we fit? <span style="color:red" float:right>[2 point]</span> 

In [9]:
from sklearn.metrics import mean_squared_error
## your code goes here
## part 0 use the model to predict based on the test data
#test_CBE['predicted_beer'] = regression_model.predict(test_CBE[column])
# part 1 calculate RMSE of the predicted test data
rmse=mean_squared_error(y_true=test_CBE['beer_log'],y_pred=test_CBE['predicted_beer_log'])# using test data not training data
print(f'The RMSE is {rmse:.4f}')
# part 2 plot line plot/ histogram of resid
plot_ts_resid(test_CBE['predicted_beer_log'])
plot_ts_resid(test_CBE['beer_log'])

The RMSE is 0.0105


### Conclusions

By evaluating both the train and test model predictions against the actual we can see at a 10,000 foot level that the linear model is relatively accurate at prediction.  By doing additional calculation of the RMSE of the test_log data v.s. the test_log prediction we can see that the RMSE is 0.0101 which when compared to our values of ~4.2-5.3 is quite good and further strengthens our position that we have created a strong model. By evaluating the residuals of the test data v.s. the test prediction we can see that the residuals follow a similar shape with values ranging from ~4.9-5.35. Additionally the histogram shows the values are centered around 5.0 with a second peak at about 3.5.  All of the above leads us to believe the model created has a strong ability to accurately predict the beer_log price to a reasonable degree.

### Fitting a time series model

Let's now try the models we learned about in this lesson. By doing so, we can later compare the two approaches and appreciate the pros and cons.

- Use the `decomp_ts` helper function to decompose `beer_log`. Remove the NAs from the data, then use the `plot` method to plot a line plot of the components. <span style="color:red" float:right>[3 point]</span> 

In [10]:
## your code goes here
decomp = decomp_ts(CBE['beer_log'], period = 12).bfill().ffill() # use helper function and remove NaNs
#plot
decomp.plot()

<AxesSubplot:>

- Compute and plot the ACF and PACF for the remainder (residual) series, up to 36 lags. <span style="color:red" float:right>[2 point]</span>

In [11]:
## your code goes here
plot_acf_pacf(decomp['resid'], lags = 36)

As you can imagine, with real data things can look very messy. The ACF and PACF can exhibit both AR and MA behavior and it's hard to know what degrees to choose. So we will use the `auto_arima` function to help us: It  iterates over a grid of $(p, d, q)$ and seasonal $(P, D, Q)$ values. For each combination the BIC is computed and compared to the best previous model. For each combination the BIC is computed and compared to the best previous model. The better model is the one with the lowest BIC: The **Bayesian information criteria (BIC)** is a measure for assessing a model's fit:

$$
\begin{align}
\text{BIC} &= \ln(n)k - 2 \ln(\hat L)
\end{align}
$$

where $\hat L$ is the likelihood of the data given the fitted model parmaters, $k$ is the number of model parameters, and $n$ is the number of observations. Lower values for BIC means we have a better fit.

The code below implements `auto_arima`. As you can see, we provide it with the data, some maximum value for the hyper-pramaters $(p, d, q)$ and $(P, D, Q)$. It's very unusual to choose a number greater than 3. Run the next cell and examine the results. The function returns the best model, i.e. the model whose hyper-parameters gave the lowest BIC.

In [12]:
CBE.index[-13]

Timestamp('1989-12-31 00:00:00', freq='M')

In [14]:
validation_cut_off=CBE.index[-13]
from pmdarima import auto_arima
best_fit = auto_arima(CBE.loc[:validation_cut_off, 'beer_log'], 
                      max_p = 3, max_d = 1, max_q = 3, 
                      m = 12, max_P = 1, max_D = 1, max_Q = 1, 
                      information_criterion = 'bic', 
                      trace = True, error_action = 'ignore', suppress_warnings = True)

ModuleNotFoundError: No module named 'pmdarima'

Let's take a look at the best model's summary:

In [None]:
print(best_fit.summary())

Let's now visualize the forecast. With time series models we use the `predict_in_sample` to make predictions for the range of data that we used during training, and we use `predict` to make forecasts.

In [None]:
#played around with some variable names to make things work with my code above.
start_idx = 1
train_idx = train_CBE.reset_index().index[start_idx:]
n_periods = len(test_CBE)

sns.lineplot(CBE.index, CBE['beer_log'], alpha = 0.3)
sns.lineplot(train_CBE.index[train_idx], best_fit.predict_in_sample(start = start_idx, end = train_idx.max()))
sns.lineplot(test_CBE.index, best_fit.predict(n_periods = n_periods))
plt.legend(['original', 'fit', 'forecast']);

Notice how the predictions are initially a bit off, but overall the forecasts look reasonable.

- Compute the RMSE and use `plot_ts_resid` to plot the line plot and the histogram of the residuals. How does the RMSE for this model compare the the linear regression model? <span style="color:red" float:right>[2 point]</span>

In [None]:
## your code goes here
# part 1 calculate RMSE of the predicted test data
rmse=mean_squared_error(y_true=test_CBE['beer_log'],y_pred=best_fit.predict(n_periods = n_periods))# using test data not training data
print(f'The RMSE is {rmse:.4f}')
# part 2 plot line plot/ histogram of resid
best_fit_predict=best_fit.predict(n_periods = n_periods) #need to change np array to pandas series for helper function to work
plot_ts_resid(pd.Series(best_fit_predict, name='best_fit_log'))
plot_ts_resid(test_CBE['predicted_beer_log'])
plot_ts_resid(test_CBE['beer_log'])

### Conclusion
The RMSE of this "best_fit" model(RMSE=0.0118) is slightly worse than (RMSE=0.0101) of the linear regression model. However by looking at the histogram charts the best_fit models residuals are more in line with the actual data. In either case both models perform relatively well and further data or testing will have to be completed to determine a clear winner. 

We hope the assignment convinced you that the decision as to which model is better is not always clear. Of course we can rely on a metric like RMSE, but we don't want that to be the sole determinant. The level of familiarity with this or that algorithm should also be important. For example, we spent a lot of time learning about linear models, so even if the linear model is a slight worse fit we may prefer it because they are efficient and we can focus on feature engineering to improve its performance. ARIMA models on the other hand have the advantage of taking care of a lot of the feature engineering, but they are more difficult to explainr and require more experience in order to tune well. These sorts of trade-offs are very common in data science.

# End of assignment