#### This notebook will generate the model to be deployed for the web app, and can be part of a CD/CI ML integration

The data comes from the US airline passengers dataset, and the main objective is to forecast the passengers expected for future dates. We will used the library Pycaret to accomplish all steps required

In [25]:
# read CSV file
import pandas as pd
import numpy as np
data = pd.read_csv('AirPassengers.csv')
data.head()

Unnamed: 0,Month,#Passengers
0,1949-01,112
1,1949-02,118
2,1949-03,132
3,1949-04,129
4,1949-05,121


In [26]:
data['Date'] = pd.to_datetime(data['Month'])
data.head()

Unnamed: 0,Month,#Passengers,Date
0,1949-01,112,1949-01-01
1,1949-02,118,1949-02-01
2,1949-03,132,1949-03-01
3,1949-04,129,1949-04-01
4,1949-05,121,1949-05-01


In [27]:
# create 12 month moving average
data['M12'] = data['#Passengers'].rolling(12).mean()

# plot the data and MA
import plotly.express as px
fig = px.line(data, x='Date', y=["#Passengers", 'M12'],
template = 'plotly_dark')
fig.show()

As algorithms are not prepared to input date values, we will create a number column based on their month and year.

In [28]:
# extracting month and year from dates
data['Year'] = [i.year for i in data['Date']]
data['MonthNum'] = [i.month for i in data['Date']]

# create a sequence of numbers
data['Series'] = np.arange(1, len(data)+1)

# drop unnecessary columns and re-arrange
data.drop(['Date', 'M12', 'Month'], axis=1, inplace=True)
data = data[['Series', 'Year', 'MonthNum', '#Passengers']]

data.head()

Unnamed: 0,Series,Year,MonthNum,#Passengers
0,1,1949,1,112
1,2,1949,2,118
2,3,1949,3,132
3,4,1949,4,129
4,5,1949,5,121


In [29]:
# split data into train-test set
train = data[data['Year']< 1960]
test = data[data['Year']>=1960]

# check shape
train.shape, test.shape

((132, 4), (12, 4))

Here, we are splitting the dataset manually because we need to have the training data to be the historical data, and the test data will be the one to be forecasted (or latest)

Now, we will initialize the setup function to train the training data, test data and cross-validation strategy by using the fold_strategy parameter.

In [33]:
# Import the regression module
from pycaret.regression import *

# Initialize setup
s = setup(data=train, test_data = test, target='#Passengers', fold_strategy='timeseries', numeric_features=['Year', 'Series'], fold=3,
         transform_target=True, session_id=123)


Unnamed: 0,Description,Value
0,session_id,123
1,Target,#Passengers
2,Original Data,"(132, 4)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(132, 13)"


#### Train and evaluate models

In [34]:
best = compare_models(sort = 'MAE')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lar,Least Angle Regression,22.398,923.8647,28.2855,0.5621,0.0878,0.0746,0.07
lr,Linear Regression,22.3981,923.8767,28.2857,0.5621,0.0878,0.0746,0.05
huber,Huber Regressor,22.4107,890.3191,27.9113,0.6001,0.0879,0.0749,0.1333
br,Bayesian Ridge,22.4783,932.2165,28.5483,0.5611,0.0884,0.0746,0.0433
ridge,Ridge Regression,23.1976,1003.9423,30.0409,0.5258,0.0933,0.0764,0.0633
lasso,Lasso Regression,38.4188,2413.5096,46.8468,0.0882,0.1473,0.1241,0.0567
en,Elastic Net,40.6486,2618.8753,49.4048,-0.0824,0.1563,0.1349,0.05
omp,Orthogonal Matching Pursuit,44.3054,3048.2658,53.8613,-0.4499,0.1713,0.152,0.0433
gbr,Gradient Boosting Regressor,50.1217,4032.0567,61.2306,-0.6189,0.2034,0.1538,0.1333
rf,Random Forest Regressor,52.7754,4705.6863,65.6728,-0.7962,0.2148,0.1592,0.4567


The best model based on MAE is Least Angle Regression with a value 22.39. Let's check the score on the test set

In [35]:
prediction_score = predict_model(best)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Least Angle Regression,25.0714,972.2733,31.1813,0.8245,0.0692,0.0571


Here the MAE is 12% higher than on the training set, which may indicate overfitting. Let's plot this

In [36]:
# generate predictions on the original dataset
predictions = predict_model(best, data=data)

# add a date column in the dataset
predictions['Date'] = pd.date_range(start='1949-01-01', end='1960-12-01', freq='MS')

# line plot 
fig = px.line(predictions, x='Date', y=['#Passengers', "Label"], template='plotly_dark')

# Add a vertical rectangle for test-set separation
fig.add_vrect(x0='1960-01-01', x1='1960-12-01', fillcolor='grey', opacity=0.25, line_width=0)

fig.show()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Least Angle Regression,12.5381,278.7548,16.696,0.9805,0.0538,0.0447


The grey backdrop towards the end is the test period. Now it is time to finalize the model using LAR on the whole dataset.

In [37]:
final_best = finalize_model(best)

With the model trained with all the data, we can predict some date range like 6 years ahead. We will create a series for those dates

In [41]:
future_dates = pd.date_range(start = '1961-01-01', end = '1965-01-01', freq = 'MS')

future_df = pd.DataFrame()

future_df['MonthNum'] = [i.month for i in future_dates]
future_df['Year'] = [i.year for i in future_dates]    
future_df['Series'] = np.arange(145,(145+len(future_dates)))

future_df.head()

Unnamed: 0,MonthNum,Year,Series
0,1,1961,145
1,2,1961,146
2,3,1961,147
3,4,1961,148
4,5,1961,149


Using this dataset of dates, we can score it based on our model

In [42]:
predictions_future = predict_model(final_best, data=future_df)
predictions_future.head()

Unnamed: 0,MonthNum,Year,Series,Label
0,1,1961,145,486.278268
1,2,1961,146,482.208187
2,3,1961,147,550.485967
3,4,1961,148,535.187177
4,5,1961,149,538.923789


In [45]:
concat_df = pd.concat([data,predictions_future], axis=0)
concat_df_i = pd.date_range(start='1949-01-01', end = '1965-01-01', freq = 'MS')
concat_df.set_index(concat_df_i, inplace=True)
fig = px.line(concat_df, x=concat_df.index, y=["#Passengers", "Label"], template = 'plotly_dark')
fig.show()

In [46]:
# save model to disk
save_model(final_best, 'time-series-forecast-pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=['Year', 'Series'],
                                       target='#Passengers', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=No...
                                                  n_nonzero_coefs=500,
                                                  normalize=True,
                                                  power_transformer_method='box-cox',
                                                  power_transformer_standardize=True,
                                     