# Project 3

In this project, we apply machine learning methods to predict Consumer Price Index. 

After obtaining the predicted CPI, we would then calculate monthly and yearly inflation.

After carefully considering the underlying structure of the data, we decided to build models using the period 2010-2020

- 2010 - 2017 as training data

- 2017 - 2019 as validation data

- 2019 - 2020 as test data


# I. Preprocessing 

## 1. Label Decomposition

Import necessary library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, acf, pacf
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import kpss
from statsmodels.tsa.arima.model import ARIMA


In [None]:
# Read in the data

df = pd.read_csv('cpi.csv', parse_dates= [['Year', 'Month']], index_col= 'Year_Month')

# get data from 2010 to 2020
df = df.loc['2010-01-01':'2019-12-31']

# Set the monthly frequency for the data

df.index.freq = 'MS'

# Change the index name to 'Date'
df.index.name = 'Date'

Visualize monthly and yearly inflation

In [None]:
df['1-Month % Change'].plot()
plt.title('1-month inflation rate')

In [None]:
df['12-Month % Change'].plot()
plt.title('12-month inflation rate')

Our current main focus is the CPI index, so let's decompose this feature first.
- First, decompose the CPI column into trend, seasonal, and residual components using additive method. 

- Second, apply multiplicative method

In [None]:
df['CPI'].describe()

### 1.1 Additive decomposition

In [None]:
additive_decomposed = seasonal_decompose(df['CPI'], 
                                         model='additive',
                                         two_sided= False, 
                                         period= 6)

# Plot the original data, trend, seasonal, and residual components
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(10, 8), sharex=True)

# Original data
ax1.plot(df['CPI'])
ax1.set_title('Original Data')
ax1.grid()

# Trend component
ax2.plot(additive_decomposed.trend)
ax2.set_title('Trend Component')
ax2.grid()

# Seasonal component
ax3.plot(additive_decomposed.seasonal)
ax3.set_title('Seasonal Component')
ax3.grid()

# Residual component
ax4.plot(additive_decomposed.resid)
ax4.set_title('Residual Component')
ax4.grid()

plt.tight_layout()
plt.show()


A statistical look into the seasonal component

In [None]:
additive_decomposed.seasonal.describe()

### 1.2 Multiplicative Decomposition

In [None]:
multiplicative_decomposed = seasonal_decompose(df['CPI'], 
                                               model='multiplicative',
                                               two_sided= False, 
                                               period= 6)

# Plot the original data, trend, seasonal, and residual components
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(10, 8), sharex=True)

# Original data
ax1.plot(df['CPI'])
ax1.set_title('Original Data')
ax1.grid()

# Trend component
ax2.plot(multiplicative_decomposed.trend)
ax2.set_title('Trend Component')
ax2.grid()

# Seasonal component
ax3.plot(multiplicative_decomposed.seasonal)
ax3.set_title('Seasonal Component')
ax3.grid()

# Residual component
ax4.plot(multiplicative_decomposed.resid)
ax4.set_title('Residual Component')
ax4.grid()

plt.tight_layout()
plt.show()


### 1.3 Decomposition Conclusion

- After trying multiple periods/frequencies, we decided to use a period of 6 to decompose the CPI index as it results the perfect seasonal component. 

Both multiplicative and additive decomposition show that the trend component is the most important component in the CPI index. 

## 2. Trend Analysis

In [None]:
# Obtain statistical attributes of the trend component
additive_decomposed.trend.describe()

Since the series has a linear trend, it is definitely not stationary. Thus, we should attempt to make it stationary.

In addition, we can address how statistical properties of a series change over time by visualizing. This would help us check the structural break and heteroscedasticity issue. 
- The rolling window size is 12 months

In [None]:
# Create a fucntion to plot rolling variance and rolling mean
def rolling_statistics(timeseries, custom_name, window_size=12):
    # Determine rolling statistics
    rolling_mean = timeseries.rolling(window=window_size).mean()
    rolling_std = timeseries.rolling(window=window_size).std()

    # Plot rolling statistics
    plt.figure(figsize=(10, 6))
    plt.plot(rolling_mean, color='black', label='Rolling Mean')
    plt.plot(rolling_std, color='red', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('12 Periods Rolling Mean & Standard Deviation of ' + custom_name)
    plt.grid()
    plt.show()

### 2.1 Label Diffencing

First, attempt to difference the data to see if the process can make the data more stationary. 

#### 2.1.1 First Order Differencing

In the first order differencing, we would subtract the immediate previous value from the current value to obtain the difference between two consecutive periods. 

First-Order Differencing = Value at time t - Value at time t-1

In [None]:
diff_data = df['CPI'].diff().dropna()

In [None]:
diff_data.plot()
plt.title('First - Order Differenced Data')

In [None]:
rolling_statistics(diff_data, 'First - Order Differenced Data')

#### 2.1.2 Second Order Differencing

Second-order difference is the difference of the differences. That is, it's the first-order difference of the first-order differences. 

In [None]:
second_order_diff = diff_data.diff().dropna()

In [None]:
second_order_diff.plot()
plt.title('Second - Order Differenced Data')


In [None]:
rolling_statistics(second_order_diff, 'Second - Order Differenced Data')

### 2.2 Label Detrending

- The method for smoothing data used in this project is backward moving average.

- Detrended data is computed by subtracting the trend values from the actual values. 

- Since we use a period of 6 to smooth out the data, the function will use a backward moving average with a window size of 6 to smooth the trend component (6 periods prior to the current value).

- As a result, we would lose 6 observations in using label detrending, compared to only 1 in first-order differencing, and 2 in second-order differencing.

In [None]:
# Here, I extract the trend component from the multiplicative decomposition. Trend values from either multiplicative or additive decompositions are identical.
trend = multiplicative_decomposed.trend

In [None]:
detrend = df['CPI']- trend

detrend.dropna(inplace=True)

detrend.plot()

plt.title('Detrended Data')

In [None]:
rolling_statistics(detrend, 'Detrended Data')

### 2.3 Differencing and Detrending Conclusion

Mean and variance of these transformed data are not constant over time. Among the 3 transformation methods, the second order differencing appear to be the most stationary. Therefore, we would move forward and investigate further the second order differencing.

### 2.4 Transformed Label's Statistical Description

Create a box plot to visualize the data distribution

In [None]:
def cus_boxplot(data1, title1):
    fig, ax1 = plt.subplots(1, 1, figsize=(5, 5))
    sns.boxplot(data1, ax=ax1)
    ax1.set_title(title1)
    plt.show()

In [None]:
cus_boxplot(second_order_diff, 'Second-Order Differenced Data')

Obtain the statistical description of the data

In [None]:
print('Second order difference data statistical summary:')
second_order_diff.describe()

## 3. Stationarity and White Noise Test

Create a function to calculate the ADF test and print out the result. 

In [None]:
def stationary_test(input):

    result = adfuller(input)
    print('ADF Statistic:', result[0])
    print('p-value:', result[1])
    print('Critical Values:', result[4])

    # Reject the null hypothesis if the p-value is below the chosen significance level
    if result[1] < 0.05:
        print("The data is STATIONARY.")
    else:
        print("The data is NOT STATIONARY.")
        

In addition to the **ADF** test, let's use the non parametric **KPSS** test to confirm the stationarity of the data. If KPSS's result contradict conclusion from ADF, we need to investigate further. 

### 3.1 Augemnted Dickey-Fuller Test

To statistically verify if the data is stationary or not, we would deploy ADF test. 

- Null hypothesis: The time series contains a unit root and is non-stationary

- Alternative hypothesis is that the time series is stationary. 

To confirm that the data is stationary, we need a p-value that is lower than the significance level in order to reject the null hypothesis, and the critical values should be greater greater than the ADF statistics.

- The significance level chosen is 0.05. 

ADF on the second order differenced dataset

In [None]:
stationary_test(second_order_diff)

### 3.2 Non-parametric KPSS test

- Null hypothesis: The time series is stationary (no unit root)

- Alternative hypothesis: The time series is stastionary (it has a unit root)

KPSS' test statistic is compared to the relevant critical values. If the test statistic is greater than the cirtical value at a chosen level of significance, we reject the null hypothesis  and conclude that the series is non-stationary with a unit root. 


In [None]:
# Create a function to perform the kpss test.
def kpss_test(input):
        result = kpss(input)
        print('KPSS Statistic:', result[0])
        print('p-value:', result[1])
        print('Critical Values:', result[3])
    
        # Reject the null hypothesis if the p-value is below the chosen significance level
        if result[1] < 0.05:
            print("The data is NOT STATIONARY.")
        else:
            print("The data is STATIONARY.")


KPSS test on the second-order differenced data

In [None]:
kpss_test(second_order_diff)

Most critical values across level of significance are well beyond the test statistic. This supports the Null hypothesis that the series is stationary

### 3.3 ADF and KPSS test conclusion 

The second-order differencing data is found to be stationary by using ADF and KPSS test. Results from both tests are consistent.

### 3.4 White Noise Check 

In this test, we would test the autocorrelation between the current value its 12 lags. If there exist a correlation between the current value and a number of its lags, then the series is not white noise

In [None]:
from statsmodels.stats.diagnostic import acorr_ljungbox
def white_noise_test(input):
    # Calculate the p-value of the autocorrelation
    lags = 12
    p_val_list = []
    for i in range(1, lags):
        result = acorr_ljungbox(input, lags= lags)
        p_value = result.iloc[i-1,1]
        p_val_list.append(p_value)
    # check if all p_values in the list are below 0.05, then the time series is not a white noise
    if all(i < 0.05 for i in p_val_list):
        print('The time series is NOT a white noise.')
    

In [None]:
white_noise_test(second_order_diff)

In [None]:

from statsmodels.stats.diagnostic import acorr_ljungbox
def white_noise_test(input):
    # Calculate the p-value of the autocorrelation
    lags = 12
    p_val_list = []
    for i in range(1, lags + 1):
        result = acorr_ljungbox(input, lags= lags)
        p_value = result.iloc[i-1,1]
        p_val_list.append(p_value.round(4))
    # check if all p_values in the list are below 0.05, then the time series is not a white noise
    if all(i < 0.05 for i in p_val_list):
        print('The time series is NOT a white noise.')
    else:
        print('The time series is a white noise.')
    # Store the p_values in a data frame
    p_val_df = pd.DataFrame(p_val_list, index=range(1, lags+1), columns=['P_Value'])
    return p_val_df

In [None]:
white_noise_test(second_order_diff)

Since the series illustrate a correlation between the current value and its lags, the data is thus not white noise. 

## 4. Lag Analysis

To identify the useful lag variables, we can use the autocorrelation function (ACF) and Partial Autocorrelation Function (PACF) plots.

The main difference between ACF and PACF is that ACF measures the total correlation between a time series and its lagged values, while PACF measures the direct correlation between a time series and its lagged values after removing the effect of the correlations with the intervening observations. 

ACF is primarily used to determine the MA component, while the PACF plot is used to determine the AR component.

The shaded area is the signifiance level in the ACF and PACF plots. If a lag is above the shaded area, it is significantly correlated with the label. 

### 4.1 Label's ACF and PACF

In [None]:
# ACF plot
plot_acf(second_order_diff, lags= 24, zero=False)
plt.title('ACF Plot of Second-Order Differenced Data')
plt.show()

# PACF plot
plot_pacf(second_order_diff, lags = 24, zero=False)
plt.title('PACF Plot of Second-Order Differenced Data')
plt.show()

### 4.2 Lag Analysis Conclusion 

- The ACF plot shows that the label is correlated with its lagged values up to 3 periods.

- Meanwhile, the PACF shows that the label is directly correlated with the first 4 lag values and lags of 9 and 22. We can't really be sure that lag 22 are really substantially significnnt as it shows on the graph due to the small size of the data.

## 5. Splitting the data

Training, validation, and test sets

- 2010 - 2017 as training data

- 2017 - 2019 as validation data

- 2019 - 2020 as test data


In [None]:

train = second_order_diff.loc['2010-01-01':'2016-12-31']

val = second_order_diff.loc['2017-01-01':'2018-12-31']

test = second_order_diff.loc['2019-01-01':'2019-12-31']


In [None]:
# Calculate the mean and standard deviation of train, val, and test sets and print the result out
train_mean = train.mean().round(2)
train_std = train.std().round(2)

val_mean = val.mean().round(2)
val_std = val.std().round(2)

test_mean = test.mean().round(2)
test_std = test.std().round(2)

print('Train mean: ', train_mean)
print('Train std: ', train_std)
print('Val mean: ', val_mean)
print('Val std: ', val_std)
print('Test mean: ', test_mean)
print('Test std: ', test_std)

# II. Modeling 1 (Lag Predictors only)

## 1. Base model: ARIMA(1,2,1)

- The ARIMA(p,d,q) model contains 3 main components: AR, I (differencing), and MA.

- After carefully taking into consideration, second-order differencing seems to be the best way to make the data stationary so decided to use it as the base model for comparision purpose.

- The model takes into account 1 lagged values, 1 lagged errors, and 2 order differencing. 

### 1.1 Model Executing

In [None]:
# Create and fit an ARIMA(1,2,1) model to the training set

#! Here we set I = 0 since we have manually differenced the data
base_model = ARIMA(train, order=(1,0,1)).fit()


### 1.2 Model Summary

In [None]:
base_model.summary()

- The lag of 1 component is found statistically insignificant since it has a very high p-value. Meanwhile, the AR component, which is the error term of the 1st lag. 

- The negative figure for skew and kurtosis also tell us about the distribution of the model's residuals as they are found to be skewed to the left and contain a fat tail. 

In [None]:
# evaluation metrics on the train set
train_pred = base_model.predict()
train_rmse = np.sqrt(np.mean((train_pred - train)**2))
train_mae = np.mean(np.abs(train_pred - train))

print('Train RMSE: ', train_rmse)
print('Train MAE: ', train_mae)

### 1.3 Predicting the Validation set

- if possible, please repeat the mean, standard deviation of the label here (2nd-order differenced)

In [None]:
# Forecast values for the validation set
validation_forecast = base_model.forecast(steps=len(val))

In [None]:
# Plot the forecasted values and the actual values and include evaluation metrics
# Calculate evaluation metrics
mae = np.mean(np.abs(validation_forecast - val))
mse = np.mean((validation_forecast - val)**2)
rmse = np.sqrt(mse)

plt.figure(figsize=(10, 6))
plt.plot(val, label='Actual')
plt.plot(validation_forecast, label='Forecast')
plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                 transform=plt.gca().transAxes, verticalalignment='top')
plt.legend(loc='upper left')
plt.title('Validation Set: Actual vs Forecast of ARIMA (1,2,1)')
plt.show()




### 1.4 Model Evaluation on test set

In [None]:
# Prediction on the test set
base_arima_test_pred = base_model.forecast(steps=36)

# Only account for the last 12 months 
base_arima_test_pred = base_arima_test_pred[-12:]

mae = np.mean(np.abs(base_arima_test_pred - test))
mse = np.mean((base_arima_test_pred - test)**2)
rmse = np.sqrt(mse)

# Print evaluation metrics
print(f"MAE: {mae:.2f}, MSE: {mse:.2f}, RMSE: {rmse:.2f}")

# Calculate the variance of the base_arima_test_pred
base_arima_test_pred_var = np.var(base_arima_test_pred)
print('Variance of the base_arima_test_pred: ', base_arima_test_pred_var)

# plot the forecasted values and the actual values and include evaluation metrics
plt.figure(figsize=(10, 6))
plt.plot(test, label='Actual')
plt.plot(base_arima_test_pred, label='Forecast')
plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}',
            transform=plt.gca().transAxes, verticalalignment='top')
plt.legend(loc='upper left')
plt.title('Test Set: Actual vs Forecast of ARIMA (1,2,1)')
plt.show()


## 2. ARIMA with more ARs and MAs

From ACF and PACF results above, we were able to identify lags that are significantly correlated with the label, 

- ACF's result is helpful in determining MA components, while PACF's helps determine AR components

From the graphs earlier, we would sequentially add MA and AR component to the model and observe how AIC and BIC change.

- A lower BIC and AIC along with lower RMSE and MAE are preferred. 

According to the lag analysis, we were able to figure out that the first 4 AR components and the first 3 residual lags appear to be stastistically significant to the model. Let's write a for loop to loop through the potential models and view the results. 

In [None]:
ar = [2,3,4]
ma = [1,2,3]

for i in ma:

    for j in ar:
        
        # train and fit the model
        
        model = ARIMA(train, order=(j,0,i)).fit(method_kwargs={'maxiter': 100})
        
        validation_forecast = model.forecast(steps=len(val))
        
        # Calculate evaluation metrics
        
        mae = np.mean(np.abs(validation_forecast - val))
        
        mse = np.mean((validation_forecast - val)**2)
        
        rmse = np.sqrt(mse)
        
        # Plot the forecasted values and the actual values
        
        plt.figure(figsize=(10, 6))
        
        plt.plot(val, label='Actual')
        
        plt.plot(validation_forecast, label='Forecast')
        
        plt.legend(loc='upper left')
        
        plt.title(f'Validation Set: Actual vs Forecast of ARIMA({j},2,{i})')
        
        plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                 transform=plt.gca().transAxes, verticalalignment='top')
        
        plt.show()

        # Attacht the model's summary right below the graph

        print(model.summary())
      

Since ARIMA(3,2,3) has the lowest RMSE, we will use it to forecast the test set. But first, let's extract the evaluation metrics on the train set


In [None]:
best_arima = ARIMA(train, order=(3,0,3)).fit()

# Prediction on the test set
best_arima_test_pred = best_arima.forecast(steps=36)

# Only account for the last 12 months

best_arima_test_pred = best_arima_test_pred[-12:]

# Calculate evaluation metrics

mae = np.mean(np.abs(best_arima_test_pred - test))

mse = np.mean((best_arima_test_pred - test)**2)

rmse = np.sqrt(mse)

# Print evaluation metrics

print(f"MAE: {mae:.2f}, MSE: {mse:.2f}, RMSE: {rmse:.2f}")

# Plot the forecasted values and the actual values

plt.figure(figsize=(10, 6))

plt.plot(test, label='Actual')

plt.plot(best_arima_test_pred, label='Forecast')

plt.legend(loc='upper left')

plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}',
            transform=plt.gca().transAxes, verticalalignment='top')

plt.title('Test Set: Actual vs Forecast of ARIMA(3,2,3)')
plt.show()

# Calculate the variance of the best_arima_test_pred

best_arima_test_pred_var = np.var(best_arima_test_pred)

print('Variance of the best_arima_test_pred: ', best_arima_test_pred_var)

In [None]:
# evaluation metrics on the train set
train_pred = best_arima.predict()
train_rmse = np.sqrt(np.mean((train_pred - train)**2))
train_mae = np.mean(np.abs(train_pred - train))

print('Train RMSE: ', train_rmse)
print('Train MAE: ', train_mae)

## 2. ARIMA Model's Conclusion

*Performance on the validation set*

- The best ARIMA model so far is ARIMA(3,0,3). For some other ARIMA model versions, the maximum likelihood optimization method fails to converge. Therefore, it leads to poor predictions, as we can see there is a horizontal line for some ARIMA model's predictions, which is completely different than the ARIMA(3,0,3)

 *Performance on Test set*

- Though model ARIMA(3,2,3) appears to have good predictive power on the validation dataset, it shows a poor performance on the test set as it underperforms the base model ARIMA(1,2,1)

# III. Modeling II (Models With External Predictors) 

## 1. Preprocessing Predictors 

### 1.1 Import and format data

First, we need to import data with external predictors 

In [None]:
predictors = pd.read_csv('full_data.csv', index_col='Date', parse_dates=True)

In [None]:
# Get some basic infor from the data 
predictors.describe().round(2)

In [None]:
# Make the date consistent with the CPI data
predictors = predictors.loc[:'2019-12-31']

### 1.2 Apply first order differencing on predictors

- Since we have taken differencing on CPI, it makes sense to take transform predictors to at least a first order differencing as well. Also, we would like to see how the change in these variables affect movement in the label.
- Also, as I have attempted to use the original data, the multicollinarity issue was so serious that we can't move forward with it.



In [None]:
# apply diff on all columns in predictors 
predictors = predictors.diff().dropna()

### 1.3 Normalize Predictors 

#### 1.3.1 Remove Outliers

All predictors are deemed to be equally important but they appear to be on different scale, thus 

In [None]:
# Visualize the data by plotting their distributions and boxplots
# sns.pairplot(predictors)

In [None]:
# Replace all outliers in the predictors file using IQR method
def replace_outliers(data):
    for col in data.columns:
        q1 = data[col].quantile(0.25)
        q3 = data[col].quantile(0.75)
        iqr = q3 - q1
        
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        
        data[col] = np.where(data[col] < lower_bound, lower_bound, data[col])
        data[col] = np.where(data[col] > upper_bound, upper_bound, data[col])
    return data


In [None]:
clean_predictors = replace_outliers(predictors)

#### 1.3.2 Normalize predictors

In [None]:
# Normalize clean predictors data using min-max scaler, and convert it to a dataframe
# import scikit-learn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
clean_predictors = scaler.fit_transform(clean_predictors)
clean_predictors = pd.DataFrame(clean_predictors, columns=predictors.columns, index=predictors.index)

Once the data has been cleaned, we can merge them with the label.

In [None]:
# Merge the clean predictors data with the CPI data
full_data = pd.merge(second_order_diff, clean_predictors, left_index=True, right_index=True)


## 2. Correlation Analysis

In [None]:
corr_matrix = full_data.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
plt.figure(figsize=(13, 9))
sns.heatmap(corr_matrix, mask=mask, cmap='coolwarm_r', annot=True, fmt='.2f', annot_kws={'fontsize': 8.5})
plt.title('Correlation Matrix')
plt.show()


**Conclusion**

- Most features are moderately or weakly correlated with CPI. In economic sense, they should have a strong correlation with the label, however, since we have differenced both label and features, the strong correlation no longer holds. 

- Though some features like Money_Stock (M2 money supply) and FedSurDef are found to have a small correlation with the label, it might still be useful based on our domain knowledge. 
 
- In addition, since correlation measures only linear relationships, non-linear relationships between predictors and lable can still be significant and useful for prediction and they won't be captured by correlation coefficients. 


## 3. Feature Selection with Lasso Regression

- Though the current set of variables look good. Next, we apply Lasso Regression to filter the number of predictors even further in order to retain the most important variables only. 

In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LassoCV
# split the data 
target = 'CPI'

train = full_data.loc['2010-01-01':'2016-12-31']

val = full_data.loc['2017-01-01':'2018-12-31']

test = full_data.loc['2019-01-01':'2019-12-31']

x_train = train.drop(columns = [target])

y_train = train[target]

x_val = val.drop(columns = [target])

y_val = val[target]

x_test = test.drop(columns = [target])

y_test = test[target]

The best alpha as performed below is the one that provides the optimal balance between fitting the data and preventing overfitting. 

In [None]:
# Create and fit a lasso regression with cross validation to find the best alpha
model = LassoCV(alphas = None, cv = 3, random_state=123).fit(x_train, y_train)

best_alpha = model.alpha_

print(f"Best alpha: {best_alpha:.4f}")

- Though we have found the best alpha, we are unable to apply it to the lasso regresion since it would only keep Crude oil as the sole predictor for the model. 

- Therfore, we reduce alpha to 0.01, while maintaining the same RMSE but it include more predictors for the model.

In [None]:
# Now we can fit the model with the best alpha
final_lasso = Lasso(alpha=0.01, random_state=123).fit(x_train, y_train)

In [None]:
# Evaluate the model performance on the validation set 
val_predictions = final_lasso.predict(x_val)
val_mse = mean_squared_error(y_val, val_predictions)
val_rmse = np.sqrt(val_mse)
print(f'Validation RMSE: {val_rmse:.2f}\n')

In [None]:
# Insepct the coefficients to see which predictors were retained in the model 
coef_df = pd.DataFrame({'Feature': x_train.columns, 'Coefficient': final_lasso.coef_})
coef_df = coef_df.sort_values(by='Coefficient', ascending=False)

# print Feature from coef_df where Coefficient is different from 0

print('Here is the list of predictors that were retained in the lasso regression using alpha = 0.01')

coef_df[coef_df['Coefficient'] != 0]

In [None]:
# extract a vector names for these retained variables. 
selected_predictors = coef_df[coef_df['Coefficient'] != 0]['Feature'].values

## 4. Random Forest

In [None]:
# Import necessary libraries for random forest regression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

### 4.1 Random Forest with No Lags

#### 4.1.1 Default Setting's Hyperparameters

Create a base random forest regression model

In [None]:
rf_base = RandomForestRegressor(random_state=123)

# Train the base model 

rf_base.fit(x_train, y_train)

In [None]:
# print evaluation metrics on training set
train_predictions = rf_base.predict(x_train)
train_mse = mean_squared_error(y_train, train_predictions)
train_rmse = np.sqrt(train_mse)
train_mae = np.mean(np.abs(train_predictions - y_train))
print(f'Train RMSE: {train_rmse:.2f}\n')
print(f'Train MAE: {train_mae:.2f}\n')


Predict and Evaluate the model's metrics on the validation set

In [None]:
# Predict the validation target variable
base_rf_pred = rf_base.predict(x_val)

# Evaluation 

mse = mean_squared_error(y_val, base_rf_pred)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(base_rf_pred - y_val))


Visualizing Actual and Predicted Values of random forest model with default setting

In [None]:
def plot_default_setting_predictions(actual, predicted, title):
    """
    Plots the actual and predicted values of a time series.
    
    Args:
        actual (series): The actual values of the time series
        predicted (series): The predicted values of the time series
        title (string): The title of the plot
    """
    # Adding the index of the actual series to the predicted series 
    predicted = pd.Series(predicted, index=actual.index)
    
    plt.figure(figsize=(10, 6))
    plt.plot(actual, label='Actual')
    plt.plot(predicted, label='Predicted')
    plt.legend(loc='upper left')
    plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                 transform=plt.gca().transAxes, verticalalignment='top')
    plt.title(title)
    # Create a small subtitle with a different color font
    plt.text(0.34, 1.1, "Default Setting's Hyperparameters", color='red', transform=plt.gca().transAxes, verticalalignment='top')
    plt.show()

In [None]:
plot_default_setting_predictions(y_val, base_rf_pred, 'Random Forest Regression Without Lags')

Predictions on testing sets and evaluate the model's metrics

In [None]:
# Predict the test target variable
base_rf_test = rf_base.predict(x_test)

# Evaluation 

mse = mean_squared_error(y_test, base_rf_test)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(base_rf_test - y_test))

# Calculate the variance of the test prediction
var = np.var(base_rf_test)

print(f"Test Prediction's Variance: {var:.2f}\n")

In [None]:
plot_default_setting_predictions(y_test, base_rf_test, "Test Set: Actual vs Forecast of Random Forest Regression Without Label's Lags")

Reverse the prediction back to the original data to compare with the monthly CPI


In [None]:
# Create a function to reverse the second order differenced data back to the original data. 
def reconstruct_second_order_differenced_data(input):
    reconstructed_data = [df['CPI'].loc['2019-01-01'], df['CPI'].loc['2019-02-01']]
    actual_data = df['CPI'].loc['2019-01-01':'2019-12-01']
    for i in range(10):
        original_value = input[i+1] + 2 * actual_data[i+1] - actual_data[i]
        reconstructed_data.append(original_value)
    return reconstructed_data[2:]

In [None]:
# Merge the result of the previous function with df['1-Month % Change']
def merge_with_1_month_pct_change(second_order_diff_data):
    reconstructed_data = reconstruct_second_order_differenced_data(second_order_diff_data)
    actual_data = df['CPI'].loc['2019-02-01':'2019-12-01']
    # calculate the percentage difference in the reconstructed_data
    reconstructed_data = [((reconstructed_data[i] - actual_data[i]) / actual_data[i]) * 100 for i in range(0, len(reconstructed_data))]
    reconstructed_data = pd.Series(reconstructed_data, index=df['1-Month % Change'].loc['2019-03-01':].index)
    reconstructed_data.name = 'Predicted'
    reconstructed_data = reconstructed_data.round(1)
    reconstructed_data = pd.merge(reconstructed_data, df['1-Month % Change'].loc['2019-03-01':], left_index=True, right_index=True)
    reconstructed_data.columns = ['Predicted', 'Actual']
    return reconstructed_data

In [None]:
merge_with_1_month_pct_change(base_rf_test)

# Calculat the variance of 

#### 4.1.2 Tuned Hyperparameters

- There are two common fine-tunning method for random forest: Grid Search and Random Search.

Define hyperparameter search space for both grid search random search

In [None]:
# Hyperparameter search space
param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [10, 20, 30, 50, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False],
}

Chooose a search method (GridSearchCV or RandomizedSearchCV) and fit the model:

In [None]:
# Grid search

grid_search = GridSearchCV(estimator=rf_base, 
    param_grid=param_grid, 
    cv=3, 
    n_jobs=-1, 
    verbose=2)

# Random search

# n_iter: Number of random parameter combinations to try

random_search = RandomizedSearchCV(estimator=rf_base, 
    param_distributions=param_grid, 
    n_iter=100, 
    cv=5, 
    n_jobs=-1, 
    verbose=2, 
    random_state=123)

# Fit the search object, here we can use either random search or grid searchq

grid_search.fit(x_train, y_train)

random_search.fit(x_train, y_train)

Get the best hyperparameters from the both hyperparameter search methods

In [None]:
grid_search_params = grid_search.best_params_

random_search_params = random_search.best_params_


# # Create a data frame that combines both grid_search_params and random_search_params
grid_search_params_df = pd.DataFrame(grid_search_params, index=[0])

random_search_params_df = pd.DataFrame(random_search_params, index=[0])

combined_params_df = pd.concat([grid_search_params_df, random_search_params_df], axis=0)

combined_params_df.index = ['Grid Search', 'Random Search']

combined_params_df


In [None]:
# Instantiate the model with the best hyperparameters
tunned_rf_regressor = RandomForestRegressor(**grid_search_params, random_state=123)

# Train the model 
tunned_rf_regressor.fit(x_train, y_train)

In [None]:
# Get evaluation metrics on the training set
train_predictions = tunned_rf_regressor.predict(x_train)
train_mse = mean_squared_error(y_train, train_predictions)
train_rmse = np.sqrt(train_mse)
train_mae = np.mean(np.abs(train_predictions - y_train))
print(f'Train RMSE: {train_rmse:.2f}\n')
print(f'Train MAE: {train_mae:.2f}\n')



Make predicitons and evaluate the model performance using RMSE

In [None]:
# Make predictions
y_pred = tunned_rf_regressor.predict(x_val)

# Evaluate the model 

mse = mean_squared_error(y_val, y_pred)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(y_pred - y_val))

In [None]:
# Plot y_val and y_pred on the same graph, but first, we need to add a time index to y_pred
y_pred = pd.Series(y_pred, index=y_val.index)

plt.figure(figsize=(10, 6))

plt.plot(y_val, label='Actual')

plt.plot(y_pred, label='Predicted')

plt.legend(loc='upper left')

plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                 transform=plt.gca().transAxes, verticalalignment='top')

plt.text(0.35, 1.1, 'Tunned Hyperparameters', 
         color='red', 
         transform=plt.gca().transAxes, 
         verticalalignment='top')

plt.title('Random Forest Regression Without Lags')

plt.show()

Predictions and Evaluation metrics on test set

In [None]:
# Make predictions
y_pred = tunned_rf_regressor.predict(x_test)

# Evaluate the model 

mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(y_pred - y_test))

# Calculate the variance on test prediction 
var = np.var(y_pred)

print(f"Varaince of Test Prediction on Tuned RF w/o Lags: {var:.2f}\n")


Visualize the prediction and acutal values of the testing prediction

In [None]:
# Plot y_val and y_pred on the same graph, but first, we need to add a time index to y_pred
def plot_tunned_predictions_test(actual, predicted, title):
    predicted = pd.Series(predicted, index=actual.index)

    plt.figure(figsize=(10, 6))

    plt.plot(actual, label='Actual')

    plt.plot(predicted, label='Predicted')

    plt.legend(loc='upper left')

    plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                    transform=plt.gca().transAxes, verticalalignment='top')

    plt.text(0.35, 1.1, 'Tunned Hyperparameters', 
            color='red', 
            transform=plt.gca().transAxes, 
            verticalalignment='top')

    plt.title(title)

    plt.show()

In [None]:
plot_tunned_predictions_test(y_test, y_pred, "Test Set: Actual vs Forecast Random Forest Regression Without Label's Lags")

In [None]:
merge_with_1_month_pct_change(y_pred)

#### 4.1.2 Random Forest with No Lags Conclusion

- Since the search space and the data are quite small, we can move forward with Grid Search CVsdsad

1. Validation set

- Though MAE in tunned hyperparameter's model is 1 basis point lower than the default setting's model, the RMSE and MSE remain the same.

2. Test set
- The model's performance on test set of the tunned hyperparameters are slightly worse than the default setting's model. 

As a result, we can conclude that tunning hyperparameters does not improve the model's performance. 

### 4.2 Random Forest with Label's Lags

From the best ARIMA model, which is ARIMA(3,0,3), we can see that the first 3 lags appear to be statistically significant to predict the CPI, therefore, we decide include them to the model.

In [None]:
# Create a function create lag features for a time series
def create_lag_features(df, target, lags):
    """
    Creates lag features for a time series.
    
    Args:
        df (dataframe): A dataframe containing the time series data
        target (string): The column name of the target variable
        lags (list): A list of lag values to create features for
        
    Returns:
        The original dataframe with added columns containing lag features
    """
    df = df.copy()
    
    for lag in lags:
        df['lag_' + str(lag)] = df[target].shift(lag)
           
    return df

In [None]:
# Adding 3 lags to the full_data dataset
full_data_w_lags = create_lag_features(full_data, target, [1, 2, 3])

full_data_w_lags.dropna(inplace=True)

In [None]:
# Create training, validation, and test sets again 
train_w_lags = full_data_w_lags.loc['2010-01-01':'2016-12-31']

val_w_lags = full_data_w_lags.loc['2017-01-01':'2018-12-31']

test_w_lags = full_data_w_lags.loc['2019-01-01':'2019-12-31']

x_train_w_lags = train_w_lags.drop(columns = [target])

y_train_w_lags = train_w_lags[target]

x_val_w_lags = val_w_lags.drop(columns = [target])

y_val_w_lags = val_w_lags[target]

x_test_w_lags = test_w_lags.drop(columns = [target])

y_test_w_lags = test_w_lags[target]

Create the base Random Forest

In [None]:
rf_w_lags = RandomForestRegressor(random_state=123)

# Train the base model 
rf_w_lags.fit(x_train_w_lags, y_train_w_lags)

#### 4.2.1 Default's Setting Hyperparameters

In [None]:
# Evaluate the model on the training set
train_predictions = rf_w_lags.predict(x_train_w_lags)
train_mse = mean_squared_error(y_train_w_lags, train_predictions)
train_rmse = np.sqrt(train_mse)
train_mae = np.mean(np.abs(train_predictions - y_train_w_lags))
print(f'Train RMSE: {train_rmse:.2f}\n')
print(f'Train MAE: {train_mae:.2f}\n')

In [None]:
# Make predictions
y_pred = rf_w_lags.predict(x_val_w_lags)

# Evaluate the model 

mse = mean_squared_error(y_val_w_lags, y_pred)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(y_pred - y_val_w_lags))

Visualize the actual values and prediction of the model with default setting

In [None]:
plot_default_setting_predictions(y_val_w_lags, y_pred, 'Random Forest Regression With Lags')

Prediction and Evaluation metrics on test set

In [None]:
# Prediction and Evaluation metrics on test set
y_pred = rf_w_lags.predict(x_test_w_lags)

mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(y_pred - y_test))

# Calculate the variance on test prediction

var = np.var(y_pred)

print(f"Variance of Test Prediction on RF w/ Lags: {var:.2f}\n")

plot_default_setting_predictions(y_test, y_pred, "Test Set: Actual vs Forecast of Random Forest Regression With Label's Lags")

In [None]:
merge_with_1_month_pct_change(y_pred)

#### 4.2.2 Tuned Hyperparameters

Define hyperparameter search space for grid search or random search:

In [None]:
# Grid search

grid_search = GridSearchCV(estimator=rf_w_lags, 
    param_grid=param_grid, 
    cv=3, 
    n_jobs=-1, 
    verbose=2)

# Random search

# n_iter: Number of random parameter combinations to try

random_search = RandomizedSearchCV(estimator=rf_w_lags, 
    param_distributions=param_grid, 
    n_iter=100, 
    cv=5, 
    n_jobs=-1, 
    verbose=2, 
    random_state=123)

# Fit the search object, here we can use either random search or grid searchq

grid_search.fit(x_train_w_lags, y_train_w_lags)

random_search.fit(x_train_w_lags, y_train_w_lags)

Obtain the optimal hyperparameters both grid search and random search 

In [None]:
grid_search_params = grid_search.best_params_

random_search_params = random_search.best_params_

# # Create a data frame that combines both grid_search_params and random_search_params
grid_search_params_df = pd.DataFrame(grid_search_params, index=[0])

random_search_params_df = pd.DataFrame(random_search_params, index=[0])

combined_params_df = pd.concat([grid_search_params_df, random_search_params_df], axis=0)

combined_params_df.index = ['Grid Search', 'Random Search']

combined_params_df

**Grid Search**

In [None]:
# Instantiate the model with grid search hyperparameters
grid_search_rf_regressor = RandomForestRegressor(**grid_search_params, random_state=123)

# Train the model 
grid_search_rf_regressor.fit(x_train_w_lags, y_train_w_lags)

In [None]:
# Evaluation metrics on training set
train_predictions = grid_search_rf_regressor.predict(x_train_w_lags)
train_mse = mean_squared_error(y_train_w_lags, train_predictions)
train_rmse = np.sqrt(train_mse)
train_mae = np.mean(np.abs(train_predictions - y_train_w_lags))
print(f'Train RMSE: {train_rmse:.2f}\n')
print(f'Train MAE: {train_mae:.2f}\n')

Prediction and Evaluation on the validation set

In [None]:
# Make predictions
grid_search_y_pred = grid_search_rf_regressor.predict(x_val_w_lags)

# Evaluate the model 

mse = mean_squared_error(y_val_w_lags, grid_search_y_pred)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(grid_search_y_pred - y_val_w_lags))

Visualizing Actual and Predicted Values of Grid Search Hyperparameters

In [None]:
# Create a function to visualize actual and predicted values
def plot_grid_search_predictions(actual, predicted, title):
    """
    Plots the actual and predicted values of a time series.
    
    Args:
        actual (series): The actual values of the time series
        predicted (series): The predicted values of the time series
        title (string): The title of the plot
    """
    # Adding the index of the actual series to the predicted series 
    predicted = pd.Series(predicted, index=actual.index)
    
    plt.figure(figsize=(10, 6))
    plt.plot(actual, label='Actual')
    plt.plot(predicted, label='Predicted')
    plt.legend(loc='upper left')
    plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                 transform=plt.gca().transAxes, verticalalignment='top')
    plt.title(title)
    # Create a small subtitle with a different color font
    plt.text(0.41, 1.1, 'Grid Search', color='red', transform=plt.gca().transAxes, verticalalignment='top')
    plt.show()
    

In [None]:
plot_grid_search_predictions(y_val_w_lags, grid_search_y_pred, 'Optimized Random Forest Regression Including Lags')


Prediction and Evaluation metrics on test set

In [None]:
# Prediction and Evaluation metrics on test set
y_pred = grid_search_rf_regressor.predict(x_test_w_lags)

mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(mse)

mae = np.mean(np.abs(y_pred - y_test))

# Calculate the variance on test prediction

var = np.var(y_pred)

print(f"Variance of Test Prediction on tunned RF w/ Lags: {var:.2f}\n")

plot_tunned_predictions_test(y_test, y_pred, "Test Set: Actual vs Forecast of Random Forest Regression With Label's Lags")

In [None]:
merge_with_1_month_pct_change(y_pred)

**Random Search**

In [None]:
# Instantiate the model with random search hyperparameters 
random_search_rf_regressor = RandomForestRegressor(**random_search_params, random_state = 123)

# Fit the model 

random_search_rf_regressor.fit(x_train_w_lags, y_train_w_lags)

Prediction and Evaluation on validation dataset

In [None]:
# Make Predictions 
random_search_y_pred = random_search_rf_regressor.predict(x_val_w_lags)

# Evaluate the model 

mse = mean_squared_error(y_val_w_lags, random_search_y_pred)

rmse  = np.sqrt(mse)

mae = np.mean(np.abs(random_search_y_pred - y_val_w_lags))

Visualizing Actual and Predicted values of Random Search Hyperparameters

In [None]:
# Create a function to visualize actual and predicted values
def plot_random_search_predictions(actual, predicted, title):
    """
    Plots the actual and predicted values of a time series.
    
    Args:
        actual (series): The actual values of the time series
        predicted (series): The predicted values of the time series
        title (string): The title of the plot
    """
    # Adding the index of the actual series to the predicted series 
    predicted = pd.Series(predicted, index=actual.index)
    
    plt.figure(figsize=(10, 6))
    plt.plot(actual, label='Actual')
    plt.plot(predicted, label='Predicted')
    plt.legend(loc='upper left')
    plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                 transform=plt.gca().transAxes, verticalalignment='top')
    plt.title(title)
    # Create a small subtitle with a different color font
    plt.text(0.41, 1.1, 'Random Search', color='red', transform=plt.gca().transAxes, verticalalignment='top')
    plt.show()
    

In [None]:
plot_random_search_predictions(y_val_w_lags, random_search_y_pred, 'Optimized Random Forest Regression Including Lags')


Predictions and Evaluation metrics on test set

In [None]:
# Make Predictions 
random_search_y_pred_test = random_search_rf_regressor.predict(x_test_w_lags)

# Evaluate the model 

mse = mean_squared_error(y_test, random_search_y_pred_test)

rmse  = np.sqrt(mse)

mae = np.mean(np.abs(random_search_y_pred_test - y_test))

In [None]:
plot_random_search_predictions(y_test, random_search_y_pred_test, "Test Set: Actual vs Forecast of Random Forest Regression Including Label's Lags")

Back transform the predictions to the original data and compare with the actual monthly rate

In [None]:
merge_with_1_month_pct_change(random_search_y_pred_test)

### 4.2 Random Forest Conclusion

- Regarding the model's performance, the model with lags performs better than the model without lags.

- The model with lags also has a lower RMSE than the model without lags.

- The model with lags also has a lower RMSE than the best ARIMA model, which is ARIMA(3,0,3).

- Regarding model without lags, tunning the hyperparameters DOES NOT improve the model's performance. It is indicated by the fact that the RMSE of the model with default setting and the model with tunning hyperparameters are identical, though MAE of the tunned Forest is slightly lower than the default Forest (0.39 vs 0.40).

- Regarding model with lags (lag1, lag2, and lag3), tunning the hyperparameters DOES improve the model's performance. It is indicated by the fact that the RMSE of tunning hyperparameters of both grid search and random search are both moderately lower than in default settting model. 

- The model yield the greatest performance is the model with lags and hyperparameters from grid search, with the RMSE of 0.42. 
 

## 5. Long Short Term Memory

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout 
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import os
import random

### 5.1 Reformatting

In order to work with LSTM, we need to reshape the data into 3D array. The 3D input array for an LSTM has the following dimensions: 

- Samples: The total number of observations in the dataset.

- Time Steps: The total number of time steps in the input data.

- Features: The total number of features in the input data that we would like to include in the model. 

In this Long Short Term Memory model, we set time steps equal 1 because we have already include the 3 lags in the training data. In addition, we would like to focus on learning patterns between features rather than across time between these features and the label.

In [None]:
# Reshape the input data into 3D format for LSTM: (samples, timesteps, features)
lstm_x_train = x_train_w_lags.values.reshape((x_train_w_lags.shape[0], 1, x_train_w_lags.shape[1]))
lstm_x_val = x_val_w_lags.values.reshape((x_val_w_lags.shape[0], 1, x_val_w_lags.shape[1]))
lstm_x_test = x_test_w_lags.values.reshape((x_test_w_lags.shape[0], 1, x_test_w_lags.shape[1]))

### 5.2 Tune Hyperparameters

We created a few samples for batch sizes, drop out rates, and hidden nodes and loop over all of them to find the best combination of hyperparameters.

Though we set epoch equal to 100, the model would stop training if the validation loss does not improve after 3 epochs since we have turned on the early stopping callback and a patience of 3.

To be specific, the training will stop if the validation loss does not improve for 3 consecutive epochs.The best weights during training will be restored to the model to ensure that we obtain the best performance for the given hyperparameters. 

One epoch is completed when all batches in the dataset have been processed.

In [None]:
# Set seeds for reproducibility
seed_value = 123

os.environ['PYTHONHASHSEED'] = str(seed_value)

random.seed(seed_value)

np.random.seed(seed_value)

tf.random.set_seed(seed_value)

batch_sizes = [8, 12, 16, 32]
dropout_rates = [0.1, 0.15, 0.2, 0.3]
hidden_nodes = [10, 20, 30, 50]

best_params = {
    "batch_size": batch_sizes[0],
    "dropout_rate": dropout_rates[0],
    "hidden_nodes": hidden_nodes[0]
}
lowest_val_loss = float("inf")

for batch_size in batch_sizes:
    for dropout_rate in dropout_rates:
        for hidden_node in hidden_nodes:
            print(f"Training with batch_size = {batch_size}, dropout_rate = {dropout_rate}, hidden_nodes = {hidden_node}")

            lstm_2 = Sequential()
            lstm_2.add(LSTM(units=hidden_node, activation='tanh', input_shape=(lstm_x_train.shape[1], lstm_x_train.shape[2])))
            lstm_2.add(Dropout(dropout_rate))
            lstm_2.add(Dense(1))

            optimizer = Adam(learning_rate=0.001)
            lstm_2.compile(optimizer=optimizer, loss='mean_squared_error')

            early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

            history = lstm_2.fit(lstm_x_train, 
                                 y_train_w_lags, 
                                 epochs=100, 
                                 batch_size=batch_size, 
                                 validation_data=(lstm_x_val, y_val_w_lags), 
                                 verbose=0, 
                                 shuffle=False, 
                                 callbacks=[early_stopping])

            current_val_loss = min(history.history['val_loss'])
            print(f"Lowest validation loss: {current_val_loss}\n")

            if current_val_loss < lowest_val_loss:
                lowest_val_loss = current_val_loss
                best_params = {
                    "batch_size": batch_size,
                    "dropout_rate": dropout_rate,
                    "hidden_nodes": hidden_node
                }

print(f"Best hyperparameters: {best_params}")

After the loop, we have found that the batch_size of 32, dropout rate of 0.2, and 10 hidden nodes yield the best performance.

### 5.3 Model Training

In [None]:
# Set seeds for reproducibility
seed_value = 123

os.environ['PYTHONHASHSEED'] = str(seed_value)

random.seed(seed_value)

np.random.seed(seed_value)

tf.random.set_seed(seed_value)

optimizer = Adam(learning_rate=0.001)

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

lstm_2 = Sequential()

lstm_2.add(LSTM(units=10, activation='tanh', input_shape=(lstm_x_train.shape[1], lstm_x_train.shape[2])))

lstm_2.add(Dropout(0.1))

lstm_2.add(Dense(1))

lstm_2.compile(optimizer=optimizer, loss='mean_squared_error')

# store the model history in a variable to plot the learning curve vs epochs

history = lstm_2.fit(lstm_x_train, 
                    y_train_w_lags, 
                    epochs=100, 
                    batch_size=32, 
                    validation_data=(lstm_x_val, y_val_w_lags), 
                    verbose=0, 
                    shuffle=False, 
                    callbacks=[early_stopping])

Visualize Loss vs. Epochs of training and validation set.

In [None]:
plt.figure(figsize=(10, 6))

plt.plot(history.history['loss'], label='Training Loss')

plt.plot(history.history['val_loss'], label='Validation Loss')

plt.legend(loc='upper right')

plt.title('Training and Validation Loss')

plt.xlabel('Epochs')

plt.ylabel('Loss')

In [None]:
# Evaluation metrics on the training set

train_predictions = lstm_2.predict(lstm_x_train)

# convert train_predictions to a 1D array
train_predictions = train_predictions.flatten()

train_mse = mean_squared_error(y_train_w_lags, train_predictions)

train_rmse = np.sqrt(train_mse)

train_mae = np.mean(np.abs(train_predictions - y_train_w_lags))

print(f'Train RMSE: {train_rmse:.2f}\n')

print(f'Train MAE: {train_mae:.2f}\n')

In [None]:
# plot train_predictions and y_train_w_lags on the same graph, but first add the time index to train_predictions

train_predictions = pd.Series(train_predictions, index=y_train_w_lags.index)

plt.figure(figsize=(10, 6))

plt.plot(y_train_w_lags, label='Actual')

plt.plot(train_predictions, label='Predicted')

plt.legend(loc='upper left')

plt.title('LSTM Regression Without Lags')

plt.show()

### 5.3 Validation Prediction and Evaluation

In [None]:
lstm_val_pred = lstm_2.predict(lstm_x_val)

# Convert lstm_val_pred to a 1D array

lstm_val_pred = lstm_val_pred.flatten()

mse = mean_squared_error(y_val_w_lags, lstm_val_pred)  

mae = np.mean(np.abs(lstm_val_pred - y_val_w_lags.values))

rmse = np.sqrt(mean_squared_error(y_val_w_lags, lstm_val_pred))


### 5.4 Prediction vs. Actual Visualization

In [None]:
def lstm_plot(actual, predicted, title):
    """
    Plots the actual and predicted values of this lstm result.
    
    Args:
        actual (series): The actual values of the time series
        predicted (series): The predicted values of the time series
        title (string): The title of the plot
    """
    # Adding the index of the actual series to the predicted series 
    predicted = pd.Series(predicted.reshape(-1), index=actual.index)
    
    plt.figure(figsize=(10, 6))
    plt.plot(actual, label='Actual')
    plt.plot(predicted, label='Predicted')
    plt.legend(loc='upper left')
    plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}', 
                 transform=plt.gca().transAxes, verticalalignment='top')
    plt.title(title)
    plt.show()

In [None]:
lstm_plot(y_val_w_lags, lstm_val_pred, 'LSTM Model Including Lags')

In [None]:
# Perform the prediction on test set
lstm_x_test = x_test_w_lags.values.reshape((x_test_w_lags.shape[0], 1, x_test_w_lags.shape[1]))

lstm_test_pred = lstm_2.predict(lstm_x_test)

# convert lstm_test_pred to a 1D array

lstm_test_pred = lstm_test_pred.flatten()

mse = mean_squared_error(y_test, lstm_test_pred)

mae = np.mean(np.abs(lstm_test_pred - y_test.values))

rmse = np.sqrt(mean_squared_error(y_test, lstm_test_pred))

# Calculate the variance on test prediction

var = np.var(lstm_test_pred)

print(f"Variance of Test Prediction on LSTM w/ Lags: {var:.2f}\n")

In [None]:
# visualize teh prediction on test set
lstm_plot(y_test, lstm_test_pred, "Test Set: Actual vs Forecast of LSTM Model With Label's Lags")

#### 5.5 LSTM Conclusion

- After the tunning process, we were able to achieve the best RMSE among all machine learning method so far. However, its MAE actually underperforms all random forest models and ARIMA(3,0,3)

- LSTM does seem to catch the movement of the actual data well. 

In [None]:
merge_with_1_month_pct_change(lstm_test_pred)

## 6. Garch Model

The Generalized Autogregressive Conditional Heteroskedasticity (GARCH) model is a populuar tool for estimating and forecasting volatility in time series data. 

Since Garch models only help predict the volatility, we would combine it with the best ARIMA model to see predicted values and the prediction interval. 

A prediction interval is a range of values that is likely to contain the actual future value of the time series. The wider the interval, the higher the uncertainty or risk. 

In GARCH model, p and q are parameters that define the order of the model. They control how many lagged values of the squared residuals (for p) and the conditional variance (for q) are used in the model. 

In [None]:
# from arch import arch_model

# # fit ARIMA(3,0,3)

# arima_3_0_3 = ARIMA(train['CPI'], order=(3,0,3)).fit()

# # extract residual from the ARIMA(3,0,3) model

# arima_3_0_3_residuals = arima_3_0_3.resid

# # fit GARCH(1,1) model on the residuals of the ARIMA model 
# garch = arch_model(arima_3_0_3_residuals, p=1, q=1).fit()


# # use ARIMA to predict mean 

# predicted_mean = arima_3_0_3.predict(n_periods= 2)[0]

# # use Garch to predict the residual 

# garch_forecast = garch.forecast(horizon=2)

# predict_et = garch_forecast.mean['h.1'].iloc[-1]

# # combine the predicted mean and predicted residual to get the final prediction

# arima_garch_pred = predicted_mean + predict_et

# #plot the prediction and actual value on the same graph

# plt.figure(figsize=(10, 6))

# plt.plot(y_val, label='Actual')

# plt.plot(arima_garch_pred, label='Predicted')

# plt.legend(loc='upper left')

# plt.title('ARIMA-GARCH Model')

# plt.show()


## 7. Rolling Forecast Origin in ARIMA

The basic idea behind this method is to split your dataset into a training set, validation, and test set. However,instead of a simple static split, you use a series of "windows" that roll through the data. 

Let's say we have a time series data from 2010 to 2016. You could decide to use the data from 2010 to 2016 to train your model and then test the model on data from 2019-2020 (2017-2019 for validation). But in a rolling forecast origin approach, you might start by training your model on data from 2010 to 2016, and then use this model to forecast the value for the first point in 2017. Then you would expand your training data to include the first point in 2017, refit the model, and forecast the second point in 2017, and so on.

In [None]:
# Create rolling forecast origin for ARIMA(1,2,1) model

# Create the model

predictions_rolling = pd.Series()

for end_date in test.index:
    rolled_data = second_order_diff.loc[:end_date]
    model = ARIMA(rolled_data, order=(1,0,1)).fit()
    pred = model.forecast(horizon=1)
    predictions_rolling.loc[end_date] = pred.values[0]

# after finish the loop, add the index to predictions_rolling
predictions_rolling.index = test.index

Visualize residual rolling and prediction on test set

In [None]:
residual_rolling = second_order_diff - predictions_rolling

# plot the residuals

plt.figure(figsize=(10, 6))

plt.plot(residual_rolling)

plt.title('Residuals of Rolling Forecast Origin')

plt.show()

In [None]:
# plot prediction against actual values of the validation set, include MAE, MSE, RMSE

mae = np.mean(np.abs(predictions_rolling - test['CPI']))

rmse = np.sqrt(mean_squared_error(test['CPI'], predictions_rolling))

plt.figure(figsize=(10, 6))

plt.plot(test['CPI'], label='Actual')

plt.plot(predictions_rolling, label='Predicted')

plt.legend(loc='upper left')

plt.text(0.88, 0.98, f'MAE: {mae:.2f}\nMSE: {mse:.2f}\nRMSE: {rmse:.2f}',
            transform=plt.gca().transAxes, verticalalignment='top')

plt.title('ARIMA(1,2,1) Rolling Forecast Origin On Test Set')

plt.show()

## 8. ARIMA Prediction's Reverse

In [None]:
# Convert base arima to monthly inflation rate
merge_with_1_month_pct_change(base_arima_test_pred)

In [None]:
# Convert best arima to monthly inflation rate

merge_with_1_month_pct_change(best_arima_test_pred)