# 4. Model Development

## Pre-processing

In [2]:
# Libraries imported for this notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [3]:
# Read EDA_Data.xlsx into a dataframe, formatted dates, and indexed dates.

df = pd.read_excel('/Users/NJahns/Desktop/Bootcamp/Capstone_Two/Edited_Data/Pre_Process_Train.xlsx', parse_dates=True, index_col=[0])

In [3]:
# Looked at shape

df.shape

(2272, 555)

In [4]:
# Defined Xs and ys.

X_orig = df[[col for col in df.columns if '_' not in col and col != 'Sludge Volume Index']].copy() # Original features
X_diff_1 = df[df.columns[df.columns.str.contains('1st')]] # 1st order differenced versions of features
X_diff_1_no_SVI = df[df.columns[df.columns.str.contains('1st')]].drop(columns=['1st_Sludge Volume Index']) # 1st order differenced versions of features excluding SVI
X_lag_1 = df[df.columns[df.columns.str.contains('_1')]] # 1 day lagged versions of features
X_lag_1_no_SVI = df[df.columns[df.columns.str.contains('_1')]].drop(columns=['Sludge Volume Index_lag_1']) # 1 day lagged versions of features excluding SVI
X_lag_2 = df[df.columns[df.columns.str.contains('_2')]] # 2 day lagged versions of features
X_lag_3 = df[df.columns[df.columns.str.contains('_3')]] # 3 day lagged versions of features
X_lag_4 = df[df.columns[df.columns.str.contains('_4')]] # 4 day lagged versions of features
X_lag_4_no_SVI = df[df.columns[df.columns.str.contains('_4')]].drop(columns=['Sludge Volume Index_lag_4']) # 4 day lagged versions of features excluding SVI
X_lag_7 = df[df.columns[df.columns.str.contains('_7')]] # 7 day lagged versions of features
X_lag_7_no_SVI = df[df.columns[df.columns.str.contains('_7')]].drop(columns=['Sludge Volume Index_lag_7']) # 7 day lagged versions of features excluding SVI

y = df['Sludge Volume Index'] # Target variable for all models
y_shift = df['Sludge Volume Index'].shift(1).dropna() # Target variable for all models shifted by one day

I defined multiple versions of X and y for use in my models. I did this for the X data to try different combinations of independent variables in the models. This also reduced the size of the dataset which prevented the curse of dimensionality and overfitting. These data consisted of original data (including the additional WTP metrics I calculated) and specific orders of differenced and lagged data containing and not containing SVI. I didn’t create every possible combination of these data groups, but rather a range so that I could hone in on specific combinations when running the models. For instance, I made a data set of data with a one day lag, a set with four day lag, and with seven day lag. That way, a model performing, for instance, well with one and four day lag data but poorly with seven day lag data, would direct me to try the model with two or three day lag data. As for y data, I defined a dataset of original SVI and SVI shifted forward one day. The latter was created to avoid any potential data leakage.

In [9]:
# Defined number of splits for time series cross-validation.

n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

I used TimeSeriesSplit as a cross-validator because it is specifically designed for time series data. The dataset is split into consecutive folds, where each fold is a superset of the previous one, ensuring that the model is trained on data that retains the temporal nature of time series data. In addition, these methods can also check for autocorrelation.

Models capable of making predictions from time series data were essential for this project. I chose to explore linear regression, random forest, and ARIMAX models.

## Linear Regression

In [93]:
# Created funtion to perform cross-validation on folds and calculate RMSE and Adj R2

# Defined function
def lrcv(X, y, tscv): # linear regression cross validation
    # Initialize empty lists to store scores and predictions
    lr_rmse_scores = []
    lr_predictions = []
    lr_adj_r2_scores = []
    # Iterate over each fold for cross-validation using the time series cross-validation splitter (tscv)    
    for train_index, test_index in tscv.split(X):
        # Split data into training and test sets        
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        # Instantiate a linear regression model
        lr_model = LinearRegression()
        # Fit the linear regression model to the training data
        lr_model.fit(X_train, y_train)
        # Make predictions on the test set
        lr_pred = lr_model.predict(X_test)
        # Store predictions
        lr_predictions.extend(lr_pred)
        # Calculate Root Mean Squared Error (RMSE) and append to list
        lr_rmse_scores.append(np.sqrt(mean_squared_error(y_test, lr_pred)))
        # Calculate R^2 score
        r2 = r2_score(y_test, lr_pred)
        n = len(y_test)  # Number of observations
        k = X_train.shape[1]  # Number of predictors
        adj_r2 = 1 - ((1 - r2) * ((n - 1) / (n - k - 1)))
        # Append adjusted R^2 score to list
        lr_adj_r2_scores.append(adj_r2)
    # Calculate average RMSE and average adjusted R^2 score across all folds
    lr_avg_rmse = np.mean(lr_rmse_scores)
    lr_avg_adj_r2 = np.mean(lr_adj_r2_scores)
    # Return average RMSE and average adjusted R^2 score, and predictions
    return lr_avg_rmse, lr_avg_adj_r2, lr_predictions

RMSE was calculated to measures the error between the model's predictions and the actual values. R squared was calculated to measures the proportion of the variance in SVI that is explained by the features. This value was then adjusted to account for the large number of predictors in the model. This was done because the R-squared value of a model can increase as predictors are added to a model even if they are not contributing to the explanatory power of the model. Adjusting the R-squared corrects this.

In [76]:
# Ran function with original features
lr_X_orig_avg_rmse, lr_X_orig_avg_adj_r2, lr_X_orig_predictions = lrcv(X_orig, y, tscv)

In [77]:
# Ran function with shifted y and original features
lr_y_shift_X_orig_avg_rmse, lr_y_shift_X_orig_avg_adj_r2, lr_y_shift_X_orig_predictions = lrcv(X_orig.iloc[1:,:], y_shift, tscv)

In [94]:
# Ran function with 1st order differenced versions of features
lr_X_diff_1_avg_rmse, lr_X_diff_1_avg_adj_r2, lr_X_diff_1_predictions = lrcv(X_diff_1, y, tscv)

In [95]:
# Ran function with shifted y and 1st order differenced versions of features
lr_y_shift_X_diff_1_avg_rmse, lr_y_shift_X_diff_1_avg_adj_r2, lr_y_shift_X_diff_1_predictions = lrcv(X_diff_1.iloc[1:,:], y_shift, tscv)

In [80]:
# Ran function with 1 day lagged versions of features
lr_X_lag_1_avg_rmse, lr_X_lag_1_avg_adj_r2, lr_X_lag_1_predictions = lrcv(X_lag_1, y, tscv)

In [81]:
# Ran function with shifted y and 1 day lagged versions of features
lr_y_shift_X_lag_1_avg_rmse, lr_y_shift_X_lag_1_avg_adj_r2, lr_y_shift_X_lag_1_predictions = lrcv(X_lag_1.iloc[1:,:], y_shift, tscv)

In [82]:
# Ran function with 4 day lagged versions of features
lr_X_lag_4_avg_rmse, lr_X_lag_4_avg_adj_r2, lr_X_lag_4_predictions = lrcv(X_lag_4, y, tscv)

In [83]:
# Ran function with shifted y and 4 day lagged versions of features
lr_y_shift_X_lag_4_avg_rmse, lr_y_shift_X_lag_4_avg_adj_r2, lr_y_shift_X_lag_4_predictions = lrcv(X_lag_4.iloc[1:,:], y_shift, tscv)

In [84]:
# Ran function with 7 day lagged versions of features
lr_X_lag_7_avg_rmse, lr_X_lag_7_avg_adj_r2, lr_X_lag_7_predictions = lrcv(X_lag_7, y, tscv)

In [85]:
# Ran function with shifted y and 7 day lagged versions of features
lr_y_shift_X_lag_7_avg_rmse, lr_y_shift_X_lag_7_avg_adj_r2, lr_y_shift_X_lag_7_predictions = lrcv(X_lag_7.iloc[1:,:], y_shift, tscv)

In [88]:
# Ran function with shifted y and 2 day lagged versions of features
lr_y_shift_X_lag_2_avg_rmse, lr_y_shift_X_lag_2_avg_adj_r2, lr_y_shift_X_lag_2_predictions = lrcv(X_lag_2.iloc[1:,:], y_shift, tscv)

In [89]:
# Ran function with shifted y and 3 day lagged versions of features
lr_y_shift_X_lag_3_avg_rmse, lr_y_shift_X_lag_3_avg_adj_r2, lr_y_shift_X_lag_3_predictions = lrcv(X_lag_3.iloc[1:,:], y_shift, tscv)

## Random Forest

In [99]:
# Created funtion to perform cross-validation on folds and calculate RMSE and Adj R2

# Defined function
def rfcv(X, y, tscv): # Random forest cross validation
    # Initialize empty lists to store scores and predictions
    rf_rmse_scores = []
    rf_predictions = []
    rf_adj_r2_scores = []
    # Iterate over each fold for cross-validation using the time series cross-validation splitter (tscv)    
    for train_index, test_index in tscv.split(X):
        # Split data into training and test sets        
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        # Instantiate a random forest model
        rf_model = RandomForestRegressor()
        # Fit the random forest model to the training data
        rf_model.fit(X_train, y_train)
        # Make predictions on the test set
        rf_pred = rf_model.predict(X_test)
        # Store predictions
        rf_predictions.extend(rf_pred)
        # Calculate Root Mean Squared Error (RMSE) and append to list
        rf_rmse_scores.append(np.sqrt(mean_squared_error(y_test, rf_pred)))
        # Calculate R^2 score
        r2 = r2_score(y_test, rf_pred)
        n = len(y_test)  # Number of observations
        k = X_train.shape[1]  # Number of predictors
        adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - k - 1))
        # Append adjusted R^2 score to list
        rf_adj_r2_scores.append(adj_r2)
    # Calculate average RMSE and average adjusted R^2 score across all folds
    rf_avg_rmse = np.mean(rf_rmse_scores)
    rf_avg_adj_r2 = np.mean(rf_adj_r2_scores)
    # Return average RMSE and average adjusted R^2 score, and predictions
    return rf_avg_rmse, rf_avg_adj_r2, rf_predictions

In [100]:
# Ran function with original features
rf_X_orig_avg_rmse, rf_X_orig_avg_adj_r2, rf_X_orig_predictions = rfcv(X_orig, y, tscv)

In [101]:
# Ran function with shifted y and original features
rf_y_shift_X_orig_avg_rmse, rf_y_shift_X_orig_avg_adj_r2, rf_y_shift_X_orig_predictions = rfcv(X_orig.iloc[1:,:], y_shift, tscv)

In [102]:
# Ran function with 1st order differenced versions of features
rf_X_diff_1_avg_rmse, rf_X_diff_1_avg_adj_r2, rf_X_diff_1_predictions = rfcv(X_diff_1, y, tscv)

In [103]:
# Ran function with shifted y and 1st order differenced versions of features
rf_y_shift_X_diff_1_avg_rmse, rf_y_shift_X_diff_1_avg_adj_r2, rf_y_shift_X_diff_1_predictions = rfcv(X_diff_1.iloc[1:,:], y_shift, tscv)

In [104]:
# Ran function with 1 day lagged versions of features
rf_X_lag_1_avg_rmse, rf_X_lag_1_avg_adj_r2, rf_X_lag_1_predictions = rfcv(X_lag_1, y, tscv)

In [106]:
# Ran function with shifted y and 1 day lagged versions of features
rf_y_shift_X_lag_1_avg_rmse, rf_y_shift_X_lag_1_avg_adj_r2, rf_y_shift_X_lag_1_predictions = rfcv(X_lag_1.iloc[1:,:], y_shift, tscv)

In [116]:
# Ran function with 4 day lagged versions of features
rf_X_lag_4_avg_rmse, rf_X_lag_4_avg_adj_r2, rf_X_lag_4_predictions = rfcv(X_lag_4, y, tscv)

In [114]:
# Ran function with shifted y and 4 day lagged versions of features
rf_y_shift_X_lag_4_avg_rmse, rf_y_shift_X_lag_4_avg_adj_r2, rf_y_shift_X_lag_4_predictions = rfcv(X_lag_4.iloc[1:,:], y_shift, tscv)

In [109]:
# Ran function with 7 day lagged versions of features
rf_X_lag_7_avg_rmse, rf_X_lag_7_avg_adj_r2, rf_X_lag_7_predictions = rfcv(X_lag_7, y, tscv)

In [110]:
# Ran function with shifted y and 7 day lagged versions of features
rf_y_shift_X_lag_7_avg_rmse, rf_y_shift_X_lag_7_avg_adj_r2, rf_y_shift_X_lag_7_predictions = rfcv(X_lag_7.iloc[1:,:], y_shift, tscv)

## ARIMAX

In [126]:
# Created funtion to perform cross-validation on folds and calculate RMSE and Adj R2
from pmdarima.arima import auto_arima

# Modified function with auto ARIMA
def arimaxcv(y, exog, tscv): # ARIMAX cross validation
    # Initialize empty lists to store scores and predictions
    arimax_rmse_scores = []
    arimax_predictions = []
    arimax_adj_r2_scores = []    
    # Iterate over each fold for cross-validation using the time series cross-validation splitter (tscv)    
    for train_index, test_index in tscv.split(y):
        # Split data into training and test sets
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        exog_train, exog_test = exog.iloc[train_index], exog.iloc[test_index]
        # Use auto ARIMA to find the best ARIMA parameters
        arimax_model = auto_arima(y_train, exogenous=exog_train)
        # Fit the ARIMAX model to the training data
        arimax_model_fit = arimax_model.fit(y_train, exogenous=exog_train)
        # Make predictions on the test set
        arimax_pred = arimax_model_fit.predict(n_periods=len(y_test), exogenous=exog_test)
        # Store predictions
        arimax_predictions.append(arimax_pred)
        # Calculate Root Mean Squared Error (RMSE) and append to list
        arimax_rmse_scores.append(np.sqrt(mean_squared_error(y_test, arimax_pred)))
        # Calculate R^2 score
        r2 = r2_score(y_test, arimax_pred)
        n = len(y_test)  # Number of observations
        k = 1 + exog_train.shape[1]  # Number of predictors (1 for the lagged values of the target variable and additional ones for exogenous variables)
        adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - k - 1))        
        # Append adjusted R^2 score to list
        arimax_adj_r2_scores.append(adj_r2)    
    # Calculate average RMSE and average adjusted R^2 score across all folds
    arimax_avg_rmse = np.mean(arimax_rmse_scores)
    arimax_avg_adj_r2 = np.mean(arimax_adj_r2_scores)    
    # Return average RMSE and average adjusted R^2 score, and predictions
    return arimax_avg_rmse, arimax_avg_adj_r2, arimax_predictions

The ARIMAX model creates differenced and lagged versions of the target variable so the dataset fed to it should not include the SVI differenced and lagged data.

In [128]:
# Ran function with original features
arimax_X_orig_avg_rmse, arimax_X_orig_avg_adj_r2, arimax_X_orig_predictions = arimaxcv(y, X_orig, tscv)

In [129]:
# Ran function with shifted y and original features
arimax_y_shift_X_orig_avg_rmse, arimax_y_shift_X_orig_avg_adj_r2, arimax_y_shift_X_orig_predictions = arimaxcv(y_shift, X_orig.iloc[1:,:], tscv)

In [130]:
# Ran function with 1st order differenced versions of features
arimax_X_diff_1_no_SVI_avg_rmse, arimax_X_diff_1_no_SVI_avg_adj_r2, arimax_X_diff_1_no_SVI_predictions = arimaxcv(y, X_diff_1_no_SVI, tscv)

In [131]:
# Ran function with shifted y and 1st order differenced versions of features
arimax_y_shift_X_diff_1_no_SVI_avg_rmse, arimax_y_shift_X_diff_1_no_SVI_avg_adj_r2, arimax_y_shift_X_diff_1_no_SVI_predictions = arimaxcv(y_shift, X_diff_1_no_SVI.iloc[1:,:], tscv)

In [132]:
# Ran function with 1 day lagged versions of features
arimax_X_lag_1_no_SVI_avg_rmse, arimax_X_lag_1_no_SVI_avg_adj_r2, arimax_X_lag_1_no_SVI_predictions = arimaxcv(y, X_lag_1_no_SVI, tscv)

In [133]:
# Ran function with shifted y and 1 day lagged versions of features
arimax_y_shift_X_lag_1_no_SVI_avg_rmse, arimax_y_shift_X_lag_1_no_SVI_avg_adj_r2, arimax_y_shift_X_lag_1_no_SVI_predictions = arimaxcv(y_shift, X_lag_1_no_SVI.iloc[1:,:], tscv)

In [134]:
# Ran function with 4 day lagged versions of features
arimax_X_lag_4_no_SVI_avg_rmse, arimax_X_lag_4_no_SVI_avg_adj_r2, arimax_X_lag_4_no_SVI_predictions = arimaxcv(y, X_lag_4_no_SVI, tscv)

In [135]:
# Ran function with shifted y and 4 day lagged versions of features
arimax_y_shift_X_lag_4_no_SVI_avg_rmse, arimax_y_shift_X_lag_4_no_SVI_avg_adj_r2, arimax_y_shift_X_lag_4_no_SVI_predictions = arimaxcv(y_shift, X_lag_4_no_SVI.iloc[1:,:], tscv)

In [136]:
# Ran function with 7 day lagged versions of features
arimax_X_lag_7_no_SVI_avg_rmse, arimax_X_lag_7_no_SVI_avg_adj_r2, arimax_X_lag_7_no_SVI_predictions = arimaxcv(y, X_lag_7_no_SVI, tscv)

In [137]:
# Ran function with shifted y and 7 day lagged versions of features
arimax_y_shift_X_lag_7_no_SVI_avg_rmse, arimax_y_shift_X_lag_7_no_SVI_avg_adj_r2, arimax_y_shift_X_lag_7_no_SVI_predictions = arimaxcv(y_shift, X_lag_7_no_SVI.iloc[1:,:], tscv)

## Model performance

In [97]:
# Compared model RMSE and adj R2 scores.

print(f"Linear Regression using original versions of features (no differenced and lagged)")
print(f"   Average RMSE: {lr_X_orig_avg_rmse}, Average Adj R^2: {lr_X_orig_avg_adj_r2}")
print()
print(f"Linear Regression using shifted y and original versions of features (no differenced and lagged)")
print(f"   Average RMSE: {lr_y_shift_X_orig_avg_rmse}, Average Adj R^2: {lr_y_shift_X_orig_avg_adj_r2}")
print()
print(f"Linear Regression using 1st order differenced versions of features")
print(f"   Average RMSE: {lr_X_diff_1_avg_rmse}, Average Adj R^2: {lr_X_diff_1_avg_adj_r2}")
print()
print(f"Linear Regression using shifted y and 1st order differenced versions of features")
print(f"   Average RMSE: {lr_y_shift_X_diff_1_avg_rmse}, Average Adj R^2: {lr_y_shift_X_diff_1_avg_adj_r2}")
print()
print(f"Linear Regression using 1 day lagged versions of features")
print(f"   Average RMSE: {lr_X_lag_1_avg_rmse}, Average Adj R^2: {lr_X_lag_1_avg_adj_r2}")
print()
print(f"Linear Regression using shifted y and 1 day lagged versions of features")
print(f"   Average RMSE: {lr_y_shift_X_lag_1_avg_rmse}, Average Adj R^2: {lr_y_shift_X_lag_1_avg_adj_r2}")
print()
print(f"Linear Regression using 4 day lagged versions of features")
print(f"   Average RMSE: {lr_X_lag_4_avg_rmse}, Average Adj R^2: {lr_X_lag_4_avg_adj_r2}")
print()
print(f"Linear Regression using shifted y and 4 day lagged versions of features")
print(f"   Average RMSE: {lr_y_shift_X_lag_4_avg_rmse}, Average Adj R^2: {lr_y_shift_X_lag_4_avg_adj_r2}")
print()
print(f"Linear Regression using 7 day lagged versions of features")
print(f"   Average RMSE: {lr_X_lag_7_avg_rmse}, Average Adj R^2: {lr_X_lag_7_avg_adj_r2}")
print()
print(f"Linear Regression using shifted y and 7 day lagged versions of features")
print(f"   Average RMSE: {lr_y_shift_X_lag_7_avg_rmse}, Average Adj R^2: {lr_y_shift_X_lag_7_avg_adj_r2}")
print()
print(f"Linear Regression using shifted y and 2 day lagged versions of features")
print(f"   Average RMSE: {lr_y_shift_X_lag_2_avg_rmse}, Average Adj R^2: {lr_y_shift_X_lag_2_avg_adj_r2}")
print()
print(f"Linear Regression using shifted y and 3 day lagged versions of features")
print(f"   Average RMSE: {lr_y_shift_X_lag_3_avg_rmse}, Average Adj R^2: {lr_y_shift_X_lag_3_avg_adj_r2}")
print()

Linear Regression using original versions of features (no differenced and lagged)
   Average RMSE: 31.134840752257695, Average Adj R^2: -10.053469692335657

Linear Regression using shifted y and original versions of features (no differenced and lagged)
   Average RMSE: 31.546349509864495, Average Adj R^2: -10.263323140499713

Linear Regression using 1st level differenced versions of features
   Average RMSE: 22.939486782000166, Average Adj R^2: -19.878422296388482

Linear Regression using shifted y and 1st level differenced versions of features
   Average RMSE: 22.936403732201697, Average Adj R^2: -19.89598447489483

Linear Regression using 1 day lagged versions of features
   Average RMSE: 8.764134453374917, Average Adj R^2: 0.16222923716691875

Linear Regression using shifted y and 1 day lagged versions of features
   Average RMSE: 1.996572896218072e-12, Average Adj R^2: 1.0

Linear Regression using 4 day lagged versions of features
   Average RMSE: 11.947664576160218, Average Adj R^

I first ran the linear regression model using combinations of not shifted and shifted y values and original, 1st order differenced, 1 day lagged, 4 day lagged, and 7 day lagged versions of X. The RMSE and adjusted R2 indicated that the models:
1. Perform poorly with the original data and a lag of 4 and 7 days,
2. Are not improve by differencing the data,
3. Perform moderately well with shifted y and a lag of 1 day data (considering a mean SVI of 101 and max of 235),
4. Appear to suffer from data leakage when y is shifted and lag of 1 day is used,
5. Perform better when y is shifted than when it is not in combination with lag data.

This information indicated that I should run models with a combination of shifted y data and lag of 2 days and lag of 3 days. The models perform:
1. Moderately well with lag of 2 days,
2. Poorly with a lag of 3 days.

Overall, the highest performing linear regression model used shifted y data and a lag of 1 day. The RMSE indicated that the error between the model's predictions and the actual values was small. Adjusted R squared, on the other hand, indicated that the model explains only 16% of the variance in the target variable.

In [118]:
# Compared model RMSE and adj R2 scores.

print(f"Random forest using original versions of features (no differenced and lagged)")
print(f"   Average RMSE: {rf_X_orig_avg_rmse}, Average Adj R^2: {rf_X_orig_avg_adj_r2}")
print()
print(f"Random forest using shifted y and original versions of features (no differenced and lagged)")
print(f"   Average RMSE: {rf_y_shift_X_orig_avg_rmse}, Average Adj R^2: {rf_y_shift_X_orig_avg_adj_r2}")
print()
print(f"Random forest using 1st order differenced versions of features")
print(f"   Average RMSE: {rf_X_diff_1_avg_rmse}, Average Adj R^2: {rf_X_diff_1_avg_adj_r2}")
print()
print(f"Random forest using shifted y and 1st order differenced versions of features")
print(f"   Average RMSE: {rf_y_shift_X_diff_1_avg_rmse}, Average Adj R^2: {rf_y_shift_X_diff_1_avg_adj_r2}")
print()
print(f"Random forest using 1 day lagged versions of features")
print(f"   Average RMSE: {rf_X_lag_1_avg_rmse}, Average Adj R^2: {rf_X_lag_1_avg_adj_r2}")
print()
print(f"Random forest using shifted y and 1 day lagged versions of features")
print(f"   Average RMSE: {rf_y_shift_X_lag_1_avg_rmse}, Average Adj R^2: {rf_y_shift_X_lag_1_avg_adj_r2}")
print()
print(f"Random forest using 4 day lagged versions of features")
print(f"   Average RMSE: {rf_X_lag_4_avg_rmse}, Average Adj R^2: {rf_X_lag_4_avg_adj_r2}")
print()
print(f"Random forest using shifted y and 4 day lagged versions of features")
print(f"   Average RMSE: {rf_y_shift_X_lag_4_avg_rmse}, Average Adj R^2: {rf_y_shift_X_lag_4_avg_adj_r2}")
print()
print(f"Random forest using 7 day lagged versions of features")
print(f"   Average RMSE: {rf_X_lag_7_avg_rmse}, Average Adj R^2: {rf_X_lag_7_avg_adj_r2}")
print()
print(f"Random forest using shifted y and 7 day lagged versions of features")
print(f"   Average RMSE: {rf_y_shift_X_lag_7_avg_rmse}, Average Adj R^2: {rf_y_shift_X_lag_7_avg_adj_r2}")
print()

Random forest using original versions of features (no differenced and lagged)
   Average RMSE: 24.91117034313367, Average Adj R^2: -8.827540933472704

Random forest using shifted y and original versions of features (no differenced and lagged)
   Average RMSE: 26.108343706896584, Average Adj R^2: -10.349435610163873

Random forest using 1st level differenced versions of features
   Average RMSE: 18.55058165833638, Average Adj R^2: -5.962731646621618

Random forest using shifted y and 1st level differenced versions of features
   Average RMSE: 18.594086516379136, Average Adj R^2: -6.212064115947517

Random forest using 1 day lagged versions of features
   Average RMSE: 8.344846808933502, Average Adj R^2: 0.4388156795004276

Random forest using shifted y and 1 day lagged versions of features
   Average RMSE: 0.5355404055379289, Average Adj R^2: 0.9966741032940988

Random forest using 4 day lagged versions of features
   Average RMSE: 11.50996240791368, Average Adj R^2: -0.0479016438207062

I ran the random forest model using combinations of not shifted and shifted y values and original, 1st order differenced, 1 day lagged, 4 day lagged, and 7 day lagged versions of X. The RMSE and adjusted R2 indicated that the model:
1. Perform poorly with the original data and a lag of 4 and 7 days,
2. Are only slightly improve by differencing the data, although they still perform poorly,
3. Perform better with shifted y data than with non shifted y data in combination with lagged data,
4. With a lag of one day, perform moderately well with not shifted y and very well with shifted y.
5. 
This information indicated that I did not need to run the model again with different features because I already found a very good model.
Overall, the random forest model using non shifted y data and a lag of 1 day performed best. The RMSE indicated that the error between the model's predictions and the actual values was very small. In addition, the R2adj indicated that the model explains almost 97% of the variance in the target variable.


In [138]:
# Compared model RMSE and adj R2 scores.

print(f"ARIMAX using original versions of features (no differenced and lagged)")
print(f"   Average RMSE: {arimax_X_orig_avg_rmse}, Average Adj R^2: {arimax_X_orig_avg_adj_r2}")
print()
print(f"ARIMAX using shifted y and original versions of features (no differenced and lagged)")
print(f"   Average RMSE: {arimax_y_shift_X_orig_avg_rmse}, Average Adj R^2: {arimax_y_shift_X_orig_avg_adj_r2}")
print()
print(f"ARIMAX using 1st order differenced versions of features")
print(f"   Average RMSE: {arimax_X_diff_1_no_SVI_avg_rmse}, Average Adj R^2: {arimax_X_diff_1_no_SVI_avg_adj_r2}")
print()
print(f"ARIMAX using shifted y and 1st order differenced versions of features")
print(f"   Average RMSE: {arimax_y_shift_X_diff_1_no_SVI_avg_rmse}, Average Adj R^2: {arimax_y_shift_X_diff_1_no_SVI_avg_adj_r2}")
print()
print(f"ARIMAX using 1 day lagged versions of features")
print(f"   Average RMSE: {arimax_X_lag_1_no_SVI_avg_rmse}, Average Adj R^2: {arimax_X_lag_1_no_SVI_avg_adj_r2}")
print()
print(f"ARIMAX using shifted y and 1 day lagged versions of features")
print(f"   Average RMSE: {arimax_y_shift_X_lag_1_no_SVI_avg_rmse}, Average Adj R^2: {arimax_y_shift_X_lag_1_no_SVI_avg_adj_r2}")
print()
print(f"ARIMAX using 4 day lagged versions of features")
print(f"   Average RMSE: {arimax_X_lag_4_no_SVI_avg_rmse}, Average Adj R^2: {arimax_X_lag_4_no_SVI_avg_adj_r2}")
print()
print(f"ARIMAX using shifted y and 4 day lagged versions of features")
print(f"   Average RMSE: {arimax_y_shift_X_lag_4_no_SVI_avg_rmse}, Average Adj R^2: {arimax_y_shift_X_lag_4_no_SVI_avg_adj_r2}")
print()
print(f"ARIMAX using 7 day lagged versions of features")
print(f"   Average RMSE: {arimax_X_lag_7_no_SVI_avg_rmse}, Average Adj R^2: {arimax_X_lag_7_no_SVI_avg_adj_r2}")
print()
print(f"ARIMAX using shifted y and 7 day lagged versions of features")
print(f"   Average RMSE: {arimax_y_shift_X_lag_7_no_SVI_avg_rmse}, Average Adj R^2: {arimax_y_shift_X_lag_7_no_SVI_avg_adj_r2}")
print()

ARIMAX using original versions of features (no differenced and lagged)
   Average RMSE: 17.769912703273747, Average Adj R^2: -1.189739868762603

ARIMAX using shifted y and original versions of features (no differenced and lagged)
   Average RMSE: 17.09662846657269, Average Adj R^2: -1.0362848863061318

ARIMAX using 1st level differenced versions of features
   Average RMSE: 17.769912703273747, Average Adj R^2: -2.982376652516444

ARIMAX using shifted y and 1st level differenced versions of features
   Average RMSE: 17.09662846657269, Average Adj R^2: -2.703295311365037

ARIMAX using 1 day lagged versions of features
   Average RMSE: 17.769912703273747, Average Adj R^2: -1.4954503049859533

ARIMAX using shifted y and 1 day lagged versions of features
   Average RMSE: 17.09662846657269, Average Adj R^2: -1.3205714126410786

ARIMAX using 4 day lagged versions of features
   Average RMSE: 17.769912703273747, Average Adj R^2: -1.4954503049859533

ARIMAX using shifted y and 4 day lagged vers

I ran the ARIMAX model using combinations of not shifted and shifted y values and original, 1st order differenced, 1 day lagged, 4 day lagged, and 7 day lagged versions of X, none of which containd any versions of SVI. The RMSE and adjusted R2 indicated that the models:
1. Perform poorly with all data,
2. Are not improve by differencing the data,
3. Perform slightly better with shifted y values than not shifted.

This information indicated that I did not need to run the model again with different features because no additional data is expected to produce a better ARIMAX model. Overall, these scores indicate that ARIMAX models do not provide a good fit to the data and perform worse than a simple mean prediction model as indicated by the negative R2adj scores.

Therfore, the random forest model using shifted y data and an X lag of one day is the best predictive model for this project.

# Data save

In [153]:
# Saved to Excel.
df.to_excel('/Users/NJahns/Desktop/Bootcamp/Capstone_Two/Edited_Data/Model_Data.xlsx', index=True)