## 02 Forecasting

### Overview

### Load Data

In [None]:
import skforecast
import pmdarima

### Linear Plots & Stationarity Analysis

For forecasting, features should be stationary, meaning that no significant trends or seasonal patterns should be present in the data.  The mean and variance should be consistent throughout the time period.

No features are stationary without differencing.  The trends are enormous, and while seasonality is difficult to detect at this level, it's almost certainly present.  Many of these look pretty good at first differencing, but some definitely need to be differenced at least one more time.

In [None]:
column_list = [
    'DATE',
    'copper_PRICE',
    'CONSUMER_SENTIMENT',
    'r2000_PRICE',
    'UNEMPLOYMENT',
    'HOUSE_STARTS'
]
df = dev_data.select(column_list).to_pandas()
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values('DATE')
fig, axes = vis.plot_time_series_diffs(df)

##### Common Stationarity Tests

In [None]:
# TODO add functional stationarity tests & interpretation
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import kpss
adfuller(df['copper_PRICE'])

In [None]:
stationarity_tests = {
    'adfuller' :  adfuller(df['copper_Price']),
    'kpss' : kpss(df['copper_Price']) 
}

### Baseline Models

Key tools:
- [skforecast](https://skforecast.org/)
- [pmdarima](https://github.com/alkaline-ml/pmdarima)
- [sklearn scaling]()

In [None]:
from skforecast.sarimax import Sarimax
from sklearn.preprocessing import StandardScaler

##### Remove Nontrading Days

At this point, all required lags and moving-average values have been calculated.  For calculating deltas, we will only want trading days. 

Pure ARIMA model

In [None]:
df = dev_data.to_pandas()
df['DATE'] = pd.to_datetime(df['DATE'])
pdq = (1,1,1) # p autoregression lags, d differences, q moving average
model = Sarimax(order = pdq)
model.fit(
    y = df['COPPER_PRICE'])
model.summary()

Simple ARIMAX model

In [None]:
df = dev_data.to_pandas()
exog_cols = [col for col in df.columns if '_OPEN' in col]
exog = df[exog_cols]
pdq = (1,1,1) # p autoregression lags, d differences, q moving average
model = Sarimax(order = pdq)
model.fit(
    y = df['COPPER_PRICE'],
    exog = exog)
model.summary()

In [None]:
df = dev_data.to_pandas()
(df['COPPER_OPEN'] - df['COPPER_OPEN'].mean())/df['COPPER_OPEN'].std()

In [None]:
df = dev_data.to_pandas()
exog_cols = [col for col in df.columns if '_OPEN' in col]
exog = df[exog_cols]
exog = exog.drop(['NATGAS_OPEN','GOLD_OPEN','CORN_OPEN'], axis='columns')
exog = (exog - exog.mean())/exog.std()

target = df['COPPER_PRICE']
target = (target - target.mean())/target.std()

pdq = (1,1,1) # p autoregression lags, d differences, q moving average
model = Sarimax(order = pdq)
model.fit(
    y = target,
    exog = exog)
model.summary()

### Feature Engineering