<a href="https://colab.research.google.com/github/maschu09/mless/blob/main/time_series_forecasting/5_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Approaches:

If we want to incorporate future temperature data to predict current ozone measurements, we have two approaches that do not require architectural changes:

1. **Shifting the temperature measurements backwards:** Say our context window starts at $t=0$ and goes until timepoint $t=w_c$. Our prediciton window then, of course, starts at $t=w_c +1$ and goes until $t=w_c + w_p$, where $w_p$ is the length of the prediciton window. In this case (if we have two input variables), the input dimensions to our model are $(N, w_c, 2)$. To include future data for the temperature variable, we can shift the temperature data by $k$ timepoints, s.t. the first measurement included in the context window would be from timepoint $t=k$, while the last timepoint would be at $t=w_c + k$. This way, the model can see $k$ steps into the future, at the cost of losing the first $k$ measurements for the temperature data.

2. **Appending and Padding:** If we do not want to lose the information at the beginning of the temperature sequence, we can also append the whole future window for this variable to our context window. Then the data would be of dimensionality $(N, w_c + w_p, 2)$, where the last $w_p$ indices for ozone would be some padding value (probably best to pick an out-of-distribution value). If the model struggles with the large sequence of pads, the task could also be reformulated to a single-step prediciton task.

(**A probably better apporach:** If we can make architectural changes, it would probably be a good idea to revert to an encoder-decoder architecture, where the _past_ data is encoded together, and future temperature data is only given to the decoder while predicting ozone values. In this way, we have an explicit distinction between past and future values, which can probably be learned more efficiently by the model.)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from tensorflow.keras.models import Sequential,load_model
from tensorflow.keras.layers import Dense, LSTM, Input
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf
import os

context_window = 336
prediction_horizon = 96
variable_column = ["temp", "o3"] # define the variables wanted for training

In [None]:
# Function to evaluate model performance
def evaluate_model(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    print(f"RMSE: {rmse:.4f}")
    return rmse

# Loading multi-variable data-sequence

Here the same as in the original ozone-prediction notebook is done.

In [None]:
from re import X
import pickle

# Load the prepared multi-variable data
with open("X_train.pkl", "rb") as f:
    X_train_full = pickle.load(f)

with open("X_test.pkl", "rb") as f:
    X_test_full = pickle.load(f)

with open("y_train.pkl", "rb") as f:
    y_train_full = pickle.load(f)

with open("y_test.pkl", "rb") as f:
    y_test_full = pickle.load(f)

print(f"X_train_full shape: {X_train_full.shape}, y_train_full shape: {y_train_full.shape}")
print(f"X_test_full shape: {X_test_full.shape}, y_test_full shape: {y_test_full.shape}")

## Else if using local files:
dataframe = pd.read_csv("normalized_data.csv")
scaler_stats = {col: {'mean': dataframe[col].mean(), 'std': dataframe[col].std()} for col in variable_column}


# the station code is the first variable column, hence select only the last two
X_train = X_train_full[:,:,1:].copy()
X_test = X_test_full[:,:,1:].copy()

# for the label, we only want the ozone data, which is the second column

temp_y_train = y_train_full[:,:,1].copy()  # temperature data for training
temp_y_test = y_test_full[:,:,1].copy()    # temperature data for testing

y_train = y_train_full[:,:,2].copy()
y_test = y_test_full[:,:,2].copy()

X_train = np.array(X_train, dtype=np.float32)
X_test = np.array(X_test, dtype=np.float32)
y_train = np.array(y_train, dtype=np.float32)
y_test = np.array(y_test, dtype=np.float32)
temp_y_train = np.array(temp_y_train, dtype=np.float32)
temp_y_test = np.array(temp_y_test, dtype=np.float32)

# verify the shapes of the data
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

X_test_full shape: (160260, 336, 2), y_test_full shape: (160260, 96, 2)
X_test shape: (160260, 336, 1), y_test shape: (160260, 96, 1)


## Define Training Function

The model is trained in the same way for both approaches, only the data preparation is different. The only thing we need to account for is the context-window length.

In [None]:
def train_lstm_model(X_train, y_train, context_window):

    # Tunable LSTM parameters
    lstm_units = 50
    lstm_epochs = 5
    lstm_batch_size = 16
    lstm_optim = 'adam'
    lstm_loss = 'mse'

    checkpoint_dir = "./checkpoint/"
    os.makedirs(checkpoint_dir, exist_ok=True)

    checkpoint_path = os.path.join(checkpoint_dir, f"lstm_multivar.h5")

    ## Ignore user warning on keras as the choice for this exercise is to use h5.
    print(f"Training new model for variables {variable_column}")

    # the only change needed to allow for multiple input variables is to change the input shape of the LSTM layer
    # to match the number of variables in the input data
    lstm_model = Sequential([
        LSTM(lstm_units, return_sequences=True, input_shape=(context_window, len(variable_column))), # change to allow mulitple input variables
        LSTM(lstm_units, return_sequences=False),
        Dense(prediction_horizon)
    ])

    lstm_model.compile(optimizer="adam", loss="mse")

    checkpoint_callback = ModelCheckpoint(
        checkpoint_path, monitor="val_loss", save_best_only=True, verbose=1
    )

    training = lstm_model.fit(
        X_train,
        y_train,
        epochs=lstm_epochs, batch_size=lstm_batch_size,
        validation_split=0.2, verbose=1,
        callbacks=[checkpoint_callback]
    )

    training_history = training.history

    return lstm_model

In [None]:
def get_ozone_predictions(model, X_test, y_test):
    """
    Get ozone predictions from the trained model.
    """
    lstm_pred = model.predict(X_test)
    rmse = evaluate_model(y_test, y_test)
    
    return lstm_pred

# Approach 1

## Preparing the data

In [None]:
k = 24 # shift the temp data by 24 hours to use future temperature data

temp_train = X_train[:, k:, 0].copy() # take the temperature data from the training set
temp_future = temp_y_train[:, :k].copy() # take the temperature data from the future set

concat_temp = np.concatenate((temp_train, temp_future), axis=1) # concatenate the two temperature data sets

# replace temperature data in the training set with the concatenated data
X_train1 = X_train.copy()  # create a copy of the training data to avoid modifying the original
X_train1[:, :, 0] = concat_temp

# train the LSTM model with the modified training data
lstm_model = train_lstm_model(X_train1, y_train, context_window)

# Approach 2

## Preparing the data

In [None]:
y_test_vars = y_test_full[:, :, 1:]

X_train2 = np.concatenate((X_train1, y_test_vars), axis=1)  # concatenate the training data with the test data

# mask out the ozone data in the training set
X_train2[:, :, 1] = -99

# Train the LSTM model with the modified training data
lstm_model = train_lstm_model(X_train2, y_train, X_train2.shape[1])