# <a id='toc1_'></a>[Forecating (Keras)](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Forecating (Keras)](#toc1_)    
  - [Create sequences](#toc1_1_)    
  - [Split](#toc1_2_)    
  - [Scale](#toc1_3_)    
  - [Train](#toc1_4_)    
  - [Features](#toc1_5_)    
  - [Parameters](#toc1_6_)    
  - [Evaluate](#toc1_7_)    
- [Training](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import json
import os
from datetime import datetime

import joblib
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from tensorflow import keras

SEED = 10
np.random.seed(seed=SEED)

## <a id='toc1_1_'></a>[Create sequences](#toc0_)

In [2]:
def create_sequences(df: pd.DataFrame, lookback: int, target: str,
                     inference: bool, univariate: bool) -> tuple | np.ndarray:
    x_train, y_train = [], []
    if univariate:
        for i in range(lookback, len(df)):
            x_train.append(df.iloc[i - lookback:i][target])
            y_train.append(df.iloc[i][target])
        x_train = np.expand_dims(x_train, axis=-1)
    else:
        for i in range(lookback, len(df)):
            x_train.append(df.iloc[i - lookback:i])
            y_train.append(df.iloc[i][target])
        x_train = np.stack(x_train)
    y_train = np.expand_dims(y_train, axis=-1)
    if inference:
        return x_train
    return x_train, y_train


## <a id='toc1_2_'></a>[Split](#toc0_)

In [3]:
def split(df: pd.DataFrame, train_split: float) -> tuple:
    interim_df = df.copy(deep=True)
    train = interim_df.iloc[:int(len(interim_df) * train_split)]
    test = interim_df.iloc[int(len(interim_df) * train_split):]
    return train, test

## <a id='toc1_3_'></a>[Scale](#toc0_)

In [4]:
def scale(df: pd.DataFrame, timestamp: str, cluster: int,
          test: bool) -> tuple | pd.DataFrame:
    if not test:
        scaler = StandardScaler()
        scaled = scaler.fit_transform(df.values)
        path = os.path.join(os.environ['RUNS_PATH'], timestamp,
                            f'cluster_{cluster}')
        os.makedirs(path, exist_ok=True)
        joblib.dump(scaler, os.path.join(path, 'scaler.joblib'))
    else:
        scaler = joblib.load(
            os.path.join(os.environ['MODELS_PATH'], 'scaler.joblib'))
        scaled = scaler.transform(df.values)
    scaled_df = pd.DataFrame(scaled, columns=df.columns)
    # os.makedirs(processed_data_path, exist_ok=True)
    # scaled_df.to_parquet(os.path.join(
    #     processed_data_path,
    #     f'scaled_{"train" if not test else "test"}.parquet'),
    #                      index=False)
    return scaler, scaled_df


## <a id='toc1_4_'></a>[Train](#toc0_)

In [5]:
def train(df: pd.DataFrame,
          target: str,
          folds: int,
          lookback: int,
          univariate: bool,
          lstm_units: int,
          learning_rate: float,
          batch_size: int,
          val_split: float,
          epochs: int,
          patience: int,
          verbose: int,
          cluster: int,
          timestamp: str | None = None) -> tuple:
    """
    The train function trains a model on the data provided.

    Args:
        df: pd.DataFrame: Pass the dataframe with all the features
        target: str: Define the column name of the target variable
        folds: int: Define the number of folds to be used in cross-validation
        lookback: int: Define the number of time steps to look back in order to predict the next value
        univariate: bool: Create the sequences for training and validation
        lstm_units: int: Define the number of units in each lstm layer
        learning_rate: float: Control the magnitude of the updates to weights during training
        batch_size: int: Specify the number of samples to work through before updating the internal model parameters
        val_split: float: Specify the fraction of the training data to be used as validation data
        epochs: int: Specify the number of epochs to train for
        patience: int: Stop the training early if the loss does not improve after a given number of epochs
        verbose: int: Control the amount of logging information printed during training

    Returns:
        A tuple with two elements: model and val_metrics
    """
    tscv = TimeSeriesSplit(n_splits=folds)
    val_loss, val_rmse = [], []
    callbacks = [
        keras.callbacks.EarlyStopping(patience=patience,
                                      monitor='val_loss',
                                      mode='min',
                                      verbose=verbose,
                                      restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(monitor='val_loss',
                                          factor=0.75,
                                          patience=patience // 2,
                                          verbose=verbose,
                                          mode='min')
    ]
    for fold, (train_index, val_index) in enumerate(tscv.split(df), start=1):
        if timestamp is not None:
            callbacks.append(
                keras.callbacks.TensorBoard(log_dir=os.path.join(
                    os.environ['RUNS_PATH'], timestamp, f'cluster_{cluster}',f'fold_{fold}'),
                                            histogram_freq=1,
                                            write_graph=True,
                                            write_images=True,
                                            update_freq='epoch'))
        x_train, y_train = create_sequences(df=df.iloc[train_index],
                                            lookback=lookback,
                                            univariate=univariate,
                                            target=target,
                                            inference=False)
        model = keras.models.Sequential(name='forecaster')
        model.add(
            keras.layers.LSTM(lstm_units,
                              input_shape=(x_train.shape[1], x_train.shape[2]),
                              return_sequences=False))
        model.add(keras.layers.Dense(1))
        model.compile(
            loss=keras.losses.mean_squared_error,
            metrics=keras.metrics.RootMeanSquaredError(),
            optimizer=keras.optimizers.RMSprop(learning_rate=learning_rate))
        history = model.fit(x_train,
                            y_train,
                            shuffle=False,
                            epochs=epochs,
                            batch_size=batch_size,
                            validation_split=val_split,
                            verbose=verbose,
                            callbacks=callbacks)
        x_val, y_val = create_sequences(df=df.iloc[val_index],
                                        lookback=lookback,
                                        univariate=univariate,
                                        target=target,
                                        inference=False)
        val_loss_in_fold, val_rmse_in_fold = model.evaluate(
            x=x_val, y=y_val, batch_size=batch_size, verbose=verbose)
        val_loss.append(val_loss_in_fold)
        val_rmse.append(val_rmse_in_fold)
    val_metrics = {
        'val_loss_mean': np.mean(val_loss),
        'val_loss_std': np.std(val_loss),
        'val_rmse_mean': np.mean(val_rmse),
        'val_rmse_std': np.std(val_rmse)
    }
    return model, val_metrics


## <a id='toc1_5_'></a>[Features](#toc0_)

In [6]:
features = [
    'Impressions',
    'AbsoluteTopImpressionPercentage',
    'TopImpressionPercentage',
    'SearchImpressionShare',
    'SearchTopImpressionShare',
    'SearchRankLostTopImpressionShare',
    'Clicks',
    'Cost_gbp',
    'CpcBid_gbp',
]
features_date = features + ['Date']
target = 'CpcBid_gbp'
features.remove(target)
with open(
        os.path.join(os.environ['MODELS_PATH'], 'kmeans_clustered_dict.json'),
        'r') as f:
    clustered = json.load(f)

## <a id='toc1_6_'></a>[Parameters](#toc0_)

In [7]:
seq_len = 14
pred_len = 1
features = features
target = target
train_split = 0.8
test_split = 0.2
batch_size = 4
patience = 20
epochs = 500
learning_rate = 0.01
log_interval = 10
seed = SEED
hidden_size = 10
num_layers = 2
folds = 5
univariate = False
lstm_units = 10
val_split = .1
verbose = 10
timestamp = datetime.now().strftime('%Y%m%d%H%M%S')

## <a id='toc1_7_'></a>[Evaluate](#toc0_)

In [8]:
def evaluate(df: pd.DataFrame, model: keras.models.Model, lookback: int,
             batch_size: int, univariate: bool, target: str,
             verbose: int) -> dict:
    """
    The evaluate function evaluates the model on the test data.
    It returns a dictionary with two keys: `test_loss` and `test_rmse`.
    The values of each key are scalar values that give the loss and RMSE on
    the test set, respectively.

    Args:
        df: pd.DataFrame: Pass the dataframe that is used for training and testing
        model: keras.models.Model: Evaluate the model
        lookback: int: Define how many timesteps back the input data should go
        batch_size: int: Define the number of samples per gradient update
        univariate: bool: Determine whether the model is univariate or multivariate
        target: str: Specify the column name of the target variable
        verbose: int: Specify the verbosity of the model

    Returns:
        A dictionary with two values: test_loss, test_rmse
    """
    x_test, y_test = create_sequences(df=df,
                                      lookback=lookback,
                                      univariate=univariate,
                                      target=target,
                                      inference=False)
    test_loss, test_rmse = model.evaluate(x=x_test,
                                          y=y_test,
                                          batch_size=batch_size,
                                          verbose=verbose)
    return {'test_loss': test_loss, 'test_rmse': test_rmse}


# <a id='toc2_'></a>[Training](#toc0_)

In [9]:
for cluster in clustered.keys():
    cluster_df = pd.read_csv(
        os.path.join(os.environ['PROCESSED_DATA_PATH'],
                     f'processed_{cluster}.csv'))
    train_df, test_df = split(df=cluster_df, train_split=train_split)
    scaler, scaled_train_df = scale(df=train_df[features + [target]],
                                    cluster=cluster,
                                    timestamp=timestamp,
                                    test=False)
    _, scaled_test_df = scale(df=test_df[features + [target]],
                              cluster=cluster,
                              timestamp=timestamp,
                              test=True)
    model, val_metrics = train(df=scaled_train_df,
                               target=target,
                               cluster=cluster,
                               folds=folds,
                               lookback=seq_len,
                               univariate=univariate,
                               lstm_units=lstm_units,
                               learning_rate=learning_rate,
                               batch_size=batch_size,
                               val_split=val_split,
                               epochs=epochs,
                               patience=patience,
                               timestamp=timestamp,
                               verbose=verbose)
    with open(
            os.path.join(os.environ['RUNS_PATH'], timestamp,
                         f'cluster_{cluster}',
                         f'val_metrics_cluster_{cluster}.json'), 'w') as fp:
        json.dump(val_metrics, fp)
    model.save(
        os.path.join(os.environ['RUNS_PATH'], timestamp, f'cluster_{cluster}',
                     f'cluster_{cluster}.h5'))
    test_metrics = evaluate(df=scaled_test_df,
                            model=model,
                            lookback=seq_len,
                            batch_size=batch_size,
                            univariate=univariate,
                            target=target,
                            verbose=verbose)
    with open(
            os.path.join(os.environ['RUNS_PATH'], timestamp,
                         f'cluster_{cluster}',
                         f'test_metrics_cluster_{cluster}.json'), 'w') as fp:
        json.dump(test_metrics, fp)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500

Epoch 48: ReduceLROnPlateau reducing learning rate to 0.007499999832361937.
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Restoring model weights from the end of the best epoch: 38.

Epoch 58: ReduceLROnPlateau reducing learning rate to 0.005624999874271452.
Epoch 58: early stopping
Epoch 1/500
Epoc