# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Pipeline</span>


## 🗒️ This notebook is divided into 3 main sections:
1. Feature selection.
2. Feature transformations.
3. Training datasets creation.
4. Loading the training data.
5. Train the model.
6. Register model to Hopsworks model registry.

![02_training-dataset](../../images/02_training-dataset.png)

In [None]:
!pip install tensorflow --quiet

In [None]:
import inspect 
import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

In [None]:
# Retrieve feature groups
electricity_prices_fg = fs.get_feature_group(
    name='electricity_prices',
    version=1,
)

meteorological_measurements_fg = fs.get_feature_group(
    name='meteorological_measurements',
    version=1,
)

swedish_holidays_fg = fs.get_feature_group(
    name='swedish_holidays',
    version=1,
)

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

Let's start by selecting all the features you want to include for model training/inference.

In [None]:
# Select features for training data
selected_features = electricity_prices_fg.select_all()\
    .join(
    meteorological_measurements_fg\
        .select_except(["timestamp"])
    )\
    .join(
        swedish_holidays_fg.select_all()
    )\
.filter(meteorological_measurements_fg.precipitaton_type_se1.isin(['missing','Regn']))\
.filter(meteorological_measurements_fg.precipitaton_type_se2.isin(['missing','Regn']))\
.filter(meteorological_measurements_fg.precipitaton_type_se3.isin(['missing','Regn']))\
.filter(meteorological_measurements_fg.precipitaton_type_se4.isin(['missing','Regn']))

In [None]:
# Uncomment this if you would like to view your selected features
# selected_features.show(5)

### <span style="color:#ff5f27;"> 🤖 Transformation Functions</span>

Hopsworks Feature Store provides functionality to attach transformation functions to feature views and comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

You will preprocess your data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# List of price areas
price_areas = ["se1", "se2", "se3", "se4"]

# Mapping features to their respective transformation functions
mapping_transformers = {}

# Iterate through each price area and map features to their transformation functions
for area in price_areas:
    mapping_transformers[f"price_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"mean_temp_per_day_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"mean_wind_speed_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"precipitaton_amount_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"total_sunshine_time_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"mean_cloud_perc_{area}"] = fs.get_transformation_function(name="min_max_scaler")    
    mapping_transformers[f"precipitaton_type_{area}"] = fs.get_transformation_function(name='label_encoder')

# Additional transformation for 'type_of_day'
mapping_transformers["type_of_day"] = fs.get_transformation_function(name='label_encoder')

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='electricity_feature_view',
    version=1,
    labels=[], # you will define our 'y' later manualy
    transformation_functions=mapping_transformers,
    query=selected_features,
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset you will use the `FeatureView.train_validation_test_split()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- You can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- You can create **train, test** splits using `create_train_test_split()`. 

- You can create **train,validation, test** splits using `create_train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

### <span style="color:#ff5f27;"> ⛳️ Dataset with train, test and validation splits</span>

In [None]:
# since you didn't specify 'labels' in feature view creation, it will return None for Y.
X_train, X_val, X_test, _, _, _ = feature_view.train_validation_test_split(
    train_start="2021-01-01",
    train_end="2022-02-28",
    validation_start="2022-03-01",
    validation_end="2022-05-31",
    test_start="2022-06-01",
    test_end="2022-09-09",
    description='Electricity price prediction dataset',
)

In [None]:
# Sorting the training, validation, and test datasets based on the 'timestamp' column
X_train.sort_values(["timestamp"], inplace=True)
X_val.sort_values(["timestamp"], inplace=True)
X_test.sort_values(["timestamp"], inplace=True)

In [None]:
# Define 'y_train', 'y_val' and 'y_test'
y_train = X_train[["price_se1", "price_se2", "price_se3", "price_se4"]]
y_val = X_val[["price_se1", "price_se2", "price_se3", "price_se4"]]
y_test = X_test[["price_se1", "price_se2", "price_se3", "price_se4"]]

In [None]:
# Dropping the 'day' and 'timestamp' columns from the training, validation, and test datasets
X_train.drop(["day", "timestamp"], axis=1, inplace=True)
X_val.drop(["day", "timestamp"], axis=1, inplace=True)
X_test.drop(["day", "timestamp"], axis=1, inplace=True)

In [None]:
# Displaying the first 3 rows of the test dataset (X_test)
X_test.head(3)

---
## <span style="color:#ff5f27;">🗃 Window timeseries dataset </span>

In [None]:
class WindowGenerator():
    def __init__(self, input_width, label_width, shift,
                 df_train, val_df, test_df,
                 label_columns=None, batch_size=32):
        # Store the raw data.
        self.df_train = df_train
        self.val_df = val_df
        self.test_df = test_df

        # Work out the label column indices.
        self.label_columns = label_columns
        if label_columns is not None:
            self.label_columns_indices = {name: i for i, name in enumerate(label_columns)}
        self.column_indices = {name: i for i, name in enumerate(df_train.columns)}

        # Work out the window parameters.
        self.input_width = input_width
        self.label_width = label_width
        self.shift = shift

        self.total_window_size = input_width + shift

        self.input_slice = slice(0, input_width)
        self.input_indices = np.arange(self.total_window_size)[self.input_slice]

        self.label_start = self.total_window_size - self.label_width
        self.labels_slice = slice(self.label_start, None)
        self.label_indices = np.arange(self.total_window_size)[self.labels_slice]

        self.batch_size = batch_size

    def __repr__(self):
        return '\n'.join([
            f'Total window size: {self.total_window_size}',
            f'Input indices: {self.input_indices}',
            f'Label indices: {self.label_indices}',
            f'Label column name(s): {self.label_columns}'
        ])

    def split_window(self, features):
        inputs = features[:, self.input_slice, :]
        labels = features[:, self.labels_slice, :]
        if self.label_columns is not None:
            labels = tf.stack(
                [labels[:, :, self.column_indices[name]] for name in self.label_columns],
                axis=-1
            )

        # Slicing doesn't preserve static shape information, so set the shapes
        # manually. This way the `tf.data.Datasets` are easier to inspect.
        inputs.set_shape([None, self.input_width, None])
        labels.set_shape([None, self.label_width, None])

        return inputs, labels

    def plot(self, plot_col, model=None, max_subplots=3):
        inputs, labels = self.example
        plt.figure(figsize=(12, 8))
        plot_col_index = self.column_indices[plot_col]
        max_n = min(max_subplots, len(inputs))
        for n in range(max_n):
            plt.subplot(max_n, 1, n + 1)
            plt.ylabel(f'{plot_col} [normed]')
            plt.plot(self.input_indices, inputs[n, :, plot_col_index],
                     label='Inputs', marker='.', zorder=-10)

            if self.label_columns:
                label_col_index = self.label_columns_indices.get(plot_col, None)
            else:
                label_col_index = plot_col_index

            if label_col_index is None:
                continue

            plt.scatter(self.label_indices, labels[n, :, label_col_index],
                        edgecolors='k', label='Labels', c='#2ca02c', s=64)
            if model is not None:
                predictions = model(inputs)
                plt.scatter(self.label_indices, predictions[n, :, label_col_index],
                            marker='X', edgecolors='k', label='Predictions',
                            c='#ff7f0e', s=64)

            if n == 0:
                plt.legend()

        plt.xlabel('Time [h]')

    # make_dataset method will take a time series DataFrame and convert it to a tf.data.Dataset of (input_window, label_window)
    # pairs using the tf.keras.utils.timeseries_dataset_from_array function:
    def make_dataset(self, data):
        data = np.array(data, dtype=np.float32)
        ds = tf.keras.utils.timeseries_dataset_from_array(
            data=data,
            targets=None,
            sequence_length=self.total_window_size,
            sequence_stride=1,
            shuffle=False,
            batch_size=self.batch_size,
        )
        ds = ds.map(self.split_window)
        ds = ds.repeat(1000)
        ds = ds.prefetch(10)
        return ds

    @property
    def train(self):
        return self.make_dataset(self.df_train)

    @property
    def val(self):
        return self.make_dataset(self.val_df)

    @property
    def test(self):
        return self.make_dataset(self.test_df)

    @property
    def example(self):
        """Get and cache an example batch of `inputs, labels` for plotting."""
        result = getattr(self, '_example', None)
        if result is None:
            # No example batch was found, so get one from the `.train` dataset
            result = next(iter(self.test))
            # And cache it for next time
            self._example = result
        return result

In [None]:
# Creating a WindowGenerator instance
# The window represents a time series data window with an input width of 4, label width of 4, and a shift of 1
# The label columns are ["price_se1", "price_se2", "price_se3", "price_se4"]
n_step_window = WindowGenerator(
    df_train=X_train, 
    val_df=X_val, 
    test_df=X_test, 
    input_width=4, 
    label_width=4, 
    shift=1, 
    label_columns=["price_se1", "price_se2", "price_se3", "price_se4"],
)

# Displaying the WindowGenerator instance
n_step_window

In [None]:
# Extracting an example batch of inputs and labels from the WindowGenerator instance
inputs, labels = n_step_window.example

# Displaying the shapes of the input and label tensors
print(inputs.shape)
print(labels.shape)

# Displaying the indices of label columns in the dataset
print(n_step_window.label_indices)

In [None]:
# Iterating over the training dataset (n_step_window.train) to extract an example batch of inputs and labels
for example_inputs, example_labels in n_step_window.train.take(1):
    # Displaying the shape of the example batch of inputs
    print(f'Inputs shape (batch, time, features): {example_inputs.shape}')
    
    # Displaying the shape of the example batch of labels
    print(f'Labels shape (batch, time, features): {example_labels.shape}')

---
## <span style="color:#ff5f27;">🧬 Modeling</span>

In [None]:
def build_model(input_dim):
    # Creating a Sequential model
    model = tf.keras.models.Sequential()

    # Adding a 1D convolutional layer
    model.add(tf.keras.layers.Conv1D(filters=64, kernel_size=1, padding='same', kernel_initializer="uniform", input_shape=(input_dim[0], input_dim[1])))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.LeakyReLU(alpha=0.2))

    # Adding 1D convolutional layer
    model.add(tf.keras.layers.Conv1D(filters=32, kernel_size=1, padding='same', kernel_initializer="uniform"))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.LeakyReLU(alpha=0.2))

    # Adding 1D convolutional layer
    model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=1, padding='same', kernel_initializer="uniform"))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.LeakyReLU(alpha=0.2))
    
    # Adding a 1D max pooling layer
    model.add(tf.keras.layers.MaxPooling1D(pool_size=1, padding='same'))

    # Adding a Bidirectional LSTM layer
    model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=16, return_sequences=True))) 
    model.add(tf.keras.layers.Dropout(rate=0.1))

    # Adding a Dense layer
    model.add(tf.keras.layers.Dense(units=4))

    # Displaying the model summary
    model.summary()

    # Compiling the model with mean absolute error loss and the Adam optimizer
    model.compile(loss='mae', optimizer='adam')

    return model

In [None]:
# Building a model using the specified input shape derived from the example batch of inputs
model = build_model(inputs.shape[1:])

In [None]:
# Recording the start time
from timeit import default_timer as timer
start = timer()

# Training the model on the training dataset with specified parameters
# - Using 50 epochs
# - Displaying minimal training information (verbose=0)
# - Specifying the number of steps per epoch (steps_per_epoch=200)
# - Using the training dataset for validation (validation_data=n_step_window.train)
# - Specifying the number of validation steps (validation_steps=1)
history = model.fit(
    n_step_window.train,
    epochs=50,
    verbose=0,
    steps_per_epoch=200,
    validation_data=n_step_window.train,
    validation_steps=1,                    
)

# Recording the end time
end = timer()

# Displaying the time taken for training
print(end - start)

In [None]:
# Extracting an example batch of inputs and labels from the WindowGenerator instance (n_step_window)
inputs, labels = n_step_window.example

# Making predictions on the example batch using the trained model
prediction_test = model.predict(inputs)

# Displaying the shapes of the predicted outputs and the true labels
print(prediction_test.shape)
print(labels.shape)

In [None]:
# Extracting the training history dictionary from the model training
history_dict = history.history

# Displaying the keys in the history dictionary
print(history_dict.keys())

In [None]:
# Extracting training and validation loss values from the history dictionary
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

# Creating separate variables for loss values (50 epochs)
loss_values50 = loss_values
val_loss_values50 = val_loss_values

# Generating a plot for training and validation loss over epochs
epochs = range(1, len(loss_values50) + 1)
plt.plot(epochs, loss_values50, 'b', color='blue', label='Training loss')
plt.plot(epochs, val_loss_values50, 'b', color='red', label='Validation loss')

# Setting plot details and labels
plt.rc('font', size=18)
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.xticks(epochs)

# Adjusting the size of the plot
fig = plt.gcf()
fig.set_size_inches(15, 7)

# Displaying the plot
plt.show()

In [None]:
# Plotting the time series data for the 'price_se4' column
n_step_window.plot(
    plot_col="price_se4", 
    max_subplots=3, 
    model=model.predict,
)

In [None]:
# Extracting actual values for each time step in the example batch
se1_actual = []
se2_actual = []
se3_actual = []
se4_actual = []

# Extracting inputs and labels from the example batch of the WindowGenerator instance (n_step_window)
inputs, labels = n_step_window.example

# Iterating over batches and windows to collect actual values for each time step
for batch_n in range(len(labels)):
    batch = labels[batch_n]
    for window_n in range(4):
        se1_actual.append(batch[window_n][0].numpy())
        se2_actual.append(batch[window_n][1].numpy())
        se3_actual.append(batch[window_n][2].numpy())
        se4_actual.append(batch[window_n][3].numpy())

In [None]:
# Extracting predicted values for each time step in the example batch using the trained model
se1_pred = []
se2_pred = []
se3_pred = []
se4_pred = []

# Making predictions on the example batch using the trained model
prediction_test = model.predict(inputs)

# Iterating over batches and windows to collect predicted values for each time step
for batch_n in range(len(prediction_test)):
    batch = prediction_test[batch_n]
    for window_n in range(4):
        se1_pred.append(batch[window_n][0])
        se2_pred.append(batch[window_n][1])
        se3_pred.append(batch[window_n][2])
        se4_pred.append(batch[window_n][3])

In [None]:
# Displaying the first 10 rows of the true labels (y_test)
y_test.head(10)

In [None]:
# Displaying the value at index 1 in the list 'se3_actual'
se3_actual[1]

In [None]:
# Plotting the predicted and actual values for "SE1" prices over time
plt.plot(se1_pred, color='red', label='Test SE1 price prediction')
plt.plot(se1_actual, color='blue', label='Test actual')
plt.xlabel('Time')
plt.ylabel('Price (scaled)')
plt.legend(loc='upper left')

# Adjusting the size of the plot
fig = plt.gcf()
fig.set_size_inches(15, 5)

# Displaying the plot
plt.show()

In [None]:
# Plotting the predicted and actual values for "SE2" prices over time
plt.plot(se2_pred, color='red', label='Test SE2 price prediction')
plt.plot(se2_actual, color='blue', label='Test actual')
plt.xlabel('Time')
plt.ylabel('Price (scaled)')
plt.legend(loc='upper left')

# Adjusting the size of the plot
fig = plt.gcf()
fig.set_size_inches(15, 5)

# Displaying the plot
plt.show()

In [None]:
# Plotting the predicted and actual values for "SE3" prices over time
plt.plot(se3_pred, color='red', label='Test SE3 price prediction')
plt.plot(se3_actual, color='blue', label='Test actual')
plt.xlabel('Time')
plt.ylabel('Price (scaled)')
plt.legend(loc='upper left')

# Adjusting the size of the plot
fig = plt.gcf()
fig.set_size_inches(15, 5)

# Displaying the plot
plt.show()

In [None]:
# Plotting the predicted and actual values for "SE4" prices over time
plt.plot(se4_pred, color='red', label='Test SE4 price prediction')
plt.plot(se4_actual, color='blue', label='Test actual')
plt.xlabel('Time')
plt.ylabel('Price (scaled)')
plt.legend(loc='upper left')

# Adjusting the size of the plot
fig = plt.gcf()
fig.set_size_inches(15, 5)

# Displaying the plot
plt.show()

---
## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# Exporting the trained model to a directory
model_dir = "electricity_price_model"
print('Exporting trained model to: {}'.format(model_dir))

# Saving the model using TensorFlow's saved_model.save function
tf.saved_model.save(model, model_dir)

In [None]:
# Retrieving the Model Registry
mr = project.get_model_registry()

# Extracting loss value from the training history
metrics = {'loss': history_dict['val_loss'][0]} 

# Creating a TensorFlow model in the Model Registry
tf_model = mr.tensorflow.create_model(
    name="electricity_price_prediction_model",
    metrics=metrics,
    description="Daily electricity price prediction model.",
    input_example=n_step_window.example[0].numpy(),
)

# Saving the model to the specified directory
tf_model.save(model_dir)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 04: Batch Inference </span>

In the next notebook you will use your registered model to predict batch data.
