# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Pipeline</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/bitcoin/3_bitcoin_training_pipeline.ipynb)

<span style="font-width:bold; font-size: 1.4rem;">This is the third part of advanced tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group, create training dataset within the feature store, train a model, register it in Hopsworks Model Registry and then use for batch predictions.</span>

## 🗒️ This notebook is divided into the following sections: 

1. Fetch Feature Groups.
2. Define Transformation functions.
3. Create Feature Views.
4. Create Training Dataset with training, validation and test splits.
5. Train the model.
6. Register model in Hopsworks model registry.
7. Online model deployment.

![part2](../../images/02_training-dataset.png) 

### <span style="color:#ff5f27;"> 📝 Imports</span>

In [None]:
!pip install -U hopsworks --quiet

In [None]:
import time
import datetime
import os

import tensorflow as tf
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [None]:
from __future__ import print_function

%config InlineBackend.figure_format='retina'
%matplotlib inline

from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
import inspect 

import warnings
warnings.filterwarnings('ignore')

---

## <span style="color:#ff5f27;"> 📡 Connecting to the Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

In [None]:
btc_price_fg = fs.get_or_create_feature_group(
    name='bitcoin_price',
    version=1,
)
# btc_price_fg.show(3)

In [None]:
tweets_textblob_fg = fs.get_or_create_feature_group(
    name='bitcoin_tweets_textblob',
    version=1,
)

In [None]:
tweets_vader_fg = fs.get_or_create_feature_group(
    name='bitcoin_tweets_vader',
    version=1,
)
# tweets_vader_fg.show(3)

--- 

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

In [None]:
# Query Preparation
query = btc_price_fg.select_except(["date"]) \
               .join(tweets_textblob_fg.select(["subjectivity","polarity"])) \
               .join(tweets_vader_fg.select("compound"))

final_df = query.read()
final_df

In [None]:
final_df.shape

In [None]:
columns_to_transform = final_df.columns
columns_to_transform = columns_to_transform.tolist()
columns_to_transform.remove("unix")

In [None]:
# Load the transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")

# Map features to transformation functions.
transformation_functions = {col: min_max_scaler for col in columns_to_transform}

In [None]:
transformation_functions

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='bitcoin_feature_view',
    version=1,
    transformation_functions=transformation_functions,
    query=query,
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset you will use the `FeatureView.train_test_split()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- You can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- You can create **train, test** splits using `create_train_test_split()`. 

- You can create **train,validation, test** splits using `train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

In [None]:
# You can combine different datetime formats.
X_train, X_test, y_train, y_test = feature_view.train_test_split(
    train_start=int(final_df.unix.min()),
    train_end=int(np.percentile(final_df.unix, 80)), # get the date that represents 80th percentile
    test_start=int(np.percentile(final_df.unix, 81)), # get the date that represents 81th percentile
    test_end=int(final_df.unix.max()),
    )

In [None]:
X_train

In [None]:
X_train.shape, X_test.shape

In [None]:
# lets remove redundant column "unix"
X_train.drop(columns=["unix"], inplace=True)
X_test.drop(columns=["unix"], inplace=True)

In [None]:
y_train = X_train[["close"]]
y_test = X_test[["close"]]

---

## <span style="color:#ff5f27;">🤖 Time series model</span>

In [None]:
# Now lets define Tensorflow Dataset as we are going to train keras tensorflow model

def windowed_dataset(dataset, target, window_size, batch_size):
    ds = dataset.window(window_size, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda x: x.batch(window_size))
    ds = ds.map(lambda window: tf.reshape(window[-1:], [-1, 33]))
        
    target_ds = target.window(window_size, shift=1, drop_remainder=True)
    target_ds = target_ds.flat_map(lambda window: window.batch(window_size))
    target_ds = target_ds.map(lambda window: window[-1:])
    
    ds = tf.data.Dataset.zip((ds, target_ds))
    ds = ds.batch(batch_size,True)
    ds = ds.prefetch(1)
    return ds

In [None]:
training_dataset = tf.data.Dataset.from_tensor_slices(tf.cast(X_train.values, tf.float32)) 
training_target = tf.data.Dataset.from_tensor_slices(y_train.values.flatten().tolist()) 
training_dataset = training_dataset.repeat(500)
training_dataset = windowed_dataset(training_dataset, training_target, window_size=2, batch_size=16)
training_dataset

In [None]:
test_dataset = tf.data.Dataset.from_tensor_slices(tf.cast(X_test.values, tf.float32))
validation_target = tf.data.Dataset.from_tensor_slices(y_test.values.flatten().tolist()) 
training_dataset = training_dataset.repeat(500)
test_dataset = windowed_dataset(test_dataset, validation_target, window_size=2, batch_size=16)
test_dataset

In [None]:
def build_model(input_dim):
    inputs = tf.keras.layers.Input(shape=(input_dim[0],input_dim[1]))
    x = tf.keras.layers.Conv1D(filters = 128, kernel_size=1, padding='same', kernel_initializer="uniform")(inputs)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.LeakyReLU(alpha=0.2)(x)    
    x = tf.keras.layers.MaxPooling1D(pool_size=2, padding='same')(x)
    x = tf.keras.layers.Conv1D(filters = input_dim[1], kernel_size= 1,padding='same',  kernel_initializer="uniform")(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.LeakyReLU(alpha=0.2)(x)    
    x = tf.keras.layers.MaxPooling1D(pool_size=2, padding='same')(x)    

    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(33, activation="relu", kernel_initializer="uniform")(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Dense(1, activation="relu", kernel_initializer="uniform")(x)
    
    model = tf.keras.Model(inputs, x)
    model.summary()
    model.compile(loss='mse',optimizer='adam',metrics=['mae'])
    return model

In [None]:
model = build_model([1, X_train.shape[1]])

In [None]:
from timeit import default_timer as timer
start = timer()
history = model.fit(
    training_dataset,
    epochs=10,
    verbose=0,
    steps_per_epoch=500,
    validation_data=test_dataset,
    validation_steps=1,                    
)
end = timer()
print(end - start)

In [None]:
history_dict = history.history
history_dict.keys()

### <span style='color:#ff5f27'>⚖️ Model Validation</span>

In [None]:
loss_values = history_dict['mae']
val_loss_values = history_dict['val_mae']

loss_values50 = loss_values
val_loss_values50 = val_loss_values
epochs = range(1, len(loss_values50) + 1)
plt.plot(epochs, loss_values50, 'b',color = 'blue', label='Training loss')
plt.plot(epochs, val_loss_values50, 'b',color='red', label='Validation loss')
plt.rc('font', size = 18)
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.xticks(epochs)
fig = plt.gcf()
fig.set_size_inches(15,7)
plt.show()

In [None]:
y_pred_scaled = model.predict(X_test.values.reshape(-1, 1, X_test.shape[1]))
y_pred_scaled[:5]

In [None]:
import inspect 
# Recall that you applied transformation functions, such as min max scaler and laber encoder. 
# Now you want to transform them back to human readable format.
feature_view.init_serving(1)
td_transformation_functions = feature_view._single_vector_server._transformation_functions

y_pred = pd.DataFrame(y_pred_scaled, columns=["close"])

for feature_name in td_transformation_functions:
    if feature_name == "close":
        td_transformation_function = td_transformation_functions[feature_name]
        sig, foobar_locals = inspect.signature(td_transformation_function.transformation_fn), locals()
        param_dict = dict([(param.name, param.default) for param in sig.parameters.values() if param.default != inspect._empty])
        if td_transformation_function.name == "min_max_scaler":
            y_pred[feature_name] = y_pred[feature_name].map(lambda x: x*(param_dict["max_value"]-param_dict["min_value"])+param_dict["min_value"])
            y_test[feature_name] = y_test[feature_name].map(lambda x: x*(param_dict["max_value"]-param_dict["min_value"])+param_dict["min_value"])

In [None]:
from sklearn.metrics import mean_absolute_error

print("MAE:", mean_absolute_error(y_test, y_pred))

In [None]:
colors = ['darkslategrey']

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.plot(y_test, 'black')
ax.plot(y_pred, 'orange')
ax.set_ylabel('$price$')
ax.set_xlabel('$time$')
ax.grid(True)
ax.legend(["actual", "pred"])

plt.grid(True)
plt.show()

---
## <span style='color:#ff5f27'>🗄 Model Registry</span>

In [None]:
# The 'bitcoin_price_model' directory will be saved to the model registry
model_dir = "bitcoin_price_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

fig.savefig(model_dir + "/chart.png") 

print('Exporting trained model to: {}'.format(model_dir))
    
tf.saved_model.save(model, model_dir)

In [None]:
mr = project.get_model_registry()

metrics={'loss': history_dict['val_mae'][0]} 

today = datetime.today()
latest_date = int(time.mktime(today.timetuple()) * 1000) # converting todays datetime to unix

tf_model = mr.tensorflow.create_model(
    name="bitcoin_price_model", 
    metrics=metrics,
    input_example=[latest_date],
    description="Bitcoin daily price prediction model.",
)

tf_model.save(model_dir)

---
## <span style="color:#ff5f27;">🚀 Model Deployment</span>

In [None]:
%%writefile btc_model_transformer.py

import os
import hsfs
import numpy as np

class Transformer(object):
    
    def __init__(self):        
        # get feature store handle
        fs_conn = hsfs.connection()
        self.fs = fs_conn.get_feature_store()
        
        # get feature views
        self.fv = self.fs.get_feature_view("bitcoin_feature_view", 1)
        
        # initialise serving
        self.fv.init_serving(1)

    def flat2gen(self, alist):
        for item in alist:
            if isinstance(item, list):
                for subitem in item: yield subitem
            else:
                yield item
        
    def preprocess(self, inputs):
        feature_vector = self.fv.get_feature_vector({"unix": inputs["instances"][0][0]})
        feature_vector = [*feature_vector[:9], *feature_vector[10:]]
        return { "instances" :  np.array(list(self.flat2gen(feature_vector))).reshape(-1, 1, len(feature_vector)).tolist() }

    def postprocess(self, outputs):
        return outputs    

In [None]:
from hsml.transformer import Transformer

dataset_api = project.get_dataset_api()

uploaded_file_path = dataset_api.upload("btc_model_transformer.py", "Resources", overwrite=True)
transformer_script_path = os.path.join("/Projects", project.name, uploaded_file_path)
transformer_script = Transformer(script_file=transformer_script_path)

In [None]:
# we can retrieve the model using code like this
tf_model = mr.get_model("bitcoin_price_model", version = 1)

In [None]:
deployment = tf_model.deploy(
    name="btcmodeldeployment",
    transformer=transformer_script
)

In [None]:
print("Deployment is warming up...")
time.sleep(40)

The deployment has now been registered. Lets retrieve it from Hopsworks for demonstration purpose.

In [None]:
ms = project.get_model_serving()

# get deployment object
deployment = ms.get_deployment("btcmodeldeployment")

In [None]:
print("Deployment: " + deployment.name)
deployment.describe()

To start it you need to run:

In [None]:
deployment.start(await_running=120)

For trouble shooting one can use get_logs method

In [None]:
deployment.get_logs()

---
## <span style="color:#ff5f27;">🔮 Predicting</span>

Using the deployment let's use the input example that we registered together with the model to query the deployment.

In [None]:
today = datetime.today()
latest_date = int(time.mktime(today.timetuple()) * 1000) # converting todays datetime to unix

deployment_output = deployment.predict(inputs=[latest_date])
# or deployment.predict({ "instances": [[latest_date]] })

deployment_output

In [None]:
pred_encoded = pd.DataFrame(
    deployment_output["predictions"],
    columns=["close"],
) # since we applied transformation function to the 'close' columns,
  # now we need to provide a df with the same column name to decode.
                                                
pred = decode_features(pred_encoded, feature_view=feature_view)

In [None]:
pred = pred.rename(columns={"close": "predicted_price"})
pred["datetime"] = [today.strftime("%Y-%m-%d")]

In [None]:
pred

In [None]:
# # For trouble shooting one you can use get_logs method.
# deployment.get_logs()

---
## <span style="color:#ff5f27;"> ✨ Load Batch Data of last days</span>

In [None]:
feature_view.init_batch_scoring(1)

batch_data = feature_view.get_batch_data().drop('unix',axis=1)
batch_data.head()

In [None]:
model = mr.get_model("bitcoin_price_model", version=1)
model_dir = model.download()

loaded_model = tf.saved_model.load(model_dir)
serving_function = loaded_model.signatures["serving_default"]

In [None]:
predictions_batch = serving_function(
    tf.constant(
        batch_data.values.reshape(
            -1, 
            batch_data.shape[0], 
            batch_data.shape[1]), 
        tf.float32
    )
)['dense_1'].numpy()

Recall that you applied transformation functions, such as min max scaler and laber encoder. 

Now you want to transform them back to human readable format.

In [None]:
pred_batch_encoded = pd.DataFrame(
    predictions_batch,
    columns=["close"]
) # since we applied transformation function to the 'close' columns,
  # now we need to provide a df with the same column name to decode.
                                                
pred_batch = decode_features(pred_batch_encoded, feature_view=feature_view)

In [None]:
pred_batch.tail(5)

---
## <span style="color:#ff5f27;"> ⏭️ **Next:** Part 04: Batch Inference</span>

In the following notebook you will use your model for Batch Inference.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/bitcoin/4_bitcoin_batch_inference.ipynb)