# <span style="color:#ff5f27"> 👨🏻‍🏫 PyTorch model and Sklearn Transformation Functions with Hopsworks Model Registry</span>

In this tutorial you will learn how to **register Sklearn Transformation Functions and PyTorch model** in the Hopsworks Model Registry, how to **retrieve** them and then use in **training and inference pipelines**. 

## <span style="color:#ff5f27">🗄️ Table of Contents</span>
- [📝 Imports](#1)
- [⛳️ Feature Pipeline](#t1)
    - [💽 Loading Data](#2)
    - [🔮 Connecting to Hopsworks Feature Store](#3)
    - [🪄 Creating Feature Groups](#4)
- [⛳️ Training Pipeline](#t2)
    - [🖍 Feature View Creation](#5)
    - [👩🏻‍🔬 Data Transformation](#6)
    - [👔 Sklearn Transformation Functions](#7)
    - [🧬 Modeling](#8)
    - [💾 Saving the Model and Transformation Functions](#9)
- [⛳️ Inference Pipeline](#t3)
    - [📮 Retrieving the Model and Transformation Functions from Model Registry](#10)
    - [👨🏻‍⚖️ Batch Prediction](#11)
    - [👨🏻‍⚖️ Real-time Predictions](#12)

<a name='1'></a>
## <span style='color:#ff5f27'> 📝 Imports </span>

In [None]:
# Import necessary libraries
import pandas as pd               # For data manipulation using DataFrames
import numpy as np                # For numerical operations
import matplotlib.pyplot as plt   # For data visualization
import os                         # For operating system-related tasks
import joblib                     # For saving and loading models

import torch                      # PyTorch library for deep learning
import torch.nn as nn             # Module for creating neural networks
import torch.optim as optim       # Module for optimization algorithms

# Import specific modules from scikit-learn
from sklearn.preprocessing import StandardScaler, OneHotEncoder   # For data preprocessing
from sklearn.metrics import accuracy_score                        # For evaluating model accuracy

---
<a name='t1'></a>
# <span style="color:#ff5f27;">⛳️ Feature Pipeline </span>

In this section you will load data, create a Hopsworks feature group and insert your dataset into created feature group.

<a name='2'></a>
## <span style="color:#ff5f27;"> 💽 Loading Data </span>

To begin with, let's load a dataset which contains air quality measurements for different  cities from 2013-01-01 to 2023-04-11.

In [None]:
# Load the data
df_original = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_eu.csv")
df_original.head(3)

Now let's add a target variable to the DataFrame. For simplicity and for demonstration purposes you will randomly assign either a 0 or a 1 to each row.

In [None]:
# Generate a binary target column
df_original['target'] = np.random.choice([0, 1], size=len(df_original))
df_original.head(3)

<a name='3'></a>
## <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

The next step is to login to the Hopsworks platform. 

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

<a name='4'></a>
## <span style="color:#ff5f27;">🪄 Creating Feature Groups</span>

Now you need to create a Feature Group and insert your dataset.

You will use `.get_or_create_feature_group()` method of the feature store object.

You can read about **Feature Groups** [here](https://docs.hopsworks.ai/3.2/concepts/fs/feature_group/fg_overview/).

In [None]:
feature_group = fs.get_or_create_feature_group(
    name='feature_group_online',
    description='Online Feature Group',
    version=1,
    primary_key=['city_name', 'date'],
    online_enabled=True,
)    
feature_group.insert(df_original)

---
<a name='t2'></a>
# <span style="color:#ff5f27;">⛳️ Training Pipeline </span>

In the **Training Pipeline** you will create a train-test split, build `to_df` and `transform_data` functions required for data transformation, fit OneHotEncoder and StandardScaler and use them for transforming train and test splits. Then you will build a PyTorch model, fit it and register in the Hopsworks Model Registry together with sklearn transformation functions.

<a name='5'></a>
## <span style="color:#ff5f27;"> 🖍 Feature View Creation</span>

In this part you will build a Query object and create a feature view.

In Hopsworks Feature Store, a Query object allows you to select specific features from a feature group.

`feature_group.select_except(['date'])` selects all columns from the feature group except for the 'date' column.

In [None]:
# Create a Query object
query = feature_group.select_except(['date'])
query.show(3)

After creating the Query object, you will create a feature view.

A feature view is a logical representation of data which can be used for real-time serving or batch processing. 

You can read more about **Feature Views** [here](https://docs.hopsworks.ai/3.2/concepts/fs/feature_view/fv_overview/).

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='serving_fv',
    version=1,
    query=query,
    labels=['target'],
)

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>
The next step is to create the train-test split of your data.

Let's clarify the next parameters of the `.train_test_split()` method:

- test_size=0.1: This parameter specifies the size of the test set relative to the entire dataset. In this case, the test set will contain 10% of the data, and the train set will have the remaining 90%.

- description='Description of the dataset': A brief description provided for the train-test split dataset, explaining its purpose or any other relevant information.

In [None]:
# Create a train-test split dataset
X_train, X_test, y_train, y_test = feature_view.train_test_split(
    test_size=0.1,
    description='Description of the dataset',
)

In [None]:
X_train.head(3)

In [None]:
y_train.head(3)

<a name='6'></a>
## <span style="color:#ff5f27;">👩🏻‍🔬 Data Transformation</span>

For Data Transformation let's create two functions: `to_df` and `transform_data`.

- `to_df` function will transform a feature vector(s) list into a pandas DataFrame.
- `transform_data` function will apply transformations to the input data using OneHotEncoder and StandardScaler.

In [None]:
def to_df(feature_vector):
    """
    Convert a feature vector or a list of feature vectors into a pandas DataFrame.

    Parameters:
        feature_vector (a list, or list of lists): 
            A feature vector or a list of feature vectors. A feature vector is 
            represented as a list containing two elements: the first 
            element corresponds to the city name (categorical feature), and the 
            second element corresponds to the PM2.5 value (numerical feature).

    Returns:
        pandas.DataFrame: A DataFrame representing the feature vector(s). 
        The DataFrame will have two columns: 'city_name' for the city names 
        and 'pm2_5' for the corresponding PM2.5 values.

    Example:
        >>> feature_vector = ['New York', 15.3]
        >>> to_df(feature_vector)
           city_name  pm2_5
        0  New York   15.3

        >>> multiple_vectors = [['New York', 15.3], ['Los Angeles', 10.7]]
        >>> to_df(multiple_vectors)
          city_name  pm2_5
        0  New York   15.3
        1  Los Angeles 10.7
    """
    
    # Check if the input is a list of feature vectors
    if isinstance(feature_vector[0], list): 
        # Separate the city names and PM2.5 values into separate lists
        city_names = [vector[0] for vector in feature_vector]
        pm2_5_values = [vector[1] for vector in feature_vector]
        
        # Create a DataFrame with 'city_name' and 'pm2_5' columns from the lists
        data = pd.DataFrame(
            {
                'city_name': city_names,
                'pm2_5': pm2_5_values,
            }
        )
        
        # Return the DataFrame representing multiple feature vectors
        return data

    # If only one feature vector is provided, create a DataFrame for it
    data = pd.DataFrame(
            {
                'city_name': [feature_vector[0]],
                'pm2_5': [feature_vector[1]],
            }
        )
    
    # Return the DataFrame representing a single feature vector
    return data

In [None]:
def transform_data(data, one_hot_encoder, standard_scaler):
    """
    Apply transformations to the input data using OneHotEncoder and StandardScaler.

    Parameters:
        data (pandas.DataFrame):
            The input DataFrame containing the columns 'city_name' (categorical feature)
            and 'pm2_5' (numerical feature) to be transformed.

        one_hot_encoder (sklearn.preprocessing.OneHotEncoder):
            The fitted OneHotEncoder object used to encode the 'city_name' column into binary vectors.

        standard_scaler (sklearn.preprocessing.StandardScaler):
            The fitted StandardScaler object used to standardize the 'pm2_5' column.

    Returns:
        pandas.DataFrame:
            A new DataFrame with the 'city_name' column encoded and the 'pm2_5' column
            standardized using StandardScaler. The new DataFrame contains all the original
            columns except 'city_name', and the encoded 'city_name' columns as binary vectors.
    """
    # Transform the 'city_name' column using OneHotEncoder
    city_encoded = one_hot_encoder.transform(data[['city_name']])

    # Create a new DataFrame with the encoded values
    encoded_df = pd.DataFrame(city_encoded, columns=one_hot_encoder.categories_[0])

    # Concatenate the encoded DataFrame with the original DataFrame
    data = pd.concat([data.drop('city_name', axis=1), encoded_df], axis=1)
    
    # Transform the 'pm2_5' column using StandardScaler
    data['pm2_5'] = standard_scaler.transform(data[['pm2_5']])

    return data

<a name='7'></a>
### <span style="color:#ff5f27;"> 👔 Sklearn Transformation Functions</span>

The next step is to create instances of OneHotEncoder and StandardScaler transformers and fit them on X_train dataset.

- The `OneHotEncoder` is used for converting categorical (discrete) features into a one-hot encoded representation. Categorical features are those that have distinct values and are not numerical in nature. For example, if you have a feature "Color" with categories like "Red," "Blue," and "Green," the one-hot encoding will create three binary columns representing each category. If the original data point belonged to the category "Red," then the first column will be 1, and the other two will be 0.

- The `StandardScaler` is used for standardizing numerical features by removing the mean and scaling them to unit variance. This process ensures that the features have similar scales, which is particularly important for algorithms that are sensitive to the scale of features, like gradient descent-based methods.

In [None]:
# Create an instance of the OneHotEncoder and StandardScaler
one_hot_encoder = OneHotEncoder(sparse=False)
standard_scaler = StandardScaler()

Let's fit `one_hot_encoder` on the `city_name` column and `standard_scaler` on the `pm2_5` column. 

In [None]:
# Fit the OneHotEncoder to the 'city_name' column of the training data
one_hot_encoder.fit(X_train[['city_name']])

# Fit the StandardScaler to the 'pm2_5' column of the training data
standard_scaler.fit(X_train[['pm2_5']])

# Print a success message after the fitting process is complete
print('✅ Done!')

### <span style="color:#ff5f27;">⛳️ Train Data Transformation</span>

Now let's use `transform_data` function to transform `X_train` and `X_test` using fitted `OneHotEncoder` and `StandardScaler` transformers.

In [None]:
X_train_transformed = transform_data(X_train, one_hot_encoder, standard_scaler)
X_train_transformed.head(3)

### <span style="color:#ff5f27;">⛳️ Test Data Transformation</span>

In [None]:
X_test_transformed = transform_data(X_test, one_hot_encoder, standard_scaler)
X_test_transformed.head(3)

<a name='8'></a>
## <span style="color:#ff5f27;">🧬 Modeling</span>

In the Modeling part, you will build a PyTorch Binary Classification model and fit it on the transformed X_train dataset.

In addition, let's create the `to_tensor` function in order to **transform pandas dataframe** into **PyTorch tensor**.

In [None]:
def to_tensor(dataframe):
    """
    Convert a pandas DataFrame to a PyTorch tensor.

    Parameters:
        dataframe (pandas.DataFrame):
            The input DataFrame to be converted to a tensor.

    Returns:
        torch.Tensor:
            A PyTorch tensor containing the values from the input DataFrame.
            The data type of the tensor is torch.float32.
    """
    return torch.tensor(dataframe.values, dtype=torch.float32)

Let's convert the `X_train_transformed` and `y_train` data into PyTorch tensors using `to_tensor` function.

In [None]:
# Convert data to PyTorch tensors
X_train_transformed_tensor = to_tensor(X_train_transformed)
y_train_tensor = to_tensor(y_train)

# Show the first observation
X_train_transformed_tensor[0]

Let's define a custom PyTorch model class `BinaryClassificationModel`. 

In [None]:
class BinaryClassificationModel(nn.Module):
    def __init__(self, input_dim):
        super(BinaryClassificationModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

In [None]:
# Get the number of features (input dimensions) from the preprocessed training data
input_dim = X_train_transformed_tensor.shape[1]

# Create an instance of the BinaryClassificationModel
model = BinaryClassificationModel(input_dim)

# Define the binary cross-entropy loss function
criterion = nn.BCELoss()

# Define the Adam optimizer with a learning rate of 0.005
optimizer = optim.Adam(model.parameters(), lr=0.005)

Now you are ready to train your model.

In [None]:
num_epochs = 5
batch_size = 32
num_batches = len(X_train_transformed_tensor) // batch_size

for epoch in range(num_epochs):
    for i in range(num_batches):
        # Prepare mini-batches
        start_idx = i * batch_size
        end_idx = start_idx + batch_size
        batch_X, batch_y = X_train_transformed_tensor[start_idx:end_idx], y_train_tensor[start_idx:end_idx]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y.view(-1, 1))

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print training progress
        if (i + 1) % 1786 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{num_batches}], Loss: {loss.item():.4f}')

## <span style="color:#ff5f27;">🗄 Model Registry</span>

In Hopsworks, the Model Registry is a crucial component used to manage and version machine learning models. It acts as a centralized repository where trained models can be stored, tracked, and shared among team members.

By calling `project.get_model_registry()`, the code retrieves a reference to the Model Registry associated with the current Hopsworks project. This reference allows the user to interact with the Model Registry and perform operations such as registering, versioning, and accessing trained machine learning models.
With the Model Registry, data scientists and machine learning engineers can effectively collaborate, track model changes, and easily deploy the best-performing models to production environments.

In [None]:
mr = project.get_model_registry()

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The next step is to **define input and output schema** of a machine learning model.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train_transformed.values)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

<a name='9'></a>
### <span style="color:#ff5f27;">💾 Saving the Model and Transformation Functions</span>

Now you are ready to register your model and sklearn transformation functions in the Hopsworks Model Registry.

To begin with, let's create the `torch_tf_model` model directory and save the trained model and sklearn transformation functions in this directory.

In [None]:
model_dir = "torch_tf_model"

if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

# Save Transformation Functions
joblib.dump(one_hot_encoder, model_dir + '/one_hot_encoder.pkl')
joblib.dump(standard_scaler, model_dir + '/standard_scaler.pkl')

# Save the model
joblib.dump(model, model_dir + '/torch_classifier.pkl')

To register your model in the Hopsworks model registry you can use `.create_model()` method with the next parameters:

- name="torch_model": The name of the model.

- metrics={"Accuracy": accuracy}: The model's performance metrics are specified as a dictionary, with "Accuracy" as the key and the value being the accuracy score computed earlier in the code. This metric represents the accuracy of the model's predictions on the test data.

- description="PyTorch model": A brief description of the model.

- input_example=X_train.sample(): An example input from the training data (X_train) is used to demonstrate the expected format of the model's input data. It is randomly sampled from X_train.

- model_schema=model_schema: The model schema, which represents the data input and output structure of the model, is specified using the previously defined model_schema.

In [None]:
# Create a model in the model registry
model = mr.torch.create_model(
    name="torch_model",
    description="PyTorch model",
    input_example=X_train.sample(),
    model_schema=model_schema,
)

model.save(model_dir)

---
<a name='t3'></a>
# <span style="color:#ff5f27;">⛳️ Inference Pipeline </span>

In the **Inference Pipeline** section, you will retrieve your model from Hopsworks Model Registry and utilize this model to make predictions on both Batch Data and Online Feature Vectors.

<a name='10'></a>
## <span style="color:#ff5f27;"> 📮 Retrieving the Model from Model Registry </span>

To retrieve a previously registered machine learning model from the Hopsworks Model Registry you need to use the `.get_model()` method with the next parameters:

- name="torch_model": The name of the model to be retrieved.

- version=1: The version number of the model to be retrieved.

Then you will download the model and transformation functions from the Model Registry.

In [None]:
# Retrieve your model from the model registry
retrieved_model = mr.get_model(
    name="torch_model",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
# Retrieve the PyTorch model
retrieved_torch_model = joblib.load(saved_model_dir + "/torch_classifier.pkl")

# Retrieve Transformation Functions
one_hot_encoder = joblib.load(saved_model_dir + "/one_hot_encoder.pkl")
standard_scaler = joblib.load(saved_model_dir + "/standard_scaler.pkl")

<a name='11'></a>
## <span style="color:#ff5f27;"> 👨🏻‍⚖️ Batch Prediction </span>

Batch prediction is a process in which a trained machine learning model is used to make predictions on a large set of data all at once.

To retrieve batch data from the feature view you need to use `init_batch_scoring` method of the feature view object.

`training_dataset_version` parameter specifies the version number of the training dataset that will be used for scoring.

Then you can use the `.get_batch_data()` method to retrieve batch data.

In [None]:
# Initialise feature view to retrieve batch data
feature_view.init_batch_scoring(training_dataset_version=td_version)

# Retrieve batch data
batch_data = feature_view.get_batch_data()
batch_data.head(3)

The `transform_data` function applies the same transformations to batch_data as were applied to the training data during the preprocessing phase. This ensures that the batch data has the same format and scale as the data used to train the model

In [None]:
# Apply transformations to the batch data using transform_data function
batch_data_transformed = transform_data(batch_data, one_hot_encoder, standard_scaler)
batch_data_transformed.head(3)

Now let's use retrieved model to predict batch data.

In [None]:
# Predict batch data using retrieved model
predictions_batch = retrieved_torch_model(to_tensor(batch_data_transformed))
predictions_batch[:10]

<a name='12'></a>
## <span style="color:#ff5f27;"> 👨🏻‍⚖️ Real-time Predictions</span>

**Real-time Predictions** is a process of using a trained machine learning model to make predictions on feature vector(s) in real-time. 

To begin with, let's create `to_df` function which will transform a feature vector(s) list into a pandas DataFrame.

The next step is to initialize the feature view for serving and then retrieve a feature vector with specified primary keys.

In [None]:
# Initialise feature view to retrieve feature vector
feature_view.init_serving(1)

# Retrieve a feature vector
feature_vector = feature_view.get_feature_vector(
    entry = {
        "city_name": 'Amsterdam',
        "date": '2013-01-01',
    }
)
feature_vector

Let's apply `to_df` function in order to transform the feature vector into pandas dataframe.

In [None]:
# Transform feature vector to pandas dataframe
feature_vector_df = to_df(feature_vector)
feature_vector_df

Transform `feature_vector_df` using `transform_data` function.

In [None]:
# Apply transformations to the feature vector df using transform_data function
feature_vector_transformed = transform_data(feature_vector_df, one_hot_encoder, standard_scaler)
feature_vector_transformed.head(3)

Now you can use your model to predict the transformed feature vector dataframe.

In [None]:
# Predict transformed feature vector using retrieved model
prediction_feature_vector = retrieved_torch_model(to_tensor(feature_vector_transformed))
prediction_feature_vector

In addition, you can retrieve several feature vectors. Just pass primary keys as a list of dictionaries.

In [None]:
# Retrieve feature vectors from feature store
feature_vectors = feature_view.get_feature_vectors(
    entry = [
        {"city_name": 'Amsterdam', "date": '2013-01-01'},
        {"city_name": 'Amsterdam', "date": '2014-01-01'},
    ]
)
feature_vectors

Apply `to_df` function in order to transform feature vectors into pandas dataframe.

In [None]:
# Convert feature vectors to pandas dataframe
feature_vectors_df = to_df(feature_vectors)
feature_vectors_df

Transform `feature_vector_dfs` using `transform_data` function.

In [None]:
# Apply transformations to the feature vectors df using transform_data function
feature_vectors_transformed = transform_data(feature_vectors_df, one_hot_encoder, standard_scaler)
feature_vectors_transformed.head(3)

Now you can use your model to predict the transformed feature vectors dataframe.

In [None]:
# Predict transformed feature vectors using retrieved model
prediction_feature_vectors = retrieved_torch_model(to_tensor(feature_vectors_transformed))
prediction_feature_vectors

---