# <span style="color:#ff5f27"> üë®üèª‚Äçüè´ Custom Transformation Functions Registration</span>

In this tutorial you will learn how to **write custom transformation functions for feature view** and **register Keras model** using Hopsworks Model Registry, and use retrieved model in **training and inference pipelines**. 

## <span style="color:#ff5f27">üóÑÔ∏è Table of Contents</span>
- [üìù Imports](#1)
- [‚õ≥Ô∏è Feature Pipeline](#t1)
    - [üíΩ Loading Data](#2)
    - [üîÆ Connecting to Hopsworks Feature Store](#3)
    - [ü™Ñ Creating Feature Groups](#4)
- [‚õ≥Ô∏è Training Pipeline](#t2)
    - [üë©üèª‚Äçüî¨ Custom Transformation Functions](#12)
    - [‚úçüèª Registering Custom Transformation Functions in Hopsworks](#5)
    - [üñç Feature View Creation](#6)
    - [üß¨ Modeling](#7)
    - [üíæ Saving the Model in the Model Registry](#8)
- [‚õ≥Ô∏è Inference Pipeline](#t3)
    - [üìÆ Retrieving the Model from the Model Registry](#9)
    - [üë®üèª‚Äç‚öñÔ∏è Batch Prediction](#10)
    - [üë®üèª‚Äç‚öñÔ∏è Real-time Predictions](#11)

<a name='1'></a>
## <span style='color:#ff5f27'> üìù Imports </span>

In [None]:
# Importing necessary libraries
import pandas as pd        # For data manipulation and analysis using DataFrames
import numpy as np         # For numerical computations and arrays
import os                  # For operating system-related functions
import joblib              # For saving and loading model files

import xgboost as xgb      # For using the XGBoost machine learning library
from sklearn.metrics import accuracy_score  # For evaluating model accuracy

---
<a name='t1'></a>
# <span style="color:#ff5f27;">‚õ≥Ô∏è Feature Pipeline </span>

In this section you will load data, create a Hopsworks feature group and insert your dataset into created feature group.

<a name='2'></a>
## <span style="color:#ff5f27;"> üíΩ Loading Data </span>

To begin with, let's load a dataset which contains air quality measurements for different  cities from 2013-01-01 to 2023-04-11.

In [None]:
# Load the data
df_original = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_eu.csv")
df_original.head(3)

Now let's add a target variable to the DataFrame. For simplicity and for demonstration purposes you will randomly assign either a 0 or a 1 to each row.

In [None]:
# Generate a binary target column
df_original['target'] = np.random.choice([0, 1], size=len(df_original))
df_original.head(3)

<a name='3'></a>
## <span style="color:#ff5f27;"> üîÆ Connecting to Hopsworks Feature Store </span>

The next step is to login to the Hopsworks platform. 

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

<a name='4'></a>
## <span style="color:#ff5f27;">ü™Ñ Creating Feature Groups</span>

Now you need to create a Feature Group and insert your dataset.

You will use `.get_or_create_feature_group()` method of the feature store object.

You can read about **Feature Groups** [here](https://docs.hopsworks.ai/3.2/concepts/fs/feature_group/fg_overview/).

In [None]:
feature_group = fs.get_or_create_feature_group(
    name='feature_group_online',
    description='Online Feature Group',
    version=1,
    primary_key=['city_name', 'date'],
    online_enabled=True,
)    
feature_group.insert(df_original)

---
<a name='t2'></a>
# <span style="color:#ff5f27;">‚õ≥Ô∏è Training Pipeline </span>

In the **Training Pipeline** you will register custom transformation functions in the Hopsworks Feature Store, apply custom transformation functions to specific columns in the feature view, create a train-test split and train the XGBClassifier. Then you will register your trained model in the Hopsworks Model Registry.

<a name='12'></a>
## <span style="color:#ff5f27;">üë©üèª‚Äçüî¨ Custom Transformation Functions</span>

In the `transformations.py` file you can find the custom `encode_city_name` and `scale_pm2_5` transformation functions which will be registered in the Hopsworks Feature Store and then attached to feature view during feature view creation for further data transformation.

Let's import them and see how they work.

If you are running on Hopsworks, custom transformation functions need to be registered in the feature store to make them accessible for feature view creation. To register them in the feature store, they either have to be part of the library installed in Hopsworks or attached when starting a Jupyter notebook or Hopsworks job.

Uncomment the next cell to install `transformations` file with custom transformation functions.

In [None]:
#!wget https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/master/advanced_tutorials/transformation_functions/custom/transformations.py

In [None]:
from transformations import encode_city_name, scale_pm2_5

In [None]:
city_name = 'Madrid'
encoded_city_name = encode_city_name(city_name)
print("‚õ≥Ô∏è Encoded City Name:", encoded_city_name)  # Output: Encoded City Name: 0

In [None]:
pm2_5_value = 13.0
scaled_pm2_5 = scale_pm2_5(pm2_5_value)
print("‚õ≥Ô∏è Scaled PM2.5 Value:", scaled_pm2_5)  # Output: Scaled PM2.5 Value: 0.0

<a name='5'></a>
## <span style="color:#ff5f27;"> ‚úçüèª Registering Custom Transformation Functions in Hopsworks</span>

The next step is to **register custom transformation functions** in Hopsworks Feature Store.

You can check existing transformation functions in feature store using the `.get_transformation_functions()` method.

In [None]:
# Check existing transformation functions
fns = [fn.name for fn in fs.get_transformation_functions()]
fns

You can register your transformation function using the `.create_transformation_function()` method with the next parameters:

- `transformation_function` - your custom transformation function.

- `output_type` - python or numpy output type that will be inferred as pyspark.sql.types type.

- `version` - version of your custom transformation function.

Then don't forget to use the `.save()` method in order to persist transformation function in backend.

In [None]:
# Register encode_city_name in Hopsworks
if "encode_city_name" not in fns:
    encoder = fs.create_transformation_function(
        encode_city_name, 
        output_type=int,
        version=1,
    )
    encoder.save()
    
# Register scale_pm2_5 in Hopsworks
if "scale_pm2_5" not in fns:
    scaler = fs.create_transformation_function(
        scale_pm2_5, 
        output_type=float,
        version=1,
    )
    scaler.save()

Now let's check if your custom transformation functions are present in the feature store.

In [None]:
# Check it your transformation functions are present in the feature store
fns = [fn.name for fn in fs.get_transformation_functions()]
fns

<a name='6'></a>
## <span style="color:#ff5f27;"> üñç Feature View Creation</span>

In this part you will retrieve your custom transformation functions from the feature store, build a Query object and create a feature view.

To retrieve your custom transformation function you need to use the `.get_transformation_function()` method by specifying the **name** and **version** of required transformation function.

In [None]:
# Retrieve encode_city_name transformation function
encoder = fs.get_transformation_function(
    name="encode_city_name",
    version=1
)

# Retrieve scale_pm2_5 transformation function
scaler = fs.get_transformation_function(
    name="scale_pm2_5",
    version=1
)

In Hopsworks Feature Store, a Query object allows you to select specific features from a feature group.

`feature_group.select_except(['date'])` selects all columns from the feature group except for the 'date' column.

In [None]:
# Build a Query object
query = feature_group.select_except(['date'])
query.show(3)

After creating the Query object, you will create a feature view.

A feature view is a logical representation of data which can be used for real-time serving or batch processing. 

You can read more about **Feature Views** [here](https://docs.hopsworks.ai/3.2/concepts/fs/feature_view/fv_overview/).

In [None]:
# Get or create a feature view
feature_view = fs.get_or_create_feature_view(
    name='serving_fv',
    version=1,
    query=query,
    # Apply your custom transformation functions to necessary columns
    transformation_functions={
        "city_name": encoder,
        "pm2_5": scaler,
    },
    labels=['target'],
)

## <span style="color:#ff5f27;"> üèãÔ∏è Training Dataset Creation</span>
The next step is to create the train-test split of your data.

Let's clarify the next parameters of the `.create_train_test_split()` method:

- test_size=0.1: This parameter specifies the size of the test set relative to the entire dataset. In this case, the test set will contain 10% of the data, and the train set will have the remaining 90%.

- description='Description of the dataset': A brief description provided for the train-test split dataset, explaining its purpose or any other relevant information.

- data_format='csv': This parameter specifies the format in which the train-test split dataset will be stored. Here, it is set to 'csv', meaning the dataset will be saved in CSV format.

In [None]:
# Create a train-test split dataset
td_version, job = feature_view.create_train_test_split(
    test_size=0.1,
    description='Description of the dataset',
    data_format='csv'
)

### <span style="color:#ff5f27;">ü™ù Training Dataset Retrieval</span>

To retrieve your train_test_split you can use the `.get_train_test_split()` method of the feature_view object.

The parameter `training_dataset_version` specifies the version number of the train-test split dataset to retrieve. 

`td_version` is the version number that was obtained when the train-test split dataset was created in a previous step.

In [None]:
# Retrieve the train-test split
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
    training_dataset_version=td_version
)

In [None]:
X_train.head(3)

In [None]:
y_train.head(3)

<a name='7'></a>
## <span style="color:#ff5f27;">üß¨ Modeling</span>

As a machine learning algorithm you will use the XGBClassifier.

Let's initialize it, fit on train data and then evaluate using Accuracy Score.

In [None]:
# Initialize XGBClassifier
xgb_classifier = xgb.XGBClassifier()

# Fit the classifier
xgb_classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = xgb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("üëÆüèª‚Äç‚ôÇÔ∏è Accuracy:", accuracy)

## <span style="color:#ff5f27;">üóÑ Model Registry</span>

In Hopsworks, the Model Registry is a crucial component used to manage and version machine learning models. It acts as a centralized repository where trained models can be stored, tracked, and shared among team members.

By calling `project.get_model_registry()`, the code retrieves a reference to the Model Registry associated with the current Hopsworks project. This reference allows the user to interact with the Model Registry and perform operations such as registering, versioning, and accessing trained machine learning models.
With the Model Registry, data scientists and machine learning engineers can effectively collaborate, track model changes, and easily deploy the best-performing models to production environments.

In [None]:
mr = project.get_model_registry()

### <span style="color:#ff5f27;">‚öôÔ∏è Model Schema</span>

The next step is to **define input and output schema** of a machine learning model.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train.values)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

<a name='8'></a>
### <span style="color:#ff5f27;">üíæ Saving the Model</span>

Now you are ready to register your model in the Hopsworks Moder Registry.

To begin with, let's create the `xgb_model` model directory and save the trained model in this directory.

In [None]:
model_dir = "xgb_model"

if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

# Save the model
joblib.dump(xgb_classifier, model_dir + '/xgb_classifier.pkl')

To register your model in the Hopsworks model registry you can use `.create_model()` method with the next parameters:

- name="xgb_model": The name of the model.

- metrics={"Accuracy": accuracy}: The model's performance metrics are specified as a dictionary, with "Accuracy" as the key and the value being the accuracy score computed earlier in the code. This metric represents the accuracy of the model's predictions on the test data.

- description="XGB model": A brief description of the model.

- input_example=X_train.sample(): An example input from the training data (X_train) is used to demonstrate the expected format of the model's input data. It is randomly sampled from X_train.

- model_schema=model_schema: The model schema, which represents the data input and output structure of the model, is specified using the previously defined model_schema.

In [None]:
# Create a model in the model registry
model = mr.python.create_model(
    name="xgb_model",
    metrics={"Accuracy": accuracy}, 
    description="XGB model",
    input_example=X_train.sample(),
    model_schema=model_schema
)

model.save(model_dir)

---
<a name='t3'></a>
# <span style="color:#ff5f27;">‚õ≥Ô∏è Inference Pipeline </span>

In the **Inference Pipeline** section, you will retrieve your model from Hopsworks Model Registry and utilize this model to make predictions on both Batch Data and Online Feature Vectors.

<a name='9'></a>
## <span style="color:#ff5f27;"> üìÆ Retrieving the Model from Model Registry </span>

To retrieve a previously registered machine learning model from the Hopsworks Model Registry you need to use the `.get_model()` method with the next parameters:

- name="xgb_model": The name of the model to be retrieved.

- version=1: The version number of the model to be retrieved.

Then you will download the model from the Model Registry.

In [None]:
# Retrieve your model from the model registry
retrieved_model = mr.get_model(
    name="xgb_model",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
# Retrieve the XGB model
retrieved_xgboost_model = joblib.load(saved_model_dir + "/xgb_classifier.pkl")
retrieved_xgboost_model

<a name='10'></a>
## <span style="color:#ff5f27;"> üë®üèª‚Äç‚öñÔ∏è Batch Prediction </span>

Batch prediction is a process in which a trained machine learning model is used to make predictions on a large set of data all at once.

To retrieve batch data from the feature view you need to use `init_batch_scoring` method of the feature view object.

`training_dataset_version` parameter specifies the version number of the training dataset that will be used for scoring.

Then you can use the `.get_batch_data()` method to retrieve batch data.

In [None]:
# Initialise feature view to retrieve batch data
feature_view.init_batch_scoring(training_dataset_version=td_version)

# Retrieve batch data
batch_data = feature_view.get_batch_data()
batch_data.head(3)

Now let's use retrieved model to predict batch data.

In [None]:
# Predict batch data using retrieved model
predictions_batch = retrieved_xgboost_model.predict(batch_data)
predictions_batch[:10]

<a name='11'></a>
## <span style="color:#ff5f27;"> üë®üèª‚Äç‚öñÔ∏è Real-time Predictions</span>

**Real-time Predictions** is a process of using a trained machine learning model to make predictions on feature vector(s) in real-time. 

To begin with, let's create `to_df` function which will transform a feature vector(s) list into a pandas DataFrame.

In [None]:
def to_df(feature_vector):
    """
    Convert a feature vector or a list of feature vectors into a pandas DataFrame.

    Parameters:
        feature_vector (a list, or list of lists): 
            A feature vector or a list of feature vectors. A feature vector is 
            represented as a list containing two elements: the first 
            element corresponds to the city name (categorical feature), and the 
            second element corresponds to the PM2.5 value (numerical feature).

    Returns:
        pandas.DataFrame: A DataFrame representing the feature vector(s). 
        The DataFrame will have two columns: 'city_name' for the city names 
        and 'pm2_5' for the corresponding PM2.5 values.

    Example:
        >>> feature_vector = ['New York', 15.3]
        >>> to_df(feature_vector)
           city_name  pm2_5
        0  New York   15.3

        >>> multiple_vectors = [['New York', 15.3], ['Los Angeles', 10.7]]
        >>> to_df(multiple_vectors)
          city_name  pm2_5
        0  New York   15.3
        1  Los Angeles 10.7
    """
    
    # Check if the input is a list of feature vectors
    if isinstance(feature_vector[0], list): 
        # Separate the city names and PM2.5 values into separate lists
        city_names = [vector[0] for vector in feature_vector]
        pm2_5_values = [vector[1] for vector in feature_vector]
        
        # Create a DataFrame with 'city_name' and 'pm2_5' columns from the lists
        data = pd.DataFrame(
            {
                'city_name': city_names,
                'pm2_5': pm2_5_values,
            }
        )
        
        # Return the DataFrame representing multiple feature vectors
        return data

    # If only one feature vector is provided, create a DataFrame for it
    data = pd.DataFrame(
            {
                'city_name': [feature_vector[0]],
                'pm2_5': [feature_vector[1]],
            }
        )
    
    # Return the DataFrame representing a single feature vector
    return data

The next step is to initialize the feature view for serving and then retrieve a feature vector with specified primary keys.

In [None]:
# Initialise feature view to retrieve feature vector
feature_view.init_serving(1)

# Retrieve a feature vector
feature_vector = feature_view.get_feature_vector(
    entry = {
        "city_name": 'Amsterdam',
        "date": '2013-01-01',
    }
)
feature_vector

Let's apply `to_df` function in order to transform the feature vector into pandas dataframe.

In [None]:
# Transform feature vector to pandas dataframe
feature_vector_df = to_df(feature_vector)
feature_vector_df

Now you can use your model to predict the feature vector dataframe.

In [None]:
# Predict feature vector dataframe using retrieved model
prediction_feature_vector = retrieved_xgboost_model.predict(feature_vector_df)
prediction_feature_vector

In addition, you can retrieve several feature vectors. Just pass primary keys as a list of dictionaries.

In [None]:
# Retrieve feature vectors from feature store
feature_vectors = feature_view.get_feature_vectors(
    entry = [
        {"city_name": 'Amsterdam', "date": '2013-01-01'},
        {"city_name": 'Amsterdam', "date": '2014-01-01'},
    ]
)
feature_vectors

Apply `to_df` function in order to transform feature vectors into pandas dataframe.

In [None]:
# Convert feature vectors to pandas dataframe
feature_vectors_df = to_df(feature_vectors)
feature_vectors_df

Now you can use your model to predict the dataframe which contains feature vectors.

In [None]:
# Predict dataframe of feature vectors using retrieved model
prediction_feature_vectors = retrieved_xgboost_model.predict(feature_vectors_df)
prediction_feature_vectors

---