# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Pipeline</span>


<span style="font-width:bold; font-size: 1.4rem;">This notebook explains how to read from a feature group, create training dataset within the feature store, train a model and save it to model registry.</span>

## üóíÔ∏è This notebook is divided into the following sections:

1. Fetch Feature Groups.
2. Define Transformation functions.
3. Create Feature Views.
4. Create Training Dataset with training, validation and test splits.
5. Train the model.
6. Register model in Hopsworks Model Registry.
7. Create the Deployment.

![part2](../images/02_training-dataset.png) 

## <span style='color:#ff5f27'> üìù Imports

In [None]:
!pip install -U xgboost --quiet

In [1]:
import joblib
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> üì° Connecting to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

2025-11-21 13:50:30,270 INFO: Python Engine initialized.

Logged in to project, explore it here https://hopsworks.ai.local/p/120


---

## <span style="color:#ff5f27;"> üî™ Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [3]:
# Retrieve feature groups.
trans_fg = fs.get_feature_group(
    name='transactions_fraud_online_fg', 
    version=1,
)
profile_online_fg = fs.get_feature_group(
    name='profile_fraud_online_fg', 
    version=1,
)

# Select features for training dataset
selected_features = trans_fg.select_features().join(profile_online_fg.select_features())

2025-11-21 13:50:30,997 INFO: Using ['tid', 'amount', 'country', 'fraud_label', 'loc_delta_t_plus_1', 'loc_delta_t_minus_1', 'time_delta_t_minus_1'] from feature group `transactions_fraud_online_fg` as features for the query. To include primary key and event time use `select_all`.
2025-11-21 13:50:30,997 INFO: Using ['gender'] from feature group `profile_fraud_online_fg` as features for the query. To include primary key and event time use `select_all`.


In [4]:
# Uncomment this if you would like to view your selected features
# selected_features.show(5)

Recall that you computed the features in `transactions_fraud_online_fg`. If you had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

---

### <span style="color:#ff5f27;"> ü§ñ Transformation Functions </span>


You will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [5]:
# Import transformation functions from Hopsworks.
from hopsworks.hsfs.builtin_transformations import label_encoder

# Map features to transformation functions.
transformation_functions = [
    label_encoder("country"),
    label_encoder("gender"),
]

## <span style="color:#ff5f27;"> ‚öôÔ∏è Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create or get a Feature View you may use `fs.get_or_create_feature_view()`

In [6]:
# Get or create the 'transactions_fraud_online_fv' feature view
feature_view = fs.get_or_create_feature_view(
    name='transactions_fraud_online_fv',
    version=1,
    query=selected_features,
    labels=["fraud_label"],
    transformation_functions=transformation_functions,
    inference_helper_columns=["tid"],
    logging_enabled=True
)

Feature view created successfully, explore it at 
https://hopsworks.ai.local/p/120/fs/68/fv/transactions_fraud_online_fv/version/1


## <span style="color:#ff5f27;"> üèãÔ∏è Training Dataset </span>

In [7]:
# Training/Test splits, datasets creation. Using timerange arguments.
train_start = "2022/01/01"
train_end = "2022/03/10"
test_start = "2022/03/10"
test_end = "2022/03/31"

X_train, X_test, y_train, y_test = feature_view.train_test_split(
    train_start=train_start,
    train_end=train_end,
    test_start=test_start,
    test_end=test_end,
)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (0.85s) 
2025-11-21 13:50:42,896 INFO: Profiling dataframe in Python Engine
2025-11-21 13:50:44,491 INFO: Profiling dataframe in Python Engine
2025-11-21 13:50:44,537 INFO: Profiling dataframe in Python Engine



In [8]:
X_train

Unnamed: 0,amount,loc_delta_t_plus_1,loc_delta_t_minus_1,time_delta_t_minus_1,label_encoder_country_,label_encoder_gender_
0,510.90,0.000000,0.000112,0.958333,111,1
1,21.89,0.000112,0.000127,0.958333,111,1
2,11.28,0.000127,0.244046,3.231134,111,1
3,27.56,0.244046,0.087633,0.553877,111,1
4,752.57,0.087633,2.417108,0.166667,111,1
...,...,...,...,...,...,...
365107,1.78,0.177518,0.177387,0.380405,111,1
365108,25.77,0.177387,0.221254,0.234132,111,1
365109,91.63,0.221254,0.273620,0.182662,111,1
365110,1.87,0.273620,0.195295,0.208206,111,1


The feature view and training dataset are now visible in the UI

![fg-overview](../../../images/fv_overview.gif)

In [9]:
# Display the normalized value counts of the training labels (y_train)
y_train.value_counts(normalize=True)

fraud_label
0              0.996458
1              0.003542
Name: proportion, dtype: float64

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus you should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

---

## <span style="color:#ff5f27;"> üß¨ Modeling</span>

Next you will train a model. Here, you set larger class weight for the positive class.

In [10]:
# Initialize an XGBoost classifier
model = xgb.XGBClassifier()

# Train the classifier using the training features (X_train) and labels (y_train)
model.fit(X_train, y_train)

In [11]:
# Predict the training set
y_pred_train = model.predict(X_train)

# Predict the test set
y_pred_test = model.predict(X_test)

In [12]:
# Compute f1 score
metrics = {
    "f1_score": f1_score(y_test, y_pred_test, average='macro')
}
metrics

{'f1_score': 1.0}

In [13]:
# Calculate the confusion matrix for the test set predictions
results = confusion_matrix(
    y_test, 
    y_pred_test, 
    labels=[False, True],
)

# Print the confusion matrix
print(results)

[[40229     0]
 [    0     0]]


---

## <span style="color:#ff5f27;">üìù Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [14]:
# Specify the model directory
model_dir = "fraud_online_model"
images_dir = os.path.join(model_dir, "images")

# Create directories if they don't exist
os.makedirs(images_dir, exist_ok=True)

In [15]:
# Save the trained XGBoost model
joblib.dump(model, os.path.join(model_dir, "xgboost_fraud_online_model.pkl"))

['fraud_online_model/xgboost_fraud_online_model.pkl']

In [16]:
# Create a DataFrame from the confusion matrix results
df_cm = pd.DataFrame(
    results, 
    ['True Normal', 'True Fraud'],
    ['Pred Normal', 'Pred Fraud']
)

# Create and save the confusion matrix heatmap
plt.figure(figsize=(8, 6))
cm = sns.heatmap(
    df_cm, 
    annot=True,
    fmt='d',                 # Use integer format for numbers
    cmap='RdPu',             # Use a color palette that works well for binary classification
    annot_kws={'size': 12},  # Increase annotation text size
    cbar=True                # Include color bar
)

# Add title and labels
plt.title('Confusion Matrix for Fraud Detection')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')

# Adjust layout and save
plt.tight_layout()
plt.savefig(os.path.join(images_dir, "confusion_matrix.png"), dpi=300, bbox_inches='tight')
plt.close()

In [17]:
# Get the model registry
mr = project.get_model_registry()

# Create a Python model in the model registry
fraud_model = mr.python.create_model(
    name="xgboost_fraud_online_model", 
    description="Fraud Online Predictor", # Add a description for the model
    metrics=metrics,                      # Specify the metrics used to evaluate the model
    input_example=[4467360740682089],     # Example input for testing deployments
    feature_view=feature_view,            # Add a feature view to the model
)

# Save the model to the specified model directory
fraud_model.save(model_dir)

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://hopsworks.ai.local/p/120/models/xgboost_fraud_online_model/1


Model(name: 'xgboost_fraud_online_model', version: 1)

---

## <a class="anchor" id="1.5_bullet" style="color:#ff5f27"> üöÄ Model Deployment</a>


### About Model Serving
Models can be served via KFServing or "default" serving, which means a Docker container exposing a Flask server. For KFServing models, or models written in Tensorflow, you do not need to write a prediction file (see the section below). However, for sklearn models using default serving, you do need to proceed to write a prediction file.

In order to use KFServing, you must have Kubernetes installed and enabled on your cluster.

### <span style="color:#ff5f27;">üìé Predictor script for Python models</span>


Scikit-learn and XGBoost models are deployed as Python models, in which case you need to provide a **Predict** class that implements the **predict** method. The **predict()** method invokes the model on the inputs and returns the prediction as a list.

The **init()** method is run when the predictor is loaded into memory, loading the model from the local directory it is materialized to, *ARTIFACT_FILES_PATH*.

The directive "%%writefile" writes out the cell before to the given Python file. We will use the **predict_example.py** file to create a deployment for our model. 

In [18]:
%%writefile predict_example.py
import os
import numpy as np
import hopsworks
import joblib


class Predict(object):

    def __init__(self, async_logger, model):
        """ Initializes the serving state, reads a trained model"""        
        # Get feature store handle
        project = hopsworks.login()
        self.mr = project.get_model_registry()

        # Retrieve the feature view from the model
        retrieved_model = self.mr.get_model(
            name="xgboost_fraud_online_model",
            version=1,
        )
        self.feature_view = retrieved_model.get_feature_view()
        
        # Initialize serving and async feature logging
        self.feature_view.init_serving(feature_logger=async_logger)

        # Load the trained model
        self.hopsworks_model = model

        self.model = joblib.load(os.environ["MODEL_FILES_PATH"] + "/xgboost_fraud_online_model.pkl")

        print("Initialization Complete")

    def predict(self, inputs):
        """ Serves a prediction request usign a trained model"""
        feature_vector = self.feature_view.get_feature_vector({"cc_num": inputs[0][0]}, logging_data=True)   
        prediction = self.model.predict(np.asarray(feature_vector).reshape(1, -1)).tolist() # Numpy Arrays are not JSON serializable
        self.feature_view.log(feature_vector,
            predictions=prediction,
            model=self.hopsworks_model)
        return prediction

Overwriting predict_example.py


If you wonder why we use the path Models/fraud_tutorial_model/1/model.pkl, it is useful to know that the Data Sets tab in the Hopsworks UI lets you browse among the different files in the project. Registered models will be found underneath the Models directory. Since you saved you model with the name fraud_tutorial_model, that's the directory you should look in. 1 is just the version of the model you want to deploy.

This script needs to be put into a known location in the Hopsworks file system. Let's call the file predict_example.py and put it in the Models directory.

In [19]:
# Get the dataset API for the current project
dataset_api = project.get_dataset_api()

# Specify the local file path of the Python script to be uploaded
local_script_path = "predict_example.py"

# Upload the Python script to the "Models", and overwrite if it already exists
uploaded_file_path = dataset_api.upload(local_script_path, "Models", overwrite=True)

# Create the full path to the uploaded script for future reference
predictor_script_path = os.path.join("/Projects", project.name, uploaded_file_path)

Uploading /hopsfs/Jupyter/hopsworks-tutorials/real-time-ai-systems/fraud_online/predict_example.py: 0.000%|   ‚Ä¶

### Create the deployment
Here, you fetch the model you want from the model registry and define a configuration for the deployment. For the configuration, you need to specify the serving type (default or KFserving).

In [20]:
# Deploy the fraud model
deployment = fraud_model.deploy(
    name="fraudonlinemodeldeployment",  # Specify a name for the deployment
    script_file=predictor_script_path,  # Provide the path to the Python script for prediction
)

Deployment created, explore it at https://hopsworks.ai.local/p/120/deployments/18
Before making predictions, start the deployment by using `.start()`


In [21]:
# Print the name of the deployment
print("Deployment: " + deployment.name)

# Display information about the deployment
deployment.describe()

Deployment: fraudonlinemodeldeployment
{
    "api_protocol": "REST",
    "batching_configuration": {
        "batching_enabled": false
    },
    "config_file": null,
    "created": "2025-11-21T13:51:07.000Z",
    "creator": "Admin Admin",
    "description": null,
    "id": 18,
    "inference_logging": "NONE",
    "model_framework": "PYTHON",
    "model_name": "xgboost_fraud_online_model",
    "model_path": "/Projects/happening_example/Models/xgboost_fraud_online_model",
    "model_server": "PYTHON",
    "model_version": 1,
    "name": "fraudonlinemodeldeployment",
    "predictor": "predict_example.py",
    "predictor_resources": {
        "limits": {
            "cores": 2,
            "gpus": 0,
            "memory": 1024
        },
        "requests": {
            "cores": 0.2,
            "gpus": 0,
            "memory": 32
        }
    },
    "project_namespace": "happening-example",
    "requested_instances": 1,
    "serving_tool": "KSERVE",
    "version": 1
}


#### The deployment has now been registered. However, to start it you need to run the following command:

In [22]:
# Start the deployment and wait for it to be in a running state for up to 300 seconds
deployment.start(await_running=300)

  0%|          | 0/5 [00:00<?, ?it/s]

Start making predictions by using `.predict()`


In [23]:
# Get the current state of the deployment
deployment.get_state().describe()

{
    "available_instances": 1,
    "condition": {
        "reason": "Deployment is ready",
        "status": true,
        "type": "READY"
    },
    "deployed": "2025-11-21T13:51:08.000Z",
    "hopsworks_inference_path": "/project/120/inference/serving/fraudonlinemodeldeployment",
    "model_server_inference_path": "/v1/models/fraudonlinemodeldeployment",
    "revision": "21592840",
    "status": "Running"
}


In [24]:
# To troubleshoot you can use `get_logs()` method
deployment.get_logs(component='predictor')

Explore all the logs and filters in the Kibana logs at https://hopsworks.ai.local/p/120/deployments/18

DeployableComponentLogs(instance_name: 'fraudonlinemodeldeployment-predictor-00001-deployment-894885zkh', date: datetime.datetime(2025, 11, 21, 13, 51, 43, 486760)) 

2025-11-21 13:51:30.691 uvicorn.error INFO:     Started server process [8]
2025-11-21 13:51:30.691 uvicorn.error INFO:     Waiting for application startup.
2025-11-21 13:51:30.694 8 kserve INFO [server.py:start():68] Starting gRPC server on [::]:8081
2025-11-21 13:51:30.694 uvicorn.error INFO:     Application startup complete.
2025-11-21 13:51:30.694 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
2025-11-21 13:51:38.509 uvicorn.access INFO:     127.0.0.1:53888 8 - "GET /metrics HTTP/1.1" 200 OK
2025-11-21 13:51:38.509 kserve.trace kserve.io.kserve.protocol.rest.server.metrics_handler: 0.0014011859893798828 ['http_status:200', 'http_method:GET', 'time:wall']
2025-11-21 13:51:38.509 

### Stop Deployment
To stop the deployment you simply run:

In [None]:
# Stop the deployment and wait for it to be in a stopped state for up to 180 seconds
deployment.stop(await_stopped=180)

---
## <span style="color:#ff5f27;">‚è≠Ô∏è **Next:** Part 03: Inference Pipeline</span>

In the following notebook you will use your model for Serving Vector Inference.
