# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Pipeline</span>

<span style="font-width:bold; font-size: 1.4rem;">This notebook explains how to read from a feature group, create training dataset within the feature store, train a model and save it to model registry.</span>

## 🗒️ This notebook is divided into the following sections:

1. Fetch Feature Groups.
2. Define Transformation functions.
3. Create Feature Views.
4. Create Training Dataset with training, validation and test splits.
5. Train the model.
6. Register model in Hopsworks Model Registry.

![part2](../../images/02_training-dataset.png) 

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U xgboost --quiet

In [1]:
import joblib
import os

import pandas as pd
import numpy as np
from matplotlib import pyplot
import seaborn as sns

import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://aff99120-da3e-11ee-8cd7-4f3734b3ce24.cloud.hopsworks.ai/p/3192
Connected. Call `.close()` to terminate connection gracefully.


### <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [14]:
# Retrieve feature groups
trans_fg = fs.get_feature_group(
    name='transactions_fraud_streaming_fg', 
    version=1,
)
window_aggs_fg = fs.get_feature_group(
    name='transactions_aggs_fraud_streaming_fg', 
    version=1,
)

In [19]:
# Select features for training data.
selected_features = trans_fg.select(["fraud_label", "category", "amount", "datetime", "age_at_transaction", "days_until_card_expires"])\
    .join(window_aggs_fg.select_except(["cc_num", "event_time"]))

In [26]:
# Uncomment this if you would like to view your selected features
df = selected_features.read(read_options={"use_hive":True})



Finished: Reading data from Hopsworks, using Hive (21.07s) 


Recall that you computed the features in `transactions_4h_aggs_fraud_batch_fg` using 4-hour aggregates. If you had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

---

### <span style="color:#ff5f27;"> 🤖 Transformation Functions </span>


You will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [21]:
# Load transformation functions.
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View you may use `fs.create_feature_view()`. Here we try first to get the feature view, and if we can't an exception is thrown and we create the feature view.

In [22]:
# Get or create the 'transactions_view_fraud_batch_fv' feature view
feature_view = fs.get_or_create_feature_view(
    name='transactions_view_fraud_batch_fv',
    version=1,
    query=selected_features,
    labels=["fraud_label"],
    transformation_functions=transformation_functions,
)

Feature view created successfully, explore it at 
https://aff99120-da3e-11ee-8cd7-4f3734b3ce24.cloud.hopsworks.ai/p/3192/fs/3140/fv/transactions_view_fraud_batch_fv/version/1


The feature view is now visible in the UI.

![fg-overview](../images/fv_overview.gif)

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `feature_view.train_validation_test_split()` method.

In [23]:
TEST_SIZE = 0.2

X_train, X_test, y_train, y_test = feature_view.train_test_split(
    test_size = TEST_SIZE, read_options={"use_hive":True}
)



Finished: Reading data from Hopsworks, using Hive (25.04s) 


KeyError: DataType(null)

In [None]:
X_train.head()

In [None]:
# Sort the X_train DataFrame based on the "datetime" column in ascending order
X_train = X_train.sort_values("datetime")

# Reindex the y_train Series to match the order of rows in the sorted X_train DataFrame
y_train = y_train.reindex(X_train.index)

In [None]:
# Sort the X_test DataFrame based on the "datetime" column in ascending order
X_test = X_test.sort_values("datetime")

# Reindex the y_test Series to match the order of rows in the sorted X_test DataFrame
y_test = y_test.reindex(X_test.index)

In [None]:
# Drop the "datetime" column from the X_train DataFrame along the specified axis (axis=1 means columns)
X_train.drop(["datetime"], axis=1, inplace=True)

# Drop the "datetime" column from the X_test DataFrame along the specified axis (axis=1 means columns)
X_test.drop(["datetime"], axis=1, inplace=True)

In [None]:
# Display the normalized value counts of the y_train Series
y_train.value_counts(normalize=True)

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus you should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

---

## <span style="color:#ff5f27;"> 🧬 Modeling</span>

Next you will train a model. Here, you set larger class weight for the positive class.

In [None]:
# Create an instance of the XGBClassifier
clf = xgb.XGBClassifier()

# Fit the classifier on the training data
clf.fit(X_train.values, y_train)

In [None]:
# Predict the training data using the trained classifier
y_pred_train = clf.predict(X_train.values)

# Predict the test data using the trained classifier
y_pred_test = clf.predict(X_test.values)

In [None]:
# Compute f1 score
metrics = {
    "f1_score": f1_score(y_test, y_pred_test, average='macro')
}
metrics

In [None]:
# Generate the confusion matrix using the true labels (y_test) and predicted labels (y_pred_test)
results = confusion_matrix(y_test, y_pred_test)

# Print the confusion matrix
print(results)

In [None]:
# Create a DataFrame from the confusion matrix results with appropriate labels
df_cm = pd.DataFrame(
    results, 
    ['True Normal', 'True Fraud'],
    ['Pred Normal', 'Pred Fraud'],
)

# Create a heatmap using seaborn with annotations
cm = sns.heatmap(df_cm, annot=True)

# Get the figure from the heatmap and display it
fig = cm.get_figure()
fig.show()

---

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Define the input schema using the values of X_train
input_schema = Schema(X_train.values)

# Define the output schema using y_train
output_schema = Schema(y_train)

# Create a ModelSchema object specifying the input and output schemas
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

# Convert the model schema to a dictionary for further inspection or serialization
model_schema.to_dict()

## <span style="color:#ff5f27;">📝 Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# Specify the directory where the model will be saved
model_dir = "fraud_batch_model"

# Check if the directory exists, and create it if it doesn't
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)

# Save the trained XGBoost model using joblib
joblib.dump(clf, model_dir + '/xgboost_fraud_batch_model.pkl')

# Save the confusion matrix heatmap as an image in the model directory
fig.savefig(model_dir + "/confusion_matrix.png")

In [None]:
# Get the model registry
mr = project.get_model_registry()

# Create a new model in the model registry
fraud_model = mr.python.create_model(
    name="xgboost_fraud_batch_model",     # Name for the model
    metrics=metrics,                      # Metrics used for evaluation
    model_schema=model_schema,            # Schema defining the model's input and output
    input_example=X_train.sample(),       # Example input data for reference
    description="Fraud Batch Predictor",  # Description of the model
)

# Save the model to the specified directory
fraud_model.save(model_dir)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 03: Batch Inference</span>

In the following notebook you will use your model for batch inference.
