# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Pipeline</span>

<span style="font-width:bold; font-size: 1.4rem;">This notebook explains how to read from a feature group, create training dataset within the feature store, train a model and save it to model registry.</span>

## 🗒️ This notebook is divided into the following sections:

1. Fetch Feature Groups.
2. Define Transformation functions.
3. Create Feature Views.
4. Create Training Dataset with training, validation and test splits.
5. Train the model.
6. Register model in Hopsworks Model Registry.

![part2](../images/02_training-dataset.png) 

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U xgboost --quiet

In [None]:
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

### <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [None]:
# Retrieve feature groups
trans_fg = fs.get_feature_group(
    name='transactions_fraud_batch_fg', 
    version=1,
)
window_aggs_fg = fs.get_feature_group(
    name='transactions_4h_aggs_fraud_batch_fg', 
    version=1,
)

In [None]:
# Select features for training data.
selected_features = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_all(include_primary_key=False))

In [None]:
# Uncomment this if you would like to view your selected features
# selected_features.show(5)

Recall that you computed the features in `transactions_4h_aggs_fraud_batch_fg` using 4-hour aggregates. If you had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

---

### <span style="color:#ff5f27;"> 🤖 Transformation Functions </span>


You will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Import transformation functions from Hopsworks.
from hopsworks.hsfs.builtin_transformations import label_encoder

# Map features to transformations.
transformation_functions = [
    label_encoder("category"),
]

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View you may use `fs.create_feature_view()`. Here we try first to get the feature view, and if we can't an exception is thrown and we create the feature view.

In [None]:
# Get or create the 'transactions_view_fraud_batch_fv' feature view
feature_view = fs.get_or_create_feature_view(
    name='transactions_view_fraud_batch_fv',
    version=1,
    query=selected_features,
    labels=["fraud_label"],
    transformation_functions=[label_encoder("category")],
)

The feature view is now visible in the UI.

![fg-overview](../images/fv_overview.gif)

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `feature_view.train_validation_test_split()` method.

In [None]:
TEST_SIZE = 0.2

X_train, X_test, y_train, y_test = feature_view.train_test_split(
    test_size=TEST_SIZE,
)

In [None]:
# Sort the X_train DataFrame based on the "datetime" column in ascending order
X_train = X_train.sort_values("datetime")

# Reindex the y_train Series to match the order of rows in the sorted X_train DataFrame
y_train = y_train.reindex(X_train.index)

In [None]:
# Sort the X_test DataFrame based on the "datetime" column in ascending order
X_test = X_test.sort_values("datetime")

# Reindex the y_test Series to match the order of rows in the sorted X_test DataFrame
y_test = y_test.reindex(X_test.index)

In [None]:
# Drop the "datetime" column from the X_train DataFrame along the specified axis (axis=1 means columns)
X_train.drop(["datetime"], axis=1, inplace=True)

# Drop the "datetime" column from the X_test DataFrame along the specified axis (axis=1 means columns)
X_test.drop(["datetime"], axis=1, inplace=True)

In [None]:
# Display the normalized value counts of the y_train Series
y_train.value_counts(normalize=True)

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus you should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

---

## <span style="color:#ff5f27;"> 🧬 Modeling</span>

Next you will train a model. Here, you set larger class weight for the positive class.

In [None]:
# Create an instance of the XGBClassifier
model = xgb.XGBClassifier()

# Fit the classifier on the training data
model.fit(X_train, y_train)

In [None]:
# Predict the training data using the trained classifier
y_pred_train = model.predict(X_train)

# Predict the test data using the trained classifier
y_pred_test = model.predict(X_test)

In [None]:
# Compute f1 score
metrics = {
    "f1_score": f1_score(y_test, y_pred_test, average='macro')
}
metrics

In [None]:
# Generate the confusion matrix using the true labels (y_test) and predicted labels (y_pred_test)
results = confusion_matrix(y_test, y_pred_test)

# Print the confusion matrix
print(results)

---

## <span style="color:#ff5f27;">📝 Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# Specify the model directory
model_dir = "fraud_batch_model"
images_dir = os.path.join(model_dir, "images")

# Create directories if they don't exist
os.makedirs(images_dir, exist_ok=True)

In [None]:
# Save the trained XGBoost model
model.save_model(os.path.join(model_dir, "model.json"))

In [None]:
# Create a DataFrame from the confusion matrix results
df_cm = pd.DataFrame(
    results, 
    ['True Normal', 'True Fraud'],
    ['Pred Normal', 'Pred Fraud']
)

# Create and save the confusion matrix heatmap
plt.figure(figsize=(8, 6))
cm = sns.heatmap(
    df_cm, 
    annot=True,
    fmt='d',                 # Use integer format for numbers
    cmap='RdPu',             # Use a color palette that works well for binary classification
    annot_kws={'size': 12},  # Increase annotation text size
    cbar=True                # Include color bar
)

# Add title and labels
plt.title('Confusion Matrix for Fraud Detection')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')

# Adjust layout and save
plt.tight_layout()
plt.savefig(os.path.join(images_dir, "confusion_matrix.png"), dpi=300, bbox_inches='tight')
plt.close()

In [None]:
# Get the model registry
mr = project.get_model_registry()

# Create a new model in the model registry
fraud_model = mr.python.create_model(
    name="xgboost_fraud_batch_model",     # Name for the model
    description="Fraud Batch Predictor",  # Description of the model
    metrics=metrics,                      # Metrics used for evaluation
    input_example=X_train.sample(),       # Example input data for reference
    feature_view=feature_view,            # Add a feature view to the model
)

# Save the model to the specified directory
fraud_model.save(model_dir)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 03: Batch Inference</span>

In the following notebook you will use your model for batch inference.
