# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Model training & UI Exploration</span>

<span style="font-width:bold; font-size: 1.4rem;">In this last notebook, we will train a model on the dataset we created in the previous tutorial. We will train our model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch. We will also show some of the exploration that can be done in Hopsworks, notably the search functions and the lineage. </span>

## **🗒️ This notebook is divided in 3 main sections:** 
1. **Loading the training data**
2. **Train the model**
3. **Explore feature groups and views** via the UI.

![tutorial-flow](../images/03_model.png)

In [1]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/124




Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> ✨ Load Training Data </span>

First, we'll need to fetch the training dataset that we created in the previous notebook. We will use January - February data training and testing.

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load data.
feature_view = fs.get_feature_view("transactions_view", 1)
X_train, y_train, X_val, y_val, X_test, y_test = feature_view.get_train_validation_test_splits(1)

We will train a model to predict `fraud_label` given the rest of the features.

Let's check the distribution of our target label.

In [3]:
y_train.value_counts(normalize=True)

fraud_label
0              0.998545
1              0.001455
dtype: float64

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

## <span style="color:#ff5f27;"> 🏃 Train Model</span>

Next we'll train a model. Here, we set the class weight of the positive class to be twice as big as the negative class.

In [4]:
# Train model.
pos_class_weight = 0.9
clf = LogisticRegression(class_weight={0: 1.0 - pos_class_weight, 1: pos_class_weight}, solver='liblinear')
clf.fit(X_train, y_train)



LogisticRegression(class_weight={0: 0.09999999999999998, 1: 0.9},
                   solver='liblinear')

Let's see how well it performs on our validation data.

In [5]:
from sklearn.metrics import classification_report

preds = clf.predict(X_val)

print(classification_report(y_val, preds))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     21132
           1       0.00      0.00      0.00        25

    accuracy                           1.00     21157
   macro avg       0.50      0.50      0.50     21157
weighted avg       1.00      1.00      1.00     21157





## <span style="color:#ff5f27;">  Use the model to score transactions </span>
We trained model based on January - February data. Now lets retrieve March data and score whether transactions are fraudulend or not   


In [6]:
from datetime import datetime
date_format = "%Y-%m-%d %H:%M:%S"
# Create training datasets based event time filter
start_time = int(float(datetime.strptime("2022-01-03 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-03-31 23:59:59", date_format).timestamp()) * 1000)

march_transactions = feature_view.get_batch_data(start_time = start_time,  end_time = end_time)



2022-06-20 09:44:08,897 INFO: USE `robin100_featurestore`
2022-06-20 09:44:09,915 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg1`.`cc_num` `join_pk_cc_num`, `fg1`.`datetime` `join_evt_datetime`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`, RANK() OVER (PARTITION BY `fg1`.`cc_num`, `fg1`.`datetime` ORDER BY `fg0`.`datetime` DESC) pit_rank_hopsworks
FROM `robin100_featurestore`.`transactions_1` `fg1`
INNER JOIN `robin100_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`cc_num` = `fg0`.`cc_num` AND `fg1`.`datetime` >= `fg0`.`datetime`
WHERE `fg1`.`datetime` >= 1641164401000 AND `fg1`.`datetime` <= 1648763999000) NA
WHERE `pit_rank_hopsworks` = 1) (SELECT `right_fg0`.`



In [7]:
march_transactions

Unnamed: 0,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,4,0.003120,0.091597,0.139747,0.000000,0.003120,0.003120,0.003120,0.000000
1,2,0.002173,0.091615,0.139474,0.122200,0.002173,0.002173,0.002173,0.135041
2,4,0.000008,0.091622,0.139367,0.120125,0.000008,0.000008,0.000008,0.132748
3,4,0.000047,0.091628,0.139291,0.000000,0.000028,0.000028,0.000028,0.066374
4,4,0.000659,0.091725,0.137862,0.040270,0.000659,0.000659,0.000659,0.044502
...,...,...,...,...,...,...,...,...,...
102997,0,0.000736,0.357364,0.481294,0.228904,0.000736,0.000736,0.000736,0.252957
102998,0,0.002816,0.357399,0.480778,0.166719,0.002816,0.002816,0.002816,0.184238
102999,0,0.002934,0.357403,0.480721,0.166874,0.002875,0.002875,0.002875,0.184323
103000,0,0.010322,0.357470,0.479735,0.001149,0.010322,0.010322,0.010322,0.001270


In [8]:
predictions = clf.predict(march_transactions)

In [9]:
predictions

array([0, 0, 0, ..., 0, 0, 0])

## <span style="color:#ff5f27;"> 👓  Exploration</span>
In the Hopsworks feature store, the metadata allows for multiple levels of explorations and review. Here we will show a few of those capacities. 

### 🔎 <b>Search</b> 
Using the search function in the ui, you can query any aspect of the feature groups, feature_view and training data that was previously created.

### 📊 <b>Statistics</b> 
We can also enable statistics in one or all the feature groups.

In [10]:
trans_fg = fs.get_feature_group("transactions", version = 1)
trans_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

trans_fg.update_statistics_config()
trans_fg.compute_statistics()

Statistics Job started successfully, you can follow the progress at https://c.app.hopsworks.ai/p/124/jobs/named/transactions_1_compute_stats_20062022074522/executions


![fg-statistics](../images/fg_statistics.gif)


### ⛓️ <b> Lineage </b> 
In all the feature groups and feature view you can look at the relation between each abstractions; what feature group created which training dataset and that is used in which model.
This allows for a clear undestanding of the pipeline in relation to each element. 

## <span style="color:#ff5f27;"> 🎁  Wrapping things up </span>

We have now performed a simple training with training data that we have created in the feature store. This concludes the fisrt module and introduction to the core aspect of the feauture store. In the second module we will introduce streaming and external feature groups for a similar fraud use case.