![hopsworks_logo](../../images/hopsworks_logo.png)

# Part 02: Training Data & Feature views

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/fraud_online/2_feature_view_creation.ipynb)

This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store

## 🗒️ In this notebook we will see how to create a training dataset from the feature groups: 
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset** for training fraud detection model.

![tutorial-flow](../../images/02_training-dataset.png) 

In [1]:
!pip install -U hopsworks --quiet

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/164




Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

We start by selecting all the features we want to include for model training/inference.

In [3]:
# Load feature groups.
trans_fg = fs.get_feature_group('transactions_fraud_online_fg', version=1)
profile_online_fg = fs.get_feature_group('profile_fraud_online_fg', version=1)

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "loc_delta_t_plus_1", "loc_delta_t_minus_1","time_delta_t_plus_1", 
                                                       "time_delta_t_minus_1", "country"]).\
                                      join(profile_online_fg.select_all())

In [4]:
# uncomment this if you would like to view query results
#ds_query.show(5)

Recall that you computed the features in `transactions_fg`. If you had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

🤖 Transformation Functions </span>

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [5]:
# Load the transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformation functions.
transformation_functions = {
    "loc_delta_t_plus_1": min_max_scaler, 
    "loc_delta_t_minus_1": min_max_scaler, 
    "time_delta_t_plus_1": min_max_scaler, 
    "time_delta_t_minus_1": min_max_scaler,
    "country": label_encoder,
    "gender": label_encoder,
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use `fs.create_feature_view()`

In [6]:
feature_view = fs.create_feature_view(
    name='transactions_fraud_online_fv',
    query=ds_query,
    labels=["fraud_label"],
    transformation_functions=transformation_functions
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/164/fs/106/fv/transactions_fraud_online_fv/version/1


To view and explore data in the feature view we can retrieve batch data using `get_batch_data()` method 

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_training_dataset()` method.

**From feature view APIs we can also create training datasts based on even time filters specifing `start_time` and `end_time`** 



In [7]:
from datetime import datetime


date_format = "%Y-%m-%d %H:%M:%S"

In [8]:
# Create training datasets based event time filter
start_time = int(float(datetime.strptime("2022-01-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-02-28 23:59:59", date_format).timestamp()) * 1000)

td_jan_feb_version, td_job = feature_view.create_training_data(
        start_time = start_time,
        end_time = end_time,    
        description = 'transactions fraud online training dataset jan/feb',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
    )

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/164/jobs/named/transactions_fraud_online_fv_1_1_create_fv_td_07082022232615/executions




In [9]:
start_time = int(float(datetime.strptime("2022-03-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-03-31 23:59:59", date_format).timestamp()) * 1000)

td_mar_version, td_job = feature_view.create_training_data(
        start_time = start_time,
        end_time = end_time,
        description = 'transactions fraud online training dataset mar',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
 )


Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/164/jobs/named/transactions_fraud_online_fv_1_2_create_fv_td_07082022232728/executions




## <span style="color:#ff5f27;"> 🪝 Training Dataset retreival </span>

To retrieve training data from storage (already materialised) or from feature groups direcly we can use `get_training_dataset_splits` or `get_training_dataset` methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.

In [10]:
train_jan_feb_x, train_jan_feb_y = feature_view.get_training_data(td_jan_feb_version)

In [11]:
test_mar_x, test_mar_y = feature_view.get_training_data(td_mar_version)

In [12]:
train_jan_feb_x

Unnamed: 0,loc_delta_t_plus_1,loc_delta_t_minus_1,time_delta_t_plus_1,time_delta_t_minus_1,country,cc_num,gender
0,0.213599,0.161837,0.024191,0.078029,0,4577712371266045,0
1,0.018723,0.018720,0.009129,0.035515,0,4454908897243389,0
2,0.119281,0.123407,0.004962,0.020359,0,4059852277897159,0
3,0.022917,0.000014,0.020810,0.029762,0,4084075340653842,1
4,0.113309,0.084312,0.049007,0.126181,0,4160806832774853,0
...,...,...,...,...,...,...,...
272373,0.180734,0.177742,0.068752,0.002731,0,4797751386008277,1
272374,0.286531,0.120248,0.056238,0.024252,0,4744684937388746,0
272375,0.068666,0.109803,0.006902,0.014586,0,4118114933437576,1
272376,0.159406,0.214929,0.028729,0.010537,0,4685966460215706,1


In [13]:
test_mar_x

Unnamed: 0,loc_delta_t_plus_1,loc_delta_t_minus_1,time_delta_t_plus_1,time_delta_t_minus_1,country,cc_num,gender
0,0.068857,0.076588,0.031517,0.040856,0,4816662071803939,0
1,0.189034,0.009323,0.026373,0.272612,0,4252647146863787,0
2,0.098402,0.015487,0.005863,0.000890,0,4324395877528542,1
3,0.217565,0.096638,0.019229,0.007014,0,4964834484564644,1
4,0.104052,0.049225,0.153346,0.081898,0,4288937182842045,0
...,...,...,...,...,...,...,...
84798,0.016615,0.097850,0.025963,0.022055,0,4434061688613280,0
84799,0.229835,0.234828,0.006403,0.129226,0,4308142387522957,0
84800,0.189890,0.053031,0.237191,0.211166,0,4548082515193901,0
84801,0.093655,0.077059,0.087958,0.015909,0,4780705085153569,0


The feature view and training dataset are now visible in the UI

![fg-overview](../../images/fv_overview.gif)

### ⛓️ <b> Lineage </b> 
In all the feature groups and feature view you can look at the relation between each abstractions; what feature group created which training dataset and that is used in which model.
This allows for a clear undestanding of the pipeline in relation to each element. 

![provenance](../../images/provenance.gif)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the following notebook, you will train a model on the dataset you created in this notebook.