## Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Feature Selection

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
trans_fg = fs.get_feature_group("transactions", 1)
window_aggs_fg = fs.get_feature_group("transactions_4h_aggs", 1)

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["cc_num"]), on="cc_num")\

ds_query.show(5)

2022-05-31 11:42:41,718 INFO: USE `fraud_batch_online_featurestore`
2022-05-31 11:42:42,647 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`fraud_label` `fraud_label`, `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg1`.`cc_num` `join_pk_cc_num`, `fg1`.`datetime` `join_evt_datetime`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`, RANK() OVER (PARTITION BY `fg1`.`cc_num`, `fg1`.`datetime` ORDER BY `fg0`.`datetime` DESC) pit_rank_hopsworks
FROM `fraud_batch_online_featurestore`.`transactions_1` `fg1`
INNER JOIN `fraud_batch_online_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`cc_num` = `fg0`.`cc_num` AND `fg1`.`datetime` >= `fg0`.`datetime`) NA
WHERE `pit_rank_hopsworks` = 1) (SELECT `right_fg0`.`fraud_label` 

Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,0,Grocery,93.51,25.334094,175.91228,0.0,93.51,93.51,93.51,0.0
1,0,Domestic Transport,65.14,25.335632,175.350486,0.319574,65.14,65.14,65.14,0.319574
2,0,Grocery,0.26,25.336235,175.130347,0.314148,0.26,0.26,0.26,0.314148
3,0,Grocery,1.43,25.33666,174.975058,0.0,0.845,0.845,0.845,0.157074
4,0,Grocery,19.75,25.34471,172.034664,0.105313,19.75,19.75,19.75,0.105313


Recall that we computed the features in `transactions_4h_aggs` using 4-hour aggregates. If we had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join we would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

### Transformation Functions
Transformation functions are a mathematical mapping of input data that may be stateful - requiring statistics from the partent feature view (such as number of instances of a category, or mean value of a numerical feature)

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [3]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

### Feature View Creation

In Hopsworks, you write features to feature groups (where the features are stored) and you read features from feature views. A feature view is a logical view over features, stored in feature groups, and a feature view typically contains the features used by a specific model. This way, feature views enable features, stored in different feature groups, to be reused across many different models. 

In [4]:
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=ds_query,
    label=["fraud_label"],
    transformation_functions=transformation_functions
)

To view and explore data in the feature view we can retrieve batch data using `get_batch_data()` method 

In [5]:
feature_view.get_batch_data().head(5)

2022-05-31 11:43:37,381 INFO: USE `fraud_batch_online_featurestore`
2022-05-31 11:43:38,471 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg1`.`cc_num` `join_pk_cc_num`, `fg1`.`datetime` `join_evt_datetime`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`, RANK() OVER (PARTITION BY `fg1`.`cc_num`, `fg1`.`datetime` ORDER BY `fg0`.`datetime` DESC) pit_rank_hopsworks
FROM `fraud_batch_online_featurestore`.`transactions_1` `fg1`
INNER JOIN `fraud_batch_online_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`cc_num` = `fg0`.`cc_num` AND `fg1`.`datetime` >= `fg0`.`datetime`) NA
WHERE `pit_rank_hopsworks` = 1) (SELECT `right_fg0`.`category` `category`, `right_fg0`.`amount` `amou

Unnamed: 0,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,Grocery,93.51,25.334094,175.91228,0.0,93.51,93.51,93.51,0.0
1,Domestic Transport,65.14,25.335632,175.350486,0.319574,65.14,65.14,65.14,0.319574
2,Grocery,0.26,25.336235,175.130347,0.314148,0.26,0.26,0.26,0.314148
3,Grocery,1.43,25.33666,174.975058,0.0,0.845,0.845,0.845,0.157074
4,Grocery,19.75,25.34471,172.034664,0.105313,19.75,19.75,19.75,0.105313


#### Training Dataset Creation

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset  may contain splits such as: 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_training_dataset()` method.

In [6]:
td_random_version, td_job = feature_view.create_training_dataset(
    description = 'transactions_dataset_random_splitted',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': True},
    coalesce = True
)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/129/jobs/named/transactions_view_1_1_create_fv_td_31052022114428/executions




### Create training datasets based event time filter

In [7]:
from datetime import datetime
date_format = "%Y-%m-%d %H:%M:%S"

#### Create training dataset from January to February data 

In [8]:
start_time = int(float(datetime.strptime("2022-01-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-02-28 23:59:59", date_format).timestamp()) * 1000)

In [9]:
td_jan_feb_version, td_job = feature_view.create_training_dataset(
    description = 'transactions_dataset_jan_feb',
    data_format = 'csv',
    write_options = {'wait_for_job': True},
    coalesce = True,
    start_time = start_time,
    end_time = end_time,
)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/129/jobs/named/transactions_view_1_2_create_fv_td_31052022114615/executions




#### Create training dataset from March data

In [10]:
start_time = int(float(datetime.strptime("2022-03-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-03-31 23:59:59", date_format).timestamp()) * 1000)

In [11]:
td_mar_version, td_job = feature_view.create_training_dataset(
    description = 'transactions_dataset_mar',
    data_format = 'csv',
    write_options = {'wait_for_job': True},
    coalesce = True,
    start_time = start_time,
    end_time = end_time,
)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/129/jobs/named/transactions_view_1_3_create_fv_td_31052022114743/executions




#### Training Dataset retreival
To retrieve training data from storage (already materialised) or from feature groups direcly we can use `get_training_dataset_splits` or `get_training_dataset` methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.

In [12]:
td_version, td_df_random = feature_view.get_training_dataset_splits({'train': 80, 'validation': 20}, start_time=None, end_time=None, version = td_random_version)



In [13]:
td_df_random

{'train':        fraud_label  category        amount  age_at_transaction  \
 0                0         0  0.000000e+00            0.010858   
 1                0         0  0.000000e+00            0.047378   
 2                0         0  0.000000e+00            0.063759   
 3                0         0  0.000000e+00            0.954661   
 4                0         0  3.336858e-07            0.364075   
 ...            ...       ...           ...                 ...   
 84904            1         5  4.983598e-03            0.206613   
 84905            1         5  7.304049e-03            0.909331   
 84906            1         5  1.362873e-02            0.516288   
 84907            1         8  1.698461e-04            0.132164   
 84908            1         8  4.875150e-04            0.488983   
 
        days_until_card_expires  loc_delta  trans_volume_mstd  \
 0                     0.850452   0.024955       0.000000e+00   
 1                     0.943722   0.035718       0.0000

## From feature view we can also retrieve feature vecors from online store for low latency model serving 
Training data version is required for transformation. Call `feature_view.init_serving(version)` to pass the training dataset version. 

In [14]:
feature_view.init_serving(td_version)

### Retrieve single feature vector

In [15]:
feature_view.get_feature_vector({"cc_num": "4473593503484549"})

[]

### Retrieve batch of feature vectors

Here we have to let feature_view that we need to get batch of feature vectors by passing `batch=True` to `init_serving` method  

In [None]:
feature_view.init_serving(td_version, batch=True)
card_ids = [
    "4473593503484549",
    "4336399961348201",
    "4219785543443381",
    "4137709749259770",
    "4573366597272313",
    "4929411498746287",
    "4855787436134696"    
]
feature_view.get_feature_vectors(entry={"cc_num": card_ids})

### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.