## Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Feature Selection

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
trans_fg = fs.get_feature_group("transactions")
window_aggs_fg = fs.get_feature_group("transactions_4h_aggs")

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["tid"]), on="tid")\

ds_query.show(5)



2022-05-18 21:57:22,291 INFO: USE `feature_views_testing_featurestore`
2022-05-18 21:57:23,051 INFO: SELECT `fg1`.`fraud_label` `fraud_label`, `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`
FROM `feature_views_testing_featurestore`.`transactions_1` `fg1`
INNER JOIN `feature_views_testing_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`tid` = `fg0`.`tid`


Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta_mavg
0,0,Electronics,800.02,18.269105,19.346493,0.585538,800.02,0.0,1.0,0.585538
1,0,Grocery,0.83,36.915835,1301.76816,0.600164,15.815,21.19199,2.0,0.518607
2,0,Grocery,11.45,76.852503,-199.80037,0.304966,11.45,0.0,1.0,0.304966
3,0,Grocery,1.27,46.298131,298.954734,0.06351,8.085,9.637865,2.0,0.107042
4,0,Restaurant/Cafeteria,99.9,27.982351,1768.656076,0.190259,99.9,0.0,1.0,0.190259


Recall that we computed the features in `transactions_4h_aggs` using 4-hour aggregates. If we had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join we would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

### Transformation Functions

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [3]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

#### Feature View Creation

In order to create a Feature View we may use `fs.create_feature_view()`

In [27]:
# help(fs.create_feature_view)

In [8]:
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=ds_query,
    label=["fraud_label"],
    transformation_functions=transformation_functions
)

> `FeatureView.preview_feature_vector()` returns a sample of assembled serving vector from online feature store

In [14]:
feature_view.get_batch_data().head(1)

2022-05-18 22:07:11,461 INFO: USE `feature_views_testing_featurestore`
2022-05-18 22:07:12,213 INFO: SELECT `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`
FROM `feature_views_testing_featurestore`.`transactions_1` `fg1`
INNER JOIN `feature_views_testing_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`tid` = `fg0`.`tid`


Unnamed: 0,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta_mavg
0,Electronics,800.02,18.269105,19.346493,0.585538,800.02,0.0,1.0,0.585538


#### Dataset Creation

Finally we create the dataset using `fs.create_training_dataset()`.

In [28]:
# help(feature_view.create_training_dataset)

In [19]:
td = feature_view.create_training_dataset(
    description = 'transactions_dataset_splitted',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/120/jobs/named/transactions_view_1_3_create_fv_td_18052022221030/executions




We can sanity check that the transformation functions have been applied by loading the training and validation data.

In [None]:
td

In [None]:
# td.read("train")

In [None]:
# td.read("validation")

### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.