## Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Feature Selection

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
trans_fg = fs.get_feature_group("transactions")
window_aggs_fg = fs.get_feature_group("transactions_4h_aggs")

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["tid"]), on="tid")\

ds_query.show(5)



2022-05-21 12:43:47,413 INFO: USE `fraud_demo_featurestore`
2022-05-21 12:43:48,351 INFO: SELECT `fg1`.`fraud_label` `fraud_label`, `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`
FROM `fraud_demo_featurestore`.`transactions_1` `fg1`
INNER JOIN `fraud_demo_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`tid` = `fg0`.`tid`


Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta_mavg
0,0,Restaurant/Cafeteria,46.39,63.809817,119.942917,0.0,46.39,0.0,1.0,2.220446e-16
1,0,Domestic Transport,74.71,39.722159,1754.779306,0.043526,102.65,39.513127,2.0,0.1372198
2,0,Cash Withdrawal,59.6,59.887297,190.614005,0.419421,59.6,0.0,1.0,0.4194213
3,0,Cash Withdrawal,302.84,46.219283,950.753657,0.08944,157.165,206.015561,2.0,0.2541432
4,0,Grocery,23.41,80.213866,941.486991,0.310141,23.41,0.0,1.0,0.3101409


Recall that we computed the features in `transactions_4h_aggs` using 4-hour aggregates. If we had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join we would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

### Transformation Functions

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [3]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

#### Feature View Creation
## TODO: FV explanation
In order to create a Feature View we may use `fs.create_feature_view()`

In [4]:
# help(fs.create_feature_view)

In [5]:
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=ds_query,
    label=["fraud_label"],
    transformation_functions=transformation_functions
)

## TODO: FV exploration and some text

In [6]:
feature_view.get_batch_data().head(5)

2022-05-21 12:45:10,660 INFO: USE `fraud_demo_featurestore`
2022-05-21 12:45:11,591 INFO: SELECT `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`
FROM `fraud_demo_featurestore`.`transactions_1` `fg1`
INNER JOIN `fraud_demo_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`tid` = `fg0`.`tid`


Unnamed: 0,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta_mavg
0,Restaurant/Cafeteria,46.39,63.809817,119.942917,0.0,46.39,0.0,1.0,2.220446e-16
1,Domestic Transport,74.71,39.722159,1754.779306,0.043526,102.65,39.513127,2.0,0.1372198
2,Cash Withdrawal,59.6,59.887297,190.614005,0.419421,59.6,0.0,1.0,0.4194213
3,Cash Withdrawal,302.84,46.219283,950.753657,0.08944,157.165,206.015561,2.0,0.2541432
4,Grocery,23.41,80.213866,941.486991,0.310141,23.41,0.0,1.0,0.3101409


#### Training Training Dataset Creation
## TODO: explain diff between FV and TD
Finally we create the dataset using `fs.create_training_dataset()`.

In [7]:
td_version, td_job = feature_view.create_training_dataset(
    description = 'transactions_dataset_splitted',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/119/jobs/named/transactions_view_1_1_create_fv_td_21052022124609/executions




## TODO: TD exploration and some text

In [18]:
td_version, df = feature_view.get_training_dataset(start_time=None, end_time=None, version = td_version)

FileNotFoundError: Failed to read dataset from hopsfs://10.0.2.15:8020/Projects/fraud_demo/fraud_demo_Training_Datasets/transactions_view_1_1/transactions_view_1. Check if path exists or recreate a training dataset.

In [19]:
td_version, df = feature_view.get_training_dataset_splits({'train': 80, 'validation': 20}, start_time=None, end_time=None, version = td_version)

In [20]:
df

In [21]:
td_version

1

In [22]:
training_dataset_obj = fs.get_training_dataset("transactions_view_1", 1)

In [23]:
training_dataset_obj.read("train")

Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta_mavg
0,0,5,0.000000e+00,0.010858,0.849987,0.026437,0.000000e+00,0.000000,0.0,0.026888
1,0,5,0.000000e+00,0.047378,0.943547,0.037840,0.000000e+00,0.000000,0.0,0.038485
2,0,5,0.000000e+00,0.063759,0.129328,0.000046,0.000000e+00,0.000000,0.0,0.000047
3,0,5,0.000000e+00,0.340603,0.206006,0.224487,0.000000e+00,0.000000,0.0,0.228318
4,0,5,3.336858e-07,0.364075,0.663339,0.098624,3.336858e-07,0.000000,0.0,0.100307
...,...,...,...,...,...,...,...,...,...,...
84965,1,7,5.702691e-03,0.206613,0.226536,0.103895,2.155544e-03,0.002865,0.9,0.101301
84966,1,7,7.304049e-03,0.909331,0.728146,0.227438,4.702467e-03,0.005214,0.1,0.204941
84967,1,3,1.698461e-04,0.132164,0.815860,0.230563,1.698461e-04,0.000000,0.0,0.234498
84968,1,3,4.875150e-04,0.488983,0.605782,0.046052,4.875150e-04,0.000000,0.0,0.046838


In [24]:
training_dataset_obj.read("validation")

Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta_mavg
0,0,5,0.000000e+00,0.954661,0.874445,0.194834,0.000000e+00,0.000000,0.0,0.198159
1,0,5,3.336858e-07,0.961837,0.766053,0.196517,1.365109e-03,0.002735,0.1,0.146589
2,0,5,6.673716e-07,0.659617,0.208964,0.062039,6.673716e-07,0.000000,0.0,0.063098
3,0,5,6.673716e-07,0.815824,0.767163,0.096732,6.673716e-07,0.000000,0.0,0.098383
4,0,5,6.673716e-07,0.827326,0.068374,0.100149,2.581560e-03,0.005172,0.1,0.071501
...,...,...,...,...,...,...,...,...,...,...
21045,1,4,2.100886e-03,0.909331,0.728147,0.175566,2.100886e-03,0.000000,0.0,0.178562
21046,1,4,2.690842e-03,0.206613,0.226540,0.029113,2.856851e-03,0.000333,0.1,0.032057
21047,1,4,2.730885e-03,0.909331,0.728145,0.140484,2.889330e-03,0.003408,0.5,0.174735
21048,1,4,3.267118e-03,0.909331,0.728144,0.209982,2.489482e-03,0.003008,0.8,0.167911


### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.