## Fraud Tutorial - Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Feature Selection

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
trans_fg = fs.get_feature_group("transactions")
trans_4h_aggs_fg = fs.get_feature_group("transactions_4h_aggs")
trans_add_info_fg = fs.get_feature_group("transactions_add_info")

# Select features for training data.
ds_query = trans_fg.select(["category", "amount", "fraud_label"])\
    .join(trans_4h_aggs_fg.select_except(["tid"]), on="tid")\
    .join(trans_add_info_fg.select_except(["tid"]), on="tid")

ds_query.show(5)



2022-04-25 20:39:11,016 INFO: USE `clean_up_featurestore`
2022-04-25 20:39:11,694 INFO: SELECT `fg2`.`category`, `fg2`.`amount`, `fg2`.`fraud_label`, `fg0`.`trans_volume_mavg`, `fg0`.`trans_volume_mstd`, `fg0`.`trans_freq`, `fg0`.`loc_delta`, `fg0`.`loc_delta_mavg`, `fg1`.`age_at_transaction`, `fg1`.`days_until_card_expires`
FROM `clean_up_featurestore`.`transactions_1` `fg2`
INNER JOIN `clean_up_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg2`.`tid` = `fg0`.`tid`
INNER JOIN `clean_up_featurestore`.`transactions_add_info_1` `fg1` ON `fg2`.`tid` = `fg1`.`tid`


Unnamed: 0,fg2.category,fg2.amount,fg2.fraud_label,fg0.trans_volume_mavg,fg0.trans_volume_mstd,fg0.trans_freq,fg0.loc_delta,fg0.loc_delta_mavg,fg1.age_at_transaction,fg1.days_until_card_expires
0,Electronics,800.02,0,800.02,0.0,1.0,0.585538,0.585538,18.269105,19.346493
1,Grocery,0.83,0,15.815,21.19199,2.0,0.600164,0.518607,36.915835,1301.76816
2,Grocery,11.45,0,11.45,0.0,1.0,0.304966,0.304966,76.852503,-199.80037
3,Grocery,1.27,0,8.085,9.637865,2.0,0.06351,0.107042,46.298131,298.954734
4,Restaurant/Cafeteria,99.9,0,99.9,0.0,1.0,0.190259,0.190259,27.982351,1768.656076


Recall that we computed the aggregate features in `transactions_4h_aggs` using 4-hour windows. If we wanted to experiment with other window lengths, e.g. 24 hours, we could easily create a separate feature group for that with the same schema as `transactions_4h_aggs` and include that in the join. To prevent feature name clash we would need to include a prefix argument in the join:

```python
    ds_query = ds_query.join(trans_24h_aggs_fg.select_except(["tid"]), on="tid", prefix="24h")
```

This illustrates yet another usage of features groups, namely that they can be used to namespace features. 

### Transformation Functions

We will preprocess our data using *min-max scaling* on numerical features and *one-hot encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [3]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

#### Dataset Creation

Finally we create the dataset using `fs.create_training_dataset()`.

In [4]:
# TODO add chronological split here.
td = fs.create_training_dataset(
    name="transactions_dataset_splitted",
    label=["fraud_label"],
    data_format="csv",
    transformation_functions=transformation_functions,
    splits={'train': 70, 'validation': 30},
    train_split="train"
)

# We can save the dataset using the query alone.
td.save(ds_query)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/125/jobs/named/transactions_dataset_splitted_1_create_td_25042022204002/executions




<hsfs.core.job.Job at 0x7f977c3a8a30>

We can sanity check that the transformation functions have been applied by loading the training and validation data.

In [5]:
td.read("train")



Unnamed: 0,category,amount,fraud_label,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta,loc_delta_mavg,age_at_transaction,days_until_card_expires
0,5,0.121806,0,0.121806,0.000000,0.380643,0.123069,0.123069,8.552278,487.124915
1,5,0.137032,0,0.137032,0.000000,0.380643,0.017971,0.017971,24.205971,646.860639
2,5,0.137032,0,0.137032,0.000000,0.380643,0.054782,0.054782,33.803787,603.660132
3,5,0.171289,0,108.059502,157.590145,1.141929,0.058668,0.050028,24.796981,-115.985469
4,5,0.232192,0,5.188166,7.008805,0.761286,0.110207,0.093298,17.547099,65.412143
...,...,...,...,...,...,...,...,...,...,...
487205,6,33.214917,0,33.214917,0.000000,0.380643,0.233170,0.233170,17.054229,-19.498452
487206,6,36.111611,0,36.111611,0.000000,0.380643,0.087161,0.087161,30.029146,292.263837
487207,6,36.454190,0,36.454190,0.000000,0.380643,0.014397,0.014397,36.711899,493.101831
487208,6,38.140439,0,38.140439,0.000000,0.380643,0.204763,0.204763,15.395214,-49.990762


In [6]:
td.read("validation")

Unnamed: 0,category,amount,fraud_label,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta,loc_delta_mavg,age_at_transaction,days_until_card_expires
0,5,0.068516,0,0.068516,0.000000,0.380643,0.107028,0.107028,26.747291,110.344506
1,5,0.140838,0,0.140838,0.000000,0.380643,0.000068,0.000068,22.499537,94.316405
2,5,0.148451,0,0.148451,0.000000,0.380643,0.095808,0.095808,18.828199,594.785019
3,5,0.205547,0,0.205547,0.000000,0.380643,0.000037,0.000037,22.333039,317.663218
4,5,0.422514,0,9.755883,9.120615,1.141929,0.005720,0.035934,27.822886,683.182834
...,...,...,...,...,...,...,...,...,...,...
208845,6,15.168628,0,15.168628,0.000000,0.380643,0.054907,0.054907,20.555554,355.182733
208846,6,18.537319,0,11.543002,9.891458,0.761286,0.000835,0.035068,7.561666,-210.771801
208847,6,20.341567,0,20.341567,0.000000,0.380643,0.059708,0.059708,31.164233,218.738243
208848,6,30.086031,0,30.086031,0.000000,0.380643,0.018422,0.018422,22.108082,-152.867011


### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.