## Fraud Tutorial - Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Feature Selection

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
trans_fg = fs.get_feature_group("transactions")
trans_4h_aggs_fg = fs.get_feature_group("transactions_4h_aggs")
trans_add_info_fg = fs.get_feature_group("transactions_add_info")

# Select features for training data.
ds_query = trans_fg.select(["category", "amount", "fraud_label"])\
    .join(trans_4h_aggs_fg.select_except(["tid"]), on="tid")\
    .join(trans_add_info_fg.select_except(["tid"]), on="tid")

ds_query.show(5)



Unnamed: 0,fg2.category,fg2.amount,fg2.fraud_label,fg0.trans_volume_mavg,fg0.trans_volume_mstd,fg0.trans_freq,fg0.loc_delta,fg0.loc_delta_mavg,fg1.age_at_transaction,fg1.days_until_card_expires
0,Health/Beauty,50.49,0,50.49,0.0,1.0,0.199013,0.199013,93.52782,1561.665197
1,Restaurant/Cafeteria,1.56,0,6.69,7.254916,2.0,0.201949,0.117487,37.868375,1020.860197
2,Health/Beauty,78.7,0,78.7,0.0,1.0,0.090711,0.090711,18.028501,1024.225127
3,Grocery,77.75,0,77.75,0.0,1.0,0.128094,0.128094,58.379903,1720.178391
4,Grocery,18.76,0,39.72,29.641916,2.0,0.606323,0.394127,35.957443,-336.186377


Recall that we computed the aggregate features in `transactions_4h_aggs` using 4-hour windows. If we wanted to experiment with other window lengths, e.g. 24 hours, we could easily create a separate feature group for that with the same schema as `transactions_4h_aggs` and include that in the join. To prevent feature name clash we would need to include a prefix argument in the join:

```python
    ds_query = ds_query.join(trans_24h_aggs_fg.select_except(["tid"]), on="tid", prefix="24h")
```

This illustrates yet another usage of features groups, namely that they can be used to namespace features. 

### Transformation Functions

We will preprocess our data using *min-max scaling* on numerical features and *one-hot encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

#### Dataset Creation

Finally we create the dataset using `fs.create_training_dataset()`.

In [None]:
# TODO add chronological split here.
td = fs.create_training_dataset(
    name="transactions_dataset_splitted",
    label=["fraud_label"],
    data_format="csv",
    transformation_functions=transformation_functions,
    splits={'train': 70, 'validation': 30},
    train_split="train"
)

# We can save the dataset using the query alone.
td.save(ds_query)

We can sanity check that the transformation functions have been applied by loading the training and validation data.

In [3]:
td.read("train")

NameError: name 'td' is not defined

In [None]:
td.read("validation")

### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.