## Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

In [None]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

### Feature Selection

We start by selecting all the features we want to include for model training/inference.

In [None]:
# Load feature groups.
trans_fg = fs.get_feature_group("transactions")
window_aggs_fg = fs.get_feature_group("transactions_4h_aggs")

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["tid"]), on="tid")\

ds_query.show(5)

Recall that we computed the features in `transactions_4h_aggs` using 4-hour aggregates. If we had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join we would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

### Transformation Functions

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

#### Feature View Creation

In order to create a Feature View we may use `fs.create_feature_view()`

In [None]:
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=ds_query,
    label=["fraud_label"],
    transformation_functions=transformation_functions
)

In [None]:
feature_view.get_batch_data().head(5)

#### Dataset Creation

Finally we create the dataset using `fs.create_training_dataset()`.

In [None]:
td = feature_view.create_training_dataset(
    description = 'transactions_dataset_splitted',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)

In [None]:
td

In [None]:
# # TODO add chronological split here.
# td = fs.create_training_dataset(
#     name="transactions_dataset_splitted",
#     label=["fraud_label"],
#     data_format="csv",
#     transformation_functions=transformation_functions,
#     splits={'train': 70, 'validation': 30},
#     train_split="train"
# )

# # We can save the dataset using the query alone.
# td.save(ds_query)

We can sanity check that the transformation functions have been applied by loading the training and validation data.

In [None]:
# td.read("train")

In [None]:
# td.read("validation")

### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.