## Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

In [None]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

### Feature Selection

We start by selecting all the features we want to include for model training/inference. In this case, we'll use all features except for the customer ID.

In [None]:
# Load feature group.
customer_info_fg = fs.get_feature_group("customer_info")

# Select features for training data.
ds_query = customer_info_fg.select_except(["customerid"])

ds_query.show(5)

### Transformation Functions

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

numerical_features = ["tenure", "MonthlyCharges", "TotalCharges"]
categorical_features = [
    "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup",
    "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies",
    "Contract", "PaymentMethod"]

# Features names are lower case in the feature group.
numerical_features = [s.lower() for s in numerical_features]
categorical_features = [s.lower() for s in categorical_features]

# Map features to transformations.
transformation_functions = {}
for feature in numerical_features:
    transformation_functions[feature] = min_max_scaler

# TODO there seems to be some problems with label_encoder when having multiple categorical features...
# for feature in categorical_features:
#     transformation_functions[feature] = label_encoder

#### Dataset Creation

Finally we create the dataset using `fs.create_training_dataset()`.

In [None]:
td = fs.create_training_dataset(
    name="churn_dataset_splitted",
    label=["churn"],
    data_format="csv",
    transformation_functions=transformation_functions,
    splits={'train': 70, 'validation': 30},
    train_split="train"
)

# We can save the dataset using the query alone.
td.save(ds_query)

We can sanity check that the transformation functions have been applied by loading the training and validation data.

In [None]:
td.read("train")

In [None]:
td.read("validation")

### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.