## Dataset Creation

In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

In [2]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Feature Selection

We start by selecting all the features we want to include for model training/inference. In this case, we'll use all features except for the customer ID.

In [3]:
# Load feature group.
customer_info_fg = fs.get_feature_group("customer_info")

# Select features for training data.
ds_query = customer_info_fg.select_except(["customerid"])

ds_query.show(5)



2022-05-30 20:39:18,679 INFO: USE `churn_demo_featurestore`
2022-05-30 20:39:19,581 INFO: SELECT `fg0`.`gender` `gender`, `fg0`.`seniorcitizen` `seniorcitizen`, `fg0`.`partner` `partner`, `fg0`.`dependents` `dependents`, `fg0`.`tenure` `tenure`, `fg0`.`phoneservice` `phoneservice`, `fg0`.`multiplelines` `multiplelines`, `fg0`.`internetservice` `internetservice`, `fg0`.`onlinesecurity` `onlinesecurity`, `fg0`.`onlinebackup` `onlinebackup`, `fg0`.`deviceprotection` `deviceprotection`, `fg0`.`techsupport` `techsupport`, `fg0`.`streamingtv` `streamingtv`, `fg0`.`streamingmovies` `streamingmovies`, `fg0`.`contract` `contract`, `fg0`.`paperlessbilling` `paperlessbilling`, `fg0`.`paymentmethod` `paymentmethod`, `fg0`.`monthlycharges` `monthlycharges`, `fg0`.`totalcharges` `totalcharges`, `fg0`.`churn` `churn`
FROM `churn_demo_featurestore`.`customer_info_1` `fg0`


Unnamed: 0,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,1,0,0,0,21,1,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,1,Mailed check,19.95,416.4,0
1,1,0,1,1,72,1,No,Fiber optic,Yes,Yes,No,Yes,Yes,Yes,Two year,0,Bank transfer (automatic),105.5,7611.55,0
2,0,0,1,1,7,0,No phone service,DSL,Yes,No,No,Yes,No,No,Month-to-month,0,Mailed check,34.65,246.6,0
3,1,1,1,0,72,1,Yes,DSL,Yes,Yes,Yes,No,Yes,Yes,Two year,0,Credit card (automatic),84.45,5899.85,0
4,1,0,1,1,22,1,Yes,DSL,No,No,Yes,Yes,Yes,Yes,One year,1,Mailed check,78.65,1663.75,0


### Transformation Functions

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [4]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

numerical_features = ["tenure", "MonthlyCharges", "TotalCharges"]
categorical_features = [
    "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup",
    "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies",
    "Contract", "PaymentMethod"]

# Features names are lower case in the feature group.
numerical_features = [s.lower() for s in numerical_features]
categorical_features = [s.lower() for s in categorical_features]

# Map features to transformations.
transformation_functions = {}
for feature in numerical_features:
    transformation_functions[feature] = min_max_scaler

for feature in categorical_features:
    transformation_functions[feature] = label_encoder

### Feature Creation

In Hopsworks, you write features to feature groups (where the features are stored) and you read features from feature views. A Feature View is a logical view over features, stored in feature groups. Feature View typically contains set of features that may come from different feature groups used by a specific model. This way, feature views enable features, stored in different feature groups, to be reused across many different models.

In [12]:
feature_view = fs.create_feature_view(
    name = 'churn_feature_view',
    version = 1,
    label=["churn"],
    transformation_functions = transformation_functions,
    query = ds_query,    
)

#### Dataset Creation

Finally we create the dataset using `fs.create_training_dataset()`.

In [None]:
td_random_version, td_job = feature_view.create_training_dataset(
    description = 'churn_training_dataset_random_splitted',
    data_format="csv",
    splits={'train': 70, 'validation': 30},
    train_split="train",
    write_options = {'wait_for_job': True},
    coalesce = True
)

We can sanity check that the transformation functions have been applied by loading the training and validation data.

In [None]:
td_version, td_df_random = feature_view.get_training_dataset(version = td_random_version)

In [None]:
td_df_random["train"]

In [None]:
td_df_random["validation"]

### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.