# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Feature View and Training Dataset</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/churn/2_feature_view_creation.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

This is the second part of the quick start series of tutorials about predicting customers that are at risk of churning with the Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store

## 🗒️ In this notebook you will see how to create a training dataset from the feature groups:
1. **Select the features** you want to train the model on,
2. **How the features should be preprocessed,**
3. **Create a dataset split** for training and validation data.

In [None]:
!pip install -U hopsworks --quiet

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

---
## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [None]:
# Load feature groups.
customer_info_fg = fs.get_feature_group(
    name="customer_info",
    version=1
)

demography_fg = fs.get_feature_group(
    name="customer_demography_info",
    version=1
)

subscriptions_fg = fs.get_feature_group(
    name="customer_subscription_info",
    version=1
)

In [None]:
# Select features for training data.
ds_query = customer_info_fg.select_except(["customerid"]).join(demography_fg.select_except(["customerid"])).join(subscriptions_fg.select_all())

# uncomment this if you would like to view query result
# ds_query.show(5)

Recall that you created three feature groups in the previous notebook. If you had created multiple feature groups with identical schema and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

---
## <span style="color:#ff5f27;">🤖 Transformation Functions </span>

You will preprocess the data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you will simply define a mapping between features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

numerical_features = ["tenure", "monthlycharges", "totalcharges"]
categorical_features = [
    "multiplelines", "internetservice", "onlinesecurity", "onlinebackup",
    "deviceprotection", "techsupport", "streamingmovies", "streamingtv",
    "phoneservice", "paperlessbilling", "contract", "paymentmethod", "gender", 
    "dependents", "partner"
]

# Map features to transformations.
transformation_functions = {}
for feature in numerical_features:
    transformation_functions[feature] = min_max_scaler

for feature in categorical_features:
    transformation_functions[feature] = label_encoder

---
## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View you may use `fs.create_feature_view()`

In [None]:

try:
    feature_view = fs.get_feature_view(name = 'churn_feature_view', version = 1)
except:
    feature_view = fs.create_feature_view(
        name = 'churn_feature_view',
        version = 1,
        labels=["churn"],
        transformation_functions=transformation_functions,
        query=ds_query,
    )

---
## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_train_validation_test_split()` method.

In [None]:
td_version, td_job = feature_view.create_train_validation_test_split(
    description = 'churn_training_dataset_random_splitted',
    data_format = 'csv',
    validation_size = 0.2,
    test_size = 0.1,
    write_options = {'wait_for_job': True},
    coalesce = True,
)

The feature view and training dataset are now visible in the UI

![fv-overview](../churn/images/churn_tutofv.gif)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the following notebook, you will train a model on the dataset you created in this notebook and have quick overview of the lineage.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/{project_name}/{notebook_name}.ipynb)
