# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Data & Feature views</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/electricity/3_feature_views_and_training_dataset.ipynb)



## 🗒️ This notebook is divided into 3 main sections:
1. **Feature selection**,
2. **Feature transformations**,
3. **Training datasets creation**

![02_training-dataset](../../images/02_training-dataset.png)

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
!pip install -U hopsworks --quiet

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

---

In [None]:
electricity_prices_fg = fs.get_or_create_feature_group(
    name = 'electricity_prices',
    version = 1
)

In [None]:
meteorological_measurements_fg = fs.get_or_create_feature_group(
    name = 'meteorological_measurements',
    version = 1
)

In [None]:
swedish_holidays_fg = fs.get_or_create_feature_group(
    name = 'swedish_holidays',
    version = 1
)

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

Let's start by selecting all the features you want to include for model training/inference.

In [None]:
# Select features for training data.
fg_query = electricity_prices_fg.select_all()\
                        .join(
                            meteorological_measurements_fg.select_except(["timestamp"])
                        )\
                        .join(
                            swedish_holidays_fg.select_all()
                        )

In [None]:
# uncomment this if you would like to view query results
fg_query.show(5)

### <span style="color:#ff5f27;"> 🤖 Transformation Functions</span>

Hopsworks Feature Store provides functionality to attach transformation functions to feature views and comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

You will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
price_areas = ["se1", "se2", "se3", "se4"]

#Map features to transformations.
mapping_transformers = {}
for area in price_areas:
    mapping_transformers[f"price_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"mean_temp_per_day_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"mean_wind_speed_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"precipitaton_amount_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"total_sunshine_time_{area}"] = fs.get_transformation_function(name="min_max_scaler")
    mapping_transformers[f"mean_cloud_perc_{area}"] = fs.get_transformation_function(name="min_max_scaler")    
    mapping_transformers[f"precipitaton_type_{area}"] = fs.get_transformation_function(name='label_encoder')

mapping_transformers["type_of_day"] = label_encoder


`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
feature_view = fs.create_feature_view(
    name='electricity_feature_view',
    version=1,
    transformation_functions=mapping_transformers,
    query=fg_query
)

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`.

In [None]:
feature_view = fs.get_feature_view(
    name = 'electricity_feature_view',
    version = 1
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- We can create **train, test** splits using `create_train_test_split()`. 

- We can create **train,validation, test** splits using `create_train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

### <span style="color:#ff5f27;"> ⛳️ Dataset with train, test and validation splits</span>

In [None]:
# Create training datasets based event time filter
td_jan2021_feb2022_version, td_job = feature_view.create_training_data(
        start_time="20210101",
        end_time="20220228",    
        description='Electricity price prediction training dataset jan2021/feb2022',
        data_format="csv",
        coalesce=True,
        write_options={'wait_for_job': False},
    )

In [None]:
# Create training datasets based event time filter
td_spring2022, td_job = feature_view.create_training_data(
    start_time="20220301",
    end_time="20220531",    
    description='Electricity price prediction training dataset March/May 2022',
    data_format="csv",
    coalesce=True,
    write_options={'wait_for_job': False},
    )

In [None]:
# Create training datasets based event time filter
td_summer2022, td_job = feature_view.create_training_data(
    start_time="20220601",
    end_time="20220909",    
    description='Electricity price prediction training dataset June/August 2022',
    data_format="csv",
    coalesce=True,
    write_options={'wait_for_job': True},
    )

---

## <span style="color:#ff5f27;">⏭️ **Next:** Part 04 </span>

In the next notebook you will train a model on the Training dataset, that was created in this notebook.