# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Data & Feature views</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/electricity/3_feature_views_and_training_dataset.ipynb)



## 🗒️ This notebook is divided into 3 main sections:
1. **Feature Selection**,
2. **Feature preprocessing**,
3. **Training datasets creation**

![02_training-dataset](../../images/02_training-dataset.png)

### <span style='color:#ff5f27'> 📝 Imports

In [27]:
import pandas as pd
import numpy as np

from functions import *

import warnings

# Mute warnings
warnings.filterwarnings("ignore")

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [28]:
!pip install -U hopsworks --quiet

In [29]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

Connection closed.
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/3348
Connected. Call `.close()` to terminate connection gracefully.


In [30]:
citibike_usage_fg = fs.get_or_create_feature_group(
    name="citibike_usage",
    version=1
)

In [31]:
citibike_stations_info_fg = fs.get_or_create_feature_group(
    name="citibike_stations_info",
    version=1
)

In [32]:
us_holidays_fg = fs.get_or_create_feature_group(
    name="us_holidays",
    version=1
)

In [33]:
meteorological_measurements_fg = fs.get_or_create_feature_group(
    name="meteorological_measurements",
    version=1
)

---

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

Let's start by selecting all the features you want to include for model training/inference.

In [34]:
# Select features for training data.
query = meteorological_measurements_fg.select_except(["timestamp"])\
                                      .join(
                                            us_holidays_fg.select_except(["timestamp"]),
                                            on="date", join_type="left"
                                      )\
                                      .join(
                                          citibike_usage_fg.select_except(["timestamp"]),
                                          on="date", join_type="left"
                                      )

In [35]:
# uncomment and run cell below if you want to see some rows from this query
# but you will have to wait some time

# query.show(5)

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [40]:
feature_view = fs.create_feature_view(
    name='citibike_feature_view',
    query=query,
    labels=["users_count"],
    version=1    
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/3348/fs/3295/fv/citibike_feature_view/version/1


In [41]:
feature_view

<hsfs.feature_view.FeatureView at 0x1fd67d7e9d0>

For now this `Feature View` is saved in Hopsworks and we can retrieve it using the same method.

In [44]:
feature_view = fs.get_feature_view(
    name='citibike_feature_view',
    version=1    
)

In [45]:
feature_view

<hsfs.feature_view.FeatureView at 0x1fd6718b760>

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- We can create **train, test** splits using `create_train_test_split()`. 

- We can create **train,validation, test** splits using `create_train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

In [46]:
feature_view.create_training_data(
    description='training_dataset',
    data_format='csv'
)

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/3348/jobs/named/citibike_feature_view_1_1_create_fv_td_24112022145058/executions


(1, <hsfs.core.job.Job at 0x1fd67457b20>)

In [48]:
feature_view.create_train_validation_test_split(
    validation_size=0.3,
    test_size=0.075,
    train_start="2022-01-01",
    train_end="2022-02-15",
    validation_start="2022-02-15",
    validation_end="2022-02-28",
    test_start="2022-03-01",
    test_end="2022-03-05"
)

AttributeError: type object 'TrainingDatasetSplit' has no attribute 'TIME_SERIES_SPLIT'

In [51]:
final_df = final_df.sort_values(by=["date", "station_id"]).reset_index(drop=True)
final_df

Unnamed: 0,date,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,...,mean_7_days,mean_14_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days
0,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,8.285714,10.928571,8.454359,7.826597,7.031687,-92.307692,7.086979,9.393994,7.220015,-92.307692
1,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,6.428571,11.000000,8.524475,9.369996,6.739434,-48.148148,7.114449,10.077356,6.869926,16.666667
2,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,6.714286,10.642857,4.059087,8.777483,5.940820,40.000000,7.185532,9.627550,6.440938,-65.000000
3,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,6.714286,9.357143,4.461475,7.083082,6.043020,-71.428571,6.990181,8.526784,6.577102,-33.333333
4,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,7.285714,9.928571,4.750940,8.062325,5.544779,83.333333,6.753509,8.879828,6.150643,83.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38623,2022-03-01,7.9,0.6,4.4,6.7,-2.8,2.0,-3.4,58.2,0.000,...,,,,,,,,,,
38624,2022-03-02,12.1,5.0,7.9,12.1,3.1,6.8,-1.2,55.0,0.000,...,,,,,,,,,,
38625,2022-03-03,7.1,-3.2,4.6,5.5,-9.6,2.2,-6.8,46.5,0.501,...,,,,,,,,,,
38626,2022-03-04,4.2,-5.1,-0.4,0.9,-10.5,-4.1,-12.0,42.2,0.000,...,,,,,,,,,,


In [93]:
# mark NaN values as -1 (we will replace these values in moment)
final_df = final_df.fillna(-1)

In [94]:
station_ids = final_df.station_id.unique()

In [95]:
# remove the "-1" placeholder from station_id's list
station_ids = np.delete(station_ids, np.where(station_ids == -1))

In [97]:
df_batch = final_df[final_df.station_id == -1]

In [98]:
indexes_to_fix = df_batch.index

In [99]:
indexes_to_fix

Int64Index([38623, 38624, 38625, 38626, 38627], dtype='int64')

In [100]:
# drop unprocessed rows by now
final_df = final_df.drop(indexes_to_fix)

In [101]:
len(station_ids)

968

In [102]:
df_batch = df_batch.loc[indexes_to_fix.repeat(len(station_ids))]

In [103]:
df_batch.shape

(4840, 30)

In [104]:
# populate this weather dataframe with station ids
for i in indexes_to_fix:
    df_batch.loc[i, "station_id"] = station_ids

In [105]:
df_batch = df_batch.dropna()

In [107]:
df_batch

Unnamed: 0,date,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,...,mean_7_days,mean_14_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days
38623,2022-03-01,7.9,0.6,4.4,6.7,-2.8,2.0,-3.4,58.2,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
38623,2022-03-01,7.9,0.6,4.4,6.7,-2.8,2.0,-3.4,58.2,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
38623,2022-03-01,7.9,0.6,4.4,6.7,-2.8,2.0,-3.4,58.2,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
38623,2022-03-01,7.9,0.6,4.4,6.7,-2.8,2.0,-3.4,58.2,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
38623,2022-03-01,7.9,0.6,4.4,6.7,-2.8,2.0,-3.4,58.2,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38627,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
38627,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
38627,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
38627,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0


In [129]:
# add slightly processed new rows to the end of final dataframe
final_df = pd.concat([final_df, df_batch]).reset_index(drop=True)

In [130]:
final_df

Unnamed: 0,date,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,...,mean_7_days,mean_14_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days
0,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,8.285714,10.928571,8.454359,7.826597,7.031687,-92.307692,7.086979,9.393994,7.220015,-92.307692
1,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,6.428571,11.000000,8.524475,9.369996,6.739434,-48.148148,7.114449,10.077356,6.869926,16.666667
2,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,6.714286,10.642857,4.059087,8.777483,5.940820,40.000000,7.185532,9.627550,6.440938,-65.000000
3,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,6.714286,9.357143,4.461475,7.083082,6.043020,-71.428571,6.990181,8.526784,6.577102,-33.333333
4,2022-01-01,13.5,10.0,11.6,13.5,10.0,11.6,10.2,91.6,18.463,...,7.285714,9.928571,4.750940,8.062325,5.544779,83.333333,6.753509,8.879828,6.150643,83.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48298,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.000,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000
48299,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.000,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000
48300,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.000,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000
48301,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.000,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000


### The gaps in these last 5 days we will iteratively fill on the modeling stage.

### <span style="color:#ff5f27;"> ⛳️ Train and test splits</span>

In [None]:
# Create training dataset based event time filter
start_time = convert_date_to_unix("2022-01-01")
end_time = convert_date_to_unix("2022-01-20")


td_train_version, td_job = feature_view.create_training_data(
        start_time = start_time,
        end_time = end_time,    
        description = 'citibike training dataset 2022-01-01 -> 2022-01-20',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
    )

In [None]:
# Create test dataset based event time filter
start_time = convert_date_to_unix("2022-01-21")
end_time = convert_date_to_unix("2022-01-30")


td_train_version, td_job = feature_view.create_training_data(
        start_time = start_time,
        end_time = end_time,    
        description = 'citibike testing dataset 2022-01-21 -> 2022-01-30',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
    )

---

In [None]:
cols_to_duplicate = ['mean_7_days', 'mean_14_days', 'std_7_days', 'exp_mean_7_days',
                     'exp_std_7_days', 'rate_of_change_7_days', 'std_14_days',
                     'exp_mean_14_days', 'exp_std_14_days', 'rate_of_change_14_days']
previous_date = datetime.strptime(end_date, "%Y-%m-%d") - timedelta(days=1)
previous_date = datetime.strftime(previous_date, "%Y-%m-%d")

df_joined[df_joined.date == end_date][cols_to_duplicate] = df_joined[df_joined.date == previous_date][cols_to_duplicate]

## <span style="color:#ff5f27;">⏭️ **Next:** Part 04 </span>

In the next notebook you will train a model on the Training dataset, that was created in this notebook.