![hopsworks_logo](../../images/hopsworks_logo.png)

# Part 02: Training Data & Feature views

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/fraud_batch/2_feature_view_creation.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store

## 🗒️ In this notebook we will see how to create a training dataset from the feature groups:
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset split** for training and validation data.

![02_training-dataset](../../images/02_training-dataset.png)

#### <span style="color:#ff5f27;">📝 Importing Libraries</span>

In [1]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/164




Connected. Call `.close()` to terminate connection gracefully.


### <span style="color:#ff5f27;">⬇️ Data retrieving from Feature Groups</span>

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
fares_fg = fs.get_or_create_feature_group("fares_fg",
                                          version=1)
rides_fg = fs.get_or_create_feature_group("rides_fg",
                                          version=1)

In [3]:
# Select features for training data.
fg_query = fares_fg.select(['total_fare'])\
                            .join(rides_fg.select_except(['taxi_id', "driver_id", "pickup_datetime",
                                                          "pickup_longitude", "pickup_latitude",
                                                          "dropoff_longitude", "dropoff_latitude"]),
                                  on=['ride_id'])

fg_query.show(2)

2022-08-22 00:05:44,272 INFO: USE `romankah_featurestore`
2022-08-22 00:05:45,272 INFO: SELECT `fg1`.`total_fare` `total_fare`, `fg0`.`ride_id` `ride_id`, `fg0`.`passenger_count` `passenger_count`, `fg0`.`distance` `distance`, `fg0`.`pickup_distance_to_jfk` `pickup_distance_to_jfk`, `fg0`.`dropoff_distance_to_jfk` `dropoff_distance_to_jfk`, `fg0`.`pickup_distance_to_ewr` `pickup_distance_to_ewr`, `fg0`.`dropoff_distance_to_ewr` `dropoff_distance_to_ewr`, `fg0`.`pickup_distance_to_lgr` `pickup_distance_to_lgr`, `fg0`.`dropoff_distance_to_lgr` `dropoff_distance_to_lgr`, `fg0`.`year` `year`, `fg0`.`weekday` `weekday`, `fg0`.`hour` `hour`
FROM `romankah_featurestore`.`fares_fg_1` `fg1`
INNER JOIN `romankah_featurestore`.`rides_fg_1` `fg0` ON `fg1`.`ride_id` = `fg0`.`ride_id`




Unnamed: 0,total_fare,ride_id,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,211.0,2496ddc3d1f9c14f40f37d756a7e6807,2,55.2773,37.908068,80.145955,56.84095,89.596242,40.303302,74.979712,2020,0,11
1,105.0,080d6819af28478bd39d1b14836c9102,2,37.945798,32.427248,66.244103,27.13503,65.080031,21.829017,56.793579,2020,5,17


#### <span style="color:#ff5f27;">〰️ Transformation Functions</span>


We can preprocess our data using several encoding methods like *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [4]:
[t_func.name for t_func in fs.get_transformation_functions()]

['min_max_scaler', 'robust_scaler', 'label_encoder', 'standard_scaler']

In [5]:
# # Load transformation functions.
# min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
# label_encoder = fs.get_transformation_function(name="label_encoder")

# # Map features to transformations.
# transformation_functions = {
#     "total_fare": min_max_scaler,
#     "distance": min_max_scaler
# }

## <span style="color:#ff5f27;">🔮 Feature View Creation</span>


We start by selecting all the features we want to include for model training/inference.

After we have made a query from desired features, we should make a corresponding `Feature View`.
In order to do it we may use `fs.create_feature_view()`

In [6]:
nyc_fares_fv = fs.create_feature_view(
    name='nyc_taxi_fares_fv',
    query=fg_query,
    labels=["total_fare"]
#     transformation_functions=transformation_functions
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/164/fs/106/fv/nyc_taxi_fares_fv/version/1


In [7]:
nyc_fares_fv.version

1

## <span style="color:#ff5f27;">🏋️ Training Dataset Creation</span>
    
In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset may contain splits such as:

    Training set - the subset of training data used to train a model.
    Validation set - the subset of training data used to evaluate hparams when training a model
    Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using fs.create_train_validation_test_split() method.

In [8]:
nyc_fares_fv.create_training_data(
    description='training_dataset',
    data_format='csv'
)

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/164/jobs/named/nyc_taxi_fares_fv_1_1_create_fv_td_21082022220613/executions




(1, <hsfs.core.job.Job at 0x27a875d4940>)

In [9]:
X_train, y_train = nyc_fares_fv.get_training_data(
    training_dataset_version=1
)

In [10]:
X_train.head(5)

Unnamed: 0,ride_id,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,66f60b61a4ae984a994d737065f83c03,3,70.853384,72.580038,64.264527,85.458988,56.538901,69.452502,53.72401,2020,5,18
1,4840373f85fc0f2f911d2e31c62be61b,1,64.143399,90.239768,60.102573,99.952927,55.192258,85.271827,49.848482,2020,0,10
2,b0c6e1e08aac265f6f6c87b8fa064a07,1,6.473236,84.384584,78.599394,76.526962,71.658514,73.892893,68.182775,2020,6,19
3,e28784c998070b652963f90ce7ad6065,3,24.76576,20.041152,5.76181,7.219601,26.545754,19.687615,15.768106,2020,2,3
4,f54397dcaa22b7dadc0c874adc8724be,3,65.633791,16.511973,73.189277,37.327924,82.939706,23.082094,68.141703,2020,3,2


In [11]:
y_train.head(5)

Unnamed: 0,total_fare
0,161.0
1,93.0
2,216.0
3,54.0
4,202.0


In [12]:
nyc_fares_fv.create_train_test_split(
    test_size=0.2 # here we can define the test dataset size
)

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/164/jobs/named/nyc_taxi_fares_fv_1_2_create_fv_td_21082022220704/executions




(2, <hsfs.core.job.Job at 0x27a875f8190>)

In [13]:
X_train, y_train, X_test, y_test = nyc_fares_fv.get_train_test_split(
    training_dataset_version=2
)

In [14]:
X_test.head(5)

Unnamed: 0,ride_id,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,8a5342befe03b0ac93b8bc46c7c550cd,2,46.651897,68.894517,22.328876,63.204621,24.84082,58.588932,12.574749,2020,5,20
1,f9dcfae1d876eaba696f1705b0ed7512,4,50.638172,67.242353,61.848181,61.849265,72.472428,56.968283,57.212648,2020,3,17
2,5bec1db01aa5f469d50802adcf2fc9af,1,8.892462,66.547103,72.079062,74.946832,78.410927,60.720976,65.306347,2020,4,22
3,91186f31731b1833371733f244ef1d6f,2,44.3161,45.71403,21.584561,64.398971,25.812269,47.735653,12.377128,2020,1,10
4,68495122018e72dee50ede14b0059272,4,50.761079,18.369557,56.794849,33.132065,47.036816,16.280441,46.164399,2020,3,12


In [15]:
y_test.head(5)

Unnamed: 0,total_fare
0,30.0
1,32.0
2,37.0
3,40.0
4,51.0


## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the next notebook, we will train a model on the dataset we created in this notebook.