# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Data & Feature views</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/nyc_taxi_fares/3_feature_view_and_dataset_creation.ipynb)



## üóíÔ∏è This notebook is divided into 3 main sections:
1. **Feature Selection**,
2. **Feature preprocessing**,
3. **Training datasets creation**.

![02_training-dataset](../../images/02_training-dataset.png)

## <span style="color:#ff5f27;"> üì° Connecting to Hopsworks Feature Store </span>

In [1]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/164




Connected. Call `.close()` to terminate connection gracefully.


In [2]:
# Retrieve feature groups.
rides_fg = fs.get_or_create_feature_group("nyc_taxi_rides",
                                          version=1)

In [3]:
fares_fg = fs.get_or_create_feature_group("nyc_taxi_fares",
                                          version=1)

---

## <span style="color:#ff5f27;"> üñç Feature View Creation and Retrieving </span>

Firstly you have to make a query from desired features.

In [4]:
# Select features for training data.
fg_query = fares_fg.select(['total_fare', "tolls"])\
                            .join(rides_fg.select_except(['taxi_id', "driver_id", "pickup_datetime",
                                                          "pickup_longitude", "pickup_latitude",
                                                          "dropoff_longitude", "dropoff_latitude"]),
                                  on=['ride_id'])

fg_query.show(2)

2022-10-14 15:15:17,595 INFO: USE `romankah_featurestore`
2022-10-14 15:15:18,566 INFO: SELECT `fg1`.`total_fare` `total_fare`, `fg1`.`tolls` `tolls`, `fg0`.`ride_id` `ride_id`, `fg0`.`passenger_count` `passenger_count`, `fg0`.`distance` `distance`, `fg0`.`pickup_distance_to_jfk` `pickup_distance_to_jfk`, `fg0`.`dropoff_distance_to_jfk` `dropoff_distance_to_jfk`, `fg0`.`pickup_distance_to_ewr` `pickup_distance_to_ewr`, `fg0`.`dropoff_distance_to_ewr` `dropoff_distance_to_ewr`, `fg0`.`pickup_distance_to_lgr` `pickup_distance_to_lgr`, `fg0`.`dropoff_distance_to_lgr` `dropoff_distance_to_lgr`, `fg0`.`year` `year`, `fg0`.`weekday` `weekday`, `fg0`.`hour` `hour`
FROM `romankah_featurestore`.`nyc_taxi_fares_1` `fg1`
INNER JOIN `romankah_featurestore`.`nyc_taxi_rides_1` `fg0` ON `fg1`.`ride_id` = `fg0`.`ride_id`




Unnamed: 0,total_fare,tolls,ride_id,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,180.0,1.0,2e28bab6d25f403ea5ac99aebe4a2449,4,10.146925,72.559851,82.538992,81.944306,91.230266,67.306251,76.985348,2020,6,0
1,24.0,4.0,bc44744a6f9e7c6726b3c5b2a76843e9,2,49.185101,33.560229,67.471054,13.044146,56.937148,27.057794,56.8416,2020,1,21


`Feature Views` stands between **Feature Groups** and **Training Dataset**. –°ombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [5]:
nyc_fares_fv = fs.create_feature_view(
    name='nyc_taxi_fares_fv',
    query=fg_query,
    labels=["total_fare"]
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/164/fs/106/fv/nyc_taxi_fares_fv/version/1


In [6]:
nyc_fares_fv.version

1

---

## <span style="color:#ff5f27;">üèãÔ∏è Training Dataset Creation</span>
    
In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset may contain splits such as:

    Training set - the subset of training data used to train a model.
    Validation set - the subset of training data used to evaluate hparams when training a model
    Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using fs.create_train_validation_test_split() method.

In [7]:
nyc_fares_fv.create_training_data(
    description='training_dataset',
    data_format='csv'
)

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/164/jobs/named/nyc_taxi_fares_fv_1_1_create_fv_td_14102022131545/executions




(1, <hsfs.core.job.Job at 0x25087bf6280>)

In [8]:
X_train, y_train = nyc_fares_fv.get_training_data(
    training_dataset_version=1
)

In [9]:
X_train.head(5)

Unnamed: 0,tolls,ride_id,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,4.0,10ed900246f25e954f0649dd5ad302c4,3,30.38612,55.751279,25.366605,57.660894,31.223534,47.059334,17.2113,2020,0,21
1,1.0,cc6d3385947c42fc9c8217fd595b0055,3,32.872554,34.031407,15.773933,54.899786,28.075128,40.14648,11.31038,2020,0,8
2,3.0,10786797c421c9114abcd428b4327306,2,42.131859,69.11008,55.825699,65.860343,66.024445,59.214787,50.877398,2020,5,5
3,2.0,6326579f9aec7e2e2fc448c6d318d1e5,4,61.381904,24.789907,42.281685,45.243716,26.540563,29.983835,32.638257,2020,1,10
4,3.0,531b0628e1234ce1e5a8cffc0d161147,4,50.80799,25.693737,74.739199,41.946473,84.281986,25.098324,69.587891,2020,3,11


In [10]:
y_train.head(5)

Unnamed: 0,total_fare
0,101.0
1,230.0
2,176.0
3,19.0
4,181.0


In [11]:
nyc_fares_fv.create_train_test_split(
    test_size=0.2 # here you can define the test dataset size
)

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/164/jobs/named/nyc_taxi_fares_fv_1_2_create_fv_td_14102022131643/executions




(2, <hsfs.core.job.Job at 0x25087cab100>)

In [12]:
X_train, y_train, X_test, y_test = nyc_fares_fv.get_train_test_split(
    training_dataset_version=2
)

In [13]:
X_test.head(5)

Unnamed: 0,total_fare
0,3.0
1,3.0
2,4.0
3,5.0
4,5.0


In [14]:
y_test.head(5)

Unnamed: 0,total_fare
0,4.0
1,4.0
2,6.0
3,9.0
4,9.0


## <span style="color:#ff5f27;">‚è≠Ô∏è **Next:** Part 04 </span>

In the next notebook you will train a model on the dataset, that was created in this notebook.