# Part 03: Training Data & Feature views

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/nyc_taxi_fares/3_feature_view_and_dataset_creation.ipynb)


**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store

## 🗒️ In this notebook we will see how to create a training dataset from the feature groups:
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset split** for training and validation data.

![02_training-dataset](../../images/02_training-dataset.png)

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

___

## <span style="color:#ff5f27;">🪝 Data retrieving from Feature Groups</span>

Let's start by selecting all the features you want to include for model training/inference.

In [None]:
# Retrieve feature groups.
rides_fg = fs.get_or_create_feature_group("nyc_taxi_rides",
                                          version=1)

fares_fg = fs.get_or_create_feature_group("nyc_taxi_fares",
                                          version=1)


In [None]:
# Select features for training data.
fg_query = fares_fg.select(['total_fare', "tolls"])\
                            .join(rides_fg.select_except(['taxi_id', "driver_id", "pickup_datetime",
                                                          "pickup_longitude", "pickup_latitude",
                                                          "dropoff_longitude", "dropoff_latitude"]),
                                  on=['ride_id'])

fg_query.show(2)

---

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

After you have made a query from desired features, we should make a corresponding `Feature View`.
In order to do it you may use `fs.create_feature_view()`

In [None]:
nyc_fares_fv = fs.create_feature_view(
    name='nyc_taxi_fares_fv',
    query=fg_query,
    labels=["total_fare"]
)

In [None]:
nyc_fares_fv.version

## <span style="color:#ff5f27;">🏋️ Training Dataset Creation</span>
    
In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset may contain splits such as:

    Training set - the subset of training data used to train a model.
    Validation set - the subset of training data used to evaluate hparams when training a model
    Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using fs.create_train_validation_test_split() method.

In [None]:
nyc_fares_fv.create_training_data(
    description='training_dataset',
    data_format='csv'
)

In [None]:
X_train, y_train = nyc_fares_fv.get_training_data(
    training_dataset_version=1
)

In [None]:
X_train.head(5)

In [None]:
y_train.head(5)

In [None]:
nyc_fares_fv.create_train_test_split(
    test_size=0.2 # here you can define the test dataset size
)

In [None]:
X_train, y_train, X_test, y_test = nyc_fares_fv.get_train_test_split(
    training_dataset_version=2
)

In [None]:
X_test.head(5)

In [None]:
y_test.head(5)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 04 </span>

In the next notebook you will train a model on the dataset, that was created in this notebook.