# <a class="anchor" id="1.5_bullet" style="color:#533a7b"> **Dataset Creation** </a>
---


In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a dataset split for training and validation data.

![tutorial-flow](images/create_training_dataset.png)

### <a class="anchor" id="1.5_bullet" style="color:#e363a3"> **📝 Importing Libraries** </a>

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### <a class="anchor" id="1.5_bullet" style="color:#3772ff"> **⬇️ Data retrieving from Feature Groups** </a>

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
fares_fg = fs.get_or_create_feature_group("fares_fg",
                                          version=1)
rides_fg = fs.get_or_create_feature_group("rides_fg",
                                          version=1)

In [3]:
# Select features for training data.
fg_query = fares_fg.select(['total_fare', 'pickup_datetime', 'month_of_the_ride'])\
                            .join(rides_fg.select_except(['taxi_id',
                                  'driver_id']), on=['ride_id', 'pickup_datetime', 'month_of_the_ride'])
fg_query.show(2)

2022-07-26 17:30:09,140 INFO: USE `tutorials_testing_featurestore`
2022-07-26 17:30:09,854 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`total_fare` `total_fare`, `fg1`.`pickup_datetime` `pickup_datetime`, `fg1`.`month_of_the_ride` `month_of_the_ride`, `fg1`.`ride_id` `join_pk_ride_id`, `fg1`.`pickup_datetime` `join_evt_pickup_datetime`, `fg0`.`ride_id` `ride_id`, `fg0`.`pickup_longitude` `pickup_longitude`, `fg0`.`pickup_latitude` `pickup_latitude`, `fg0`.`dropoff_longitude` `dropoff_longitude`, `fg0`.`dropoff_latitude` `dropoff_latitude`, `fg0`.`passenger_count` `passenger_count`, `fg0`.`distance` `distance`, `fg0`.`pickup_distance_to_jfk` `pickup_distance_to_jfk`, `fg0`.`dropoff_distance_to_jfk` `dropoff_distance_to_jfk`, `fg0`.`pickup_distance_to_ewr` `pickup_distance_to_ewr`, `fg0`.`dropoff_distance_to_ewr` `dropoff_distance_to_ewr`, `fg0`.`pickup_distance_to_lgr` `pickup_distance_to_lgr`, `fg0`.`dropoff_distance_to_lgr` `dropoff_distance_to_lgr`, `fg0`.`year` `year`, `fg0

Unnamed: 0,total_fare,pickup_datetime,month_of_the_ride,ride_id,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,118.0,1577880020000,202001,1,-73.76764,40.88664,-73.843834,40.78967,3,7.794438,16.960192,10.813943,25.269297,18.642949,9.402573,1.808076,2020,2,12
1,48.0,1577880040000,202001,2,-73.85604,40.77413,-73.80203,40.84287,3,5.525897,10.044588,13.983367,17.669196,22.185668,0.959006,5.911146,2020,2,12


#### <a class="anchor" id="1.5_bullet" style="color:#772f1a"> **〰️ Transformation Functions** </a>


We can preprocess our data using several encoding methods like *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [4]:
[t_func.name for t_func in fs.get_transformation_functions()]

['min_max_scaler', 'robust_scaler', 'standard_scaler', 'label_encoder']

In [5]:
# # Load transformation functions.
# min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
# label_encoder = fs.get_transformation_function(name="label_encoder")

# # Map features to transformations.
# transformation_functions = {
#     "total_fare": min_max_scaler,
#     "distance": min_max_scaler
# }

## <a class="anchor" id="1.5_bullet" style="color:#2db3f0"> **🔮 Feature View Creation** </a>


We start by selecting all the features we want to include for model training/inference.

After we have made a query from desired features, we should make a corresponding `Feature View`.
In order to do it we may use `fs.create_feature_view()`

In [6]:
nyc_fares_fv = fs.create_feature_view(
    name='nyc_taxi_fares',
    query=fg_query,
    labels=["total_fare"]
#     transformation_functions=transformation_functions
)

Feature view created successfully, explore it at 
https://0f060790-06a4-11ed-8aed-d1422d4ec537.cloud.hopsworks.ai/p/2170/fs/2118/fv/nyc_taxi_fares/version/1


In [7]:
nyc_fares_fv.version

1

 ## <a class="anchor" id="1.5_bullet" style="color:#525252"> **🏋️ Training Dataset Creation** </a>
    
In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset may contain splits such as:

    Training set - the subset of training data used to train a model.
    Validation set - the subset of training data used to evaluate hparams when training a model
    Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using fs.create_train_validation_test_split() method.

In [8]:
nyc_fares_fv.create_training_data(
    description='training_dataset',
    data_format='csv'
)

Training dataset job started successfully, you can follow the progress at 
https://0f060790-06a4-11ed-8aed-d1422d4ec537.cloud.hopsworks.ai/p/2170/jobs/named/nyc_taxi_fares_1_1_create_fv_td_26072022173042/executions




(1, <hsfs.core.job.Job at 0x7f63633dd760>)

In [9]:
nyc_fares_fv.create_train_test_split(
    test_size=0.2 # here we can define the test dataset size
)

Training dataset job started successfully, you can follow the progress at 
https://0f060790-06a4-11ed-8aed-d1422d4ec537.cloud.hopsworks.ai/p/2170/jobs/named/nyc_taxi_fares_1_2_create_fv_td_26072022173338/executions




(2, <hsfs.core.job.Job at 0x7f6364a9dd60>)

In [10]:
X_train, y_train = nyc_fares_fv.get_training_data(
    training_dataset_version=1
)



In [11]:
X_train.head(5)

Unnamed: 0,pickup_datetime,month_of_the_ride,ride_id,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,1577882700000,202001,135,-73.92254,40.689495,-73.7673,40.887066,2,15.88382,8.270023,16.990193,13.20025,25.300091,6.552041,9.43681,2020,2,12
1,1577883520000,202001,176,-73.81693,40.8239,-73.80063,40.844654,3,1.668032,12.779188,14.099817,20.891529,22.308797,4.41082,6.052817,2020,2,12
2,1577898080000,202001,904,-73.802086,40.8428,-73.79635,40.850098,3,0.586636,13.978813,14.458144,22.180792,22.686342,5.905553,6.485838,2020,2,17
3,1577900440000,202001,1022,-73.83064,40.806454,-73.77991,40.87103,3,5.19032,11.738002,15.873092,19.729807,24.155909,3.051922,8.154735,2020,2,17
4,1577903600000,202001,1180,-73.72762,40.93757,-73.78938,40.85897,3,6.316368,20.639902,15.051129,28.980397,23.305841,13.481569,7.19253,2020,2,18


In [12]:
y_train.head(5)

Unnamed: 0,total_fare
0,72.0
1,80.0
2,91.0
3,69.0
4,155.0


In [13]:
X_train, y_train, X_test, y_test = nyc_fares_fv.get_train_test_split(
    training_dataset_version=2
)

In [14]:
X_test.head(5)

Unnamed: 0,pickup_datetime,month_of_the_ride,ride_id,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance,pickup_distance_to_jfk,dropoff_distance_to_jfk,pickup_distance_to_ewr,dropoff_distance_to_ewr,pickup_distance_to_lgr,dropoff_distance_to_lgr,year,weekday,hour
0,1577962460000,202001,4123,-73.90577,40.71084,-73.93675,40.671402,2,3.171634,8.236619,8.571996,14.153589,12.520077,4.857851,7.995554,2020,3,10
1,1578279660000,202001,19983,-73.89861,40.719948,-74.008705,40.579826,2,11.271184,8.330694,12.819831,14.602976,11.532286,4.140491,15.33721,2020,0,3
2,1578202000000,202001,16100,-73.892845,40.727287,-73.89359,40.726334,2,0.076534,8.452279,8.434118,14.981048,14.931367,3.567016,3.64113,2020,6,5
3,1578051560000,202001,8578,-73.88058,40.7429,-73.907745,40.70833,2,2.780009,8.837542,8.222081,15.828395,14.033825,2.374279,5.056343,2020,4,11
4,1578045040000,202001,8252,-73.87806,40.74611,-73.94971,40.65492,2,7.33375,8.936653,9.045361,16.009134,12.019719,2.137968,9.314055,2020,4,9


In [15]:
y_test.head(5)

Unnamed: 0,total_fare
0,8.0
1,14.0
2,21.0
3,29.0
4,30.0


### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.