## ⚠️ Warning! You should use **PySpark kernel** for this notebook!

# <a class="anchor" id="1.5_bullet" style="color:#533a7b"> **Dataset Creation** </a>
---


In this notebook, we will create the actual dataset that we will train our model on. In particular, we will:
1. Select the features we want to train our model on.
2. Specify how the features should be preprocessed.
3. Create a feature view and a dataset.

![tutorial-flow](images/create_training_dataset.png)

### <a class="anchor" id="1.5_bullet" style="color:#e363a3"> **📝 Importing Libraries** </a>

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
4,application_1654459018768_0003,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

### <a class="anchor" id="1.5_bullet" style="color:#3772ff"> **⬇️ Data retrieving from Feature Groups** </a>

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
fares_fg = fs.get_feature_group("fares_fg")
rides_fg = fs.get_feature_group("rides_fg")



#### Using `commit_details()` method we can look at all commits of this specific feature group. And then make a Point-in-Time join (retrieve records only within a certain time period)

In [3]:
fares_commit_details = fares_fg.commit_details()
rides_commit_details = rides_fg.commit_details()

In [4]:
fares_commit_details

{1654465884060: {'committedOn': '20220605215124060', 'rowsUpdated': 0, 'rowsInserted': 41078, 'rowsDeleted': 0}}

In [5]:
rides_commit_details

{1654465941636: {'committedOn': '20220605215221636', 'rowsUpdated': 0, 'rowsInserted': 40907, 'rowsDeleted': 0}}

Time is in UNIX format.

#### Lets just pick the last commit time for data query creation.

In [6]:
commit_time = max(fares_commit_details[sorted(fares_commit_details)[0]]["committedOn"], rides_commit_details[sorted(rides_commit_details)[0]]["committedOn"])

In [7]:
commit_time

'20220605215221636'

In [8]:
# Select features for training data.
fg_query = fares_fg.select(['total_fare', 'pickup_datetime', 'month_of_the_ride'])\
                            .join(rides_fg.select_except(['taxi_id',
                                  'driver_id']), on=['ride_id', 'pickup_datetime', 'month_of_the_ride']).as_of(commit_time)

In [9]:
fg_query.show(2)

+----------+---------------+-----------------+-------+----------------+---------------+-----------------+----------------+---------------+------------------+----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+----+-------+----+
|total_fare|pickup_datetime|month_of_the_ride|ride_id|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|passenger_count|          distance|pickup_distance_to_jfk|dropoff_distance_to_jfk|pickup_distance_to_ewr|dropoff_distance_to_ewr|pickup_distance_to_lgr|dropoff_distance_to_lgr|year|weekday|hour|
+----------+---------------+-----------------+-------+----------------+---------------+-----------------+----------------+---------------+------------------+----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+----+-------+----+
|      54.0|  1577882080000|           202001|   

#### <a class="anchor" id="1.5_bullet" style="color:#772f1a"> **〰️ Transformation Functions** </a>


We can preprocess our data using several encoding methods like *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [10]:
[t_func.name for t_func in fs.get_transformation_functions()]

['label_encoder', 'min_max_scaler', 'standard_scaler', 'robust_scaler']

In [4]:
# # Load transformation functions.
# min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
# label_encoder = fs.get_transformation_function(name="label_encoder")

# # Map features to transformations.
# transformation_functions = {
#     "total_fare": min_max_scaler,
#     "distance": min_max_scaler
# }

## <a class="anchor" id="1.5_bullet" style="color:#2db3f0"> **🔮 Feature View Creation** </a>


We start by selecting all the features we want to include for model training/inference.

After we have made a query from desired features, we should make a corresponding `Feature View`.
In order to do it we may use `fs.create_feature_view()`

In [11]:
nyc_fares_fv = fs.create_feature_view(
    name='nyc_taxi_fares',
    query=fg_query,
    label=["total_fare"]
#     transformation_functions=transformation_functions
)

Feature view created successfully, explore it at https://1e87e1e0-e1b3-11ec-8067-e932b2b957b4.cloud.hopsworks.ai/p/121/fs/69/fv/nyc_taxi_fares/version/1

In [12]:
nyc_fares_fv.version

1

In [13]:
nyc_fares_fv.preview_feature_vector()

[1577881300000, '202001', 561, -73.8609, 40.76794, -73.77363, 40.879017, 3, 8.928644115786534, 9.765811172299426, 16.426323820562693, 17.29121759663983, 24.724006251389003, 0.923638390507464, 8.793034715642845, 2020, 2, 15]

## <a class="anchor" id="1.5_bullet" style="color:#525252"> **📦 Dataset Creation** </a>


Finally we create the dataset using `fs.create_training_dataset()`.

In [14]:
td_metadata = nyc_fares_fv.create_training_dataset(
    description = 'NYC taxi fares dataset.',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)



We can sanity check that the transformation functions have been applied by loading the training and validation data.

Now our dataset has been splitted into two parts: **train (80% of original dataset)** and **validation (20% of original dataset)**.

- To get training dataset we can use `FeatureView.get_training_dataset()` method.

- To retrieve specific part of training dataset use `FeatureView.get_training_dataset_splits()` method

In [15]:
td_version, df = nyc_fares_fv.get_training_dataset()

df.head()

Row(total_fare=72.0, pickup_datetime=1577882700000, month_of_the_ride='202001', ride_id=135, pickup_longitude=-73.92254, pickup_latitude=40.689495, dropoff_longitude=-73.7673, dropoff_latitude=40.887066, passenger_count=2, distance=15.883820321741128, pickup_distance_to_jfk=8.27002312716127, dropoff_distance_to_jfk=16.99019305788723, pickup_distance_to_ewr=13.200250445207633, dropoff_distance_to_ewr=25.300090740530067, pickup_distance_to_lgr=6.5520408082752635, dropoff_distance_to_lgr=9.43680957272608, year=2020, weekday=2, hour=12)

### Next Steps

In the next notebook, we will train a model on the dataset we created in this notebook.