# Titanic Dataset for the Feature Store

This notebook prepares the Titanic dataset to be used with the feature store.

The Titanic dataset contains information about the passengers of the famous Titanic ship. The training and test data come in form of two CSV files, which can be downloaded from the Titanic Competition page on [Kaggle](https://www.kaggle.com/c/titanic/data).

Download the `train.csv` and `test.csv` files, and upload them to the `Resources` folder of your Hopsworks Project. If you prefer doing things using GUIs, then you can find the `Resources` by opening the **Data Sets** tab on the left menu bar.

Once you have the two files uploaded on the `Resources` folder, you can proceed with the rest of the notebook.

In [1]:
from hops import hdfs
from pyspark.sql import functions as F
import hsfs

# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
22,application_1614082217334_0028,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

Let's begin by reading the training data into a Spark DataFrame:

In [2]:
training_csv = spark.read\
             .option("inferSchema", "true")\
             .option("header", "true")\
             .format("csv")\
             .load("hdfs:///Projects/{}/Resources/titanic-train.csv".format(hdfs.project_name()))

Now, we can do some simple preprocessing. Rather than registering the whole dataset with the Feature Store, we just select a few of the columns, and cast all columns to `int`. Since the values of the `sex` column are either `male` or `female`, we also convert them to `0` or `1`, respectively. We also fill the missing values of the `age` column with `30`.

In [3]:
# simple preprocessing:
#     1 - selecting a few of the columns
#     2 - Filling the missing 'age' values with 30
#     3 - changing sex to 0 or 1
#     4 - casting all columns to int

clean_train_df = training_csv.select('survived', 'pclass', 'sex', 'fare', 'age', 'sibsp', 'parch') \
                    .fillna({'age': 30}) \
                    .withColumn('sex',
                        F.when(F.col('sex')=='male', 0)
                        .otherwise(1))\
                    .withColumn('survived',
                               F.col('survived').cast('int')) \
                    .withColumn('pclass',
                               F.col('pclass').cast('int')) \
                    .withColumn('fare',
                                F.col('fare').cast('int')) \
                    .withColumn('age',
                               F.col('age').cast('int')) \
                    .withColumn('sibsp',
                               F.col('sibsp').cast('int')) \
                    .withColumn('parch',
                               F.col('parch').cast('int'))

We now have to create a metadata object for the feature group, in order to materialize it later:

In [4]:
# create the metadata object

titanic_all_fg_meta = fs.create_feature_group(name="titanic_all_features",
                                       version=1,
                                       description="Titanic training features",
                                       time_travel_format=None,
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False})

Now that we have the metadata object, the next step would be to create a *feature group*, and to register it with the Project's Feature Store:

In [5]:
titanic_all_fg_meta.save(clean_train_df)

<hsfs.feature_group.FeatureGroup object at 0x7fc17f94e110>

Finally, we create a *training dataset* from the feature group. This is a very simple task using the Feature Store API. You can provide a name, and the data format for the dataset. For now, let's stick with `tfrecord`, TensorFlow's own file format.

In [6]:
# create training dataset

titanic_all_fg = fs.get_feature_group('titanic_all_features', version=1)

query = titanic_all_fg.select_all()

td = fs.create_training_dataset(name="titanic_train_dataset",
                               description="Titanic training dataset with all features",
                               data_format="tfrecord",
                               version=1)

td.save(query)

<hsfs.training_dataset.TrainingDataset object at 0x7fc17f4c7f10>

Done! you can now use the titanic training data in your Projects!