In [57]:
import functools
import tensorflow as tf
import numpy as np
import pandas as pd

# Tensorflow Dataset API
This notebook will be focused on understanding the usage og the tensorflow dataset api.

First thing we are going to do is  download the sample data from the titanic dataset, pretty standar exercise.
Remember the dataset will be downloaded to our local system: ~/.keras/datasets/eval.csv


In [58]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

print(test_file_path)

/Users/ness/.keras/datasets/eval.csv


# Visualization
Lets visualize how this file looks like. We are going to use the `head` command. By default it returns the first 10 lines of a file

In [59]:
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


# Create a dataset with tensorflow
Next we need to create a dataset we can work with. The `tf.data.experimental.make_csv_dataset` command will create a dataset from the csv file we input. The nice thing is that it will separate our features from our label returning a tuple: (features, label). We need to specify our **target** column in the `label_name` command.

In [62]:
dataset = tf.data.experimental.make_csv_dataset(
    train_file_path,
    batch_size=5,
    label_name='survived',
    na_value='?',
    num_epochs=1,
    ignore_errors=True,
    shuffle=False)

Now lets analyze the dataset that the function has return. 
This is the documentation for the `experimental.make_csv_dataset`:
Reads CSV files into a dataset, where each element of the dataset is a (features, labels) tuple that corresponds to a batch of CSV rows. The features dictionary maps feature column names to Tensors containing the corresponding feature data, and labels is a Tensor containing the batch's label data

**Notice how important is that the features is an ordered dictionary**

In [64]:
#for element in dataset.take(1):
 #   print(f'Each element of the dataset is a tuple(features, labels)-> Type: {type(element)}, length:{len(element)}')

# Tuple (Features: OrderedDict, label: Tensor)
for features, label in dataset.take(1):
    print(f'The label element is a tensor: {type(label)}')
    print(f'---> Label Tensor shape:{label.shape}')
    print(f'---> Label content:{label.numpy()}')
    print(f'The features element is a dictionary: {type(features)}')
    print(f'---> features contains {len(features)} (key, value) objects')
    for (key, value) in features.items():
        print("--->{:20s}:{}".format(key, value))

The label element is a tensor: <class 'tensorflow.python.framework.ops.EagerTensor'>
---> Label Tensor shape:(5,)
---> Label content:[0 1 1 1 0]
The features element is a dictionary: <class 'collections.OrderedDict'>
---> features contains 9 (key, value) objects
--->sex                 :[b'male' b'female' b'female' b'female' b'male']
--->age                 :[22. 38. 26. 35. 28.]
--->n_siblings_spouses  :[1 1 0 1 0]
--->parch               :[0 0 0 0 0]
--->fare                :[ 7.25   71.2833  7.925  53.1     8.4583]
--->class               :[b'Third' b'First' b'Third' b'First' b'Third']
--->deck                :[b'unknown' b'C' b'unknown' b'C' b'unknown']
--->embark_town         :[b'Southampton' b'Cherbourg' b'Southampton' b'Southampton' b'Queenstown']
--->alone               :[b'n' b'n' b'y' b'n' b'y']


![From csv to Dataset](figs/csv_to_dataset.png)

In [71]:
# Lets put everything into functions, so we can work without repeating code

# Create a function that uses make_csv_dataset
def create_dataset(file_path, label_name, **kwargs):
    dataset = tf.data.experimental.make_csv_dataset(
        file_path,
        batch_size=5,
        label_name=label_name,
        na_value='?',
        num_epochs=1,
        ignore_errors=True,
        shuffle=False,
        **kwargs)
    return dataset

# create a function that describes the dataset
def describe_dataset(dataset):
    for features, label in dataset.take(1):
        print(f'The label element is a tensor: {type(label)}')
        print(f'---> Label Tensor shape:{label.shape}')
        print(f'---> Label content:{label.numpy()}')
        print(f'The features element is a dictionary: {type(features)}')
        print(f'---> features contains {len(features)} (key, value) objects')
        for (key, value) in features.items():
            print("--->{:20s}:{}".format(key, value))

There is a number of things we can do with this dataset.
Suppose you don't want all the columns of the CSV file, you just need some of them. 
You can specify what you want in a list array:
`SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']`
And create a new dataset with just that

In [73]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']
temp_dataset = create_dataset(train_file_path, label_name='survived', select_columns=SELECT_COLUMNS)
describe_dataset(temp_dataset)

The label element is a tensor: <class 'tensorflow.python.framework.ops.EagerTensor'>
---> Label Tensor shape:(5,)
---> Label content:[0 1 1 1 0]
The features element is a dictionary: <class 'collections.OrderedDict'>
---> features contains 5 (key, value) objects
--->age                 :[22. 38. 26. 35. 28.]
--->n_siblings_spouses  :[1 1 0 1 0]
--->class               :[b'Third' b'First' b'Third' b'First' b'Third']
--->deck                :[b'unknown' b'C' b'unknown' b'C' b'unknown']
--->alone               :[b'n' b'n' b'y' b'n' b'y']
