In [17]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

# matplotlib: conda install requires tensorflow version downgrade
# %matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
import tensorflow as tf
import pandas as pd
import numpy as np

# The [`tf.data.Dataset`](https://www.tensorflow.org/guide/data) (TFDS)

A `Dataset`
- Is a way of *defining*  a **sequence** of **elements**
- Key property of sequence
    - Is *iterable*: we can only (directly) obtain the *next* element in the sequence
    - By restricting access to iteration
        - only one element at a time needs to fit into physical memory
        - facilitating use of datasets that are too large to fit into limited memory
        

`Dataset` may be used for multiple purposes
- our focus will be on its use to define a dataset for training a model
- So, an *element* will correspond to an *example* in the training set

Elements can be *structures*
- consisting of *components*
- components may be
    - `tuple`, `dict`
    - **not** `list`

The type/structure of an *element* will depend on the format in which your model expects inputs.

Dataset may be created
- from source
- via transformation of an existing Dataset

# Creating Datasets

We will provide a brief overview of creating a `Dataset`.

See [the docs](https://www.tensorflow.org/guide/data) for a deeper introduction.


## From memory

One way is to create a `Dataset` from Tensors stored in memory
- This assumes your physical memory is large enough
- We will learn about generators later

The `element_spec` method will be helpful in explaining the element type.

### `from_tensors`

The `from_tensors` method creates a dataset *with a **single** element* from a pre-existing Tensor.


In [19]:
t = tf.constant([[1, 2], [3, 4]])

ds = tf.data.Dataset.from_tensors(t) 

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds):
    print(f"Example {i}:", elem)


Element spec: TensorSpec(shape=(2, 2), dtype=tf.int32, name=None)

Example 0: tf.Tensor(
[[1 2]
 [3 4]], shape=(2, 2), dtype=int32)


Note that this single element may be a higher dimensional Tensor
- but it is still a **single** element

We can also define a single element that is a *structure*

Here we call `from_tensors`
- argument: *tuple* of tensors


In [20]:
ds = tf.data.Dataset.from_tensors( (tf.constant([1,2]), 
                                    tf.constant([3,4])
                                   )
                                 )

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds):
    print(f"Example {i}:", elem)
                        

Element spec: (TensorSpec(shape=(2,), dtype=tf.int32, name=None), TensorSpec(shape=(2,), dtype=tf.int32, name=None))

Example 0: (<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, <tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 4], dtype=int32)>)


2022-06-22 13:40:31.952602: W tensorflow/core/data/root_dataset.cc:167] Optimization loop failed: Cancelled: Operation was cancelled


`element_spec` reveals that the element is a *tuple* 
- Note the bracketing parentheses surrounding `TensorSpec`

###  [`from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)

Creates a Dataset with *multiple* elements from a higher dimensional Tensor
- The first dimension indexes over elements
- The remaining dimensions are for features with shape
    - e.g., 3D image: width, height, channel
- argument: single tensor
    - creates *one element per row (first dimension)* of argument
        - Slicing a 1D tensor produces scalar tensor elements
        - Slicing a 2D tensor produces 1D tensor 

In [21]:
t = tf.constant( [
                    [1, 2],
                    [3, 4]
                 ])
ds = tf.data.Dataset.from_tensor_slices(t)

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds):
    print(f"Example {i}:", elem)


Element spec: TensorSpec(shape=(2,), dtype=tf.int32, name=None)

Example 0: tf.Tensor([1 2], shape=(2,), dtype=int32)
Example 1: tf.Tensor([3 4], shape=(2,), dtype=int32)


`element_spec` reveals each example to be a 1D Tensor of length 2.

What happens if we pass a tuple ?
- argument: tuple of tensors
- creates *multiple elements* (length of tensor)
    - element is a *tuple* (containing a value from each tensor in the argument tuple of tensors)
    - element *i* is the tuple with element *i* of each tensor in the argument tuple of tensors

In [22]:
# Slicing a tuple of 1D tensors produces tuple elements containing
# scalar tensors.
ds = tf.data.Dataset.from_tensor_slices(([1, 2], [3, 4], [5, 6]))

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds):
    print(f"Example {i}:", elem)



Element spec: (TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None))

Example 0: (<tf.Tensor: shape=(), dtype=int32, numpy=1>, <tf.Tensor: shape=(), dtype=int32, numpy=3>, <tf.Tensor: shape=(), dtype=int32, numpy=5>)
Example 1: (<tf.Tensor: shape=(), dtype=int32, numpy=2>, <tf.Tensor: shape=(), dtype=int32, numpy=4>, <tf.Tensor: shape=(), dtype=int32, numpy=6>)


`element_spec` reveals each example to be a tuple
- length of tuple is size of first dimension of the Tensor argument to `from_tensor_slices`

What happens if we pass a `dict` ?
- argument: `dict` whose values are tensors
- creates *multiple elements* (length of tensor)
    - like the case where argument is tuple of tensors
    - but the element is a `dict` rather than a tuple

In [23]:
# Dictionary structure is also preserved.
ds = tf.data.Dataset.from_tensor_slices({"a": [1, 2], "b": [3, 4]})

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds):
    print(f"Example {i}:", elem)


Element spec: {'a': TensorSpec(shape=(), dtype=tf.int32, name=None), 'b': TensorSpec(shape=(), dtype=tf.int32, name=None)}

Example 0: {'a': <tf.Tensor: shape=(), dtype=int32, numpy=1>, 'b': <tf.Tensor: shape=(), dtype=int32, numpy=3>}
Example 1: {'a': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'b': <tf.Tensor: shape=(), dtype=int32, numpy=4>}


Here are some more realistic examples
- using our old friend: Titanic data

Here is a more realistic example using a `dict` for the example features



In [24]:
titanic = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic.head()

titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('survived')

titanic_features_dict = {name: np.array(value) 
                         for name, value in titanic_features.items()}

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y


In [25]:
ds = tf.data.Dataset.from_tensor_slices(titanic_features_dict)

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds.take(3)):
    print(f"Example {i}:", elem)

Element spec: {'sex': TensorSpec(shape=(), dtype=tf.string, name=None), 'age': TensorSpec(shape=(), dtype=tf.float64, name=None), 'n_siblings_spouses': TensorSpec(shape=(), dtype=tf.int64, name=None), 'parch': TensorSpec(shape=(), dtype=tf.int64, name=None), 'fare': TensorSpec(shape=(), dtype=tf.float64, name=None), 'class': TensorSpec(shape=(), dtype=tf.string, name=None), 'deck': TensorSpec(shape=(), dtype=tf.string, name=None), 'embark_town': TensorSpec(shape=(), dtype=tf.string, name=None), 'alone': TensorSpec(shape=(), dtype=tf.string, name=None)}

Example 0: {'sex': <tf.Tensor: shape=(), dtype=string, numpy=b'male'>, 'age': <tf.Tensor: shape=(), dtype=float64, numpy=22.0>, 'n_siblings_spouses': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'parch': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'fare': <tf.Tensor: shape=(), dtype=float64, numpy=7.25>, 'class': <tf.Tensor: shape=(), dtype=string, numpy=b'Third'>, 'deck': <tf.Tensor: shape=(), dtype=string, numpy=b'unknown'>, 'embark_

And a more realistic example using a tuple (features, label)
- where features is itself a `dict`

**Note**

Python strings are converted to `byte` strings when stored in a Tensor.

In [26]:
# Example is tuple with two elements: (features, label)
ds = tf.data.Dataset.from_tensor_slices((titanic_features_dict, titanic_labels))

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds.take(3)):
    print(f"Example {i}:", elem)

Element spec: ({'sex': TensorSpec(shape=(), dtype=tf.string, name=None), 'age': TensorSpec(shape=(), dtype=tf.float64, name=None), 'n_siblings_spouses': TensorSpec(shape=(), dtype=tf.int64, name=None), 'parch': TensorSpec(shape=(), dtype=tf.int64, name=None), 'fare': TensorSpec(shape=(), dtype=tf.float64, name=None), 'class': TensorSpec(shape=(), dtype=tf.string, name=None), 'deck': TensorSpec(shape=(), dtype=tf.string, name=None), 'embark_town': TensorSpec(shape=(), dtype=tf.string, name=None), 'alone': TensorSpec(shape=(), dtype=tf.string, name=None)}, TensorSpec(shape=(), dtype=tf.int64, name=None))

Example 0: ({'sex': <tf.Tensor: shape=(), dtype=string, numpy=b'male'>, 'age': <tf.Tensor: shape=(), dtype=float64, numpy=22.0>, 'n_siblings_spouses': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'parch': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'fare': <tf.Tensor: shape=(), dtype=float64, numpy=7.25>, 'class': <tf.Tensor: shape=(), dtype=string, numpy=b'Third'>, 'deck': <tf.Tensor:

# Creating a Dataset without loading data into memory

The above creation methods required the presence of the entire dataset in memory.

But the key property of a `Dataset` is that it is iterable
- available a single element at a time

There are a number of common iterable formats that do not require an entire dataset to be in memory at once
- Files: single row/line at a time
- Network stream
- Python generators

## From Files

There are a number of methods to get datasets that are stored in files with common [formats](https://www.tensorflow.org/guide/data#reading_input_data).

For example: [Text file](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset)

In [27]:
from tensorflow.keras.utils import get_file

import os 
file_path = get_file("titanic.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")

In [28]:
ds = tf.data.TextLineDataset(file_path)

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds.take(3)):
    print(f"Example {i}:", elem)

Element spec: TensorSpec(shape=(), dtype=tf.string, name=None)

Example 0: tf.Tensor(b'survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone', shape=(), dtype=string)
Example 1: tf.Tensor(b'0,male,22.0,1,0,7.25,Third,unknown,Southampton,n', shape=(), dtype=string)
Example 2: tf.Tensor(b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n', shape=(), dtype=string)


# Transforming an existing Dataset

We can also create a new `Dataset` via a transformation from an existing one.
- Can create complex transformation pipelines by *chaining* simple transformations
- The Transformations 
    - consume a single element at a time
    - produce a single element at a time
    
Hence, transformations do not require or cause the presence of the entire dataset in memory at once.

We illustrate some common transformations.

## `batch`

A training dataset is usually grouped (for the purpose of Minibatch Gradient Descent) into *batches* of examples.

There is a transformation `batch`
- Takes a source `Dataset` of elements
- Creates a new `Dataset` whose elements
    - are groups of the elements of the source `Dataset`

In [29]:
ds = tf.data.Dataset.range(8)

print("Source dataset:\n")
print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds):
    print(f"Example {i}:", elem)

print("\n\nTransformed dataset:\n")  
ds = ds.batch(3)

print(f"Element spec: {ds.element_spec}\n")

for i, elem in enumerate(ds):
    print(f"Example {i}: {elem}")

    

Source dataset:

Element spec: TensorSpec(shape=(), dtype=tf.int64, name=None)

Example 0: tf.Tensor(0, shape=(), dtype=int64)
Example 1: tf.Tensor(1, shape=(), dtype=int64)
Example 2: tf.Tensor(2, shape=(), dtype=int64)
Example 3: tf.Tensor(3, shape=(), dtype=int64)
Example 4: tf.Tensor(4, shape=(), dtype=int64)
Example 5: tf.Tensor(5, shape=(), dtype=int64)
Example 6: tf.Tensor(6, shape=(), dtype=int64)
Example 7: tf.Tensor(7, shape=(), dtype=int64)


Transformed dataset:

Element spec: TensorSpec(shape=(None,), dtype=tf.int64, name=None)

Example 0: [0 1 2]
Example 1: [3 4 5]
Example 2: [6 7]


Observe
- `element_spec` of the source is a 0-D Tensor
    - elements are single values
- `element_spec` of the transformed has a "batch" dimension
    - elements are 1-D Tensors
    - of *unequal* lengths

## Other common transformations

- `map`
    - create new elements which are the result of applying a function to each original element
- `apply`

- `filter`
    - create a sub-sequence, selecting original elements according to a condition

- `repeat`
    - create a longer dataset by repeating all the elements
- `skip`
    - create a *suffix* of the elements
- `take`
    - create a *prefix* of the elements

# HuggingFace datasets

HuggingFace has its own datasets API which is similar to TFDS.

[Here](https://huggingface.co/docs/datasets/index) is the documentation.
- tutorials
- how-to guides

You can create your own HF datasets !
- For example: [from pandas, csv files, etc.](https://huggingface.co/docs/datasets/loading#inmemory-data)

The format and orientation of datasets between TFDS and HF may be different.

Fortunately HuggingFace Datasets have a [`to_tf_dataset`](https://huggingface.co/docs/datasets/use_dataset#tokenize-text)
method to convert from HF Dataset to TFDS.



In [32]:
print("Done")

Done
