# 13. Loading and Preprocessing Data with TensorFlow

Introducing the **Data API**, a useful tool in ingesting and preprocessing large datasets. 

### The Data API

The whole API revolves around the concept of **dataset**. Here a simple example:

In [1]:
import tensorflow as tf

X = tf.range(10) # any data tensor

In [2]:
dataset = tf.data.Dataset.from_tensor_slices(X)

In [3]:
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

#### Chaining Transformations

We can apply all sorts of transformation to our datasets, and these will create new datasets. We can also chain them:

In [4]:
dataset = dataset.repeat(3).batch(7)

In [5]:
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


We can transform it using `map()`:

In [7]:
dataset = dataset.map(lambda x: x * 2) # items: [0,2,4,6,8,10,12]

#### Shuffling the Data

Example:

In [9]:
dataset = tf.data.Dataset.range(10).repeat(3) # 0 to 9, three times

In [10]:
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)

In [11]:
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


Another method to shuffle the data is **interleaving lines from multiple files**. The idea is simple: take the dataset and split it into a training set, a validation set, and a test set. Then split each set into many files.

Now let's suppose we have a list `train_filepaths` that contains the list of training file paths. Next, we create a dataset containing filepaths:

In [12]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

NameError: name 'train_filepaths' is not defined

And finally`interleave` to read one line at the time from all of them (sequentially):

In [13]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

NameError: name 'filepath_dataset' is not defined

### TFRecord Format

Useful when bottleneck is loading and parsing the data. We can create a TFRecord file easily:

In [14]:
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

Then we use `tf.data.TFRecordDataset` to read it:

In [15]:
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)
