# Data API
Tensorflow's Data API was designed to ingest and preprocess large datasets. 

Data API can **read** text files, binary files with fixed-size records, binary files with TFRecord format for records of varying sizes, supports reading from SQL databases, and other data sources like GCP's BigQuery.

Data API can preprocess data like normalization, encodes text/categorical variables (using one-hot encoding, bag-of-words, or embeddings), etc.

In [1]:
import tensorflow as tf

## Dataset object

The _dataset_ object is a sequence of data items.

Method _tf.data.Dataset.from_tensor_slices(X)_ splits input _X_ along the first dimension to generate data items.

In [2]:
# Splices 1D input into scalar items
dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4])
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

In [3]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


In [4]:
# Splices 2D input into 1D items
dataset = tf.data.Dataset.from_tensor_slices([[1,1],[2,2],[3,3]])

In [5]:
for item in dataset:
    print(item)

tf.Tensor([1 1], shape=(2,), dtype=int32)
tf.Tensor([2 2], shape=(2,), dtype=int32)
tf.Tensor([3 3], shape=(2,), dtype=int32)


In [6]:
# Splices 3D input into 2D items
dataset = tf.data.Dataset.from_tensor_slices([[[1,1],[1,1]],[[2,2],[2,2]],[[3,3],[3,3]]])

In [7]:
for item in dataset:
    print(item)

tf.Tensor(
[[1 1]
 [1 1]], shape=(2, 2), dtype=int32)
tf.Tensor(
[[2 2]
 [2 2]], shape=(2, 2), dtype=int32)
tf.Tensor(
[[3 3]
 [3 3]], shape=(2, 2), dtype=int32)


## Basic transformations
### Repeat
The function _repeat(n)_ repeats the data items in the dataset object, _n_ times. If _n=None_, the items are repeated infinitely. Interestingly, the repeats are not stored in memory. The number of times to repeat are recalled when an iterator is called to go through the items in the dataset.

In [9]:
dataset = tf.data.Dataset.from_tensor_slices([0,1,2])
dataset = dataset.repeat(2)
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


### Batch
The function _batch(n)_ will group the data items into batches of size _n_.

In [10]:
dataset = dataset.batch(3)
for item in dataset:
    print(item)

tf.Tensor([0 1 2], shape=(3,), dtype=int32)
tf.Tensor([0 1 2], shape=(3,), dtype=int32)


In [12]:
# You can drop a batch that isn't full fixed size
dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4])
dataset = dataset.batch(2)
for item in dataset:
    print(item)

tf.Tensor([0 1], shape=(2,), dtype=int32)
tf.Tensor([2 3], shape=(2,), dtype=int32)
tf.Tensor([4], shape=(1,), dtype=int32)


In [13]:
dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4])
dataset = dataset.batch(2, drop_remainder=True)
for item in dataset:
    print(item)

tf.Tensor([0 1], shape=(2,), dtype=int32)
tf.Tensor([2 3], shape=(2,), dtype=int32)


### Chaining transformations
Transformation methods can be chained sequentially for brevity.

In [14]:
dataset = tf.data.Dataset.range(3).repeat(2).batch(3)
for item in dataset:
    print(item)

tf.Tensor([0 1 2], shape=(3,), dtype=int64)
tf.Tensor([0 1 2], shape=(3,), dtype=int64)
