# `tf.data` API

`tf.data` API has two new abstractions
* `tf.data.Dataset` represents a sequence of elements, in which each element contains one or more Tensor objects. For example, in an image pipeline, the image data and a label
  * Creating a source (e.g. Dataset.from_tensor_slices()) constructs a dataset from one or more `tf.Tensor` objects.
    * `tf.data.Dataset.from_tensors()`
    * `tf.data.Dataset.from_tensor_slices()`
    * `tf.data.TFRecordDataset`:  TFRecord format을 읽을 때
  * Applying a transformation (e.g. Dataset.batch()) constructs a dataset from one or more `tf.data.Dataset` objects.
* `tf.data.Iterator` provides the main way to extract elements from a dataset. The operation returned by `Iterator.get_next()` yields the next element of a `Dataset` when executed, and typically acts as the interface between input pipeline code and your model.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf

sess_config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))

### Loading MNIST dataset from `tf.keras`

In [None]:
# Load training and eval data from tf.keras
(train_data, train_labels), (test_data, test_labels) = \
    tf.keras.datasets.mnist.load_data()

In [None]:
# set N=50 for small dataset loading
N = 50
train_data = train_data[:N]
train_labels = train_labels[:N]
train_data = train_data / 255.
train_labels = np.asarray(train_labels, dtype=np.int32)

test_data = test_data[:N]
test_labels = test_labels[:N]
test_data = test_data / 255.
test_labels = np.asarray(test_labels, dtype=np.int32)

## Input pipeline

1. You must define a source. `tf.data.Dataset`.
  * To construct a Dataset from some tensors in memory, you can use `tf.data.Dataset.from_tensors()` or `tf.data.Dataset.from_tensor_slices()`.
  * Other methods
    * `tf.data.TextLineDataset(filenames)`
    * `tf.data.FixedLengthRecordDataset(filenames)`
    * `tf.data.TFRecordDataset(filenames)`
2. Transformation
  * `Dataset.map()`: to apply a function to each element
  * `Dataset.batch()`
  * `Dataset.shuffle()`
3. `Iterator`
  * `Iterator.initializer`: which enables you to (re)initialize the iterator's state
  * `Iterator.get_next()`

### 1. Store data in `tf.data.Dataset`

* `tf.data.Dataset.from_tensor_slices((features, labels))`
* `tf.data.Dataset.from_generator(gen, output_types, output_shapes)`

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
print(train_dataset)
print(train_dataset.output_shapes)
print(train_dataset.output_types)

### 2. Transformaion

* `apply(transformation_func)`
* `batch(batch_size)`
* `concatenate(dataset)`
* `flat_map(map_func)`
* `repeat(count=None)`
  * count=max_epochs
* `shuffle(buffer_size, seed=None, reshuffle_each_iteration=None)`

In [None]:
batch_size = 16

train_dataset = train_dataset.shuffle(buffer_size = 10000)
train_dataset = train_dataset.repeat(count=2)
train_dataset = train_dataset.batch(batch_size = batch_size)

### 3. Iterator

#### 3.1 `make_one_shot_iterator()`

* Creates an Iterator for enumerating the elements of this dataset.
  * Note: The returned iterator will be initialized automatically. A "one-shot" iterator does not currently support re-initialization.

###### Common pattern
```python
while True:
  try:
    sess.run(result)
  except tf.errors.OutOfRangeError:
    break
```

In [None]:
train_iterator = train_dataset.make_one_shot_iterator()

x, y = train_iterator.get_next()
x = tf.cast(x, dtype = tf.float32)
y = tf.cast(y, dtype = tf.int32)

#### `for`문으로 epoch control 시 유의점

* 사실상 `dataset.repeat(count=2)` 함수로 max_epochs 조절함
* `for`문으로 epoch을 조절하고 싶으면 `dataset.repeat()` 쓰지 않으면 됨
  * `while`문을 다 돌면 count 만큼의 epochs이 끝남
* `for`문으로 control 하고 싶으면 `for`문 시작할 때마다 `iterator.initializer`를 해야 함
  * 다음 예제 참고

In [None]:
sess = tf.Session(config=sess_config)
#sess.run(iterator.initializer) 할 필요 없음

step = 0
max_epochs = 3
for epoch in range(max_epochs):
  while True:
    try:
      x_, y_ = sess.run([x, y])

      print("step: {}  labels: {}".format(step, y_))
      step += 1

    except tf.errors.OutOfRangeError:
      print("End of dataset")  # ==> "End of dataset"
      break

#### 3.2 `make_initializable_iterator()`

* Creates an Iterator for enumerating the elements of this dataset.
* Should `run` the `iterator.initializer`.

사용법
```python
dataset = ...
iterator = dataset.make_initializable_iterator()
# ...
sess.run(iterator.initializer)
```

In [None]:
train_iterator = train_dataset.make_initializable_iterator()

x, y = train_iterator.get_next()
x = tf.cast(x, dtype = tf.float32)
y = tf.cast(y, dtype = tf.int32)

#### `for`문으로 epoch control 시 유의점

* `iterator.initializer`를 통해서 매 `for`문 마다 dataset을 initial 해야함
* `N / batch_size`가 나누어 떨어지지 않으면 맨 마지막 배치는 `N % batch_size` 만큼의 데이터만 불러옴

In [None]:
sess = tf.Session(config=sess_config)

step = 0
max_epochs = 3
for epoch in range(max_epochs):
  sess.run(train_iterator.initializer)
  while True:
    try:
      x_, y_ = sess.run([x, y])

      print("step: {}  labels: {}".format(step, y_))
      step += 1

    except tf.errors.OutOfRangeError:
      print("End of dataset")  # ==> "End of dataset"
      break

## [From TensorFlow official site](https://www.tensorflow.org/programmers_guide/datasets)

* 밑에 예제들은 TF 홈페이지에서 가져옴

### Data structure

In [None]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)  # ==> "tf.float32"
print(dataset1.output_shapes)  # ==> "(10,)"

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random_uniform([4]),
    tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)  # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes)  # ==> "((), (100,))"

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)  # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes)  # ==> "(10, ((), (100,)))"

### Naming

In [None]:
dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random_uniform([4]),
    "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)  # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes)  # ==> "{'a': (), 'b': (100,)}"

## Creating an iterator

In [None]:
dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
  for i in range(10):
    value = sess.run(next_element)
    print(value)

In [None]:
max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
  # Initialize an iterator over a dataset with 10 elements.
  sess.run(iterator.initializer, feed_dict={max_value: 10})
  for i in range(10):
    value = sess.run(next_element)
    print(value)

  # Initialize the same iterator over a dataset with 100 elements.
  sess.run(iterator.initializer, feed_dict={max_value: 100})
  for i in range(100):
    value = sess.run(next_element)
    print(value)