# `tf.data` API

`tf.data` API has two new abstractions
* `tf.data.Dataset` represents a sequence of elements, in which each element contains one or more Tensor objects. For example, in an image pipeline, the image data and a label
  * Creating a source (e.g. Dataset.from_tensor_slices()) constructs a dataset from one or more `tf.Tensor` objects.
    * `tf.data.Dataset.from_tensors()`
    * `tf.data.Dataset.from_tensor_slices()`
    * `tf.data.TFRecordDataset`:  TFRecord format을 읽을 때
  * Applying a transformation (e.g. Dataset.batch()) constructs a dataset from one or more `tf.data.Dataset` objects.
* `tf.data.Iterator` provides the main way to extract elements from a dataset. The operation returned by `Iterator.get_next()` yields the next element of a `Dataset` when executed, and typically acts as the interface between input pipeline code and your model.

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf

sess_config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))

  return f(*args, **kwds)


### Loading MNIST dataset from `tf.keras`

In [2]:
# Load training and eval data from tf.keras
(train_data, train_labels), (test_data, test_labels) = \
    tf.keras.datasets.mnist.load_data()

N = 50
train_data = train_data[:N]
train_labels = train_labels[:N]
train_data = train_data / 255.
train_labels = np.asarray(train_labels, dtype=np.int32)

test_data = test_data[:N]
test_labels = test_labels[:N]
test_data = test_data / 255.
test_labels = np.asarray(test_labels, dtype=np.int32)

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


## Input pipeline

1. You must define a source. <font color='red'>tf.data.Dataset</font>
  * To construct a Dataset from some tensors in memory, you can use `tf.data.Dataset.from_tensors()` or `tf.data.Dataset.from_tensor_slices()`.
  * Other methods
    * `tf.data.TextLineDataset(filenames)`
    * `tf.data.FixedLengthRecordDataset(filenames)`
    * `tf.data.TFRecordDataset(filenames)`
2. Transformation
  * `Dataset.map()`: to apply a function to each element
  * `Dataset.batch()`
3. `Iterator`
  * `Iterator.initializer`: which enables you to (re)initialize the iterator's state
  * `Iterator.get_next()`

### 1. Store data in `tf.data.Dataset`

* `tf.data.Dataset.from_tensor_slices((features, labels))`
* `tf.data.Dataset.from_generator(gen, output_types, output_shapes)`

In [29]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
print(train_dataset)
print(train_dataset.output_shapes)
print(train_dataset.output_types)

<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.int32)>
(TensorShape([Dimension(28), Dimension(28)]), TensorShape([]))
(tf.float64, tf.int32)


`print(train_dataset.output_shapes)`
  * `(TensorShape([Dimension(28), Dimension(28)]), TensorShape([]))`
  * train_data: `28 * 28`
  * train_labels: `([])`, which is scalar

### 2. Transformaion

* `apply(transformation_func)`
* `batch(batch_size)`
* `concatenate(dataset)`
* `flat_map(map_func)`
* `repeat(count=None)`
* `shuffle(buffer_size, seed=None, reshuffle_each_iteration=None)`

In [7]:
batch_size = 16

train_dataset = train_dataset.shuffle(buffer_size = 10000)
train_dataset = train_dataset.repeat(count=2)
train_dataset = train_dataset.batch(batch_size = batch_size)

* `d.shuffle`
  * Randomly shuffles the elements of this dataset.
  * buffer_size: A `tf.int64` scalar `tf.Tensor`, representing the number of elements from this dataset from which the new dataset will sample.  
    * Q. shuffle the elements of the buffer size only and leave the others? or shuffle them and return them?
*  


### 3. Iterator

#### 3.1 `make_one_shot_iterator()`

* Creates an Iterator for enumerating the elements of this dataset.
  * Note: The returned iterator will be initialized automatically. A "one-shot" iterator does not currently support re-initialization.

###### Common pattern
```python
sess.run(iterator.initializer)
while True:
  try:
    sess.run(result)
  except tf.errors.OutOfRangeError:
    break
```

In [17]:
train_iterator = train_dataset.make_one_shot_iterator()

x, y = train_iterator.get_next()
x = tf.cast(x, dtype=tf.float32)
y = tf.cast(y, dtype=tf.int32)

In [None]:
sess = tf.Session(config=sess_config)
#sess.run(iterator.initializer) 할 필요 없음

step = 0
while True:
  try:
    x_, y_ = sess.run([x, y])

    print("step: {}  labels: {}".format(step, y_))
    step += 1

  except tf.errors.OutOfRangeError:
    print("End of dataset")  # ==> "End of dataset"
    break

#### 3.2 `make_initializable_iterator()`

* Creates an Iterator for enumerating the elements of this dataset.
* Should `run` the `iterator.initializer`.

사용법
```python
dataset = ...
iterator = dataset.make_initializable_iterator()
# ...
sess.run(iterator.initializer)
```

# What's the difference between initializing and not? 

In [30]:
batch_size = 16

train_dataset = train_dataset.shuffle(buffer_size = 10000)
train_dataset = train_dataset.repeat(count=2)
train_dataset = train_dataset.batch(batch_size = batch_size)

In [31]:
train_iterator = train_dataset.make_initializable_iterator()

x, y = train_iterator.get_next()
x = tf.cast(x, dtype=tf.float32)
y = tf.cast(y, dtype=tf.int32)

In [32]:
sess = tf.Session(config=sess_config)
sess.run(train_iterator.initializer)

step = 0
while True:
  try:
    x_, y_ = sess.run([x, y])

    print("step: {}  labels: {}".format(step, y_))
    step += 1

  except tf.errors.OutOfRangeError:
    print("End of dataset")  # ==> "End of dataset"
    break

step: 0  labels: [1 3 9 0 3 1 8 6 7 0 0 0 2 5 2 8]
step: 1  labels: [3 6 1 6 5 3 9 9 1 9 7 8 4 6 2 9]
step: 2  labels: [4 5 5 7 8 7 3 9 4 1 3 2 4 3 9 1]
step: 3  labels: [6 1 1 0 5 6 4 8 8 1 0 2 9 3 1 6]
step: 4  labels: [1 3 5 4 9 6 1 3 9 6 7 9 2 5 7 8]
step: 5  labels: [5 3 7 4 3 6 2 9 2 1 7 4 3 9 9 1]
step: 6  labels: [0 0 8 3]
End of dataset


## [From TensorFlow official site](https://www.tensorflow.org/programmers_guide/datasets)

* 밑에 예제들은 TF 홈페이지에서 가져옴

### Data structure

In [35]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)  # ==> "tf.float32"
print(dataset1.output_shapes)  # ==> "(10,)"

print()
dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random_uniform([4]),
    tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)  # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes)  # ==> "((), (100,))"

print()
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)  # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes)  # ==> "(10, ((), (100,)))"

<dtype: 'float32'>
(10,)

(tf.float32, tf.int32)
(TensorShape([]), TensorShape([Dimension(100)]))

(tf.float32, (tf.float32, tf.int32))
(TensorShape([Dimension(10)]), (TensorShape([]), TensorShape([Dimension(100)])))


### Naming

In [36]:
dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random_uniform([4]),
    "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)  # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes)  # ==> "{'a': (), 'b': (100,)}"

{'b': tf.int32, 'a': tf.float32}
{'b': TensorShape([Dimension(100)]), 'a': TensorShape([])}


## Creating an iterator

In [55]:
dataset = tf.data.Dataset.range(50)
dataset = dataset.shuffle(10)
dataset = dataset.repeat(count=2)  # double the dataset
dataset = dataset.batch(13)

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    for i in range(10):
        try:
            value = sess.run(next_element)
            print(value)
        except:
            print('end of dataset')
            break

[ 0  7 11  8  3 12  6 10 17 16  1  2 19]
[18  5 24  9 20  4 15 21 28 22 23 30 13]
[29 36 34 32 33 14 39 37 41 26 45 38 40]
[44 47 48 46 31 27 43 35 25 49 42  6  2]
[10  9  3 11  8 14 13 17  1  0 15 22 18]
[16 23 25  4 12 24  5 20 26 32 31  7 19]
[30 29 38 40 34 21 39 41 43 36 44 35 33]
[47 42 45 48 46 37 28 49 27]
end of dataset


In [60]:
max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
  # Initialize an iterator over a dataset with 10 elements.
  sess.run(iterator.initializer, feed_dict={max_value: 10})
  for i in range(10):
    value = sess.run(next_element)
    print(value)

  # Initialize the same iterator over a dataset with 100 elements.
  sess.run(iterator.initializer, feed_dict={max_value: 100})
  for i in range(100):
    value = sess.run(next_element)
    print(value)

0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
