# Dataset API
---
In this notebook, I'll describe how to use the Dataset API in your training.  
First of all, I'll show you the implementation of batch generator using numpy.  
After that, I show you some examples of batches using the Dataset API.

## Table of contents
---
- Batch generator using numpy
    - mnist example
- Batch generator using Dataset API
  - tf.data.Dataset
  - tf.data.Iterator
  - mnist example

__packages:__

In [1]:
import tensorflow as tf
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
import matplotlib
import matplotlib.pyplot as plt
import gzip
%matplotlib inline

print("tensorflow version: ", tf.__version__)
print("numpy version: ", np.__version__)
print("scikit learn version: ", sklearn.__version__)
print("matplotlib version: ", matplotlib.__version__)

tensorflow version:  1.13.1
numpy version:  1.16.2
scikit learn version:  0.20.3
matplotlib version:  3.0.3


## Batch generator using numpy
---
### Load mnist dataset as image shape

In [2]:
def load_mnist_images(file_path):
    with gzip.open(file_path, 'rb') as f:
        data = np.frombuffer(f.read(), np.uint8, offset=16)
    data = data.reshape(-1,784)
    return data.reshape(data.shape[0], 28, 28, 1)

def load_mnist_labels(file_path):
    with gzip.open(file_path, 'rb') as f:
        data = np.frombuffer(f.read(), np.uint8, offset=8)
    return data

In [3]:
dataset_dir = './../../data/mnist/'

filenames = {
    'test_images':'t10k-images-idx3-ubyte.gz',
    'test_labels':'t10k-labels-idx1-ubyte.gz',
    'train_images':'train-images-idx3-ubyte.gz',
    'train_labels':'train-labels-idx1-ubyte.gz'
}

train_images = load_mnist_images(dataset_dir + filenames['train_images'])
train_labels = load_mnist_labels(dataset_dir + filenames['train_labels'])
# split into train/validation dataset
X_train, X_val, y_train, y_val = train_test_split(train_images, train_labels, test_size=0.2, shuffle=True)

X_test = load_mnist_images(dataset_dir + filenames['test_images'])
y_test = load_mnist_labels(dataset_dir + filenames['test_labels'])

### Normalize images

In [4]:
X_train = X_train / 255.
X_val = X_val / 255.
X_test = X_test / 255.

### One-hot encoding

In [5]:
def get_one_hot_labels(categolical_labels):
    '''Convert categolical labels into one-hot labels
    # Arguments
        categolical_labels: Categolical labels which start from zero.
    '''
    n_categoly = len(np.unique(categolical_labels))
    n_data = len(categolical_labels)
    print('n_categoly: {}, n_data: {}'.format(n_categoly, n_data))
    one_hot_labels = np.zeros((n_data, n_categoly), dtype=float)
    for idx in range(n_data):
        label = categolical_labels[idx]
        one_hot_labels[idx,label] = 1
    return one_hot_labels

# Test code
test_input = np.array([0,0,1,2])
expected = np.array([[1.,0.,0.],
                     [1.,0.,0.],
                     [0.,1.,0.],
                     [0.,0.,1.]])
actual = get_one_hot_labels(test_input)
assert (expected == actual).all(), "Test Failed"

n_categoly: 3, n_data: 4


In [6]:
y_train = get_one_hot_labels(y_train)
y_val = get_one_hot_labels(y_val)
y_test = get_one_hot_labels(y_test)

n_categoly: 10, n_data: 48000
n_categoly: 10, n_data: 12000
n_categoly: 10, n_data: 10000


### Batch generator

In [7]:
def batch_generator(X, y, batch_size=32):
    '''Create a generator that returns batch of size batch_size
    # Arguments
        X: Numpy array you want to make batch from
        y: labels of X
        batch_size: Batch size, the number of data per batch
    # Returns
        generator which returns (X_batch, y_batch) 
    '''
    
    assert X.shape[0] == y.shape[0], "Data shape does not match!!"
    
    n_data = X.shape[0]
    for idx in range(0, n_data, batch_size):
        # Over index is cared in this version of numpy
        X_batch = X[idx:idx+batch_size]
        y_batch = y[idx:idx+batch_size]
        yield X_batch, y_batch

In [8]:
# define generator
train_bach_generator = batch_generator(X_train, y_train, 64)

print('type: ', type(train_bach_generator))

type:  <class 'generator'>


In [9]:
# One shot data batch
X_, y_ = next(train_bach_generator)

print(X_.shape)
print(y_.shape)

(64, 28, 28, 1)
(64, 10)


In [10]:
# Repeat getting batches
for X_, y_ in train_bach_generator:
    # Some processing
    continue

## Batch generator using Dataset API
---
### Pipline
- Load data into the Dataset object.
- Create iterator from the Dataset object.
- Get data and feed it into the network.

### Main components
`tf.data.Dataset`
- Create a source.
- Applying a transformation.

`tf.data.Iterator`
- Create a iterator from the Dataset object.
- Applying a batch generation.

### `tf.data.Dataset`
---
__Notifications:  
In this section, I describe dataset operations using eager_execution so as to be able to understand the outputs easily. If you want to run the code cells of next section namede tf.data.Iterator which uses tf.Session(), please do not run this section.__

- https://www.tensorflow.org/guide/datasets
- https://www.tensorflow#org/api_docs/python/tf/data/Dataset

In [11]:
tf.enable_eager_execution()

#### Create a source
#### - `from_tensors`

Creates a Dataset with a single element.

> Note that if tensors contains a NumPy array, and eager execution is not enabled, the values will be embedded in the graph as one or more tf.constant operations. For large datasets (> 1 GB), this can waste memory and run into byte limits of graph serialization. If tensors contains one or more large NumPy arrays, consider the alternative described in [this guide](https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays).

In [14]:
# 1D vector
X_sample = tf.data.Dataset.from_tensors(tf.range(0, 10))
print(type(X_sample))
print(X_sample.output_types)
print(X_sample.output_shapes)

# We can treat dataset objects as iterator in eager mode
for d in X_sample:
    print(d)

<class 'tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter'>
<dtype: 'int32'>
(10,)
tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int32)


In [15]:
# 2D matrix
# Input matrix will be teated as single element.
X_sample = tf.data.Dataset.from_tensors(tf.random_uniform((2,3)))
print(type(X_sample))
print(X_sample.output_types)
print(X_sample.output_shapes)

for d in X_sample:
    print(d)

<class 'tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter'>
<dtype: 'float32'>
(2, 3)
tf.Tensor(
[[0.94202423 0.8030591  0.5831975 ]
 [0.17291617 0.01673031 0.54511225]], shape=(2, 3), dtype=float32)


#### - `from_tensor_slices`
Creates a Dataset whose elements are slices of the given tensors.  
> Note that if tensors contains a NumPy array, and eager execution is not enabled, the values will be embedded in the graph as one or more tf.constant operations. For large datasets (> 1 GB), this can waste memory and run into byte limits of graph serialization. If tensors contains one or more large NumPy arrays, consider the alternative described in [this guide](https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays).

In [16]:
# 1D vector
X_sample = tf.data.Dataset.from_tensor_slices(tf.range(0, 10))
print(type(X_sample))
print(X_sample.output_types)
print(X_sample.output_shapes)

for d in X_sample:
    print(d)

<class 'tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter'>
<dtype: 'int32'>
()
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


In [17]:
# 2D matrix
# Input matrix will be teated as single element.
X_sample = tf.data.Dataset.from_tensor_slices(tf.random_uniform((2,3)))
print(type(X_sample))
print(X_sample.output_types)
print(X_sample.output_shapes)

for d in X_sample:
    print(d)

<class 'tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter'>
<dtype: 'float32'>
(3,)
tf.Tensor([0.3265065 0.7801666 0.5116267], shape=(3,), dtype=float32)
tf.Tensor([0.5576527  0.57961714 0.7255517 ], shape=(3,), dtype=float32)


__\- zip Dataset objects__

In [18]:
# Input two tensor
X_sample = tf.data.Dataset.from_tensor_slices(tf.random_uniform((3,10)))
y_sample = tf.data.Dataset.from_tensor_slices(tf.random_uniform((3,1)))
dataset_sample = tf.data.Dataset.zip((X_sample, y_sample))

print(type(dataset_sample))
print(dataset_sample.output_types)
print(dataset_sample.output_shapes)

for d in dataset_sample:
    print(d)

<class 'tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter'>
(tf.float32, tf.float32)
(TensorShape([Dimension(10)]), TensorShape([Dimension(1)]))
(<tf.Tensor: id=117, shape=(10,), dtype=float32, numpy=
array([0.54385114, 0.79987025, 0.966998  , 0.5304023 , 0.53878844,
       0.14456284, 0.31475198, 0.8985603 , 0.03431952, 0.33608198],
      dtype=float32)>, <tf.Tensor: id=118, shape=(1,), dtype=float32, numpy=array([0.4512118], dtype=float32)>)
(<tf.Tensor: id=121, shape=(10,), dtype=float32, numpy=
array([0.29774928, 0.25932002, 0.42823505, 0.20841694, 0.46167576,
       0.18783677, 0.28852165, 0.73213303, 0.15576804, 0.01828802],
      dtype=float32)>, <tf.Tensor: id=122, shape=(1,), dtype=float32, numpy=array([0.08884335], dtype=float32)>)
(<tf.Tensor: id=125, shape=(10,), dtype=float32, numpy=
array([0.5363462 , 0.50935555, 0.10325265, 0.03271317, 0.1840074 ,
       0.7609285 , 0.03748512, 0.5316086 , 0.5329342 , 0.24820483],
      dtype=float32)>, <tf.Tensor: id=126, shape=(1

__\- Give names to each component of an element__

In [19]:
dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random_uniform([4]),
    "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)
print(dataset.output_shapes)

# We can treat a tensor data like a dictionary
for d in dataset:
    print(d['a'])

{'a': tf.float32, 'b': tf.int32}
{'a': TensorShape([]), 'b': TensorShape([Dimension(100)])}
tf.Tensor(0.033620358, shape=(), dtype=float32)
tf.Tensor(0.9896467, shape=(), dtype=float32)
tf.Tensor(0.17672813, shape=(), dtype=float32)
tf.Tensor(0.24519813, shape=(), dtype=float32)


#### Applying a transformation
#### - `map(map_func, num_parallel_calls=None)`
- This function applies elementwise transformation
- We can use lambda equations in addition to normal functions.

In [20]:
sample_const = tf.constant([1,2,3,4,5])
sample_dataset = tf.data.Dataset.from_tensors(sample_const)
print('Before mapping')
for d in sample_dataset:
    print(d)

sample_dataset = sample_dataset.map(lambda x: x+1)
print('After mapping')
for d in sample_dataset:
    print(d)

Before mapping
tf.Tensor([1 2 3 4 5], shape=(5,), dtype=int32)
After mapping
tf.Tensor([2 3 4 5 6], shape=(5,), dtype=int32)


#### - `shuffle`
> This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.

In [21]:
sample_const = tf.data.Dataset.from_tensor_slices(tf.constant([[1,2,3],
                                                               [4,5,6],
                                                               [7,8,9],
                                                               [10,11,12],
                                                               [13,14,15]])).shuffle(buffer_size=5)
for d in sample_const:
    print(d)

tf.Tensor([1 2 3], shape=(3,), dtype=int32)
tf.Tensor([7 8 9], shape=(3,), dtype=int32)
tf.Tensor([10 11 12], shape=(3,), dtype=int32)
tf.Tensor([13 14 15], shape=(3,), dtype=int32)
tf.Tensor([4 5 6], shape=(3,), dtype=int32)


#### - `repeat`
> Repeats this dataset count times.  
The default behavior (if count is None or -1) is for the dataset be repeated indefinitely.


In [23]:
sample_const = tf.data.Dataset.from_tensor_slices(tf.constant([[1,2,3],
                                                               [4,5,6],
                                                               [7,8,9],
                                                               [10,11,12],
                                                               [13,14,15]])).repeat(count=2)
for d in sample_const:
    print(d)

tf.Tensor([1 2 3], shape=(3,), dtype=int32)
tf.Tensor([4 5 6], shape=(3,), dtype=int32)
tf.Tensor([7 8 9], shape=(3,), dtype=int32)
tf.Tensor([10 11 12], shape=(3,), dtype=int32)
tf.Tensor([13 14 15], shape=(3,), dtype=int32)
tf.Tensor([1 2 3], shape=(3,), dtype=int32)
tf.Tensor([4 5 6], shape=(3,), dtype=int32)
tf.Tensor([7 8 9], shape=(3,), dtype=int32)
tf.Tensor([10 11 12], shape=(3,), dtype=int32)
tf.Tensor([13 14 15], shape=(3,), dtype=int32)


#### - `batch`
> Combines consecutive elements of this dataset into batches.

In [24]:
sample_const = tf.data.Dataset.from_tensor_slices(tf.constant([[1,2,3],
                                                               [4,5,6],
                                                               [7,8,9],
                                                               [10,11,12],
                                                               [13,14,15]])).batch(2)
for d in sample_const:
    print(d)

tf.Tensor(
[[1 2 3]
 [4 5 6]], shape=(2, 3), dtype=int32)
tf.Tensor(
[[ 7  8  9]
 [10 11 12]], shape=(2, 3), dtype=int32)
tf.Tensor([[13 14 15]], shape=(1, 3), dtype=int32)


### `tf.data.Iterator`
---
- https://www.tensorflow.org/api_docs/python/tf/data/Iterator

__Notifications:  
If you want to run the code cells of this section, please do not run the previous section which uses eager_execution.__

#### Create a iterator from the Dataset object / Applying a batch generation
#### - `make_one_shot_iterator`

In [11]:
sample_dataset = tf.data.Dataset.from_tensor_slices(tf.range(20))
sample_iter = sample_dataset.make_one_shot_iterator()
# define ops
next_data = sample_iter.get_next()

with tf.Session() as sess:
    for ii in range(20):
        print(sess.run(next_data))

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


#### - `make_initializable_iterator`
> An initializable iterator requires you to run an explicit iterator.initializer operation before using it. In exchange for this inconvenience, it enables you to parameterize the definition of the dataset, using one or more tf.placeholder() tensors that can be fed when you initialize the iterator.

Pipeline:  
- Define generic iterator.
- Initialize iterator.

In [13]:
max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)

# Create iterator
iterator = dataset.make_initializable_iterator()

# Define ops
next_element = iterator.get_next()

In [14]:
# Initialize an iterator over a dataset with 10 elements.
with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={max_value: 10})
    for i in range(10):
        value = sess.run(next_element)
        print(value)
        assert i == value

0
1
2
3
4
5
6
7
8
9


In [15]:
# Initialize the same iterator over a dataset with 100 elements.
with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={max_value: 100})
    for i in range(100):
        value = sess.run(next_element)
        assert i == value

#### - `from_structure`

In [16]:
# Define training dataset and validation dataset (same structure)
training_dataset = tf.data.Dataset.range(100).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64))
validation_dataset = tf.data.Dataset.range(50)

In [17]:
# Create generic iterator
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
                                           training_dataset.output_shapes)
# Define ops of the generic iterator object
next_data = iterator.get_next()

In [18]:
# Ops for initializing iterators to get concreate iterator
training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)

In [19]:
# Run 20 epochs in which the training dataset is traversed, followed by the
# validation dataset.
with tf.Session() as sess:
    for _ in range(20):
        # Initialize an iterator over the training dataset.
        sess.run(training_init_op)
        for _ in range(100):
            sess.run(next_data)

        # Initialize an iterator over the validation dataset.
        sess.run(validation_init_op)
        for _ in range(50):
            sess.run(next_data)

### (Build it together) Create batch generator using Dataset API
---

In [23]:
# load mnist
train_images = load_mnist_images(dataset_dir + filenames['train_images'])
train_labels = load_mnist_labels(dataset_dir + filenames['train_labels'])

# split into train/validation dataset
X_train, X_val, y_train, y_val = train_test_split(train_images, train_labels, test_size=0.2, shuffle=True)

X_test = load_mnist_images(dataset_dir + filenames['test_images'])
y_test = load_mnist_labels(dataset_dir + filenames['test_labels'])

X_train = np.float32(X_train)
X_val = np.float32(X_val)
X_test = np.float32(X_test)

In [32]:
n_epochs = 10
batch_size = 32
norm = tf.constant(255.0)

# Build source dataset for training
X_train_dataset = tf.data.Dataset.from_tensor_slices(X_train).map(lambda x : tf.divide(x, norm))
y_train_dataset = tf.data.Dataset.from_tensor_slices(y_train).map(lambda x : tf.one_hot(x, 10))
train_dataset = tf.data.Dataset.zip((X_train_dataset, y_train_dataset)).batch(batch_size)

# Build source dataset for validation
X_valid_dataset = tf.data.Dataset.from_tensor_slices(X_val).map(lambda x : tf.divide(x, norm))
y_valid_dataset = tf.data.Dataset.from_tensor_slices(y_val).map(lambda x : tf.one_hot(x, 10))
validation_dataset = tf.data.Dataset.zip((X_valid_dataset, y_valid_dataset)).batch(batch_size)

# Make iterator from dataset structure
dataset_iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
# Define ops for batch generation
next_data = dataset_iter.get_next()

# Define initialize ops
train_iter_init = dataset_iter.make_initializer(train_dataset)
validation_iter_init = dataset_iter.make_initializer(validation_dataset)

with tf.Session() as sess:
    for epoch in range(n_epochs):
        print("epoch: {}".format(epoch))
        # training
        sess.run(train_iter_init)
        while True:
            try:
                train_batch = sess.run(next_data)
                # Add training ops here
            except tf.errors.OutOfRangeError:
                break
        # validation
        sess.run(validation_iter_init)
        while True:
            try:
                validation_batch = sess.run(next_data)
                # Add validation ops here
            except tf.errors.OutOfRangeError:
                break

epoch: 0
epoch: 1
epoch: 2
epoch: 3
epoch: 4
epoch: 5
epoch: 6
epoch: 7
epoch: 8
epoch: 9
