<a href="https://colab.research.google.com/github/iust-deep-learning/tensorflow-2-tutorial/blob/master/part_05_tf_data/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensorflow 2.0 Tutorial: Part #5


TensorFlow 2.0 Tutorial by IUST

*   Last Update: May 2020
*   Official Page: https://github.com/iust-deep-learning/tensorflow-2-tutorial





---




Please run the following cell before going through the rest of the tutorial.

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import numpy as np

from pprint import pprint

### Exercise #1: Simple ops

Convert the following dataset to the one-hot encoding format and vice-versa.

a) Normal to one-hot

In [3]:
NUM_CLASSES = 5

ds = tf.data.Dataset.from_tensor_slices(
    [0, 2, 1, 3, 3, 4, 1, 3, 3, 4])

one_hot_matrix = tf.one_hot(ds, NUM_CLASSES, on_value = 1.0, off_value = 0 )

a = list(ds.as_numpy_iterator())

ValueError: Attempt to convert a value (<TensorSliceDataset shapes: (), types: tf.int32>) with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>) to a Tensor.

b) one-hot to Normal


In [0]:
ds = tf.data.Dataset.from_tensor_slices(
    [[1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]], )

####################################
#   Put your implementation here   #
####################################

list(ds.as_numpy_iterator())

### Exercise #2: Seq-to-Seq dataset

In this question, we want to create a `tf.data.Dataset` instance for a Sequence-to-Sequence task. The source sequence and the target sequence are both given in a separate stream of the same size. So, every data in the source stream is mapped to a sequence in the target stream at the exact position. 

1. Create a Dataset instance where each element is a (src_seq, tgt_seq) tuple 
2. Remove pairs that the length of the source or target sequence is zero.
3. Batch and pad the dataset.

In [0]:
import random

def get_generator(size):
  def gen():
    for _ in range(size):
      yield np.random.randint(0, 100, size=[np.random.randint(0, 11)])
  return gen

src_seq_gen = get_generator(10000)
tgt_seq_gen = get_generator(10000)

src_ds = tf.data.Dataset.from_generator(
    src_seq_gen,
    output_types=tf.int32, output_shapes=tf.TensorShape([None]))

tgt_ds = tf.data.Dataset.from_generator(
    tgt_seq_gen,
    output_types=tf.int32, output_shapes=tf.TensorShape([None]))

ds = #...

####################################
#   Put your implementation here   #
####################################

# Hint: You may need:
# 1. tf.shape(..), https://www.tensorflow.org/api_docs/python/tf/shape
# 2. tf.data.Dataset.padded_batch(...),https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch

for s, t in ds.take(10):
  print(s)
  print(t)
  print("-------------")

### Exercise #3: Language Model Data

The Language modeling task, despite being straightforward, requires special data preparation. Dataset for these tasks consists of hundreds of documents, and each document is basically a sequence of words/token. To prepare the dataset for the model, we first concatenate all these documents to form one single giant string. Then, we divide this string to equal-length smaller sequences to create elements of our dataset. Finally, the training batches are constructed by putting several elements together. For instance, suppose that a string 40 elements are created after merging all documents together:

<p align="center">
<img src="https://imgur.com/download/6HjGVpG"/>
</p>

In this question, you should implement a data pipeline for a language model that follows the above strategy. That is:

1. Read the dataset file. Each line is a single document.
Tokenize each document by splitting it using a space.
2. Append the `<eod>` token to the end of each document by which the model can learn the concept of documents.
3. Concatenate all documents to form a single sequence of words/tokens.
4. Divide the aforementioned sequence of tokens to a bunch of 5-element sequences.
5. Create a batched dataset with batch_size = 4.

Note that your final data pipeline should be something like this `[(batch_1_inputs, batch_1_targets), (batch_2_inputs, batch_2_targets), ..., (batch_n_inputs, batch_n_targets)]`. `targets` in the language modeling task is essentially the shifted inputs (Hint: Create another shifted dataset for targets after step 3. Then, zip the original sequence and the shifted sequence to create a dataset consists of <inputs, targets> pairs.)

First, run the following cell:

In [0]:
%%writefile dataset.txt
d1_w1 d1_w2 d1_w3 d1_w4 d1_w5 d1_w6 d1_w7 d1_w8 d1_w9 d1_w10
d2_w1 d2_w2 d2_w3 d2_w4 d2_w5 d2_w6 d2_w7 d2_w8 d2_w9 d2_w10 d2_w11 d2_w12
d3_w1 d3_w2 d3_w3 d3_w4 d3_w5 d3_w6 d3_w7 d3_w8 d3_w9 d3_w10 d3_w11 d3_w12 d3_w13 d3_w14 d3_w15
d4_w1 d4_w2 d4_w3 d4_w4 d4_w5 d4_w6 d4_w7 d4_w8 d4_w9 d4_w10 d4_w11 d4_w12 d4_w13
d5_w1 d5_w2 d5_w3 d5_w4 d5_w5 d5_w6 d5_w7 d5_w8 d5_w9 d5_w10 d5_w11 d5_w12 d5_w13
d6_w1 d6_w2 d6_w3 d6_w4 d6_w5 d6_w6 d6_w7 d6_w8 d6_w9
d7_w1 d7_w2 d7_w3 d7_w4 d7_w5 d7_w6 d7_w7 d7_w8 d7_w9 d7_w10 d7_w11 d7_w12 d7_w13
d8_w1 d8_w2 d8_w3 d8_w4 d8_w5 d8_w6 d8_w7
d9_w1 d9_w2 d9_w3 d9_w4 d9_w5 d9_w6 d9_w7 d9_w8 d9_w9 d9_w10 d9_w11 d9_w12
d10_w1 d10_w2 d10_w3 d10_w4 d10_w5 d10_w6 d10_w7 d10_w8 d10_w9 d10_w10 d10_w11 d10_w12

In [0]:
ds = tf.data.TextLineDataset('dataset.txt')

# Hint: You may need: 
# 1. tf.strings.split(...) https://www.tensorflow.org/api_docs/python/tf/strings/strip
# 2. tf.data.Dataset.flat_map(...) https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map
# 3. tf.data.Dataset.zip(...) https://www.tensorflow.org/api_docs/python/tf/data/Dataset#zip
# 4. tf.data.Dataset.skip(...) https://www.tensorflow.org/api_docs/python/tf/data/Dataset#skip
# 5. tf.data.Dataset.batch(...) https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch

####################################
#   Put your implementation here   #
####################################

ds

## References


---

