In this tutorial, we study the TensorFlow version of the Embedding() layer as well as word embeddings. We will also introduce anew TensorFlow's API similar to the Keras API. This new API called 'tf.data' is often used conjunction with the Embedding() layers in NLP tasks. We have encountered word embedding problems before using Python libraries such as 'gensim'. Here we will extend the analysis under the TensorFlow framework. 

This tutorial contains an introduction to word embeddings under the TensorFlow framework. We will train our own word embeddings using a simple Keras model for a sentiment classification task, and then visualize them in the "Embedding Projector", a dynamic GUI that can be used to visualize word embeddings within Python script. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import os
import re
import shutil
import string
import io 

from tensorflow.keras import layers
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import GlobalAveragePooling1D
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.preprocessing import text_dataset_from_directory  # this only exists beyond TensorFlow 2.5 or above

In [2]:
#path="C:\\Users\\gao\\GAO_Jupyter_Notebook\\Datasets"
# path="C:\\Users\\GAO\\python workspace\\GAO_Jupyter_Notebook\\Datasets"
# os.chdir(path)

path="C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\Introduction to Data Science Using Python\\datasets"
os.chdir(path)

print("TensorFlow Version: ", tf.__version__) # this needs version 2.6 or above; if not, in the anaconda prompt, we can run "pip install --user tensorflow --upgrade"
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

TensorFlow Version:  2.7.0
Eager mode:  True
GPU is NOT AVAILABLE


Recall that word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. We do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter we specify). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

In this tutorial we will go over a word-embedding example in TensorFlow. However, before we jump into the actual example, we will need some prerequisite. TensorFlow is a powerful platform where many APIs are available. The one we are familiar with so far is the Keras API. However, when it comes to data science, 'tf.keras' is not the only hot one. Here, we will introduce the 'tf.data' module. This module comes in handy when we deal with large volumes of text data. It is part of the ecosystem of TensorFlow platform that allows more flexible handling of datasets with scaled data preprocessing and transformations. 

### I. The 'tf.data.Dataset' API

Within the TensorFlow 2.6 version or above, one of the most powerful modules is the 'tf.data' API. Within this module, the 'tf.data.Dataset' is an API that represents a potentially large set of elements, which we will be focusing on in this section. It supports writing descriptive and efficient input pipelines. The 'Dataset' usage follows a common pattern:

   - create a source dataset from input data;
   - apply dataset transformations to preprocess the data;
   - iterate over the dataset and process the elements;
   - iteration happens in a streaming fashion, so the full dataset does not need to fit into memory;

Let's see some examples of the "tf.data.Dataset" API. Below, the simplest way to create a dataset is to create it from a python list. Notice that when we generate the dataset, the type of object is specific to TensorFlow rather than some 'numpy' arrays objects:


In [3]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
    print(element)
print(type(dataset))
print(dataset)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
<TensorSliceDataset shapes: (), types: tf.int32>


If we want to see the results as common 'numpy' type of objects, we can invoke the as_numpy_iterator() method, which returns an iterator converting all elements of the underlying dataset to a 'numpy' type of object:

In [4]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
    print(element)
for element in dataset.as_numpy_iterator():
  print(element)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
1
2
3


The function will also be able to preserve the nested structure of the dataset elements:

In [5]:
dataset = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]),
                                              'b': [5, 6]})
print(type(dataset))
list(dataset.as_numpy_iterator()) == [{'a': (1, 3), 'b': 5},
                                      {'a': (2, 4), 'b': 6}]

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>


True

We have seen a simple example creating an object in the 'Dataset' library. The vision of having this library is to be able to turn different kinds of input into the forms we want for processing. In the example above, we are given a simple tensor to work with. In reality, we may fac different types of data input types or formats. To handle the varieties of data input, the module has many other utilities incoporated. For example, to process lines from files, we can use use tf.data.TextLineDataset():

   - dataset = tf.data.TextLineDataset([_file1.txt_, _file2.txt_])

To process records written in the 'TFRecord' format, we can use tf.data.TFRecordDataset():

   - dataset = tf.data.TFRecordDataset([_file1.tfrecords_, _file2.tfrecords_])

To create a dataset of all files matching a pattern, we can use tf.data.Dataset.list_files():

   - dataset = tf.data.Dataset.list_files(_/_path/*.txt_)

Once we have a dataset, we can apply transformations to prepare the data for your model:

In [6]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset = dataset.map(lambda x: x*2)
list(dataset.as_numpy_iterator()) # the as_numpy_iterator() method turns the object into a 'numpy' object

[2, 4, 6]

Notice that the map() method above is an associated function with the specific type of object in the 'Dataset' module. Since the 'dataset' created above from the tensor slice is a class, it must have a set of associated methods. For example, we can use the apply() method to apply a function to the 'Dataset' at hand:

In [7]:
dataset = tf.data.Dataset.range(100)
def dataset_fn(ds):
    return ds.filter(lambda x: x < 5)
dataset = dataset.apply(dataset_fn)
list(dataset.as_numpy_iterator())

[0, 1, 2, 3, 4]

Much of the rest of the section will focus on introducing the most useful methods in the 'tf.data.Dataset' API such as the one above (e.g. the apply() method). Now let's begin our journey and we start with the batch() method which is designed to combine consecutive elements of datasets into batches:

In [8]:
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3)
print(list(dataset.as_numpy_iterator()))

dataset = tf.data.Dataset.range(11)
dataset = dataset.batch(3, drop_remainder=True)
print(list(dataset.as_numpy_iterator()))

[array([0, 1, 2], dtype=int64), array([3, 4, 5], dtype=int64), array([6, 7], dtype=int64)]
[array([0, 1, 2], dtype=int64), array([3, 4, 5], dtype=int64), array([6, 7, 8], dtype=int64)]


The concatenate() method can concatenate the datasets. Keep in mind that the input dataset and dataset to be concatenated should have compatible element specs:

In [59]:
a = tf.data.Dataset.range(1, 4)  # ==> [ 1, 2, 3 ]
b = tf.data.Dataset.range(4, 8)  # ==> [ 4, 5, 6, 7 ]
ds = a.concatenate(b)
print(list(ds.as_numpy_iterator()))

c = tf.data.Dataset.zip((a, b))
print(type(c))
print(c)
try:
    a.concatenate(c)
except TypeError:
    print('Two datasets to be concatenated must have the same data type')


[1, 2, 3, 4, 5, 6, 7]
<class 'tensorflow.python.data.ops.dataset_ops.ZipDataset'>
<ZipDataset shapes: ((), ()), types: (tf.int64, tf.int64)>
Two datasets to be concatenated must have the same data type


The enumerate() method can enumerate the elements of datasets:

In [64]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset = dataset.enumerate(start=4)
for element in dataset.as_numpy_iterator():
    print(element)

print("------")

dataset = tf.data.Dataset.from_tensor_slices([(7, 8), (9, 10)]) # the nested structure of the input dataset determines the structure of elements in the resulting dataset
dataset = dataset.enumerate()
for element in dataset.as_numpy_iterator():
    print(element)

(4, 1)
(5, 2)
(6, 3)
------
(0, array([7, 8]))
(1, array([ 9, 10]))


The filter() method filters the _predicate_; it works like a subset conditionm (i.e. we basically filter on the condition and keep all elements that meet the condition). The method can be used in the following way: 

   - filter(_predicate_)

Here is the example of how we can use this method:

In [66]:
dataset = tf.data.Dataset.from_tensor_slices([0, 1, 2, 3])
dataset = dataset.filter(lambda x: x < 3)
print(list(dataset.as_numpy_iterator()))

print("------")

def filter_fn(x):
    return tf.math.equal(x, 1)  # tf.math.equal(x, y) is required for equality comparison
dataset = dataset.filter(filter_fn)
print(list(dataset.as_numpy_iterator()))

[0, 1, 2]
------
[1]


Obviously, we have seen the method from_tensor_slices() from examples above. The metho is a useful tool associated with the 'tf.data.Dataset' API. It creates a special type of object called '_tensorflow.python.data.ops.dataset\_ops.TensorSliceDataset_' whose elements are slices of the given tensors. For simplicity, we will call this type of dataset the 'Dataset' object, as this object is unique in the 'tf.data.Dataset' API. Here is a little more detail on how we can use it. First, the from_tensor_slices() can take different type of inputs: the example below shows that it can take 1D, 2D TENSORS, tuples and dictionaries:

In [67]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) # slicing a 1D tensor produces scalar tensor elements
print(type(dataset))
print(list(dataset.as_numpy_iterator()))

dataset = tf.data.Dataset.from_tensor_slices([[1, 2], [3, 4]]) # slicing a 2D tensor produces 1D tensor elements
print(list(dataset.as_numpy_iterator()))

dataset = tf.data.Dataset.from_tensor_slices(([1, 2], [3, 4], [5, 6])) # slicing a tuple of 1D tensors produces tuple elements containing scalar tensors
print(list(dataset.as_numpy_iterator()))

dataset = tf.data.Dataset.from_tensor_slices({"a": [1, 2], "b": [3, 4]}) # dictionary structure is also preserved
print(list(dataset.as_numpy_iterator()) == [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}])

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
[1, 2, 3]
[array([1, 2]), array([3, 4])]
[(1, 3, 5), (2, 4, 6)]
True


We can also concatenate different tensors together:

In [74]:
features = tf.constant([[1, 3], [2, 1], [3, 3]]) # ==> 3-by-2 tensor
lbs = tf.constant(['A', 'B', 'A']) # ==> 3-by-1 tensor
dataset = tf.data.Dataset.from_tensor_slices((features, lbs)) # 2 tensors can be combined into one Dataset object
print(dataset)
for element in dataset.as_numpy_iterator():
    print(element)
print('--------------')
features_dataset = tf.data.Dataset.from_tensor_slices(features) # both the features and the labels tensors can be converted to a 'Dataset' object separately and combined after
labels_dataset = tf.data.Dataset.from_tensor_slices(lbs)
dataset = tf.data.Dataset.zip((features_dataset, labels_dataset))
print(dataset)
for element in dataset.as_numpy_iterator():
    print(element)
print('--------------')
batched_features = tf.constant([[[1, 3], [2, 3]],
                                [[2, 1], [1, 2]],
                                [[3, 3], [3, 2]]], shape=(3, 2, 2)) # a batched feature and label set can be converted to a 'Dataset' type of object in similar fashion
batched_labels = tf.constant([['A', 'A'],
                              ['B', 'B'],
                              ['A', 'B']], shape=(3, 2, 1))
dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
for element in dataset.as_numpy_iterator():
    print(element)

<TensorSliceDataset shapes: ((2,), ()), types: (tf.int32, tf.string)>
(array([1, 3]), b'A')
(array([2, 1]), b'B')
(array([3, 3]), b'A')
-------------------
<ZipDataset shapes: ((2,), ()), types: (tf.int32, tf.string)>
(array([1, 3]), b'A')
(array([2, 1]), b'B')
(array([3, 3]), b'A')
-------------------
(array([[1, 3],
       [2, 3]]), array([[b'A'],
       [b'A']], dtype=object))
(array([[2, 1],
       [1, 2]]), array([[b'B'],
       [b'B']], dtype=object))
(array([[3, 3],
       [3, 2]]), array([[b'A'],
       [b'B']], dtype=object))


A corresponding function is from_tensors(). This method creates a 'Dataset' object with a single element, comprising the given tensors:

In [75]:
dataset = tf.data.Dataset.from_tensors([1, 2, 3])
print(list(dataset.as_numpy_iterator()))

dataset = tf.data.Dataset.from_tensors(([1, 2, 3], 'A'))
print(list(dataset.as_numpy_iterator()))

example = tf.constant([1,2,3])
dataset = tf.data.Dataset.from_tensors(example).repeat(2)
print(list(dataset.as_numpy_iterator()))

[array([1, 2, 3])]
[(array([1, 2, 3]), b'A')]
[array([1, 2, 3]), array([1, 2, 3])]


Let's keep looking at some other important methods. The list_file() method matches file patterns. This is very useful when we want to handle multiple files with similar patterns. For example, if we had the following files on our filesystem:

   - /path/to/dir/a.txt
   - /path/to/dir/b.py
   - /path/to/dir/c.py

and if we pass "/path/to/dir/*.py" as the directory, the dataset would produce:

   - /path/to/dir/b.py
   - /path/to/dir/c.py

This is very similar in SAS and R where we can define a set of files and then let the code read through them iteratively.

Another function similar to the R function lapply() is the map(_map\_func_) method. This method maps its argument '_map\_func_' across the elements of this dataset. This transformation applies _map\_func_ to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input. Here, _map\_func_ can be used to change both the values and the structure of a dataset's elements. Below is an example:

In [76]:
ds1 = tf.data.Dataset.range(1, 6)  # [ 1, 2, 3, 4, 5 ]
print(list(ds1.as_numpy_iterator()))
ds1 = ds1.map(lambda x: x + 1) # the argument takes a single element of type 'tf.Tensor' with the same shape and dtype
print(list(ds1.as_numpy_iterator()))

[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]


The input signature of _map\_func_ is determined by the structure of each element in this dataset. Below are some examples to illustrate the varieties of usage. In the first example, each element is a tuple while in the second example. each element is a dictionary:

In [78]:
elements = [(1, "foo"), (2, "bar"), (3, "baz")] # each element is a tuple containing two 'tf.Tensor' objects
ds2 = tf.data.Dataset.from_generator(lambda: elements, (tf.int32, tf.string))
result = ds2.map(lambda x_int, y_str: x_int) # the argument takes 2 arguments of type 'tf.Tensor', and this function projects out just the first component
print(list(result.as_numpy_iterator()))

elements =  ([{"a": 1, "c": "foo"},
              {"a": 2, "c": "bar"},
              {"a": 3, "c": "baz"}]) # each element is a dictionary mapping strings to 'tf.Tensor' objects
ds3 = tf.data.Dataset.from_generator(lambda: elements, {"a": tf.int32, "c": tf.string}) 
result = ds3.map(lambda d: str(d["a"]) + d["c"]) # the argument of map() takes a single argument of type 'dict' with the same keys as the elements
print(list(result.as_numpy_iterator()))

[1, 2, 3]
[b'Tensor("args_0:0", dtype=int32)foo', b'Tensor("args_0:0", dtype=int32)bar', b'Tensor("args_0:0", dtype=int32)baz']


The map() method can take on even more complicated functions, some of them can be nested or customized:

In [90]:
dataset = tf.data.Dataset.range(3)
def g(x):
    return tf.constant(37.0), tf.constant(["Foo", "Bar", "Baz"])
result = dataset.map(g)
print(type(result.element_spec)) # this is a tuple
for element in result.element_spec:
    print(element)
print('--------------')

dataset = tf.data.Dataset.range(3)
def n(x):
    return (37.0, [42, 16]), "foo"
result = dataset.map(n) # the map() method can return nested structures
print(result.element_spec) # this is a typle
print('--------------')

dataset = tf.data.Dataset.range(3)
def h(x):
  return 37.0, ["Foo", "Bar"], np.array([1.0, 2.0], dtype=np.float64)
result = dataset.map(h) # Python primitives, lists, and numpy arrays are implicitly converted to 'tf.Tensor'
print(result.element_spec) # this is a tuple

<class 'tuple'>
TensorSpec(shape=(), dtype=tf.float32, name=None)
TensorSpec(shape=(3,), dtype=tf.string, name=None)
--------------
((TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(2,), dtype=tf.int32, name=None)), TensorSpec(shape=(), dtype=tf.string, name=None))
--------------
(TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(2,), dtype=tf.string, name=None), TensorSpec(shape=(2,), dtype=tf.float64, name=None))


The map() method is one of the most complicated methods. The online reference has more information. We will not elaborate here.

Next, let's examine the padded_batch() method. This function combines consecutive elements of this dataset into padded batches. The transformation combines multiple consecutive elements of the input dataset into a single element. The syntax is:

   - padded_batch(_batch\_size_, _padded\_shapes_=None, _padding\_values_=None, _drop\_remainder_=False).

In [100]:
A = (tf.data.Dataset.range(1, 5, output_type=tf.int32).map(lambda x: tf.fill([x], x))) # padding to the smallest per-batch size that fits all elements
for element in A.as_numpy_iterator():
    print(element)
print('')

B = A.padded_batch(2)
for element in B.as_numpy_iterator():
    print(element)

print('')

C = A.padded_batch(2, padded_shapes=5) # padding to a fixed size
for element in C.as_numpy_iterator():
    print(element)

print('')

D = A.padded_batch(2, padded_shapes=5, padding_values=-1) # padding with a custom value
for element in D.as_numpy_iterator():
    print(element)

print('')

[1]
[2 2]
[3 3 3]
[4 4 4 4]

[[1 0]
 [2 2]]
[[3 3 3 0]
 [4 4 4 4]]

[[1 0 0 0 0]
 [2 2 0 0 0]]
[[3 3 3 0 0]
 [4 4 4 4 0]]

[[ 1 -1 -1 -1 -1]
 [ 2  2 -1 -1 -1]]
[[ 3  3  3 -1 -1]
 [ 4  4  4  4 -1]]



Let's look at some complicated examples involving padding:

In [112]:
elements = [([1, 2, 3], [10]),
            ([4, 5], [11, 12])]
dataset0 = tf.data.Dataset.from_generator(lambda: iter(elements), (tf.int32, tf.int32)) # components of nested elements can be padded independently
dataset = dataset0.padded_batch(2, padded_shapes=([4], [None]), padding_values=(-1, 100)) # padding the 1st component of the tuple to length 4, and the 2nd component to the smallest size that fits
print(list(dataset.as_numpy_iterator()))

print('')

E = tf.data.Dataset.zip((A, A))
F = E.padded_batch(2, padding_values=-1) # padding with a single value and multiple components
print('After padding E: ')
for element in F.as_numpy_iterator():
    print(element)

[(array([[ 1,  2,  3, -1],
       [ 4,  5, -1, -1]]), array([[ 10, 100],
       [ 11,  12]]))]

After padding E: 
(array([[ 1, -1],
       [ 2,  2]]), array([[ 1, -1],
       [ 2,  2]]))
(array([[ 3,  3,  3, -1],
       [ 4,  4,  4,  4]]), array([[ 3,  3,  3, -1],
       [ 4,  4,  4,  4]]))


The methods so far are fairly complicated conceptually. Let's look at some easier methods. The 'tf.data' module behaves a lot like the 'numpy' package in some way. For example, the random() method creates pseudo-values. We have also seen the range() method that specifies the range of the data:

In [113]:
d1 = tf.data.Dataset.random(seed=4).take(10) # creating a 'Dataset' class with at most 10 elements from the underlying dataset
print(type(d1))
print(list(d1.as_numpy_iterator()))

<class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
[3949691112, 1216040066, 825314089, 3986316358, 3335442075, 2799817130, 3237978012, 131803297, 1232301472, 2905322510]


In [115]:
print(list(tf.data.Dataset.range(5).as_numpy_iterator()))
print(list(tf.data.Dataset.range(2, 5).as_numpy_iterator())) # [2, 3, 4]
print(list(tf.data.Dataset.range(1, 5, 2).as_numpy_iterator()))
print(list(tf.data.Dataset.range(1, 5, -2).as_numpy_iterator()))
print(list(tf.data.Dataset.range(2, 5, output_type=tf.int32).as_numpy_iterator()))
print(list(tf.data.Dataset.range(2, 5, output_type=tf.float32).as_numpy_iterator()))

[0, 1, 2, 3, 4]
[2, 3, 4]
[1, 3]
[]
[2, 3, 4]
[2.0, 3.0, 4.0]


The reduce() method reduces the input dataset to a single element. The syntax for this function is:

   - reduce(_initial\_state_, _reduce\_func_)

The transformation calls _reduce\_func_ successively on every element of the input dataset until the dataset is exhausted, aggregating information in its internal state. The _initial\_state_ argument is used for the initial state and the final state is returned as the result.

In [20]:
print(tf.data.Dataset.range(5).reduce(np.int64(0), lambda x, _: x + 1).numpy())
print(tf.data.Dataset.range(5).reduce(np.int64(0), lambda x, y: x + y).numpy())

5
10


The repeat(_c_) (with the argument _c_) method repeats the underlying dataset so each original value is seen _c_ times:

In [116]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset = dataset.repeat(3) 
list(dataset.as_numpy_iterator())

[1, 2, 3, 1, 2, 3, 1, 2, 3]

An interesting function within this module is the shard(_num\_shards_, _index_) method, which is designed to create a 'Dataset' class that includes only 1/_num\_shards_ of this dataset. Here, the shard is deterministic. The 'Dataset' class produced by $A.shard(n, i)$ will contain all elements of $A$ whose index $mod n = i$ for the dataset $A$. It's a bit hard to describe this function so let's see some examples below. This operator is very useful when running distributed training, as it allows each worker to read a unique subset:

In [120]:
A = tf.data.Dataset.range(10) # [0, 1,2,3,4,5,6,7,8,9]

B = A.shard(num_shards=3, index=0) # the original meaning of the word 'shard' means a piece or fragment of a brittle substance (e.g. shards of glass)
print(list(B.as_numpy_iterator()))

C = A.shard(num_shards=3, index=1)
print(list(C.as_numpy_iterator()))

D = A.shard(num_shards=3, index=2)
print(list(D.as_numpy_iterator()))

[0, 3, 6, 9]
[1, 4, 7]
[2, 5, 8]


A related function is shuffle(_buffer\_size_, _seed_=None, _reshuffle\_each\_iteration_=None), which randomly shuffles the elements of the underlying dataset. This dataset fills a buffer with _buffer\_size_ elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required. The argument _reshuffle\_each\_iteration_ controls whether the shuffle order should be different for each epoch. 

For instance, if your dataset contains 10000 elements but _buffer\_size_ is set to 1000, then shuffle will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer.

Below let's see 2 examples for comparison:



In [131]:
dataset = tf.data.Dataset.range(3) # [0, 1, 2]
dataset = dataset.shuffle(3, reshuffle_each_iteration=True)
print(list(dataset.as_numpy_iterator())) 
dataset = dataset.repeat(4)
print(list(dataset.as_numpy_iterator()), "\n") 

dataset = tf.data.Dataset.range(5)
dataset = dataset.shuffle(6, reshuffle_each_iteration=False)
print(list(dataset.as_numpy_iterator())) 
dataset = dataset.repeat(3)
print(list(dataset.as_numpy_iterator())) 

[1, 2, 0]
[0, 1, 2, 0, 2, 1, 1, 0, 2, 1, 0, 2] 

[0, 1, 3, 4, 2]
[0, 1, 3, 4, 2, 0, 1, 3, 4, 2, 0, 1, 3, 4, 2]


Let's introduce another method: the method skip(_count_) can be useful in that it creates a 'Dataset' class that skips _count_ elements from the underlying dataset. Here is an example:

In [133]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.skip(6)
list(dataset.as_numpy_iterator())

[6, 7, 8, 9]

Another good control type of function is take_while(_predicate_), which performs a transformation that stops dataset iteration based on a _predicate_. The _predicate_ cannot just be some random conditions: it has to be a function that maps a nested structure of tensors that satisfy certain criteria. See example below:

In [135]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.take_while(lambda x: x < 6)
list(dataset.as_numpy_iterator())

[0, 1, 2, 3, 4, 5]

The unbatch() method is another useful function to unpack datasets. This function splits elements of a dataset into multiple elements. For example, if elements of the dataset are shaped [B, a0, a1, ...], where B may vary for each input element, then for each element in the dataset, the unbatched dataset will contain B consecutive elements of shape [a0, a1, ...]. Below is an example:

In [137]:
elements = [ [1, 2, 3], [8, 9], [11, 24, 345, 40] ]
dataset = tf.data.Dataset.from_generator(lambda: elements, tf.int64)
print(list(dataset.as_numpy_iterator()))
dataset = dataset.unbatch()
print(list(dataset.as_numpy_iterator()))

[array([1, 2, 3], dtype=int64), array([8, 9], dtype=int64), array([ 11,  24, 345,  40], dtype=int64)]
[1, 2, 3, 8, 9, 11, 24, 345, 40]


To remove duplicates, we can use the unique() method:

In [138]:
dataset = tf.data.Dataset.from_tensor_slices([1, 37, 2, 37, 2, 1])
dataset = dataset.unique()
sorted(list(dataset.as_numpy_iterator()))

[1, 2, 37]

The last powerful function in the module is the window() function. This function returns a dataset of "windows". Each "window" is a dataset that contains a subset of elements of the input dataset. These are finite datasets of size _size_ (or possibly fewer if there are not enough input elements to fill the window and _drop\_remainder_ evaluates to False). The syntax of the method is the following:

   - window(_size_, _shift_=None, _stride_=1, _drop\_remainder_=False)

In [28]:
dataset = tf.data.Dataset.range(7).window(3)
for window in dataset:
    print(window)

print('-------')

for window in dataset:
    print([item.numpy() for item in window])

<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
-------
[0, 1, 2]
[3, 4, 5]
[6]


The _shift_ argument determines the number of input elements to shift between the start of each window. If windows and elements are both numbered starting at 0, the first element in window $k$ will be element $k \times _shift_$ of the input dataset. In particular, the first element of the first window will always be the first element of the input dataset. The _stride_ argument determines the stride between input elements within a window. Let's see 2 examples below:

In [139]:
dataset = tf.data.Dataset.range(7).window(3, shift=1, drop_remainder=True)
for window in dataset:
    print(list(window.as_numpy_iterator()))

print('-----------')

dataset = tf.data.Dataset.range(7).window(3, shift=1, stride=2, drop_remainder=True)
for window in dataset:
    print(list(window.as_numpy_iterator()))

[0, 1, 2]
[1, 2, 3]
[2, 3, 4]
[3, 4, 5]
[4, 5, 6]
-----------
[0, 2, 4]
[1, 3, 5]
[2, 4, 6]


When the window() transformation is applied to a dataset whose elements are nested structures, it produces a dataset where the elements have the same nested structure but each leaf is replaced by a window. In other words, the nesting is applied outside of the windows as opposed inside of them.

Applying window() to a 'Dataset' of tuples gives a tuple of windows:

In [30]:
dataset = tf.data.Dataset.from_tensor_slices(([1, 2, 3, 4, 5],
                                              [6, 7, 8, 9, 10]))
dataset = dataset.window(2)
windows = next(iter(dataset))
windows

(<_VariantDataset shapes: (), types: tf.int32>,
 <_VariantDataset shapes: (), types: tf.int32>)

In [31]:
def to_numpy(ds):
    return list(ds.as_numpy_iterator())

for windows in dataset:
    print(to_numpy(windows[0]), to_numpy(windows[1]))

[1, 2] [6, 7]
[3, 4] [8, 9]
[5] [10]


Applying window() to a 'Dataset' of dictionaries gives a dictionary of 'Datasets':

In [32]:
dataset = tf.data.Dataset.from_tensor_slices({'a': [1, 2, 3],
                                              'b': [4, 5, 6],
                                              'c': [7, 8, 9]})
dataset = dataset.window(2)
def to_numpy(ds):
    return list(ds.as_numpy_iterator())

for windows in dataset:
    print(tf.nest.map_structure(to_numpy, windows))

{'a': [1, 2], 'b': [4, 5], 'c': [7, 8]}
{'a': [3], 'b': [6], 'c': [9]}


When using the window() function, we often pair the method with the flat_map() method. The flat_map() method can be used to flatten a dataset of windows into a single dataset. The argument to flat_map() is a function that takes an element from the dataset and returns a 'Dataset' class. The method flat_map() chains together the resulting datasets sequentially.

For example, to turn each window into a dense tensor, we can do the following:

In [140]:
size = 3
dataset = tf.data.Dataset.range(7).window(size, shift=1, drop_remainder=True)
batched = dataset.flat_map(lambda x:x.batch(3))
for batch in batched:
    print(batch.numpy())

[0 1 2]
[1 2 3]
[2 3 4]
[3 4 5]
[4 5 6]


### II. An Illustrative Example of Word Embedding from TensorFlow - Classifiying Movie Reviews

Let's use a word embedding example now after studying the 'tf.data.Dataset' API. This example comes from the official TensorFlow documentation. Let's use the famous movie review dataset "Large Movie Review Dataset" throughout the tutorial. We will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. The data contains the text of 50000 movie reviews from the "Internet Movie Database". These are split into 25,000 reviews for training and 25,000 reviews for testing. The following codes download all the data and then create a bunch of folder structures in which we will see the training and test folders:

In [3]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
print(dataset_dir)
os.listdir(dataset_dir)

.\aclImdb


['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [5]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [6]:
remove_dir = os.path.join(train_dir, 'unsup') # removing unnecessary stuff
shutil.rmtree(remove_dir)

Now we have created the folder structures, we can take a closer look at the folders. In the 'train' directory, it has 'pos' and 'neg' folders with movie reviews labelled as positive and negative respectively. We will use reviews from both folders to train a binary classification model. The 'train' directory also has additional stuff which should be removed before creating training dataset.

The "aclImdb/train/pos" and "aclImdb/train/neg" directories contain many text files, each of which is a single movie review. Let's take a look at one of them:

In [11]:
sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
    print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


Now let's create the "tf.data.Dataset" object using tf.keras.preprocessing.text_dataset_from_directory() utility. This method requires a path structure that looks like the following:

        main_directory/
        ...class_a/
        ......a_text_1.txt
        ......a_text_2.txt
        ......
        ...class_b/
        ......b_text_1.txt
        ......b_text_2.txt
        ......

If the directory is set up in the way above, then calling tf.keras.preprocessing.text_dataset_from_directory(_main\_directory_, _labels_='inferred') will return a tf.data.Dataset object that yields batches of texts from the subdirectories _class\_a_ and _class\_b_, together with labels 0 and 1 (0 corresponding to _class\_a_ and 1 corresponding to _class\_b_):

In [13]:
batch_size = 1024
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, # 20 percent for validation 
    subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

print(type(train_ds))
print(type(val_ds))

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>


Let's look at 3 records (both texts and labels) for the sake of sanity check:

In [20]:
for text_batch, label_batch in train_ds.take(2): # creating a 'Dataset' object with at most 2 elements from the dataset
  for i in range(3):
    print(label_batch[i].numpy(), text_batch.numpy()[i], "\n")

0 b'Noel Coward,a witty and urbane man,was friends with Louis Mountbatten.Mr Coward,a long-time admirer of all things naval,was commissioned to write a story loosely based on the loss of Mountbatten\'s ship.In a peculiarly British way it was considered that a film about the Royal Navy losing an encounter at sea would be good propaganda.It was also considered a good idea to have Mr Coward play the part of the ship\'s captain.Amang the many qualities needed to command a fighting ship,the ability to speak in a very clipped voice and sing sophisticated "point" songs does not come very high up the list at Admiralty House,or at least one would hope not.A captain must earn and retain the respect of the wardroom and the lower deck alike. Mr Coward might have had the respect of the gentlemen of the chorus at Drury Lane and Binkie Beaumont might have been terrified of him but his ability to tame,mould and direct a ship\'s crew in wartime must be brought into question.He folds himself languorousl

Now before we proceed to the next step of ML, let's configure the dataset for performance optimization, since this is a large dataset which is very common in NLP tasks. These are two important methods we can use when loading data to make sure that I/O does not become blocking:

   - The method cache() keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training our model. If our dataset is too large to fit into memory, we can also use this method to create a performance on-disk cache, which is more efficient to read than many small files.
   - The method prefetch() overlaps data preprocessing and model execution while training.  The tf.data.Dataset.prefetch() transformation can be used to decouple the time when data is produced from the time when data is consumed. In particular, the transformation uses a background thread and an internal buffer to 'prefetch' elements from the input dataset ahead of the time they are requested. The number of elements to prefetch should be equal to (or possibly greater than) the number of batches consumed by a single training step. We could either manually tune this value, or set it to 'tf.data.AUTOTUNE', which will prompt the 'tf.data' runtime to tune the value dynamically at runtime.

The data performance guide https://www.tensorflow.org/guide/data_performance has more information related to these methods. Here we will do auto-tuning:

In [21]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

Next, let's define the dataset preprocessing steps required for our sentiment classification model. We can initialize a TextVectorization() layer with the desired parameters to vectorize movie reviews. Recall that text data standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset. Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words, by splitting on whitespace). Vectorization refers to converting tokens into numbers so they can be fed into a neural network. All of these tasks can be accomplished with this layer.

As we shall see below, the reviews contain various HTML tags like. These tags will not be removed by the default standardizer in the TextVectorization() layer (which converts text to lowercase and strips punctuation by default, but doesn't strip HTML). We thus write a custom standardization function to remove the HTML. The syntax for the method is the following:

   - tf.keras.layers.TextVectorization(_max\_tokens_=None, _standardize_='lower\_and\_strip\_punctuation', _split_='whitespace', _ngrams_=None, _output\_mode_='int', _output\_sequence\_length_=None, _pad\_to\_max\_tokens_=False, _vocabulary_=None, \*\*_kwargs_)

This tf.keras.layers.TextVectorization() layer has basic options for managing texts in a Keras model. It transforms a batch of strings (i.e., one example = one string) into either a list of token indices (i.e., one example = 1D tensor of integer token indices) or a dense representation (i.e., one example = 1D tensor of float values representing data about the example's tokens).

If desired, the user can call this layer's adapt() method on a dataset. When this layer is adapted, it will analyze the dataset, determine the frequency of individual string values, and create a 'vocabulary' from them. This vocabulary can have unlimited size or be capped, depending on the configuration options for this layer; if there are more unique values in the input than the maximum vocabulary size, the most frequent terms will be used to create the vocabulary. A good processing of each example contains the following steps:
  
   - standardize each example (usually lowercasing + punctuation stripping);
   - split each example into substrings (usually words);
   - recombine substrings into tokens (usually ngrams);
   - index tokens (associate a unique int value with each token);
   - transform each example using this index, either into a vector of integers or a dense float vector.

The adapt() method fits the state of the preprocessing layer to the data being passed. After calling adapt() on a layer, a preprocessing layer's state will not update during training. More details of this topic can be found here:

   - https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization. 

Let's see an example for illustrative purpose before we continue with our example. Let's suppose we have a toy data derived from a list of strings ["foo", "bar", "baz"]. Let's suppose we want to create a TextVectorization() layer and then build a model for prediction. Here is how the layer works:


In [39]:
example = tf.data.Dataset.from_tensor_slices(["foo", "bar", "baz", "devil"])
max_features = 500  # maximum vocab size
max_len = 7  # sequence length to pad the outputs to (if the number is large then there will be a lot of zero-paddings)

vl = tf.keras.layers.TextVectorization(max_tokens=max_features, output_mode='int', output_sequence_length=max_len) # creating the TextVectorization() layer
print('before adapt() method applied:', vl.get_vocabulary())
vl.adapt(example.batch(64)) # calling adapt() on the text-only dataset to create the vocabulary (no need to batch, but for large datasets this means we are not keeping spare copies of the dataset)
print('after adapt() method applied:', vl.get_vocabulary())

model_e = tf.keras.models.Sequential() # creating the model that uses the vectorization text layer

model_e.add(tf.keras.Input(shape=(1,), dtype=tf.string)) # starting by creating an explicit input layer with a shape of (1,), as we need to guarantee there is exactly one string input per batch.
model_e.add(vl) # the  first layer in our model is the vectorization layer, and after this we have a tensor of shape (batch_size, max_len) containing vocab indices
print('Vocabulary size for the example (after adapt() is applied): {}'.format(len(vl.get_vocabulary()))) # 6

test_input = [["foo qux bar"], ["qux baz haha wth"]]
model_e.predict(test_input)

before adapt() method applied: ['', '[UNK]']
after adapt() method applied: ['', '[UNK]', 'foo', 'devil', 'baz', 'bar']
Vocabulary size for the example (after adapt() is applied): 6


array([[2, 1, 5, 0, 0, 0, 0],
       [1, 4, 1, 1, 0, 0, 0]], dtype=int64)

Now let's go back to our original example. Below, we use the text vectorization layer to normalize, split, and map strings to integers:

In [40]:
def custom_standardization(input_data): # creating a custom standardization function to strip HTML break tags '<br />'
   lowercase = tf.strings.lower(input_data)
   stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
   return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation), '')

vocab_size = 10000 # vocabulary size (number of words in a sequence)
sequence_length = 100 # sequence length to pad the outputs to (if the number is large then there will be a lot of zero-paddings)

vectorize_layer = TextVectorization( 
    standardize=custom_standardization,
    max_tokens=vocab_size, # setting maximum_sequence length as all samples are not of the same length
    output_mode='int',
    output_sequence_length=sequence_length)

text_ds = train_ds.map(lambda x, y: x) # making a text-only dataset (no labels) and call adapt() to build the vocabulary
vectorize_layer.adapt(text_ds)

In [41]:
print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

1287 --->  fully
 313 --->  idea
Vocabulary size: 10000


Now let's use the Keras Sequential() API to define the sentiment classification model. In this case it is a "Continuous bag of words" style model:

   - The TextVectorization() layer transforms strings into vocabulary indices. We have already initialized 'vectorize_layer' as a TextVectorization() layer and built it's vocabulary by calling adapt() on the training dataset (here, 'text_ds'). Now the 'vectorize_layer' can be used as the first layer of our end-to-end classification model, feeding transformed strings into the Embedding() layer later.
   - We will also need an embedding layer. The Embedding() layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
   - The GlobalAveragePooling1D() layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.
   - The fixed-length output vector is piped through a fully-connected (dense) layer with 16 hidden units.
   - The last layer is densely connected with a single output node.

In [42]:
embedding_dim=16
epochs=15

model = Sequential([vectorize_layer, Embedding(vocab_size, embedding_dim, name="embedding"), GlobalAveragePooling1D(), Dense(16, activation='relu'), Dense(1)])
model.summary()

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization_18 (TextV (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


Again, let's break down what each layer is actually doing. We have discussed the TextVectorization() layer, so now let's talk about the embedding layer. Recall that word embeddings can be thought of as an alternate to one-hot encoding along with dimensionality reduction. The Embedding() layer turns positive integers (indices) into dense vectors of fixed size. The Embedding() layer is basically a matrix which can be considered a transformation from our discrete and sparse 1-hot-vector into a continuous and dense latent space. The embedding layers enable us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one with having real values instead of just 0’s and 1’s. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions. This way embedding layer works like a lookup table. The words are the keys in this table, while the dense word vectors are the values. 

The input shape of the Embedding() layer must be a 2D tensor with shape: (_batch\_size_, _input\_length_); the output shape of the Embedding() layer must be 3D tensor with shape: (_batch\_size_, _input\_length_, _output\_dim_). Below is an example. The model will take as input an integer matrix of size (_batch_, _input\_length_) and the largest integer (i.e. word index) in the input should be no larger than 999 (vocabulary size). The input dimension must be the size of the vocabulary, i.e. maximum integer index + 1:

In [43]:
my_vocab_size=1000 # 1000 as the vocab size
my_output_dim=64 # the output dimension is 64 (length of vector for each word)
my_input_length=10  # maximum length of sequence is 10

mymodel = tf.keras.Sequential()
mymodel.add(tf.keras.layers.Embedding(input_dim=my_vocab_size, output_dim=my_output_dim, input_length=my_input_length))
input_array = np.random.randint(1000, size=(32, 10)) # input_array.shape=(32, 10)
mymodel.compile('rmsprop', 'mse')
output_array = mymodel.predict(input_array)
print(output_array.shape)

(32, 10, 64)


We see that the output dimension is (32, 10, 64). Let's see another example:

In [45]:
mymodel2 = tf.keras.Sequential()
embedding_layer2 = tf.keras.layers.Embedding(input_dim=10,output_dim=4,input_length=2)
mymodel2.add(embedding_layer2)
mymodel2.compile('adam','mse')

input_array2 =  np.array([[1,2]])
mymodel2.compile('rmsprop', 'mse')
output_array2 = mymodel2.predict(input_array2)
print(output_array2)

[[[-0.02265354  0.02642283 -0.00436708  0.03692282]
  [ 0.02681229 -0.04963363  0.00799357 -0.00078764]]]


As we see from the above example, each word (1 and 2) is represented by a vector of length 4. If we print the weights of the embedding layer, we get results below. These weights are basically the vector representations of the words in vocabulary. As we discussed earlier, this is a lookup table of size $10 \times 4$, for words 0 to 9. The first word (0) is represented by first row in this table. Note that in this example we have not trained the embedding layer. The weights assigned to the word vectors are initialized randomly:

In [46]:
embedding_layer2.get_weights()[0]

array([[-0.04000392, -0.01862346,  0.04262091,  0.04245141],
       [-0.02265354,  0.02642283, -0.00436708,  0.03692282],
       [ 0.02681229, -0.04963363,  0.00799357, -0.00078764],
       [-0.00102047, -0.01890696,  0.04803513,  0.04678296],
       [ 0.0353524 ,  0.00535844,  0.02512223,  0.00242698],
       [-0.03001713, -0.04524499,  0.01682706, -0.028548  ],
       [ 0.02637163, -0.02148254,  0.02442702, -0.0275311 ],
       [ 0.01020578,  0.04947097,  0.0418673 ,  0.02208019],
       [ 0.01632892,  0.01952398,  0.02846341,  0.02349358],
       [ 0.01891663,  0.00588686,  0.03651556,  0.03078783]],
      dtype=float32)

Besides the Embedding() layer, we also see a class called GlobalAveragePooling1D(). This class performs the global average pooling operation for temporal data. For this example, the GlobalAveragePooling1D() layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible. 

The input of the GlobalAveragePooling1D() class must be a 3D tensor, and the output shape depends on the argument. The syntax is:

   - tf.keras.layers.GlobalAveragePooling1D(_data\_format_='channels_last', _keepdims_=False, ** _kwargs_)

Specifically:

   -The argument _data\_format_ is one of 'channels\_last' (default) or 'channels\_first'. This really just dictates the ordering of the dimensions in the inputs. The 'channels\_last' option corresponds to inputs with shape (_batch_, _steps_, _features_) while 'channels\_first' corresponds to inputs with shape (_batch_, _features_, _steps_). Moreover, the argument _keepdim_ has a behavior that is the same as for tf.reduce_mean() or np.mean().

   - if _keepdims_=False: then the output would be a 2D tensor with shape (_batchsize_, _features_);
   - If _keepdims_=True and if _data\_format_='channels\_last': the output would be a 3D tensor with shape (_batchsize_, 1, _features_);
   - if _keepdims_=True and if _data\_format_='channels\_first': the output would be 3D tensor with shape (_batchsize_, _features_, 1).

Let's see an example of using the GlobalAveragePooling1D() layer:

In [48]:
input_shape = (2, 3, 4)
tf.random.set_seed(5) # 2 matrices, each 3-by-4
x = tf.random.normal(shape=input_shape, mean=0, stddev=1) 
print('x:\n', x, '\n')
y = tf.keras.layers.GlobalAveragePooling1D()(x) # the input must be a 3D tensor
print('y:\n', y, '\n') # (-0.18030666+1.3231523+2.4488697)/3=1.1972384 etc.
print('shape of y:', y.shape)

x:
 tf.Tensor(
[[[-0.18030666 -0.95028627 -0.03964049 -0.7425406 ]
  [ 1.3231523  -0.61854804  0.8540664  -0.08899953]
  [ 2.4488697   0.762508    1.2659615   0.9801489 ]]

 [[ 1.5293121  -0.57500345  0.8987044  -1.250801  ]
  [-0.8604956   1.260746   -0.6830498   0.02615766]
  [ 0.22328745  0.95914024 -0.37048063  0.03484769]]], shape=(2, 3, 4), dtype=float32) 

y:
 tf.Tensor(
[[ 1.1972384  -0.26877543  0.69346243  0.04953627]
 [ 0.297368    0.54829425 -0.05160867 -0.39659855]], shape=(2, 4), dtype=float32) 

shape of y: (2, 4)


Notice that this is different from the Flatten() layer, which simply flattens the input without affecting the batch size. Here is an example of Flatten() for the sake of comparison:

In [49]:
mymodel3 = tf.keras.Sequential()
mymodel3.add(tf.keras.layers.Conv2D(64, 3, 3, input_shape=(3, 32, 32)))
print(mymodel3.output_shape)

mymodel3.add(tf.keras.layers.Flatten())
print(mymodel3.output_shape)

(None, 1, 10, 64)
(None, 640)


There are obviously layers related to GlobalAveragePooling1D(). For example, rather than taking the average, we can take the maximum, hence GlobalMaxPooling1D(). This class downsamples the input representation by taking the maximum value over the time dimension. Here is an illustrative example:

In [49]:
x = tf.constant([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])
print(x, "\n")
print('original shape of x:', x.shape)
x = tf.reshape(x, [3, 3, 1])
print(x, "\n")
print('shape of x after reshaping:', x.shape, "\n")

my_globalmaxpool_1d = tf.keras.layers.GlobalMaxPooling1D()
my_globalmaxpool_1d (x)

tf.Tensor(
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]], shape=(3, 3), dtype=float32) 

original shape of x: (3, 3)
tf.Tensor(
[[[1.]
  [2.]
  [3.]]

 [[4.]
  [5.]
  [6.]]

 [[7.]
  [8.]
  [9.]]], shape=(3, 3, 1), dtype=float32) 

shape of x after reshaping: (3, 3, 1) 



<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[3.],
       [6.],
       [9.]], dtype=float32)>

We have done enough illustrative examples to break down the code. Let'S now get back to our main example and train the model on the movie review dataset:

In [50]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
    callbacks=[tensorboard_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x1fd0bbc6cc0>

In [51]:
model.summary()

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization_18 (TextV (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


Let's look at the result in TensorBoard. TensorBoard is part of the ecosystems of TensorFlow, which provides the visualization and tooling needed for machine learning experimentation. It has many functions including:

   - tracking and visualizing metrics such as loss and accuracy
   - visualizing the model graph (ops and layers)
   - viewing histograms of weights, biases, or other tensors as they change over time
   - projecting embeddings to a lower dimensional space
   - displaying images, text, and audio data
   - profiling TensorFlow programs
   - and much more!

We will not have a systematic introduction here for Tensorboard, but we will give an overview of how it works intuitively. TensorFlow has very detailed documentation on this topic.

Once we create the TensorBoard, we will see a few tabs:

   - the **Scalars dashboard** shows how the loss and metrics change with every epoch. One can use it to also track training speed, learning rate, and other scalar values;
   - the **Graphs dashboard** helps us visualize your model. In this case, the Keras graph of layers is shown which can help us ensure it is built correctly;
   - the **Distributions dashboard** and **Histograms dashboard** show the distribution of a Tensor over time. This can be useful to visualize weights and biases and verify that they are changing in an expected way.

Additional TensorBoard plugins are automatically enabled when we log other types of data. For example, the Keras TensorBoard callback facility lets us log images and embeddings as well. We can see what other plugins are available in TensorBoard by clicking on the "inactive" dropdown towards the top right.

We can start the TensorBoard through the command line or within a jupyter notebook experience. The two interfaces are generally the same. In notebooks, we can use the %tensorboard line magic. On the command line, we should run the same command without "%". Let's go back to our example and create a TensorBoard:

In [52]:
%load_ext tensorboard
%tensorboard --logdir logs

ERROR: Failed to launch TensorBoard (exited with 1).
Contents of stderr:
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
2021-10-18 09:35:08.464466: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-10-18 09:35:08.466274: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "C:\Users\GAO\.conda\envs\gao_uat\Scripts\tensorboard-script.py", line 6, in <module>
    from tensorboard.main import run_main
  File "C:\Users\GAO\.conda\envs\gao_uat\lib\site-packages\tensorboard\main.py", line 40, in <module>
    from 

Next, let's retrieve the word embeddings learned during training. The embeddings are weights of the Embedding() layer in the model. The weights matrix is of shape (_vocab\_size_, _embedding\_dimension_).

We can obtain the weights from the model using get_layer() and get_weights(). In addition, the get_vocabulary() function provides the vocabulary to build a metadata file with one token per line:

In [54]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Let's write the weights to disk. To use the 'Embedding Projector', we will need to upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words):

In [55]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

We can actually visualize the embedding on the TensorFlow Embedding Projector: http://projector.tensorflow.org/. To do so, we can open the 'Embedding Projector' (this can also run in a local TensorBoard instance) and click on "Load data". Then upload the two files we created above: vecs.tsv and meta.tsv. The embeddings we have trained will now be displayed. We can search for words to find their closest neighbors. For example, try searching for "beautiful", we may see neighbors like "wonderful".

#### References:

   - https://www.tensorflow.org/text/guide/word_embeddings#:~:text=The%20Embedding%20layer%20takes%20the,batch%2C%20sequence%2C%20embedding)%20
   - https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory
   - https://www.tensorflow.org/tutorials/keras/text_classification 
   - https://www.tensorflow.org/guide/data_performance 
   - https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization
   - https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D 
   - https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten 
   - https://www.tensorflow.org/tensorboard/get_started 
   - https://medium.com/analytics-vidhya/understanding-embedding-layer-in-keras-bbe3ff1327ce 
   - https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/ 