In this notebook we will learn how to create and train an embedding layer for the words appearing in a text data. We will then train a simple DNN based model to do sentiment analysis on this data. 

## Exercise

This is exercise 13.10 in [this](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) book.

### Problem Statement

In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

  - a. Download the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/), which contains 50,000 movies reviews from the Internet Movie Database. The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise.
  
    
  - b. Split the test set into a validation set (15,000) and a test set (10,000).
  
  
  - c. Use tf.data to create an efficient dataset for each set.
  
  
  - d. Create a binary classification model, using a TextVectorization layer to preprocess each review. If the TextVectorization layer is not yet available (or if you like a challenge), try to create your own custom preprocessing layer: you can use the functions in the tf.strings package, for example lower() to make everything lowercase, regex_replace() to replace punctuation with spaces, and split() to split words on spaces. You should use a lookup table to output word indices, which must be prepared in the adapt() method.
  
  
  - e. Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.
  
  
  - f. Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.


  - g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").

In [1]:
import tensorflow as tf
import tensorflow.keras as keras
print('tensorflow version: {}'.format(tf.__version__))
print('keras version: {}'.format(keras.__version__))

tensorflow version: 2.1.0
keras version: 2.2.4-tf


In [2]:
import os
print('cwd: {}'.format(os.getcwd()))

cwd: /home/prarit/MachineLearningProjects/Word-Embeddings


### Downloading the Large Movie Review Dataset

In [3]:
# good tutorial on using wget: https://www.tecmint.com/download-and-extract-tar-files-with-one-command/
# turn off verbose output of wget using the flag -nv : https://shapeshed.com/unix-wget/#how-to-turn-off-verbose-output 
!wget -c -nv http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -o - 

In [4]:
# uncompress the downloaded files
!tar xzf aclImdb_v1.tar.gz

Note that tensorflow also provides this dataset: https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb

### Briefly Explore the Dataset

In [5]:
# list the files in the current working directory
os.listdir()

['README.md',
 '.gitignore',
 'log_dir',
 '.ipynb_checkpoints',
 '.git',
 'aclImdb',
 'Word-Embeddings.ipynb',
 'aclImdb_v1.tar.gz']

We see that aclImdb_v1.tar.gz was extracted to a folder called aclImdb. Let's see the contents of this file.

In [6]:
path = os.path.join(os.getcwd() , 'aclImdb')
contents = os.listdir(path)
print('The contents of aclImdb are: \n{}'.format(contents))

The contents of aclImdb are: 
['imdb.vocab', 'train', 'README', 'imdbEr.txt', 'test']


There is a README file in aclImdb, let us read it.

In [7]:
# read README
filepath = os.path.join(path, 'README')
with open(filepath, 'r') as f:
    print(f.read())

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a scor

From README, we find that the train/test folder contains a 'pos' folder for positive reviews and a 'neg' for negative reviews along with .txt files containing urls of positive and negative reviews respectively. There are also some other files for bag-of-words features etc. 

Each train/test folder contains a total of 25000 reviews of which 12500 are positive reviews and 12500 are negative reviews.

Let's verify the above about the 'train' folder

In [8]:
train_path = os.path.join(path,'train')
print('contents of the train folder: \n{}'.format(os.listdir(train_path)))

contents of the train folder: 
['unsup', 'urls_pos.txt', 'pos', 'labeledBow.feat', 'unsupBow.feat', 'neg', 'urls_neg.txt', 'urls_unsup.txt']


In order to create a dataset, we need the path to all the reviews. We can create the corresponding list of paths using glob

In [9]:
import glob

In [10]:
# paths to the positive reviews in the training set
train_pos_path = os.path.join(train_path, 'pos', '*.txt')
train_pos_reviews = glob.glob(train_pos_path)
print('No. of train-set files with positive reviews: {}'.format(len(train_pos_reviews)))

No. of train-set files with positive reviews: 12500


In [11]:
# paths to the negative reviews in the training set
train_neg_path = os.path.join(train_path, 'neg', '*.txt')
train_neg_reviews = glob.glob(train_neg_path)
print('No. of train-set files with negative reviews: {}'.format(len(train_neg_reviews)))

No. of train-set files with negative reviews: 12500


let us give a brief look at a positive review. 

In [12]:
file = train_pos_reviews[0]
with open(file, 'r') as f:
    print(f.read())

I watch them all.<br /><br />It's not better than the amazing ones (_Strictly Ballroom_, _Shall we dance?_ (Japanese version), but it's completely respectable and pleasingly different in parts.<br /><br />I am an English teacher and I find some of the ignorance about language in some of these reviews rather upsetting. For example: the "name should scream don't watch. 'How she move.' Since when can movie titles ignore grammar?" <br /><br />There is nothing inherently incorrect about Caribbean English grammar. It's just not Canadian standard English grammar. Comments about the dialogue seem off to me. I put on the subtitles because I'm a Canadian standard English speaker, so I just AUTOMATICALLY assumed that I would have trouble understanding all of it. It wasn't all that difficult and it gave a distinctly different flavour as the other step movies I have seen were so American.<br /><br />I loved that this movie was set in Toronto and, in fact, wish it was even more clearly set there. I 

Later, we will like to load an preprocess all the data using tensorflow's data API, therefore let us quickly see how to read the same file as above but this time by using tensorflow's [TextLineDataset](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset) method.

In [13]:
pos_fl0 = tf.data.TextLineDataset(file)
for item in pos_fl0:
    print(item)

tf.Tensor(b'I watch them all.<br /><br />It\'s not better than the amazing ones (_Strictly Ballroom_, _Shall we dance?_ (Japanese version), but it\'s completely respectable and pleasingly different in parts.<br /><br />I am an English teacher and I find some of the ignorance about language in some of these reviews rather upsetting. For example: the "name should scream don\'t watch. \'How she move.\' Since when can movie titles ignore grammar?" <br /><br />There is nothing inherently incorrect about Caribbean English grammar. It\'s just not Canadian standard English grammar. Comments about the dialogue seem off to me. I put on the subtitles because I\'m a Canadian standard English speaker, so I just AUTOMATICALLY assumed that I would have trouble understanding all of it. It wasn\'t all that difficult and it gave a distinctly different flavour as the other step movies I have seen were so American.<br /><br />I loved that this movie was set in Toronto and, in fact, wish it was even more c

Perfect! As expected, we see that the pos_fl0 contains a single item and it's value output matches the output of the previous code cell. 

### Preprocessing the reviews

Having, learnt how to use tf.data.TextLineData() method, we can now starting preprocessing the data. In order to do this, we notice that the review contains punctuation marks and html line break tags etc. We will have to write a preprocessing function to get rid of these. Additionally, we will also change all alphabets to lower case.

#### Removing line-brk tags

This can be very simply done by using the .replace() method of python strings. We can therefore use it to replace all occurrances of the line-break tag with a space. In tensorflow, the equivalent method is [tf.strings.regex_replace()](https://www.tensorflow.org/api_docs/python/tf/strings/regex_replace).

#### tf.strings.regex_replace
Note that 'regex' in regex_replace() stands for ["regular expression"](https://docs.python.org/3/howto/regex.html). For e.g. the following will work:

In [14]:
tf.strings.regex_replace('hello','e','E')

<tf.Tensor: shape=(), dtype=string, numpy=b'hEllo'>

But the following will throw an error: tf.strings.regex_replace('h(llo', '(','E')

This is because "(" is a metacharacter. To match and replace meta-characters, we must prepend a backslash before them. This can be done as follows: '\\\\' + char. 

The previous code cell can now be made to work

In [15]:
# The error thrown by this code cell is intentional
tf.strings.regex_replace('h(llo', '\\'+'(','E')

<tf.Tensor: shape=(), dtype=string, numpy=b'hEllo'>

or equivalently:

In [16]:
tf.strings.regex_replace('h(llo','\(','E')

<tf.Tensor: shape=(), dtype=string, numpy=b'hEllo'>

##### Imp: 

Note that backslash itself is also a  meta-character. To search and replace backslash, we do the following:

In [17]:
tf.strings.regex_replace('h\llo', '\\'+'\\', 'E')

<tf.Tensor: shape=(), dtype=string, numpy=b'hEllo'>

or equivalently:

In [18]:
# we use '\\\\' and NOT '\\' to search and replace a backslash
tf.strings.regex_replace('h\llo', '\\\\', 'E')

<tf.Tensor: shape=(), dtype=string, numpy=b'hEllo'>

We now try three different ways of removing punctuations from a tensorflow string and compare their timings:

1) punc_filter_and_to_lower1: use tf.strings.unicode_decode() to convert all the characters in the string into an array of their ascii codes. We then loop through this array, skipping over the places where the entry matches the ascii code of a punctuation. Finally we call tf.strings.unicode_encode() on this array to convert the ascii codes back to characters, thereby obtaining a string with all the punctuations stripped. 


2) punc_filter_and_to_lower2: use tf.regex_replace() to search and replace each punctuation by empty space. Pay special attention to prepend backslash in order to be able to use meta-characters in regex_replace.


3) punc_filter_and_to_lower3: simply extract the python string using its .numpy() method. Then simple iterate through the characters of the string, skipping over the punctutations. Join the resulting list of character using .join() method. 

4) punc_filter_and_to_lower4: Use [tfds.features.text.Tokenizer()](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/Tokenizer) by setting it's "alphanum_only" arg to True. This way it will only parse alpha-numeric characters in the text and to split the text at occurrances of non-alphanumeric characters. Since whitespace is a non-alpha-numeric character, the output will largely consist of a list of words in the text.  The caeat with this approach is that words with apostrophe in them such as " don't " will be split into two words: "don" and "t". Also, [tfds.features.text.Tokenizer()](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/Tokenizer) does NOT consider underscores as non-alpha-numeric, so underscores do not get removed form the text. On the other hand since, the tokenizer already generates a list of words, we use this list to generate a vocobulary of words in the dataset at this step it-self, making the preprocessing faster.

There timing on the first review in the training set was as follows:

1) punc_filter_and_to_lower1: Wall time: ~ 3 s

2) punc_filter_and_to_lower2: Wall time: ~ 8 ms

3) punc_filter_and_to_lower3: Wall time: ~ 6 ms

4) punc_filter_and_to_lower4: Wall time: ~ 6 ms (when not udating a vocabulary of words) 

5) 4) punc_filter_and_to_lower4: Wall time: ~ 7 ms (when also udating a vocabulary of words) 

Clearly, the last function is the fastest with the first one being extremely slow (takes several secs). 

In [19]:
import string

In [20]:
# list of punctuations
punc_ls = string.punctuation
print('punctuations before utf encoding: {}'.format(punc_ls))
punc_ls2 = tf.strings.unicode_decode(punc_ls, input_encoding = 'UTF-8')
print('punctuations after utf encoding: \n{}'.format(punc_ls2))

punctuations before utf encoding: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
punctuations after utf encoding: 
[ 33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  58  59  60
  61  62  63  64  91  92  93  94  95  96 123 124 125 126]


In [21]:
# function to get rid of punctuations, html line breaks and change to lower case
# using tf.strings.unicode_encode(), tf.strings.unicode_decode()
def punc_filter_and_to_lower1(st):
    
    # before removing punctuations, we should remove the html line-break tag
    # this is because the line break tag contains <,/ and > characters which 
    # will be removed if we remove punctuations first. This will then make it harder 
    # to identify the line-break-tag
    line_brk_tag = "<br /><br />" 
    st2 = tf.strings.regex_replace(st, line_brk_tag, ' ') # regex stands for regular expressions
                                                          # i.e. they are not metacharacters
                                                          # https://docs.python.org/3/howto/regex.html#matching-characters
    
    # now we replace all the punctions in the string
    st2 = tf.strings.unicode_decode(st2, input_encoding = 'utf-8')
    st2 = tf.strings.unicode_encode([char for char in st2 if char not in punc_ls2], output_encoding = 'UTF-8')
    
    st2 = tf.strings.lower(st2)
    
    return st2

In [22]:
import time

In [23]:
%%time
for item in pos_fl0:
    lst = punc_filter_and_to_lower1(item)

CPU times: user 3.4 s, sys: 63.7 ms, total: 3.47 s
Wall time: 3.47 s


In [24]:
lst

<tf.Tensor: shape=(), dtype=string, numpy=b'i watch them all its not better than the amazing ones strictly ballroom shall we dance japanese version but its completely respectable and pleasingly different in parts i am an english teacher and i find some of the ignorance about language in some of these reviews rather upsetting for example the name should scream dont watch how she move since when can movie titles ignore grammar  there is nothing inherently incorrect about caribbean english grammar its just not canadian standard english grammar comments about the dialogue seem off to me i put on the subtitles because im a canadian standard english speaker so i just automatically assumed that i would have trouble understanding all of it it wasnt all that difficult and it gave a distinctly different flavour as the other step movies i have seen were so american i loved that this movie was set in toronto and in fact wish it was even more clearly set there i loved that the heroine was so atypic

In [25]:
# function to get rid of punctuations, html line breaks and change to lower case
# using tf.strings.regex_replace()
def punc_filter_and_to_lower2(st):
    
    # before removing punctuations, we should remove the html line-break tag
    # this is because the line break tag contains <,/ and > characters which 
    # will be removed if we remove punctuations first. This will then make it harder 
    # to identify the line-break-tag
    line_brk_tag = "<br /><br />" 
    st2 = tf.strings.regex_replace(st, line_brk_tag, ' ')
    
    # now we replace all the punctions in the string
    for punc in punc_ls:
        # V.imp: to replace meta-characters we prepend a backslash to them 
        # infact we can also prepend a backslach before all the characters
        # to prepend a backslash before a character we do: '\\' + char
        st2 = tf.strings.regex_replace(st2, '\\'+ punc, ' ')
    
    st2 = tf.strings.lower(st2)
    
    return st2

In [26]:
%%time
for item in pos_fl0:
    lst = punc_filter_and_to_lower2(item)

CPU times: user 8.89 ms, sys: 160 µs, total: 9.05 ms
Wall time: 8.33 ms


In [27]:
lst

<tf.Tensor: shape=(), dtype=string, numpy=b'i watch them all  it s not better than the amazing ones   strictly ballroom    shall we dance    japanese version   but it s completely respectable and pleasingly different in parts  i am an english teacher and i find some of the ignorance about language in some of these reviews rather upsetting  for example  the  name should scream don t watch   how she move   since when can movie titles ignore grammar    there is nothing inherently incorrect about caribbean english grammar  it s just not canadian standard english grammar  comments about the dialogue seem off to me  i put on the subtitles because i m a canadian standard english speaker  so i just automatically assumed that i would have trouble understanding all of it  it wasn t all that difficult and it gave a distinctly different flavour as the other step movies i have seen were so american  i loved that this movie was set in toronto and  in fact  wish it was even more clearly set there  i 

In [28]:
# function to get rid of punctuations, html line breaks and change to lower case
# using .join() method in python string class
def punc_filter_and_to_lower3(st):
    
    # before removing punctuations, we should remove the html line-break tag
    # this is because the line break tag contains <,/ and > characters which 
    # will be removed if we remove punctuations first. This will then make it harder 
    # to identify the line-break-tag
    line_brk_tag = "<br /><br />" 
    st2 = tf.strings.regex_replace(st, line_brk_tag, ' ') # regex stands for regular expressions
                                                          # i.e. they are not metacharacters
                                                          # https://docs.python.org/3/howto/regex.html#matching-characters
    
    # now we replace all the punctions in the string
    st2 = st2.numpy().decode('utf-8')
    st2 = ''.join([char for char in st2 if char not in punc_ls])
    
    st2 = st2.lower()
    
    st2 = tf.convert_to_tensor(st2)
    
    return st2

In [29]:
%%time
for item in pos_fl0:
    lst = punc_filter_and_to_lower3(item)

CPU times: user 3.33 ms, sys: 3.16 ms, total: 6.48 ms
Wall time: 5.81 ms


In [30]:
lst

<tf.Tensor: shape=(), dtype=string, numpy=b'i watch them all its not better than the amazing ones strictly ballroom shall we dance japanese version but its completely respectable and pleasingly different in parts i am an english teacher and i find some of the ignorance about language in some of these reviews rather upsetting for example the name should scream dont watch how she move since when can movie titles ignore grammar  there is nothing inherently incorrect about caribbean english grammar its just not canadian standard english grammar comments about the dialogue seem off to me i put on the subtitles because im a canadian standard english speaker so i just automatically assumed that i would have trouble understanding all of it it wasnt all that difficult and it gave a distinctly different flavour as the other step movies i have seen were so american i loved that this movie was set in toronto and in fact wish it was even more clearly set there i loved that the heroine was so atypic

In [31]:
# function to get rid of punctuations, html line breaks and change to lower case
# using tfds.features.text.tokenizer()
# we will also simultaneously generate a vocabulary in this step
import tensorflow_datasets as tfds
def punc_filter_and_to_lower4(st, vocab = None):
    ''' vocab: vocabulary to update with the words in st'''
    
    # before removing punctuations, we should remove the html line-break tag
    # this is because the line break tag contains <,/ and > characters which 
    # will be removed if we remove punctuations first. This will then make it harder 
    # to identify the line-break-tag
    line_brk_tag = "<br /><br />" 
    st2 = tf.strings.lower(tf.strings.regex_replace(st, line_brk_tag, ' '))
    
    # now we replace all the punctions in the string
    tokenizer = tfds.features.text.Tokenizer()
    words = tokenizer.tokenize(st2.numpy()) # note that the inpout has to be a python string NOT as tensor 
                                            # The output is a list NOT a tensor
                                            # https://stackoverflow.com/questions/56665868/tensor-numpy-not-working-in-tensorflow-data-dataset-throws-the-error-attribu
    st2 = tf.strings.join(words, ' ')
    
    if type(vocab) == set:
        vocab.update(words)
        return st2, vocab
    
    
    return st2

In [32]:
%%time
for item in pos_fl0:
    lst = punc_filter_and_to_lower4(item)

CPU times: user 7.22 ms, sys: 0 ns, total: 7.22 ms
Wall time: 6.4 ms


In [33]:
lst

<tf.Tensor: shape=(), dtype=string, numpy=b'i watch them all it s not better than the amazing ones _strictly ballroom_ _shall we dance _ japanese version but it s completely respectable and pleasingly different in parts i am an english teacher and i find some of the ignorance about language in some of these reviews rather upsetting for example the name should scream don t watch how she move since when can movie titles ignore grammar there is nothing inherently incorrect about caribbean english grammar it s just not canadian standard english grammar comments about the dialogue seem off to me i put on the subtitles because i m a canadian standard english speaker so i just automatically assumed that i would have trouble understanding all of it it wasn t all that difficult and it gave a distinctly different flavour as the other step movies i have seen were so american i loved that this movie was set in toronto and in fact wish it was even more clearly set there i loved that the heroine was

In [34]:
%%time
# test the timing of punc_filter_and_to_lower4 when simultaneously updating a dictionary
vocab2 = set([])
for item in pos_fl0:
    lst, vocab2 = punc_filter_and_to_lower4(item, vocab2)

CPU times: user 7.79 ms, sys: 0 ns, total: 7.79 ms
Wall time: 7.2 ms


### Build a Vocabulary based on the training data

This can be easily done using python's [set()](https://docs.python.org/3.8/library/stdtypes.html#set-types-set-frozenset) container.: basically, split each text in the training instance into words and update the set with this list. This will add any new words in the text to the set. In the end, the set will contain all the unique words in the training dataset.

Note that we have also implemented the above idea in our function punc_filter_and_to_lower4() 

In [35]:
# function to create a vocabulary using the set() container
vocab1 = set([]) # python set for containing unique words in the training dataset
def vocab_builder1(strng):
    # split the string into its words
    words = tf.strings.split(strng)
    # update vocab
    vocab1.update(words.numpy())
    return

In [36]:
%%time
for item in pos_fl0:
    item = punc_filter_and_to_lower3(item)
    vocab_builder1(item)

CPU times: user 233 ms, sys: 70.8 ms, total: 304 ms
Wall time: 302 ms


Note that the vocab_builder1() function as defined above takes about 300 ms. On the otherhand, in punc_filter_and_to_lower4() function,  we were able to get almost the same result (upto the caveats mentioned in the previous section) by using tfds.features.text.Tokenizer(), in about 8 ms. Clearly, there is a difference of few order of magnitude between the time taken by the two routines with the difference between their output being quite tolerable.  

#### building the vocabulary

In [37]:
vocab = set([])
train_filepaths = tf.data.Dataset.list_files([train_pos_path, train_neg_path])
train_dataset = tf.data.TextLineDataset(train_filepaths)

ctr = 0
for item in train_dataset:
    _ , vocab = punc_filter_and_to_lower4(item, vocab)
    
    ctr+=1
    if ctr%255 ==0:
        print("=", end = '')



In [38]:
# pick 5 random words from vocab to inspect it
import random

random.sample(vocab, 5)

['mastering', '1861', 'ditz', 'scwarz', 'intercepting']

In [39]:
print('no. of words in vocabulary: {}'.format(len(vocab)))

no. of words in vocabulary: 74893


### Create a lookup table based on the vocabulary

In [40]:
indices = tf.range(len(vocab), dtype = tf.int64)
lookup_initializer = tf.lookup.KeyValueTensorInitializer( list(vocab), indices) 
num_oov_buckets = 50 
lookup_table = tf.lookup.StaticVocabularyTable( initializer = lookup_initializer, 
                                               num_oov_buckets = num_oov_buckets)

In [41]:
# function to map a text to integers using the above lookup table
def text_to_integers(X):
    X = tf.strings.split(X)
    int_ls = lookup_table.lookup(X)
    return int_ls

In [42]:
for item in pos_fl0:
    item = punc_filter_and_to_lower4(item)
    item = text_to_integers(item)
    print(item)

tf.Tensor(
[46532 45075 74022 46331 53524 64720 19317 62409 73522 16143 44242 41577
 50874 68983 31313 15087  8306 21664 13123 33468 13306 53524 64720 65527
 71776 66995 14238 36541  7772 58213 46532 73888 53036 67119 62983 66995
 46532 42335 52981 16827 16143 22743 57728 58416  7772 52981 16827 18980
 16927 45262 54275 13214 65814 16143 29277 29030 19133 54210 64365 45075
 53001 16072 24642 56047 45093  4951 14832  2865 50949 31021  2131 26411
 41707 13786 56795 57728 55127 67119 31021 53524 64720 57294 19317 33442
 56790 67119 31021 32260 57728 16143 13415 52656 71145  4299 33307 46532
 11617 70284 16143 58245 42062 46532 10547 26462 33442 56790 67119 33148
 15287 46532 57294 35139 60301 65609 46532 71361 38725  8242 35308 46331
 16827 53524 53524 18322 64365 46331 65609  5258 66995 53524 34038 26462
 57757 36541 32626  3957 16143 43034 55205 20619 46532 38725 42984 38430
 15287 55800 46532 63046 65609 51505 14832 36927 33312  7772 53837 66995
  7772 11067 48930 53524 36927 62366 641

### Create an Embedding Matrix

In [43]:
embedding_dim = 100
initial_embeddings = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim] ) 
embedding_matrix = tf.Variable(initial_embeddings)
embedding_matrix

<tf.Variable 'Variable:0' shape=(74943, 100) dtype=float32, numpy=
array([[0.84739673, 0.5330745 , 0.8537576 , ..., 0.8085828 , 0.19710445,
        0.61763525],
       [0.570151  , 0.21181452, 0.9607618 , ..., 0.32157898, 0.81303966,
        0.78700495],
       [0.29486346, 0.33803856, 0.10440004, ..., 0.62405384, 0.34108508,
        0.79526067],
       ...,
       [0.58884156, 0.9760572 , 0.5741606 , ..., 0.8563837 , 0.6412374 ,
        0.4897573 ],
       [0.7385638 , 0.00988424, 0.33112252, ..., 0.14830363, 0.21763122,
        0.18210769],
       [0.10461211, 0.7373018 , 0.51076746, ..., 0.3489182 , 0.43539405,
        0.11802483]], dtype=float32)>

In [44]:
# function to create an embedding-vector from a list of word indices
def indices_to_embeddings(idxs):
    embeddings = tf.nn.embedding_lookup(embedding_matrix, idxs)
    sqrt_num_words = tf.math.sqrt(tf.cast(tf.size(idxs), tf.float32)) # size is of type int32 but 
                                                                      # tf.math.sqrt has to have one of 
                                                                      # bfloat16, half, float32, 
                                                                      # float64, complex64, 
                                                                      # complex128.
    mean_embedding = tf.math.reduce_mean(embeddings, axis = 0)
    sentence_embedding = tf.multiply(sqrt_num_words, mean_embedding)
    return sentence_embedding

In [45]:
for item in pos_fl0:
    item = punc_filter_and_to_lower4(item)
    item = text_to_integers(item)
    item = indices_to_embeddings(item)

In [46]:
item

<tf.Tensor: shape=(100,), dtype=float32, numpy=
array([8.693973 , 8.854761 , 7.82933  , 8.957902 , 7.2097936, 7.3304343,
       8.380673 , 7.3754983, 7.6155996, 8.670477 , 6.968219 , 7.8654613,
       7.5966697, 7.502631 , 7.268158 , 8.001531 , 7.877151 , 7.0359006,
       7.3503532, 8.404755 , 9.161611 , 8.507119 , 7.9875207, 7.132951 ,
       8.258061 , 7.2016726, 8.300606 , 7.6019235, 7.66854  , 7.5146866,
       7.952212 , 6.7936077, 7.91821  , 7.4322057, 7.7793183, 7.798514 ,
       7.3226557, 7.3620768, 7.9715447, 8.720195 , 7.8295207, 7.957666 ,
       7.8305845, 7.7825203, 7.7725887, 7.9687223, 7.9044704, 7.7980723,
       8.581489 , 8.405485 , 8.449623 , 7.231178 , 7.63637  , 7.9109426,
       7.703827 , 8.627375 , 7.282163 , 8.5807   , 7.225756 , 8.835483 ,
       9.227416 , 7.8802958, 6.968203 , 7.4323564, 7.3603973, 6.811419 ,
       8.219535 , 7.566163 , 7.795026 , 7.252293 , 8.265123 , 8.780831 ,
       7.650981 , 7.909001 , 7.5872536, 7.48621  , 7.622642 , 6.997355 ,
   

### Create a Preprocessing layer

Let us now define a preprocessing layer that accepts a sentence as an input, removes punctuations and line-brk tags form it, then converts it's words into indices and finally output an embedding vector corresponding to the sentence.

In [47]:
class Preprocessing_layer(keras.layers.Layer): 
    def __init__(self, vocab,  num_oov_buckets, embedding_len, **kwargs): 
        super().__init__(**kwargs)
        self.vocab = list(vocab)
        self.vocab_len  = tf.size(self.vocab, out_type = tf.int64) 
        self.num_oov_buckets = num_oov_buckets
        self.embedding_len = embedding_len
        self.indices = tf.range(self.vocab_len, dtype = tf.int64)
        self.lookup_initializer = tf.lookup.KeyValueTensorInitializer(self.vocab, 
                                                                      self.indices)
        self.lookup_table = tf.lookup.StaticVocabularyTable(initializer = self.lookup_initializer, 
                                                            num_oov_buckets = self.num_oov_buckets)
         
    # build the embedding matrix 
    def build(self, batch_input_shape):
        self.embedding_mat = self.add_weight(name = 'embedding matrix', 
                                             shape = [self.vocab_len + self.num_oov_buckets, self.embedding_len],
                                             initializer = tf.random_uniform_initializer)
        
        super().build(batch_input_shape)
        
    
    # function to convert a sentence into a list of indices for its words
    def text_to_integers(self, X):
        X_words = tf.strings.split(X)
        # X_words will be a ragged tensor
        # on the otherhand, self.lookup_table.lookup() expects either a Sparse tensor or a dense tensor
        # for example see the documentation: https://www.tensorflow.org/api_docs/python/tf/lookup/StaticVocabularyTable
        # in order to use it ragged tensors, we have to use tf.ragged.map_flay_values()
        # https://www.tensorflow.org/api_docs/python/tf/ragged/map_flat_values
        X_int = tf.ragged.map_flat_values(self.lookup_table.lookup, X_words)
        return X_int
    
    # function to convert a list of indices into an embedding vector
    def integers_to_embeddings(self, X):
        sqrt_num_words = tf.linalg.diag(
            tf.math.sqrt(tf.cast(X.nested_row_lengths(), tf.float32))[0])
        X_ems = tf.nn.embedding_lookup(self.embedding_mat, X)
        X_vec = tf.math.reduce_mean(X_ems, axis = 1) # note that X_ems is a ragged tensor whose 
                                                     # 0-th dim. corresponds to the batch dim. 
                                                     # therefore axis = 1 and NOT 0 when taking mean
        
        return sqrt_num_words@X_vec
        
    # define the call method
    def call(self, X):
        
        # recall X contains strings with all the punctuations etc removed
        # Imp: The shape of X is (batch_size, ...) i.e. the 0-th axis of X corresponds to batch dim. 
        # We first convert all its words into their indices
        X_int = self.text_to_integers(X)
        # now convert this a mean embedding vector
        X_vec = self.integers_to_embeddings(X_int)
        
        return X_vec
    
    
    # compute output shape
    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.embedding_len])
    
    
    # add the layers hyperparameters to the configuration file
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'embedding_len': self.embedding_len, 
                'num_oov_buckets': self.num_oov_buckets}
    

### Labeling the data

Note that the positive and negative reviews in each dataset have been stored in seperate folders and as such they have no explicit labels. We will therefore have to create our own labels as was done in [this](https://www.tensorflow.org/tutorials/load_data/text) official tensorflow tutorial. 

In [48]:
# function to return apply a label
def labeler(review, label):
    return review, tf.cast(label, tf.int64)

In [49]:
train_pos_paths = tf.data.Dataset.list_files(train_pos_path)
train_pos_reviews = tf.data.TextLineDataset(train_pos_paths).map(lambda x : labeler(x, 1))

In [50]:
for item in train_pos_reviews.take(2):
    print(item)
    print(' ')

(<tf.Tensor: shape=(), dtype=string, numpy=b"Cinderella is a beautiful film, with beautiful songs of course. In fact, it's one of the best films of the 1950's.<br /><br />I think all the characters are portrayed amazingly. You can see the cruelness of Cinderella's stepsisters and her stepmother, the sweetness of Cinderella. The mice are funny and sweet too.<br /><br />I think they changed the tale a bit, but I think it's for the best. It's such a nice film, and I don't think anyone could resist it deep down.<br /><br />I give it a 8/10. I don't think it's the best Disney film. But it sure is a true classic.">, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
 
(<tf.Tensor: shape=(), dtype=string, numpy=b"White man + progress + industrialization = BAD. First nations + nature + animals = GOOD. Simple formula. Actually, in past days the same kind of propaganda was used to defend the status quo; now it is used to attack it. However, that being said, I think the movie does succeed in overcoming

In [51]:
train_neg_paths = tf.data.Dataset.list_files(train_neg_path)
train_neg_reviews = tf.data.TextLineDataset(train_neg_paths).map(lambda x: labeler(x, 0))

In [52]:
for item in train_neg_reviews.take(2):
    print(item)
    print(' ')

(<tf.Tensor: shape=(), dtype=string, numpy=b"It's boggles the mind how this movie was nominated for seven Oscars and won one. Not because it's abysmal or because given the collective credentials of the creative team behind it really ought to deserve them but because in every category it was nominated Prizzi's Honor disappoints. Some would argue that old Hollywood pioneer John Huston had lost it by this point in his career but I don't buy it. Only the previous year he signed the superb UNDER THE VOLCANO, a dark character study set in Mexico, that ranks among the finest he ever did. Prizzi's Honor on the other hand, a film loaded with star power, good intentions and a decent script, proves to be a major letdown.<br /><br />The overall tone and plot of a gangster falling in love with a female hit-man prefigures the quirky crimedies that caught Hollywood by storm in the early 90's but the script is too convoluted for its own sake, the motivations are off and on the whole the story seems un

### Creating training batches

In [53]:
shuffle_buffer_size = 20000
train_dataset = train_pos_reviews.concatenate(train_neg_reviews).shuffle(buffer_size = shuffle_buffer_size)

In [54]:
count = 0
for item in train_dataset:
    count+=1
print(count)    

25000


We now wish to apply punc_filter_and_to_lower4() to the reviews in our training dataset. For this we will use the [map](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) method of tf.data.Datasets. At this point it is important to note that (as mentioned in the documentation for [map](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map)):

    Note that irrespective of the context in which map_func is defined (eager vs. graph), tf.data traces the function and executes it as a graph. To use Python code inside of the function you have two options:

     1) Rely on AutoGraph to convert Python code into an equivalent graph computation. The downside of this approach is that AutoGraph can convert some but not all Python code.

     2) Use tf.py_function, which allows you to write arbitrary Python code but will generally result in worse performance than 1)

This point is of concern to us because the [tokenizer](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/Tokenizer) we used in punc_filter_and_to_lower4() only accepts string input and NOT tensors. This string was extracted from the tensor using its .numpy() method which is only operable in eager mode. Thus when we pass punc_filter_and_to_lower4() to map(), it throws an error, complaining:

     'Tensor' object has no attribute 'numpy'


Thus we have to wrap punc_filter_and_to_lower4() with tf.py_function before passing it to map


In [55]:
# function to wrap punc_filter_and_to_lower4 with tf.py_function
# we will also use tf.ensure_shape(), or else tensorflow is 
# unable to statically determine it, leading to error being thrown during training
def string_transform(X):
    x = tf.py_function(punc_filter_and_to_lower4, [X], Tout = tf.string)
    x = tf.ensure_shape(x, ())
    
    return x

batch_size = 50
prefetch = 2
train_batch = train_dataset.map(lambda X, y: 
                                    (string_transform(X), y) ).batch(batch_size).prefetch(prefetch)

Note that we could have alternatively, preprocessed the data using punc_filter_and_to_lower2() which is based on tensorflow functions and then used it. The price we would have to pay is to rebuild the vocabulary accordingly. 

In [56]:
ite = next(iter(train_batch))
print('shape of train-batch: {}'.format(ite[0].shape))
print(ite[0][0])

shape of train-batch: (50,)
tf.Tensor(b'basically this movie is one of those rare movies you either hate and think borders on suicide as the next best thing to do rather than having to sit through it for two hours or as in my case you see it as a kult hit one of those movies wherein the humour the plot the acting is actually very hidden but for those of us willing to go looking for it trusting the director well the reward is u laugh your a of the fact that u have to find the things mentioned above actually makes the movie even more funny because u get the impression the director isn t even aware of how funny his movie is which doesn t seem likely and therein lies the intelligence at the helm of this magnificient project called spaced invaders', shape=(), dtype=string)


In [57]:
ite[1]

<tf.Tensor: shape=(50,), dtype=int64, numpy=
array([1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 1])>

### Create Validation dataset 

As instructed in the excercise, we will create a validation dataset by splitting the test set into half

In [58]:
test_path = os.path.join(path, 'test')
print(test_path)
test_pos_path = os.path.join(test_path, 'pos', '*.txt')
print('path to positive test reviews: \n{}'.format(test_pos_path))
test_neg_path = os.path.join(test_path, 'neg', '*.txt')
print('path to negative test reviews: \n{}'.format(test_neg_path))

/home/prarit/MachineLearningProjects/Word-Embeddings/aclImdb/test
path to positive test reviews: 
/home/prarit/MachineLearningProjects/Word-Embeddings/aclImdb/test/pos/*.txt
path to negative test reviews: 
/home/prarit/MachineLearningProjects/Word-Embeddings/aclImdb/test/neg/*.txt


In [59]:
test_pos_files = glob.glob(test_pos_path)
len(test_pos_files)

12500

In [60]:
valid_size = 12500
valid_pos_files = random.sample(test_pos_files, int(valid_size/2))
len(valid_pos_files)

6250

In [61]:
# We can remove the valid_pos_files from the set of test_pos_files through the
# simple trick explained in the following post:
# https://stackoverflow.com/questions/6486450/python-compute-list-difference/6486467
test_pos_files = list(set(test_pos_files) - set(valid_pos_files))
len(test_pos_files)

6250

In [62]:
test_neg_files = glob.glob(test_neg_path)
print('initial test_neg_files len:{}'.format(len(test_neg_files)))
valid_neg_files = random.sample(test_neg_files, int(valid_size/2))
print('valid_neg_files len: {}'.format(len(valid_neg_files)))
test_neg_files = list(set(test_neg_files) - set(valid_neg_files))
print('test_neg_files len: {}'.format(len(test_neg_files)))

initial test_neg_files len:12500
valid_neg_files len: 6250
test_neg_files len: 6250


In [63]:
valid_pos_reviews = tf.data.TextLineDataset(valid_pos_files).map(lambda x: labeler(x, 1))
valid_neg_reviews = tf.data.TextLineDataset(valid_neg_files).map(lambda x: labeler(x, 0))

In [64]:
for item in valid_pos_reviews.take(2):
    print(item)
    print(' ')

(<tf.Tensor: shape=(), dtype=string, numpy=b'This is the best version of Gypsy that has been filmed.Bette Midler is simply superb as Mama Rose.She has the voice,the gestures,the look,and most of all,a supreme acting ability to carry the role off and to make her character come alive.Her singing is,simply put:MAGNIFICENT! She especially shines in two numbers-"Everything\'s Coming Up Roses" and "Rose\'s Turn". The other actors are also fine,particularly Christine Ebersoll as Tessie.Also good is Peter Riegert;his portrayal of Herbie is acted with great style and believability.The direction of this movie is very,very good.There isn\'t a false note or gaffe in the entire production.This film is a vast improvement over the Roz Russell version filmed in 1962.Since viewing it again,I can state that the three greatest Mama Roses are:Ethel Merman,Bette Midler,and Angela Lansbury. See this movie.You\'ll be glad that you did.'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
 
(<tf.Tensor: shape=(), 

In [65]:
for item in valid_neg_reviews.take(2):
    print(item)
    print(' ')

(<tf.Tensor: shape=(), dtype=string, numpy=b"I've only seen most of the series since I leave the TV on as background noise in my dorm.<br /><br />I've been a fan of Mencia but this show really doesn't do much for me. Occasionally he'll say or do something to pull a chuckle out, but he has this aura of smugness that completely ruins it.<br /><br />I've always thought he was funny because of his raging angry-man routine that's not terribly prevalent in this TV series. Instead, he's just smug. I guess that just reflects how funny his comedy is: stale and uninteresting when he isn't in the proper mode of delivery. I've seen him get into it sometimes on his show, but for the most part, he just sits there smiling and looking smug, and it doesn't suit him well.<br /><br />Just my opinion though.">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
 
(<tf.Tensor: shape=(), dtype=string, numpy=b'As a recent convert to Curb Your Enthusiasm, which prompted my viewing of all season\'s episodes, I expec

In [66]:
test_pos_reviews = tf.data.TextLineDataset(test_pos_files).map(lambda x: labeler(x, 1))
test_neg_reviews = tf.data.TextLineDataset(test_neg_files).map(lambda x: labeler(x, 0))

In [67]:
for item in test_pos_reviews.take(2):
    print(item)
    print(' ')

(<tf.Tensor: shape=(), dtype=string, numpy=b'Mom has to be one of the all time uncomfortable movies to watch. It features an elderly lady you would love to have as your Nanny who becomes the nastiest mother f***ing monster you would ever want to meet on a dark night!<br /><br />This supper Nanny eats the inners of a young lady at the opening of the movie and it just gets sicker as it goes on. A cross between the howling and brain dead seem to come to mind when describing Mom!<br /><br />A must for horror fans who have the stomach for it (if you have watched re-animator or brain dead, this will float your boat)and are willing to switch the brain off for an hour or so...Let the gore pour!...8/10'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
 
(<tf.Tensor: shape=(), dtype=string, numpy=b'I haven\'t seen every single movie that Burt Reynolds has ever made, but this one (which I\'ve just finished watching, for the third time) may very well be his best! It suffers only from some slow stret

In [68]:
for item in test_neg_reviews.take(2):
    print(item)
    print(' ')

(<tf.Tensor: shape=(), dtype=string, numpy=b'This was included on the disk "Shorts: Volume 2"--a rather dull collection of short films. Shorts are among my favorite style of films but somehow the people assembling this second collection had a hard time finding quality content--and it wasn\'t nearly as good as the first volume or other shorts collections. <br /><br />This short film feels like it\'s woefully incomplete. There is a story, but so much in unanswered that the viewer, like me, feels a bit left out and unfulfilled.<br /><br />The film begins with a woman, her boyfriend and her Westie (that\'s a dog, by the way) going to a lonely beach. The lady speaks with an accent that, at times, is a bit difficult to follow. Given that I am hard of hearing, I sure would have loved if it had been closed captioned. Anyway, the boyfriend goes for a swim while she naps. When she awakens, her dog is gone. She panics and makes the guy follow her all about looking for the dog. They spend most of 

In [69]:
buffer_size = 10000
valid_dataset = valid_pos_reviews.concatenate(valid_neg_reviews).shuffle(buffer_size = buffer_size)

In [70]:
valid_batch = valid_dataset.map(lambda X,y: 
                                (string_transform(X), y) ).batch(batch_size).prefetch(prefetch)

In [71]:
test_dataset = test_pos_reviews.concatenate(test_neg_reviews).shuffle(buffer_size)

In [72]:
test_batch = test_dataset.map(lambda X,y: 
                                (string_transform(X), y) ).batch(batch_size).prefetch(prefetch)

### A simple model to test the preprocessing layer defined earlier

In [82]:
num_oov_buckets = 50
embedding_len = 100

In [86]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, BatchNormalization, PReLU

In [88]:
model = Sequential()
model.add(Input(shape = [], dtype = tf.string))
model.add(Preprocessing_layer(vocab = vocab, 
                              num_oov_buckets = num_oov_buckets, 
                              embedding_len = embedding_len) )
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dense(units = 256, activation = None))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dense(units = 256, activation = None))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dense(units = 64, activation = None))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dense(units = 64, activation = None))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dense(units = 16, activation = None))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dense(units = 4, activation = None))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dense(units = 1, activation = 'sigmoid'))

In [89]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
preprocessing_layer_4 (Prepr (None, 100)               7494300   
_________________________________________________________________
p_re_lu_2 (PReLU)            (None, 100)               100       
_________________________________________________________________
batch_normalization_9 (Batch (None, 100)               400       
_________________________________________________________________
dense_9 (Dense)              (None, 256)               25856     
_________________________________________________________________
p_re_lu_3 (PReLU)            (None, 256)               256       
_________________________________________________________________
batch_normalization_10 (Batc (None, 256)               1024      
_________________________________________________________________
dense_10 (Dense)             (None, 256)              

In [90]:
loss = 'binary_crossentropy'
optimizer = 'adam'
metrics = ['accuracy']
model.compile(optimizer = optimizer, loss = loss, metrics = metrics)

In [91]:
# tensorboard logs
import time

log_dir = os.path.join(os.getcwd(), 'log_dir')
print('log directory: \n{}'.format(log_dir))
run_id = time.strftime('%Y_m_%d_%H_%M_%S')
run_file = os.path.join(log_dir, run_id)
print('run file: \n{}'.format(run_file))

log directory: 
/home/prarit/MachineLearningProjects/Word-Embeddings/log_dir
run file: 
/home/prarit/MachineLearningProjects/Word-Embeddings/log_dir/2020_m_10_21_09_42


In [92]:
# callback for TensorBoard
# check the following tutorial
# https://www.tensorflow.org/tensorboard/get_started
tensorboard_callback = keras.callbacks.TensorBoard(run_file)

In [93]:
# Early Stopping
from tensorflow.keras.callbacks import EarlyStopping

monitor = 'val_loss'
min_delta = 0.01
patience = 10
verbose = 1
restore_best_weights = True
stopper = EarlyStopping(monitor = monitor, min_delta = min_delta, 
                        patience = patience, verbose = verbose,
                        restore_best_weights = restore_best_weights)

In [94]:
#### fit the model
epochs = 50
history = model.fit(train_batch, epochs = epochs, 
                    validation_data = valid_batch, 
                    callbacks = [stopper, tensorboard_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 00011: early stopping
