In this notebook we will learn how to create and train an embedding layer for the words appearing in a text data. We will then train a simple DNN based model to do sentiment analysis on this data. 

## Exercise

This is exercise 13.10 in [this](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) book.

### Problem Statement

In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

  - a. Download the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/), which contains 50,000 movies reviews from the Internet Movie Database. The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise.
  
    
  - b. Split the test set into a validation set (15,000) and a test set (10,000).
  
  
  - c. Use tf.data to create an efficient dataset for each set.
  
  
  - d. Create a binary classification model, using a TextVectorization layer to preprocess each review. If the TextVectorization layer is not yet available (or if you like a challenge), try to create your own custom preprocessing layer: you can use the functions in the tf.strings package, for example lower() to make everything lowercase, regex_replace() to replace punctuation with spaces, and split() to split words on spaces. You should use a lookup table to output word indices, which must be prepared in the adapt() method.
  
  
  - e. Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.
  
  
  - f. Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.


  - g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").

In [4]:
import tensorflow as tf
import tensorflow.keras as keras
print('tensorflow version: {}'.format(tf.__version__))
print('keras version: {}'.format(keras.__version__))

tensorflow version: 2.1.0
keras version: 2.2.4-tf


In [5]:
import os
print('cwd: {}'.format(os.getcwd()))

cwd: /home/prarit/MachineLearningProjects/Word-Embeddings


## Downloading the Large Movie Review Dataset

In [9]:
# good tutorial on using wget: https://www.tecmint.com/download-and-extract-tar-files-with-one-command/
# turn off verbose output of wget using the flag -nv : https://shapeshed.com/unix-wget/#how-to-turn-off-verbose-output 
!wget -c -nv http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -o - 

2020-03-06 15:09:03 URL:http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz [84125825/84125825] -> "aclImdb_v1.tar.gz" [1]


In [10]:
# uncompress the downloaded files
!tar xzf aclImdb_v1.tar.gz

In [12]:
# list the files in the current working directory
os.listdir()

['README.md',
 '.gitignore',
 '.ipynb_checkpoints',
 '.git',
 'aclImdb',
 'Word-Embeddings.ipynb',
 'aclImdb_v1.tar.gz']

We see that aclImdb_v1.tar.gz was extracted to a folder called aclImdb. Let's see the contents of this file.

In [15]:
path = os.path.join(os.getcwd() , 'aclImdb')
contents = os.listdir(path)
print('The contents of aclImdb are: \n{}'.format(contents))

The contents of aclImdb are: 
['imdb.vocab', 'train', 'README', 'imdbEr.txt', 'test']


There is a README file in aclImdb, let us read it.