# Demo: Develop an Embedding + CNN Model for Sentiment Analysis
---

## You can find the demo here:

## https://github.com/markeyser/sentiment-analysis-CNN-demo

## Introduction

In this demo, you will discover how to develop **word embedding models** with **convolutional neural networks** (**CNN**) to classify movie reviews. After completing this demo, you will know:

- How to prepare movie review text data for classification with deep learning methods.
- How to develop a neural classification model with word embedding and convolutional layers.
- How to evaluate the developed a neural classification model.

Let's get started by reviewing some key concepts:

- **Sentiment Analysis:**

<center><img src="../imgs/sentiment_class_slide.png" alt="Drawing" width = "500"/></center>

- **Model Estimation Process:**

<center><img src="../imgs/modeling-process.png" alt="Drawing" width = "500"/></center>

- **Word embeddings** are a technique for representing text where <u>different words with similar meaning</u> have a similar real-valued vector representation. 

<center><img src="../imgs/words-embedding-01.png" alt="Drawing" width = "500"/></center>

They are a key breakthrough that has
led to great performance of neural network models on a suite of challenging natural language
processing problems. 

<center><img src="../imgs/words-embeddings-02.png" alt="Drawing" width = "500"/></center>
<center><p></center>
<center><img src="../imgs/words-embeddings-03.png" alt="Drawing" width = "500"/></center>


## Overview

This demo is divided into the following parts:
1. Movie Review Dataset
2. Data Preparation  
3. Train CNN With Embedding Layer
4. Evaluate Model

## Movie Review Dataset

- The **Movie Review Data** is a collection of movie reviews retrieved from the [imdb.com](https://www.imdb.com/) website in
the early 2000s by Bo Pang and Lillian Lee. 
- The reviews were collected and made available as part of their research on natural language processing. 
- The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as *[v2.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt)*. 
- The dataset is comprised of **1,000 positive** and **1,000 negative** movie reviews drawn from an archive of the `rec.arts.movies.reviews` newsgroup hosted at [IMDd](https://www.imdb.com/). The authors refer to this corpus as the ***polarity dataset***.

The data has been cleaned up somewhat, for example:
- The dataset is comprised of only English reviews.
- All text has been converted to lowercase.
- There is white space around punctuation like periods, commas, and brackets.
- Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification,
the performance of classical models (such as Support Vector Machines) on the data is in the
range of high 70% to low 80% (e.g. 78%-to-82%). More sophisticated data preparation may see
results as high as 86% with 10-fold cross-validation. This gives us a ballpark of low-to-mid 80s
if we were looking to use this dataset in experiments on modern methods.
>... depending on choice of downstream polarity classifier, we can achieve highly
statistically significant improvement (from 82.8% to 86.4%)
>
>A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

You can download the dataset from here:

Movie Review Polarity Dataset (review polarity.tar.gz, 3MB).
http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

After unzipping the file, you will have a directory called `txt_sentoken` with two sub-directories containing the text *neg* and *pos* for negative and positive reviews. Reviews are stored one per file with a naming convention `cv000` to `cv999` for each of *neg* and *pos*. Next, let's look at loading the text data.

### Rating decision:

<center><img src="../imgs/rating.png" alt="Drawing" width = "600"/></center>

### Further information:

- [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)
- [A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.](https://xxx.lanl.gov/abs/cs/0409058)
- Dataset Readme [v2.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt) and [v1.1](http://www.cs.cornell.edu/people/pabo/movie-review-data/README.1.1).
- [IMDb](https://en.wikipedia.org/wiki/IMDb) on Wikipedia.

## Data Preparation
In this section, we will look at 3 things:

1. Separation of data into **training** and **test** sets.
2. Loading and cleaning the data to **remove punctuation** and **numbers**.
3. Defining a **vocabulary** of preferred words.

### Split into Train and Test Sets

We are developing a system that can predict the sentiment of a textual movie review as either positive or negative. 
- This means that after the model is developed, we will need to make predictions on new textual reviews - scoring. 
- This will require all of the same data preparation to be performed on those new reviews as is performed on the training data for the
model. 

We will ensure that this constraint is built into the **evaluation** of our models by **splitting the training and test datasets prior to any data preparation**. This means that any knowledge in the data in the test set that could help us better prepare the data (e.g. the words used) are unavailable in the preparation of data used for training the model.

That being said, we will use the last **100 positive reviews** and the last **100 negative reviews**
as a **test** set (**200 reviews**) and the remaining **1,800 reviews** as the **training** dataset. 

| Reviews           | Original Set  | Training Set  | Test Set  |
| -------------:    | -------------:| -------------:| ---------:|
| Positive Reviews  | 1,000         |           900 |      100  |
| Negative Reviews  | 1,000         |           900 |      100  |
| Total Reviews     | 2,000         |         1,800 |      200  |

This is a
**90% train, 10% split of the data**. The split can be imposed easily by using the filenames of the
reviews where reviews named 000 to 899 are for training data and reviews named 900 onwards
are for test.

### Loading Reviews

In this section, we will look at loading individual text files, then processing the directories of
files. 

- We will assume that the review data is downloaded and available in the current working
directory in the folder `txt_sentoken`. 
- We can load an individual text file by **opening** it, **reading** in the ASCII text, and **closing** the file. 

> The text is encoding as ASCII, the most basic encoding. To know more: [A simple explanation of character encoding in python](http://www.cogsci.nl/blog/a-simple-explanation-of-character-encoding-in-python.html)

This is standard file handling stuff. For example, we can load the first negative review file `cv000_29416.txt` as follows:

In [1]:
# load one file
filename = '../data/txt_sentoken/neg/cv000_29416.txt'  
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
print(text)

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

This loads the document as ASCII and preserves any white space, like new lines. We can
turn this into a function called `load_doc()` that takes a filename of the document to load and
returns the text.

In [3]:
# load doc into memory
def load_doc(filename):
    # open the file as read onl
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
    # print the content of the file
    print(text)

In [4]:
# specify the file to load
filename = '../data/txt_sentoken/neg/cv000_29416.txt' 
load_doc(filename)

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

We have two directories each with 1,000 documents each. We can process each directory in
turn by first getting a list of files in the directory using the `listdir()` function, then loading
each file in turn. For example, we can load each document in the negative directory using the
`load_doc()` function to do the actual loading.

In [None]:
from os import listdir

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# specify directory to load
directory = '../data/txt_sentoken/neg'
# walk through all files in the folder
for filename in listdir(directory):
    # skip files that do not have the right extension
    if not filename.endswith(".txt"):
        next
    # create the full path of the file to open
    path = directory + '/' + filename
    # load document
    doc = load_doc(path)
    print('Loaded %s' % filename)

We can turn the processing of the documents into a function as well and use it as a template
later for developing a function to clean all documents in a folder. For example, below we define
a `process_docs()` function to do the same thing.

In [None]:
from os import listdir

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load all docs in a directory
def process_docs(directory):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        doc = load_doc(path)
        print('Loaded %s' % filename)

# specify directory to load
directory = '../data/txt_sentoken/neg'
process_docs(directory)

Now that we know how to load the movie review text data, let's look at cleaning it.

### Cleaning Reviews

In this section, we will look at what data cleaning we might want to do to the **movie review
data**. We will assume that we will be using a bag-of-words model or perhaps a word embedding
that does not require too much preparation.

#### Split into Tokens

First, let's load one document and look at the raw tokens split by white space. 

- We will use the `load_doc()` function developed in the previous section. 
- We can use the `split()` function to split the loaded document into tokens separated by white space.

In [7]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load the document
filename = '../data/txt_sentoken/neg/cv993_29565.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print("Original Review:\n")
print(text)
print("Tokenized Review:\n")
print(tokens)

Original Review:

salaries of hollywood top actors are getting obscenely large these days and many find this to be the main reason for skyrocketing movie budgets . 
actors who demand such salaries might be greedy , but in some instances they are quite justified , because many films would never be watched or even made without their participation . 
proof for that can be found even in the realm of low-budget movies , and one fine example is breakaway , 1995 thriller directed by sean dash and starring ( in ) famous figure skater tonya harding . 
face of tonya harding is most prominently featured on movie's poster , but the main star of the film is terri thompson who plays myra , attractive woman who works as a courier for gangster . 
one day she decides to retire , but her employers are anything but enthusiastic about that . 
realising that her life suddenly became worthless , myra starts running for her life , followed by professional assassins . 
terri thompson being the actual star of 

#### Removing punctuation, stop words, ...

The text data is already pretty clean; not much preparation is required.

Just looking at the raw tokens can give us a lot of ideas of things to try, such as:

- Remove **punctuation from words** (e.g. 'what's').
- Removing tokens that are **just punctuation** (e.g. '-').
- Removing tokens that contain **numbers** (e.g. '10/10').
- Remove tokens that have **one character** (e.g. 'a').
- Remove tokens that **don't have much meaning** (e.g. 'and').

Some ideas:

- We can filter out punctuation from tokens using **regular expressions**.
- We can remove tokens that are just punctuation or contain numbers by using an `isalpah()` check on each token.
- We can remove **English stop words** using the list loaded using **NLTK**.
- We can filter out short tokens by checking their length and remove all words that have a length ≤ 1 character

In [8]:
from nltk.corpus import stopwords
import string
import re

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load the document
filename = '../data/txt_sentoken/neg/cv993_29565.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
tokens = [re_punc.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print("Running the example gives a much cleaner looking list of tokens:\n")
print(tokens)

Running the example gives a much cleaner looking list of tokens:

['salaries', 'hollywood', 'top', 'actors', 'getting', 'obscenely', 'large', 'days', 'many', 'find', 'main', 'reason', 'skyrocketing', 'movie', 'budgets', 'actors', 'demand', 'salaries', 'might', 'greedy', 'instances', 'quite', 'justified', 'many', 'films', 'would', 'never', 'watched', 'even', 'made', 'without', 'participation', 'proof', 'found', 'even', 'realm', 'lowbudget', 'movies', 'one', 'fine', 'example', 'breakaway', 'thriller', 'directed', 'sean', 'dash', 'starring', 'famous', 'figure', 'skater', 'tonya', 'harding', 'face', 'tonya', 'harding', 'prominently', 'featured', 'movies', 'poster', 'main', 'star', 'film', 'terri', 'thompson', 'plays', 'myra', 'attractive', 'woman', 'works', 'courier', 'gangster', 'one', 'day', 'decides', 'retire', 'employers', 'anything', 'enthusiastic', 'realising', 'life', 'suddenly', 'became', 'worthless', 'myra', 'starts', 'running', 'life', 'followed', 'professional', 'assassins', 'te

In [9]:
# English stop words in NLTK
print(stop_words)

{'through', 'it', 'before', 'once', "don't", 'yours', 'wasn', 'because', 'doesn', 'a', 'has', 'ours', 'we', 'yourself', 're', 'what', "she's", 'just', 'i', 'which', 'hasn', 'down', 'its', 'him', 'between', 'will', "haven't", 'd', 'as', 'you', 'y', 'too', 'myself', 'at', 'about', 'with', 'aren', 'itself', 'for', 'won', 'few', 'had', 'shouldn', 'doing', 'is', 'over', 'if', "needn't", 'the', 'were', 'does', 'be', 'out', "weren't", 'me', 'above', 'they', "wasn't", 'so', 'after', 'who', 'and', 'an', 'don', "couldn't", 'not', 'by', 'all', 'any', "you'll", 'did', 'he', "that'll", "shan't", 'below', 'she', 'was', 'from', "hasn't", 'weren', 'having', 'each', 'are', 'needn', 'whom', 't', 'll', "doesn't", 'own', 've', 'have', 'up', 'both', "shouldn't", 'into', 'there', 'my', 'those', 'these', 'am', "it's", 'further', 'why', 'shan', 'such', 'hers', 'this', 'should', 'some', 'been', "you're", 'our', "aren't", 'his', 'can', 'than', 'of', 'ourselves', 'theirs', 'other', 'to', 'haven', 'or', 'more', '

We can put this into a function called `clean_doc()` and test it on another review, this time
a positive review.

In [10]:
from nltk.corpus import stopwords
import string
import re

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

# load the document
filename = '../data/txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print("Again, the cleaning procedure seems to produce a good set of tokens, at least as a first cut:\n")
print(tokens)

Again, the cleaning procedure seems to produce a good set of tokens, at least as a first cut:

['films', 'adapted', 'comic', 'books', 'plenty', 'success', 'whether', 'theyre', 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'toward', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', 'theres', 'never', 'really', 'comic', 'book', 'like', 'hell', 'starters', 'created', 'alan', 'moore', 'eddie', 'campbell', 'brought', 'medium', 'whole', 'new', 'level', 'mid', 'series', 'called', 'watchmen', 'say', 'moore', 'campbell', 'thoroughly', 'researched', 'subject', 'jack', 'ripper', 'would', 'like', 'saying', 'michael', 'jackson', 'starting', 'look', 'little', 'odd', 'book', 'graphic', 'novel', 'pages', 'long', 'includes', 'nearly', 'consist', 'nothing', 'footnotes', 'words', 'dont', 'dismiss', 'film', 'source', 'get', 'past', 'whole', 'comic', 'book', 'thing', 'might', 'find', 'another', 'stumbling', 'block', 'hells', 'directors', 'albert', 'allen', 'hughes', 'getting', 'hughes', 'b

There are many more cleaning steps we could take and I leave them to your imagination.
Next, let's look at how we can manage a preferred vocabulary of tokens.

### Define a Vocabulary

It is important to define a vocabulary of known words when using a text model. The more words, the larger the representation of documents, therefore it is important to constrain the
words to only those believed to be predictive. This is diffcult to know beforehand and often it is important to test different hypotheses about how to construct a useful vocabulary. 

We can develop a vocabulary as a `Counter`, which is a dictionary mapping of words and their count that allows us to easily update and query. Each document can be added to the
counter (a new function called `add_doc_to_vocab()`) and we can step over all of the reviews in
the negative directory and then the positive directory (a new function called `process_docs()`). The complete example is listed below.

In [None]:
# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    # load doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update counts
    vocab.update(tokens)

Finally, we can use our template above for processing all documents in a directory called
`process_docs()` and update it to `call_add_doc_to_vocab()`.

In [None]:
# load all docs in a directory
def process_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

We can put all of this together and develop a full vocabulary from all documents in the
dataset.

In [11]:
import string
import re
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    # load doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update counts
    vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('../data/txt_sentoken/pos', vocab)
process_docs('../data/txt_sentoken/neg', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('could', 1248), ('bad', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]


Running the example creates a vocabulary with all documents in the dataset, including
positive and negative reviews. We can see that there are a little over 44,276 unique words across
all reviews and the top 3 words are film, one, and movie.

Running the example shows that we have a vocabulary of **44,276** words. We also can see a sample of the top 50 most used words in the movie reviews. 

>Note that this vocabulary was constructed based on only those reviews in the training dataset.

We can step through the vocabulary and remove all words that have a low occurrence, such as only being used once or twice in all reviews. For example, the following snippet will retrieve only the tokens that appear **2 or more times** in all reviews.

In [None]:
# keep tokens with a min occurrence
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))

Running the above example with this addition shows that the vocabulary size drops by a little more than half its size, from **44,276** to **25,767** words.

Finally, the vocabulary can be saved to a new file called `vocab.txt` that we can later load
and use to filter movie reviews prior to encoding them for modeling. We define a new function called `save_list()` that saves the vocabulary to file, with one word per line. For example:

In [12]:
# save list to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w')
    # write text
    file.write(data)
    # close file
    file.close()

# save tokens to a vocabulary file
save_list(tokens, '../output/vocab.txt')

Running the min occurrence filter on the vocabulary and saving it to file, you should now
have a new file called `vocab.txt` with only the words we are interested in. The order of words in your file will differ, but should look something like the following:

```
aberdeen
dupe
burt
libido
hamlet
arlene
available
corners
web
columbia
...
```

We are now ready to look at extracting features from the reviews ready for modeling.

## Train CNN with Embedding Layer

In this section, we will look at 4 things:

1. Load the vocabulary and filter out tokens not in the vocabulary
2. Combine the reviews into a single train or test set and define the class labels
3. Enconde each document as a sequence of integers
4. Define the neural network model

In this section, we will learn a **word embedding while training a convolutional neural network** on the classification problem. 

>A **word embedding** is a way of representing text where each word in
the vocabulary is represented by a real valued vector in a high-dimensional space. The vectors
are learned in such a way that words that have similar meanings will have similar representation
in the vector space (close in the vector space). This is a more expressive representation for text
than more classical methods like bag-of-words, where relationships between words or tokens are
ignored, or forced in bigram and trigram approaches.

<center><img src="../imgs/words-embedding-01.png" alt="Drawing" width = "500"/></center>
<center><p></center>
<center><img src="../imgs/words-embeddings-02.png" alt="Drawing" width = "500"/></center>
<center><p></center>
<center><img src="../imgs/words-embeddings-03.png" alt="Drawing" width = "500"/></center>
<center><p></center>
<center><img src="../imgs/word-emb-0.png" alt="Drawing" width = "500"/></center>
<center><p></center>
<center><img src="../imgs/word-emb-1.png" alt="Drawing" width = "500"/></center>

More information [here](https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer)

Accuracy using the Movie Review Dataset:

| Technique                   | Accuracy (Test) 
| -------------:              | -------------:|
| Train embedding layer       | 88%         |   
| Train word2vec embedding    | 57.5%         |   
| Use a pre-trained embedding | 76%         |   


The real valued vector representation for words can be learned while training the neural
network. We can do this in the **Keras** deep learning library using the **`Embedding` layer**. 

### Load the vocabulary and filter out tokens not in the vocabulary

- The first step is to **load the vocabulary**. We will use it to filter out words from movie reviews that
we are not interested in. 

You should have a local file called `vocab.txt` with one word per line. We can load that file and build a vocabulary
as a set for checking the validity of tokens.

In [None]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load the vocabulary
vocab_filename = '../output/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())

- Next, we need to **load all of the training data** movie reviews. 

For that we can adapt the `proces_docs()` from the previous section to:

 - load the documents, 
 - clean them, and 
 - return them as a **list of strings**, with **one document per string**. 
 
> We want each document to be a string for easy encoding as a sequence of integers later. 

Cleaning the document involves:

- splitting each review based on white space, 
- removing punctuation, and then 
- filtering out all tokens not in the vocabulary. 

The updated `clean_doc()` function is listed below.

In [None]:
# turn a doc into clean tokens
def clean_doc(doc, vocab):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

### Combine the reviews into a single train or test set and define the class labels

The updated `process_docs()` can then call the `clean_doc()` for each document in a given directory.

In [None]:
# load all docs in a directory
def process_docs(directory, vocab, is_train):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc, vocab)
        # add to list
        documents.append(tokens)
    return documents

- We can call the `process_docs` function for both the `neg` and `pos` directories and **combine the reviews into a single train or test dataset**. 

- We also can **define the class labels** for the dataset.

The `load_clean_dataset()` function below will load all reviews and prepare class labels for the training or test dataset.

In [None]:
# load and clean a dataset
def load_clean_dataset(vocab, is_train):
    # load documents
    neg = process_docs('../data/txt_sentoken/neg', vocab, is_train)
    pos = process_docs('../data/txt_sentoken/pos', vocab, is_train)
    docs = neg + pos
    # prepare labels
    labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs, labels

### Enconde each document as a sequence of integers

The next step is to **encode each document as a sequence of integers**. 

- The **Keras `Embedding` layer** requires integer inputs where each integer maps to a single token that has a specic
real-valued vector representation within the embedding. 
- These vectors are random at the beginning of training, but during training become meaningful to the network. 
- We can encode the training documents as sequences of integers using the `Tokenizer` class in the Keras API.

First, we must construct an instance of the class then train it on all documents in the training
dataset. In this case, it develops a vocabulary of all tokens in the training dataset and develops
a consistent mapping from words in the vocabulary to unique integers. We could just as easily
develop this mapping ourselves using our vocabulary file. The `create_tokenizer()` function
below will prepare a `Tokenizer` from the training data.

In [None]:
# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

Now that the mapping of words to integers has been prepared, we can use it to encode the
reviews in the training dataset. We can do that by calling the `texts_to_sequences()` function
on the Tokenizer. We also need to ensure that all documents have the same length. This is a
requirement of Keras for efficient computation. We could truncate reviews to the smallest size
or zero-pad (pad with the value 0) reviews to the maximum length, or some hybrid. In this case,
we will pad all reviews to the length of the longest review in the training dataset. First, we can
find the longest review using the `max()` function on the training dataset and take its length.
We can then call the Keras function `pad_sequences()` to pad the sequences to the maximum
length by adding 0 values on the end.

In [None]:
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)

We can then use the maximum length as a parameter to a function to integer encode and
pad the sequences.

In [None]:
# integer encode and pad documents
def encode_docs(tokenizer, max_length, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    # pad sequences
    padded = pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded

### Define the neural network model

<center><img src="../imgs/1D-basics.png" alt="Drawing" width = "800"/></center>
<center><p></center>
<center><img src="../imgs/cnn-schema.png" alt="Drawing" width = "800"/></center>

We are now ready to define our neural network model. 

- The model will use an **`Embedding` layer** as the **first hidden layer**. 
- The `Embedding` layer requires the specification of the 
        - vocabulary size, 
        - the size of the real-valued vector space, and 
        - the maximum length of input documents. 

The **vocabulary size** is the total number of words in our vocabulary, plus one for unknown words.
This could be the vocab set length or the size of the vocab within the tokenizer used to integer
encode the documents, for example:

In [None]:
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)


- ##### We will use a **100-dimensional vector space**, but you could try other values, such as 50 or 150. 
- Finally, the maximum document length was calculated above in the `max_length` variable used during padding. 

The complete model definition is listed below including the `Embedding`
layer. We use a **Convolutional Neural Network (CNN)** as they have proven to be successful
at document classification problems. 

- A conservative CNN configuration is used with **32 filters** (parallel fields for processing words) and 
- a **kernel size of 8** 
- with a rectified linear **(`relu`) activation function**. 
- This is followed by a **pooling layer** that reduces the output of the convolutional layer by half.
- Next, the 2D output from the CNN part of the model is attened to one long 2D vector to
represent the *features* extracted by the CNN. 

The back-end of the model is a standard Multilayer Perceptron layers to interpret the CNN features. The output layer uses a sigmoid activation function to output a value between 0 and 1 for the negative and positive sentiment in the review.

In [None]:
# define the model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # compile network
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='../output/model.png', show_shapes=True)
    return model

Running just this piece provides a summary of the defined network. We can see that the
Embedding layer expects documents with a length of 1,317 words as input and encodes each
word in the document as a 100 element vector.

```
...
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 1317, 100) 2576800
_________________________________________________________________
conv1d_1 (Conv1D) (None, 1310, 32) 25632
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 655, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 20960) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 209610
_________________________________________________________________
dense_2 (Dense) (None, 1) 11
=================================================================
Total params: 2,812,053
Trainable params: 2,812,053
Non-trainable params: 0
_________________________________________________________________
```

A plot the defined model is then saved to file with the name `model.png`.

<center><img src="../output/model.png" alt="Drawing" width = "450"/></center>

Next, we fit the network on the training data. We use a binary cross entropy loss function
because the problem we are learning is a binary classification problem. The efficient Adam
implementation of stochastic gradient descent is used and we keep track of accuracy in addition
to loss during training. The model is trained for 10 epochs, or 10 passes through the training
data. The network configuration and training schedule were found with a little trial and error,
but are by no means optimal for this problem. If you can get better results with a different
configuration, let me know.

In [None]:
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)

After the model is fit, it is saved to a file named `model.h5` for later evaluation.

In [None]:
# save the model
model.save('../output/model.h5')

We can tie all of this together. The complete code listing is provided below.

In [13]:
import string
import re
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc, vocab):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_train):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc, vocab)
        # add to list
        documents.append(tokens)
    return documents

# load and clean a dataset
def load_clean_dataset(vocab, is_train):
    # load documents
    neg = process_docs('../data/txt_sentoken/neg', vocab, is_train)
    pos = process_docs('../data/txt_sentoken/pos', vocab, is_train)
    docs = neg + pos
    # prepare labels
    labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs, labels

# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# integer encode and pad documents
def encode_docs(tokenizer, max_length, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    # pad sequences
    padded = pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded

# define the model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # compile network
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='../output/model.png', show_shapes=True)
    return model

# load the vocabulary
vocab_filename = '../output/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())
# load training data
train_docs, ytrain = load_clean_dataset(vocab, True)
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)
# encode data
Xtrain = encode_docs(tokenizer, max_length, train_docs)
# define model
model = define_model(vocab_size, max_length)
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# save the model
model.save('../output/model.h5')

Using TensorFlow backend.


Vocabulary size: 342
Maximum length: 402
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 402, 100)          34200     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 395, 32)           25632     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 197, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 6304)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                63050     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
Total params: 122,893
Trainable params: 122,893
Non-trainable params: 0
_____________________________

Running the example will first provide a summary of the training dataset vocabulary (25,768)
and maximum input sequence length in words (1,317). The example should run in a few minutes
and the fit model will be saved to file.

## Evaluate Model and Make Predictions on New Data

In this section, we will look at 2 things:

1. Evaluate the model on test set
2. Make prediction on new data

In this section, we will **evaluate the trained model** and use it to **make predictions on new data**.

First, we can use the built-in `evaluate()` function to estimate the skill of the model on both
the **training** and test **dataset**. This requires that we load and encode both the training and test
datasets.

In [None]:
# load all reviews
train_docs, ytrain = load_clean_dataset(vocab, True)
test_docs, ytest = load_clean_dataset(vocab, False)
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)
# encode data
Xtrain = encode_docs(tokenizer, max_length, train_docs)
Xtest = encode_docs(tokenizer, max_length, test_docs)

We can then load the model and evaluate it on both datasets and print the accuracy.

In [None]:
# load the model
model = load_model('model.h5')
# evaluate model on training dataset
_, acc = model.evaluate(Xtrain, ytrain, verbose=0)
print('Train Accuracy: %f' % (acc*100))
# evaluate model on test dataset
_, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

New data must then be prepared using the same text encoding and encoding schemes as was
used on the training dataset. Once prepared, a prediction can be made by calling the `predict()`
function on the model. The function below named `predict_sentiment()` will encode and pad
a given movie review text and return a prediction in terms of both the percentage and a label.

In [None]:
# classify a review as negative or positive
def predict_sentiment(review, vocab, tokenizer, max_length, model):
    # clean review
    line = clean_doc(review, vocab)
    # encode and pad review
    padded = encode_docs(tokenizer, max_length, [line])
    # predict sentiment
    yhat = model.predict(padded, verbose=0)
    # retrieve predicted percentage and label
    percent_pos = yhat[0,0]
    if round(percent_pos) == 0:
        return (1-percent_pos), 'NEGATIVE'
    return percent_pos, 'POSITIVE

We can test out this model with two ad hoc movie reviews. The complete example is listed
below.

In [14]:
import string
import re
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc, vocab):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_train):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc, vocab)
        # add to list
        documents.append(tokens)
    return documents

# load and clean a dataset
def load_clean_dataset(vocab, is_train):
    # load documents
    neg = process_docs('../data/txt_sentoken/neg', vocab, is_train)
    pos = process_docs('../data/txt_sentoken/pos', vocab, is_train)
    docs = neg + pos
    # prepare labels
    labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs, labels

# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# integer encode and pad documents
def encode_docs(tokenizer, max_length, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    # pad sequences
    padded = pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded

# classify a review as negative or positive
def predict_sentiment(review, vocab, tokenizer, max_length, model):
    # clean review
    line = clean_doc(review, vocab)
    # encode and pad review
    padded = encode_docs(tokenizer, max_length, [line])
    # predict sentiment
    yhat = model.predict(padded, verbose=0)
    # retrieve predicted percentage and label
    percent_pos = yhat[0,0]
    if round(percent_pos) == 0:
        return (1-percent_pos), 'NEGATIVE'
    return percent_pos, 'POSITIVE'

# load the vocabulary
vocab_filename = '../output/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())
# load all reviews
train_docs, ytrain = load_clean_dataset(vocab, True)
test_docs, ytest = load_clean_dataset(vocab, False)
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)
# encode data
Xtrain = encode_docs(tokenizer, max_length, train_docs)
Xtest = encode_docs(tokenizer, max_length, test_docs)
# load the model
model = load_model('../output/model.h5')
# evaluate model on training dataset
_, acc = model.evaluate(Xtrain, ytrain, verbose=0)
print('Train Accuracy: %.2f' % (acc*100))
# evaluate model on test dataset
_, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %.2f' % (acc*100))

# test positive text
text = 'Everyone will enjoy this film. I love it, recommended!'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))
# test negative text
text = 'This is a good movie. Watch it. I love it.'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))

Vocabulary size: 342
Maximum length: 402
Train Accuracy: 99.33
Test Accuracy: 68.00
Review: [Everyone will enjoy this film. I love it, recommended!]
Sentiment: NEGATIVE (51.151%)
Review: [This is a good movie. Watch it. I love it.]
Sentiment: NEGATIVE (50.816%)


In [None]:
# load the document
filename_score = '../data/txt_score/pos/cv_025.txt'
text_score = load_doc(filename_score)
# test positive text
text = text_score
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))

## Make Predictions on new data

In this section, we will use the trained model to **make predictions on new data**.

The new data comes from [Rotten Tomatoes](https://www.rottentomatoes.com/) movie review aggretator. 

> Rotten Tomatoes is an American review-aggregation website for film and television. The company was launched in August 1998 by three undergraduate students at the University of California, Berkeley: Senh Duong, Patrick Y. Lee and Stephen Wang.The name "Rotten Tomatoes" derives from the practice of audiences throwing rotten tomatoes when disapproving of a poor stage performance. (from [Wikipedia](https://en.wikipedia.org/wiki/Rotten_Tomatoes))

The `score.xlsx` located in `../notes/` contains:

- 3 Rotten Tomatoes Tomatometer-approved critics
- 10 positive and 10 negative reviews per critic (60 reviews)
- 85% accurate prediction
- We selected the last 10 positive and last 10 negative movie reviews from each critic. The great majority about movies realesed in 2018 and some few in 2017

| Critic                 | Positive Reviews  | Negative Reviews  | Total Reviews  |
| -------------:         | -------------:| -------------:| ---------:|
| Sara Michelle Fetters  | 10         |           10 |      20  |
| Jeffrey M. Anderson    | 10         |           10 |      20  |
| Pete Hammond           | 10         |           10 |      20  |

### Score positive reviews:

In [None]:
# load all docs in a directory
def score_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # create the full path of the file to open
        path = directory + '/' + filename
        # load doc
        text_score = load_doc(path)
        # test positive text
        percent, sentiment = predict_sentiment(text_score, vocab, tokenizer, max_length, model)
        print('Actual Sentiment: POSITIVE Predicted Sentiment: %s (%.3f%%)' % (sentiment, percent*100))

# add all docs to vocab
score_docs('../data/txt_score/pos', vocab)

### Score negative reviews:

In [None]:
# load all docs in a directory
def score_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # create the full path of the file to open
        path = directory + '/' + filename
        # load doc
        text_score = load_doc(path)
        # test positive text
        percent, sentiment = predict_sentiment(text_score, vocab, tokenizer, max_length, model)
        print('Actual Sentiment: NEGATIVE Predicted Sentiment: %s (%.3f%%)' % (sentiment, percent*100))

# add all docs to vocab
score_docs('../data/txt_score/neg', vocab)