# CHEMENG/MECHENG 789 Assignment 3

Elliot (Yixin) Huangfu

In this assignment the objective is to reproduce the Sentiment Classification example as presented here:

https://towardsdatascience.com/machine-learning-word-embedding-sentiment-classification-using-keras-b83c28087456

You need to upload the data to your google drive and develop a jupyeter notebook using colab. Your deliverables are

- an executable code
 - **The blog example is fully replicated, using tensorflow 2.2.0-rc1 in Colab.**
- some example that you make for testing your model (predict)
 - **Customized testing case is in the *Test model* section.**
- how to improve the model better? 
 - **Hyper-parameter tuning, including increasing RNN layer & size, and using LSTM may help improve the result.**

In [0]:
# for colab, set tf version
%tensorflow_version 2.x
import tensorflow as tf
tf.__version__

'2.2.0-rc2'

download the IMDB movie dataset.

In [0]:
import requests
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
myfile = requests.get(url)
open('./aclImdb_v1.tar.gz', 'wb').write(myfile.content)

print('file downloaded:', './aclImdb_v1.tar.gz')

file downloaded: ./aclImdb_v1.tar.gz


## Load IMDB dataset

Make sure the original data file (tar.gz) exists in current folder. Extract file.

In [0]:
import tarfile 
filepath = './aclImdb_v1.tar.gz'
with tarfile.open(filepath,'r') as tar_ref:
    tar_ref.extractall("./")

Search through folder and obtain all file names.

In [0]:
import os
folder = './aclImdb'
train_files = {}   # contain filenames
test_files = {}   # contain filenames
train_files['pos'] = os.listdir(folder + '/train/pos')
train_files['neg'] = os.listdir(folder + '/train/neg')
test_files['pos'] = os.listdir(folder + '/test/pos')
test_files['neg'] = os.listdir(folder + '/test/neg')

Read every .txt file and keep in DataFrame

In [0]:
import pandas as pd

train_text = []
train_label = []
test_text = []
test_label = []

# loop through each filename
for label, filenames in train_files.items():
    current_folder = os.path.join('./aclImdb/train/',label)
    for filename in filenames:
        full_fn = os.path.join(current_folder, filename)
        with open(full_fn, 'r') as f:
            train_text.append(f.read())
        if label == 'pos': train_label.append(1)
        elif label == 'neg': train_label.append(0)
        else: print('unhandled label:',label)

# test dataset
for label, filenames in test_files.items():
    current_folder = os.path.join('./aclImdb/test/',label)
    for filename in filenames:
        full_fn = os.path.join(current_folder, filename)
        with open(full_fn, 'r') as f:
            test_text.append(f.read())
        if label == 'pos': test_label.append(1)
        elif label == 'neg': test_label.append(0)
        else: print('unhandled label:',label)

# convert to dataframes
train_df = pd.DataFrame({'review':train_text, 'sentiment':train_label})
test_df = pd.DataFrame({'review':test_text, 'sentiment':test_label})

Examine the dataset.

In [0]:
print(train_df.head())
print(test_df.head())

                                              review  sentiment
0  This movie makes a statement about Joseph Smit...          1
1  Maybe one of the most entertaining Ninja-movie...          1
2  ever watched. It deals so gently and subtly no...          1
3  To a certain extent, I actually liked this fil...          1
                                              review  sentiment
0  What can I say about this film other than the ...          1
1  This is my all time favorite Looney Tunes cart...          1
2  "Mr. Bug Goes To Town" was the last major achi...          1
3  One of the more interesting films I've seen. L...          1
4  Every once in a while a film comes along with ...          1


Create dataset

In [0]:
x_train = train_df['review'].values
y_train = train_df['sentiment'].values
x_test = test_df['review'].values
y_test = test_df['sentiment'].values
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(25000,) (25000,)
(25000,) (25000,)


## Preprocessing

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# fit tokenizer
tokenizer_obj = Tokenizer()
all_reviews = np.concatenate((x_train, x_test), axis=0)
tokenizer_obj.fit_on_texts(all_reviews)

# pad sequences
max_length = max([len(s.split()) for s in all_reviews])

# define vocabulary size
vocab_size = len(tokenizer_obj.word_index) + 1

# create tokenized dataset
x_train_tokens = tokenizer_obj.texts_to_sequences(x_train)
x_test_tokens = tokenizer_obj.texts_to_sequences(x_test)

# pad sequence
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_length)
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_length)

print(x_train_pad.shape, x_test_pad.shape)

(25000, 2470) (25000, 2470)


## Simple model: word embedding + RNN

The following settings are in consistent with the blog example.

In [0]:
# check the input parameters:
print('vocab_size:', vocab_size)
print('max_length:', max_length)

vocab_size: 124253
max_length: 2470


In [0]:
# build a model
EMBEDDING_DIM = 100

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length),
    tf.keras.layers.GRU(units=32, dropout=0.2, recurrent_dropout=0.2),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])



In [0]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 2470, 100)         12425300  
_________________________________________________________________
gru_2 (GRU)                  (None, 32)                12864     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 12,438,197
Trainable params: 12,438,197
Non-trainable params: 0
_________________________________________________________________


Training. since the GRU with recurrent_dropout does not meet cuDNN kernel criteria, the training is very slow.

Since the trainable parameter is very large, the model would overfit badly.

In [0]:
model.fit(
    x_train_pad, y_train,
    batch_size=128, epochs=5,
    validation_data=(x_test_pad,y_test), verbose=1
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

### Test model (with customized examples)

In [0]:
test_samples = [
    "This movie really sucks! Can I get my money back please?",
    "Not a good movie!",
    "This movie is fantastic! I really like it because it is so good.",
    ]
test_samples_tokens = tokenizer_obj.texts_to_sequences(test_samples)
test_samples_pad = pad_sequences(test_samples_tokens, maxlen=max_length)

# predict
model.predict(test_samples_pad)

array([[0.13415487],
       [0.5118695 ],
       [0.83251464]], dtype=float32)

## word2vec Embedding

In [0]:
# download some data for nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Manually process the review text:
- remove all punctuation
- remove all non-alphabetic characters
- remove stop words

In [0]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

all_reviews_words = []
for line in all_reviews.tolist():
    # conver each review into list of words
    words = word_tokenize(line)
    # covert to lowercase
    words = [w.lower() for w in words]
    # remove punctuation from word (not from list)
    punct_dict = str.maketrans('','',string.punctuation)
    words = [w.translate(punct_dict) for w in words]
    # remove words that are not alphabetic
    words = [w for w in words if w.isalpha()]
    # remove stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]

    all_reviews_words.append(words)

len(all_reviews_words)

50000

Examine the result.

In [0]:
i = 1
print('Original text:\n', x_train[i])
print('Processed text:\n', ' '.join(all_reviews_words[i]))

Original text:
 Maybe one of the most entertaining Ninja-movies ever made. A hard-hitting action movie with lots of gore and slow motion (eehaaa!). Made in ´83 and still the greatest swedish action movie made so far! And we can hardly wait to see the upcoming sequel, Ninja mission 2000 - The legacy of Markov!
Processed text:
 maybe one entertaining ninjamovies ever made hardhitting action movie lots gore slow motion eehaaa made still greatest swedish action movie made far hardly wait see upcoming sequel ninja mission legacy markov


Train word embedding using gensim.

In [0]:
import gensim

# train word2vec model
model = gensim.models.Word2Vec(sentences=all_reviews_words, size=EMBEDDING_DIM, window=5, workers=4, min_count=1)
model

<gensim.models.word2vec.Word2Vec at 0x7f5933917860>

Examine the word embeding model.

In [0]:
vocab = list(model.wv.vocab.keys())
print('vocabulary size:', len(vocab))

vocabulary size: 134156


In [0]:
# similar words
model.wv.most_similar('horrible')

  if np.issubdtype(vec.dtype, np.int):


[('terrible', 0.9212156534194946),
 ('awful', 0.8781707882881165),
 ('pathetic', 0.7718660831451416),
 ('atrocious', 0.7631697058677673),
 ('horrendous', 0.7616218328475952),
 ('dreadful', 0.7500942349433899),
 ('sucks', 0.7495220303535461),
 ('horrid', 0.7411223649978638),
 ('lousy', 0.7325442433357239),
 ('bad', 0.7231553792953491)]

In [0]:
# math
model.wv.most_similar_cosmul(positive=['woman','king'], negative=['man'])

[('princess', 0.8775322437286377),
 ('romeo', 0.8715057969093323),
 ('bride', 0.8615204691886902),
 ('juliet', 0.856743335723877),
 ('tearle', 0.8564087152481079),
 ('queen', 0.8427306413650513),
 ('changxin', 0.8397428393363953),
 ('ciarán', 0.8384663462638855),
 ('crimecop', 0.8383185863494873),
 ('jetée', 0.8341542482376099)]

In [0]:
# find odd word
model.wv.doesnt_match('woman king queen movie'.split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


'movie'

In [0]:
# save model
w2v_fn = 'imdb_embedding_word2vec.txt'
model.wv.save_word2vec_format(w2v_fn, binary=False)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Apply word embedding

the word embedding obtained previously is word to vector, tokenization is skipped.

In [0]:
import os
embeddings = {}
print(w2v_fn)  # should be 'imdb_embedding_word2vec.txt'

with open(os.path.join('.', w2v_fn), 'r') as f:
    # skip the first line
    print('vocab_size, embedding_dims:', f.readline())
    # read every line and store as numpy array
    for line in f:
        values = line.split()
        word = values[0]
        embeddings[word] = np.array([float(v) for v in values[1:]])

len(embeddings)

imdb_embedding_word2vec.txt
vocab_size, embedding_dims: 134156 100



134156

## Preprocessing - for manually processed texts
To play fair, use the train/test split from the original dataset. The blog tutorial uses a 0.8/0.2 split which is different from original.

In [0]:
# create new dataset based on the processed text sequences
x_train_new = np.array([' '.join(words) for words in all_reviews_words[0:25000]])
x_test_new = np.array([' '.join(words) for words in all_reviews_words[25000:]])

Check the difference btw original and processed:

In [0]:
i = 18
print('Train data comparison:', y_train[i])
print(x_train[i])
print(x_train_new[i])

print('\nTest data comparison:', y_test[i])
print(x_test[i])
print(x_test_new[i])

Train data comparison: 1
A very good wartime movie showing the effects of war on a hometown boy who looses his eyesight on Guadalcanal and must come home and re-adjust himself with the help of family and friends. An excellent cast of actor's helps make this movie very entertaining. Eleanor Parker's role as the girlfriend was worthy of an Oscar nomination. She has such an innocence to her in this movie. Ann Doran role was equally satisfying as was all of her small supporting roles. I especially like the hometown aura of pre-war Phildelphia. The hunting scene is very good. Of course the war scene on Guadalcanal truly showed the horror faced by our soldiers during this epic battle. A well deserving film and one that should not be forgotten
good wartime movie showing effects war hometown boy looses eyesight guadalcanal must come home readjust help family friends excellent cast actor helps make movie entertaining eleanor parker role girlfriend worthy oscar nomination innocence movie ann dor

preprocessing same as before

In [0]:
# fit tokenizer
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(np.concatenate((x_train_new, x_test_new), axis=0))

# pad sequences
max_length = max([len(w) for w in all_reviews_words])

# define vocabulary size
vocab_size = len(tokenizer_obj.word_index) + 1

# create tokenized dataset
x_train_tokens = tokenizer_obj.texts_to_sequences(x_train_new)
x_test_tokens = tokenizer_obj.texts_to_sequences(x_test_new)

# pad sequence
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_length)
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_length)

print('vocab_size:', vocab_size)
print('max_length:', max_length)
print('shape of train / test dataset:')
print(x_train_pad.shape, x_test_pad.shape)

vocab_size: 134157
max_length: 1440
shape of train / test dataset:
(25000, 1440) (25000, 1440)


So far we have two dicts: word - token and word - embedding. Now obtain the token - embedding dict (matrix).

In [0]:
# obtain token - embedding matrix
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

for word, i in tokenizer_obj.word_index.items():
    embedding_matrix[i] = embeddings[word]

print(embedding_matrix.shape)

(134157, 100)


## RNN model: with pre-trained embeddings

In [0]:
# check the input parameters:
print('vocab_size:', vocab_size)
print('max_length:', max_length)
print('embedding_dims:', EMBEDDING_DIM)

vocab_size: 134157
max_length: 1440
embedding_dims: 100


In [0]:
# build a model
EMBEDDING_DIM = 100

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(
        vocab_size, EMBEDDING_DIM,
        embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
        input_length=max_length,
        trainable=False),
    tf.keras.layers.GRU(units=32, dropout=0.2, recurrent_dropout=0.2),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])



In [0]:
# trainable parameters is significantly fewer
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1440, 100)         13415700  
_________________________________________________________________
gru_1 (GRU)                  (None, 32)                12864     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 13,428,597
Trainable params: 12,897
Non-trainable params: 13,415,700
_________________________________________________________________


In [0]:
model.fit(
    x_train_pad, y_train,
    batch_size=128, epochs=10,
    validation_data=(x_test_pad,y_test), verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5a1f505ba8>

### Test model (with customized examples)

This model is able to tell that "Not a good movie!" is a negative review.

In [0]:
# test some samples
test_samples = [
    "This movie really sucks! Can I get my money back please?",
    "Not a good movie!",
    "This movie is fantastic! I really like it because it is so good.",
    ]

test_samples_tokens = tokenizer_obj.texts_to_sequences(test_samples)
test_samples_pad = pad_sequences(test_samples_tokens, maxlen=max_length)

# predict
model.predict(test_samples_pad)

array([[0.13060334],
       [0.5897073 ],
       [0.91971165]], dtype=float32)

# Conclusion

Train from scratch, test accuracy: ~86%

Pre-trained word embedding, test accuracy: ~88%

With pre-trained embedding, the training is much faster and less prone to overfitting.

