## Text Mining Brown Bag

### Introduction
The **Large Movie Review Dataset** (http://ai.stanford.edu/~amaas/data/sentiment/) is used to train several sentiment classification models.  This training dataset consists of 25,000 labeled movie reviews (50/50 positive and negative) as well as 50,000 unlabeled reviews.  The test dataset consists of an additional 25,000 labeled reviews.


*Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.metrics import accuracy_score
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from gensim.models.word2vec import Word2Vec

%matplotlib inline

## Import Data

In [2]:
df = pd.read_csv("../data/review_data.csv")
df.head()

Unnamed: 0,review,train,label,target
0,For a movie that gets no respect there sure ar...,True,pos,True
1,Bizarre horror movie filled with famous faces ...,True,pos,True
2,"A solid, if unremarkable film. Matthau, as Ein...",True,pos,True
3,It's a strange feeling to sit alone in a theat...,True,pos,True
4,"You probably all already know this by now, but...",True,pos,True


In [3]:
def model_accuracy(y_test, y_test_pred, model_desc=None):
    acc = accuracy_score(y_test, y_test_pred)
    print("%s Accuracy: %0.3f" %(model_desc, acc))
    return (model_desc, acc)

In [4]:
# Import word vectors from "Text Mining - Sentiment Classification.ipynb"
model = Word2Vec.load("wordvec.model")

In [5]:
model.wv.most_similar("uncle")

[('nephew', 0.8309506177902222),
 ('cousin', 0.8183605074882507),
 ('fiancée', 0.8039058446884155),
 ('grandfather', 0.801784873008728),
 ('niece', 0.7836310863494873),
 ('aunt', 0.7762334942817688),
 ('fiancé', 0.7591961622238159),
 ('grandmother', 0.7586368918418884),
 ('colleague', 0.7578848600387573),
 ('pal', 0.7566899657249451)]

## Data Prep for LSTM

In [6]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import Embedding, Input, LSTM, Dense, Bidirectional

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [9]:
word2index = {"<UNK>": 0}

for i, k in enumerate(model.wv.index2word):
    word2index[k] = i + 1
    
embedding = np.zeros((1, model.wv.syn0.shape[1]))
embedding = np.concatenate([embedding, model.wv.syn0], axis=0)

In [10]:
word2index

{'<UNK>': 0,
 'the': 1,
 ',': 2,
 '.': 3,
 'and': 4,
 'a': 5,
 'of': 6,
 'to': 7,
 'is': 8,
 '/': 9,
 '>': 10,
 '<': 11,
 'br': 12,
 'it': 13,
 'in': 14,
 'i': 15,
 'this': 16,
 'that': 17,
 "'s": 18,
 'was': 19,
 'as': 20,
 'with': 21,
 'for': 22,
 'movie': 23,
 'but': 24,
 'film': 25,
 ')': 26,
 '(': 27,
 'you': 28,
 "''": 29,
 '``': 30,
 "n't": 31,
 'on': 32,
 'not': 33,
 'are': 34,
 'he': 35,
 'his': 36,
 'have': 37,
 'be': 38,
 'one': 39,
 '!': 40,
 'at': 41,
 'they': 42,
 'all': 43,
 'by': 44,
 'an': 45,
 'who': 46,
 'from': 47,
 'so': 48,
 'like': 49,
 'there': 50,
 'her': 51,
 'or': 52,
 'just': 53,
 'do': 54,
 'about': 55,
 'has': 56,
 'if': 57,
 'out': 58,
 '?': 59,
 'what': 60,
 'some': 61,
 'good': 62,
 'when': 63,
 'more': 64,
 'very': 65,
 'she': 66,
 'would': 67,
 'no': 68,
 'up': 69,
 'even': 70,
 '...': 71,
 'my': 72,
 'can': 73,
 'which': 74,
 'their': 75,
 'time': 76,
 'only': 77,
 'really': 78,
 'story': 79,
 'see': 80,
 'had': 81,
 'were': 82,
 'we': 83,
 'did': 84

In [11]:
print("Vocabulary Size: %i" %len(word2index))
print("Embedding matrix shape: %s" %str(embedding.shape))

Vocabulary Size: 9464
Embedding matrix shape: (9464, 300)


In [12]:
def index_lookup(x):
    try:
        return word2index[x]
    except KeyError:
        return word2index["<UNK>"]

def texts_to_sequence(texts, max_length = 100):
    out = []
    for x in texts:
        # Convert to lowercase
        x = x.lower()
        
        # Tokenize
        x = word_tokenize(x)
        
        x_seq = [index_lookup(t) for t in x]
        out.append(x_seq)
    return pad_sequences(out, maxlen=max_length)

In [13]:
x = df["review"].iloc[0]
x_token = word_tokenize(x.lower())
x_seq = texts_to_sequence([x], max_length=100)

print("INPUT TEXT")
print("-----------")
print(x)
print("\n")

print("LOWERCASE AND TOKENIZED")
print("-----------------------")
print(x_token)
print("Length: %i" %len(x_token))
print("\n")

print("SEQUENCE OF INDICES")
print("-----------------------")
print(x_seq)
print("Shape: %s" %str(x_seq.shape))
print("\n")

INPUT TEXT
-----------
For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.


LOWERCASE AND TOKENIZED
-----------------------
['for', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'imagine', 'a', 'movie', 'where', 'joe', 'piscopo', 'is', 'actually', 'funny', '!', 'maureen', 'stapleton', 'is', 'a', 'scene', 'stealer', '.', 'the', 'moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'watch', 'for', 'alan', '``', 'the', 'skipper', "''", 'hale', 'jr.', 'as', 'a', 'police', 'sgt', '.']
Length: 59


SEQUENCE OF INDICES
-----------------------
[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0

In [18]:
X_train.shape

(25000, 100)

In [14]:
MAX_LEN = 100

X_train = texts_to_sequence(df[(df["train"] == True) & (df["label"].notnull())]["review"], MAX_LEN)
X_test = texts_to_sequence(df[(df["train"] == False) & (df["label"].notnull())]["review"], MAX_LEN)

In [15]:
y_train = df[(df["train"] == True) & (df["label"].notnull())]["target"]
y_test = df[(df["train"] == False) & (df["label"].notnull())]["target"]

In [16]:
# Define embedding layer
d1, d2 = embedding.shape
embedding_layer = Embedding(d1, d2, weights = [embedding],
                            input_length=MAX_LEN,
                            trainable=False)

## Unidirectional LSTM

In [20]:
in1 = Input(shape=(MAX_LEN,))

x = embedding_layer(in1)
x = LSTM(10)(x)
x = Dense(10, activation="relu")(x)
out = Dense(1, activation="sigmoid")(x)

model = Model(in1, out)

In [21]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 300)          2839200   
_________________________________________________________________
lstm_2 (LSTM)                (None, 10)                12440     
_________________________________________________________________
dense_3 (Dense)              (None, 10)                110       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 2,851,761
Trainable params: 12,561
Non-trainable params: 2,839,200
_________________________________________________________________


NOTE: The number of parameters in the embedding layer is equal to vocabulary size * word vector length

In [22]:
model.compile(loss="binary_crossentropy",
              optimizer="adam", metrics=["acc"])


model.fit(X_train, y_train, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x121f18ba8>

In [23]:
y_test_pred =  model.predict(X_test).flatten()

In [24]:
acc_lstm = model_accuracy(y_test, y_test_pred > 0.5, "LSTM")

LSTM Accuracy: 0.823


## Bidirectional LSTM

In [25]:
in1 = Input(shape=(MAX_LEN,))

x = embedding_layer(in1)
x = Bidirectional(LSTM(10))(x)
x = Dense(10, activation="relu")(x)
out = Dense(1, activation="sigmoid")(x)

model = Model(in1, out)

In [26]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 300)          2839200   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 20)                24880     
_________________________________________________________________
dense_5 (Dense)              (None, 10)                210       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 11        
Total params: 2,864,301
Trainable params: 25,101
Non-trainable params: 2,839,200
_________________________________________________________________


NOTE: The bidirectional LSTM layer now has twice as many parameters as the unidirectional

In [27]:
model.compile(loss="binary_crossentropy",
              optimizer="adam", metrics=["acc"])

history = model.fit(X_train, y_train, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x119a4a748>

In [28]:
y_test_pred =  model.predict(X_test).flatten()

In [29]:
acc_lstm_bi = model_accuracy(y_test, y_test_pred > 0.5, "LSTM (Bidirectional)")

LSTM (Bidirectional) Accuracy: 0.834
