## Word-level Sequential Model for Sentiment Classification

In [1]:
from __future__ import print_function

### 1. Data set

We will use IMDB review data set for generating the encoding of sentence (i.e text review from user) to classify sentiment polarity of this text. Data is originally taken from https://www.kaggle.com/c/word2vec-nlp-tutorial/data. It contains 25000 reviews with labels 0 for "negative" sentiment and 1 for "positive" sentiment. For validation set, the information about binary labels (0 and 1) can be seen in attribute "id" of the data set. Number after character '\_' represents rating score. If rating <5, then the sentiment score is 0 or "negative" sentiment. If the rating >=7, then the score is 1 or "positive. Otherwise, is negative. For test set, data is provided without label.

Example of (part of) original text in data set:

```
id	sentiment	review

"7759_3"	0	"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger ."

```

### 2. Problem Definition

Given a text (e.g. a movie review), we will predict whether this review is positive (class label=1) or negative (class label =0).

### 3. Tasks:

* Encode text using LSTM as an encoder layer.
* Project the output of encoder model to dense prediction layer.
* Train the model with objective function to minimize error loss of sentiment classification task, given the encoding of text sequence as the input of the model.

YOUR ASSIGNMENT TASKS:
* <span style="color:red">Plot error loss and accuracy in training and validation stage. Discuss the result.</span>
* <span style="color:red">Generate document embedding from RNN layer that has been trained and optimized to the sentiment classification task.</span>
* <span style="color:red">Visualize the resulting document embedding and project to their sentiment labels in embedding space.</span>
* <span style="color:red">Test with new set of document (raw data set is provided).</span>
* <span style="color:red">Assign new labels to new unseen and unlabelled document. You will need to encode this raw document as well as a query to source (trained) document embedding. Sample only 10 new unlabelled documents.</span>

### 4. Read preprocessed data

In [2]:
import os
import sys
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 100
import re
import nltk
import string
from string import punctuation

DATA_PATH = 'data/imdb'
MODEL_PATH = 'model/assignment_3_1'

In [3]:
import _pickle as cPickle

# reading file in pickle format
def readPickle(pickleFilename):
    f = open(pickleFilename, 'rb')
    obj = cPickle.load(f)
    f.close()
    return obj

In [4]:
def savePickle(dataToWrite,pickleFilename):
    f = open(pickleFilename, 'wb')
    cPickle.dump(dataToWrite, f)
    f.close()

In [5]:
def striphtml(html):
    p = re.compile(r'<.*?>')
    return p.sub('', html)

In [6]:
def clean(s):
    return re.sub(r'[^\x00-\x7f]', r'', s)

In [7]:
data = pd.read_csv(os.path.join(DATA_PATH,"labeledTrainData.tsv"), header=0, delimiter="\t", quoting=3)

In [8]:
valid_data = pd.read_csv(os.path.join(DATA_PATH,"testData.tsv"), header=0, delimiter="\t")

In [9]:
txt = ''
docs = []
sentiments = []
for cont, sentiment in zip(data.review, data.sentiment):
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    docs.append(doc)
    sentiments.append(sentiment)

In [10]:
valid_docs =[]
valid_labels = []
i=0
for docid,cont in zip(valid_data.id, valid_data.review):
    id_label = docid.split('_')
    if(int(id_label[1]) >= 7):
        valid_labels.append(1)
    else:
        valid_labels.append(0)         
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    valid_docs.append(doc)

In [11]:
def tokenizeWords(text):
    tokens = re.sub(r"[^a-z0-9]+", " ", text.lower()).split()
    return [str(strtokens) for strtokens in tokens]

In [12]:
def indexingVocabulary(array_of_words):

    wordIndex = list(array_of_words)
    wordIndex.insert(0,'</PAD>')
    if 'sof' not in array_of_words:
        wordIndex.append('</START_DOC>')
    if 'eof' not in array_of_words:
        wordIndex.append('</END_DOC>')
    wordIndex.append('</UNK>')
    vocab=dict([(i,wordIndex[i]) for i in range(len(wordIndex))])
    return vocab

In [13]:
train_str_tokens = []
all_tokens = []
for i, text in enumerate(docs):
    # tokenize text 
    train_str_tokens.append(tokenizeWords(text))
    all_tokens.extend(tokenizeWords(text))

In [14]:
print(train_str_tokens[0][:10])

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with']


In [15]:
valid_str_tokens = []
for i, text in enumerate(valid_docs):
    # tokenize text 
    valid_str_tokens.append(tokenizeWords(text))

In [16]:
tf = nltk.FreqDist(all_tokens)
common_words = tf.most_common(5000)
arr_common = np.array(common_words)
words = arr_common[:,0]

In [17]:
words_indices = indexingVocabulary(words)
indices_words = dict((v,k) for (k,v) in words_indices.items())

In [18]:
list(words_indices.items())[:5]

[(0, '</PAD>'), (1, 'the'), (2, 'and'), (3, 'a'), (4, 'of')]

In [19]:
list(indices_words.items())[:5]

[('treated', 1908),
 ('has', 46),
 ('olivier', 4092),
 ('remake', 1014),
 ('starring', 1180)]

In [20]:
# integer format of training input 
train_int_input = []
for i, text in enumerate(train_str_tokens):
    int_tokens = [indices_words[w] if w in indices_words.keys() else indices_words['</UNK>'] for w in text ]
    train_int_input.append(int_tokens)

In [21]:
# integer format of test validation input 
valid_int_input = []
for i, text in enumerate(valid_str_tokens):
    int_tokens = [indices_words[w] if w in indices_words.keys() else indices_words['</UNK>'] for w in text ]
    valid_int_input.append(int_tokens)

In [22]:
X_train = np.array(train_int_input)
y_train = np.array(sentiments)

In [23]:
X_valid = np.array(valid_int_input)
y_valid = np.array(valid_labels)

In [24]:
# storing training and validation set
savePickle(X_train, os.path.join(DATA_PATH,'X_train'))
savePickle(y_train, os.path.join(DATA_PATH,'y_train'))
savePickle(X_valid, os.path.join(DATA_PATH,'X_valid'))
savePickle(y_valid, os.path.join(DATA_PATH,'y_valid'))
# storing look-up dictionary for vocabulary index
savePickle(words_indices, os.path.join(DATA_PATH,'words_indices'))
savePickle(indices_words, os.path.join(DATA_PATH,'indices_words'))

In [None]:
# YOUR CODE HERE TO PREPARE THE ENCODING OF NEW UNSEEN UNLABELED TEST DATA

# 1. Read file data/unlabeledTrainData.csv (ONLY USE FIRST 25000 DOCUMENTS)
# 2. Do similar preprocessing as in training and validation set
# 3. Encode to integer format of sequences

### 5. Word-level document encoder

In [25]:
from keras.preprocessing import sequence

Using TensorFlow backend.


In [27]:
max_review_length = 500
X_train_pad = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_valid_pad = sequence.pad_sequences(X_valid, maxlen=max_review_length)

In [28]:
from keras.models import Model
from keras.layers import Dense, Input, Embedding
from keras.layers import LSTM

latent_dim = 100  # Latent dimensionality of the encoding space.
embedding_dim = 32

encoder_input = Input(shape=(None,), name='encoder_inputs')
encoder_embedding = Embedding(len(words_indices), embedding_dim, name='embedding_encoder')(encoder_input)
lstm_encoder = LSTM(latent_dim, name='lstm_encoder')(encoder_embedding)
output_encoder = Dense(1, activation='sigmoid')(lstm_encoder)
model = Model(inputs=encoder_input, outputs=output_encoder)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder_inputs (InputLayer)  (None, None)              0         
_________________________________________________________________
embedding_encoder (Embedding (None, None, 32)          160128    
_________________________________________________________________
lstm_encoder (LSTM)          (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,429
Trainable params: 213,429
Non-trainable params: 0
_________________________________________________________________


In [29]:
model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])

In [32]:
model.fit(X_train_pad, y_train, validation_data=(X_valid_pad, y_valid), epochs=3, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fe2fe188a20>

In [None]:
# YOUR CODE HERE
# CHANGE BELOW CODE FOR TRAINING THE MODEL
# 1. Increase epoch number for inspecting the error loss through epochs 
#    (optional - if your computation resource is sufficient)
# 2. Add callback function for historical error loss and accuracy during training and validation stage
# 3. Plot history of error loss and accuracy (with matplotlib or any available library)

### 6. Save the trained models and weights

In [33]:
# Save model
model.save(os.path.join(MODEL_PATH,'word_sequence_classification_model.h5'))

In [34]:
# Save trained weight parameters
model.save_weights(os.path.join(MODEL_PATH, 'weights_word_sequence_classification.hdf5'))

### 7. Retrieve the encoding of document that has been optimized to sentiment classification task

* input: validation set
* output: document embedding of validation set

In [None]:
# YOUR CODE HERE
# 1. Generate document embedding from trained model and parameters. 
#    There are several ways to retrieve document embedding from LSTM encoder layer. Choose one.
# 2. Visualize w.r.t. sentiment labels (Use tSNE for dimensionality reduction)
# 3. Evaluate the quality of document embedding on subsequent binary classification task, by:

#    - loss and accuracy of trained model on new unseen documents

#    - MLP classifier built in keras model. Justify your chosen architecture.

#    - Linear model / SVM classifier 

#    IMPORTANT: your task is not optimizing classifier or finetuning it -  but evaluating the quality of embedding
#    with the most minimal parameter settings of classifier
#    Discuss the results and if anys - a better way to evaluate the quality of embeddings.

# You may need to store your encoder model, full model, and weights for the next task (TASK 8)

### 8. Document similarity task from new unseen documents

In [None]:
# YOUR CODE HERE
# 1. Generate document embedding from the trained model and new unseen document (preprocessed unlabelled data)
# 2. Sample 10 unseen unlabelled document embedding and compute document similarity with the previous 
#    resulting labelled document embedding
#    - compute document similarity
#    - classify new 10 document embeddings according to the similarity measurement 
#       define the decision making to assign these new 10 labels
# 3. Visualize the result of additional 10 new unseen documents and evaluate similarity results. 
