## Character-level Sequence Model for Sentiment Classification

In [1]:
from __future__ import print_function

### 1. Data set

We will use IMDB review data set for generating the encoding of sentence (i.e text review from user) to classify sentiment polarity of this text. Data is originally taken from https://www.kaggle.com/c/word2vec-nlp-tutorial/data. It contains 25000 reviews with labels 0 for "negative" sentiment and 1 for "positive" sentiment. For validation and testing set, the information about binary labels (0 and 1) can be seen in attribute "id" of the data set. Number after character '\_' represents rating score. If rating <5, then the sentiment score is 0 or "negative" sentiment. If the rating >=7, then the score is 1 or "positive". 

Example of (part of) original text in data set:

```
id	sentiment	review

"7759_3"	0	"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger ."

```

### 2. Problem Definition

Given a text (e.g. a movie review), we need to predict whether this review is positive (class label=1) or negative (class label =0).

Tasks:
* Encode text from character level by using bidirectional LSTM as encoder model
* Project the output of encoder model to dense prediction layer

### 3. Preprocessing

* remove HTML tags
* remove non-informative characters
* Take the first 1000 characters of text review

In [2]:
import os
import sys
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 100
import re

DATA_PATH = 'data'
EMBEDDING_PATH = 'embedding'
MODEL_PATH = 'model'

In [3]:
import _pickle as cPickle

# reading file in pickle format
def readPickle(pickleFilename):
	f = open(pickleFilename, 'rb')
	obj = cPickle.load(f)
	f.close()
	return obj

def savePickle(dataToWrite,pickleFilename):
	f = open(pickleFilename, 'wb')
	cPickle.dump(dataToWrite, f)
	f.close()

In [4]:
data = pd.read_csv(os.path.join(DATA_PATH,"labeledTrainData.tsv"), header=0, delimiter="\t", quoting=3)

In [5]:
valid_data = pd.read_csv(os.path.join(DATA_PATH,"testData.tsv"), header=0, delimiter="\t")

In [6]:
def striphtml(html):
    p = re.compile(r'<.*?>')
    return p.sub('', html)

In [7]:
def clean(s):
    return re.sub(r'[^\x00-\x7f]', r'', s)

In [8]:
data[:5]

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment with MJ i've started listening to his music, watch..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously g..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to P..."
3,"""3630_4""",0,"""It must be assumed that those who praised this film (\""the greatest filmed opera ever,\"" didn't..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening..."


In [28]:
valid_data[:5]

Unnamed: 0,id,review
0,12311_10,"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is p..."
1,8348_2,"This movie is a disaster within a disaster film. It is full of great action scenes, which are on..."
2,5828_4,"All in all, this is a movie for kids. We saw it tonight and my child loved it. At one point my k..."
3,7186_2,"Afraid of the Dark left me with the impression that several different screenplays were written, ..."
4,12128_7,"A very accurate depiction of small time mob life filmed in New Jersey. The story, characters and..."


### 4. Create document corpus (array list of text documents)

#### For training sets

In [9]:
docs = []
sentiments = []
for cont, sentiment in zip(data.review, data.sentiment):
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    docs.append(doc)
    sentiments.append(sentiment)

#### For validation sets

In [29]:
valid_docs =[]
valid_labels = []
i=0
for docid,cont in zip(valid_data.id, valid_data.review):
    id_label = docid.split('_')
    if(int(id_label[1]) >= 7):
        valid_labels.append(1)
    else:
        valid_labels.append(0)         
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    valid_docs.append(doc)

### 5. Build character level vocabulary index

In [33]:
txt = ''

In [34]:
for doc in docs:
    for s in doc:
        txt += s

In [35]:
for doc in valid_docs:
    for s in doc:
        txt += s

In [36]:
chars = set(txt)
print('total chars:', len(chars))

total chars: 71


In [37]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [38]:
list(char_indices.items())[:5]

[('_', 61), ('&', 0), ('i', 1), ('q', 2), ('[', 58)]

In [39]:
list(indices_char.items())[:5]

[(0, '&'), (1, 'i'), (2, 'q'), (3, 's'), (4, 'a')]

In [40]:
# uncomment to store files

#savePickle(char_indices, os.path.join(DATA_PATH,'char_indices'))
#savePickle(indices_char, os.path.join(DATA_PATH,'indices_char'))

In [41]:
num_chars = len(char_indices)
num_chars

71

#### Padding training sets into fixed length (1000 characters)

In [42]:
maxlen = 1000

X = np.zeros((len(docs), maxlen), dtype=np.int32)
y = np.array(sentiments)

for i, doc in enumerate(docs):
    len_doc = len(doc)
    if len_doc > maxlen:
        txt = doc[:maxlen]
    else:
        txt = doc
    for j, char in enumerate(txt):
        X[i, j] = char_indices[char]

In [43]:
X.shape

(25000, 1000)

#### Padding validation sets into fixed length (1000 characters)

In [44]:
maxlen = 1000

X_valid = np.zeros((len(valid_docs), maxlen), dtype=np.int32) 
y_valid = np.array(valid_labels)

for i, doc in enumerate(valid_docs):
    len_doc = len(doc)
    if len_doc > maxlen:
        txt = doc[:maxlen]
    else:
        txt = doc
    for j, char in enumerate(txt):
        X_valid[i, j] = char_indices[char]

In [45]:
X_valid.shape

(25000, 1000)

In [46]:
X_train = X[:10000]
X_val = X_valid[:5000]

y_train = y[:10000]
y_val = y_valid[:5000]

### 4. Character-level sequential model

In [48]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout
from keras.layers import LSTM, Lambda, merge, concatenate
import tensorflow as tf
import keras.callbacks

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [49]:
def binarize(x, sz=71):
    return tf.to_float(tf.one_hot(x, sz, on_value=1, off_value=0, axis=-1))

In [50]:
def binarize_outshape(in_shape):
    return in_shape[0], in_shape[1], 71

### Model 1: LSTM layer (Keras sequential model)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Lambda
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

model = Sequential()
model.add(Lambda(binarize, output_shape=binarize_outshape,name='embedding_encoder', input_shape=(1000,), dtype='int32'))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])
print(model.summary())

### Model 2: Bidirectional LSTM and dropout layers (Keras sequential model)

In [51]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Lambda, Bidirectional
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

model = Sequential()
model.add(Lambda(binarize, output_shape=binarize_outshape,name='embedding_encoder', input_shape=(1000,), dtype='int32'))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_encoder (Lambda)   (None, 1000, 71)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               69632     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               16512     
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 86,273
Trainable params: 86,273
Non-trainable params: 0
_________________________________________________________________
None


In [54]:
model.fit(X_train, y_train, validation_data=(X_val, y_val), batch_size=64, epochs=10)

Train on 10000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd914963ac8>

#### QA-1

Question: 

* Why do you think this model does not converge? What could be the reason? Can you improve the performance by adding more training data? (e.g. 20.000 training sets instead of 10.000)

### Model 3: Bidirectional LSTM with Keras fungsional API

Same model, with modularity of Fungsional API

In [None]:
x_input = Input(shape=(None, ), name='encoder_input')
char_embedding = Lambda(binarize, output_shape=binarize_outshape,name='embedding_encoder')(x_input)
forwards = LSTM(32, return_sequences=False)(char_embedding)
backwards = LSTM(32, return_sequences=False, go_backwards=True)(char_embedding)
merged = concatenate([forwards, backwards],axis=-1)
output = Dropout(0.5)(merged)
output = Dense(128, activation='relu')(output)
output = Dropout(0.5)(output)
output = Dense(1, activation='sigmoid')(output)
model = Model(inputs=encoder_input, outputs=output)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

### Model 4: Hierarchical Model of Sentence-Document with CNN + LSTM 

#### 1. Create corpus of document as array list of sentences (3D matrix input, instead of 2D)

In [55]:
docs_sents = []
docs_sents_y = []
for cont, sentiment in zip(data.review, data.sentiment):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', clean(striphtml(cont)))
    sentences = [sent.lower() for sent in sentences]
    docs_sents.append(sentences)
    docs_sents_y.append(sentiment)

In [57]:
val_docs_sents = []
val_docs_sents_y = []
for docid,cont in zip(valid_data.id, valid_data.review):
    
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', clean(striphtml(cont)))
    sentences = [sent.lower() for sent in sentences]
    val_docs_sents.append(sentences)
    
    id_label = docid.split('_')
    if(int(id_label[1]) >= 7):
        val_docs_sents_y.append(1)
    else:
        val_docs_sents_y.append(0)   

In [58]:
maxlen = 50 # maximum number of words in a sentence
max_sentences = 15 # maximum number of sentence in a document

X = np.zeros((len(docs_sents), max_sentences, maxlen), dtype=np.int32) 
y = np.array(docs_sents_y)

for i, doc in enumerate(docs_sents):
    for j, sentence in enumerate(doc):
        if j < max_sentences:
            len_sent = len(sentence) 
            if len_doc > maxlen:
                sent = sentence[:maxlen]
            else:
                sent = sentence
            
            for t, char in enumerate(sent):
                X[i, j, (maxlen - 1 - t)] = char_indices[char]

In [59]:
maxlen = 50 # maximum number of words in a sentence
max_sentences = 15 # maximum number of sentence in a document

X_val = np.zeros((len(val_docs_sents), max_sentences, maxlen), dtype=np.int32) 
y_val = np.array(val_docs_sents_y)

for i, doc in enumerate(val_docs_sents):
    for j, sentence in enumerate(doc):
        if j < max_sentences:
            len_sent = len(sentence) 
            if len_doc > maxlen:
                sent = sentence[:maxlen]
            else:
                sent = sentence
            
            for t, char in enumerate(sent):
                X_val[i, j, (maxlen - 1 - t)] = char_indices[char]

Notice that the input shape is in 3D: number of examples, max sentences, max words

In [70]:
X.shape

(25000, 15, 50)

In [72]:
X_val.shape

(25000, 15, 50)

In [76]:
x_train = X
y_train = y

x_valid = X_val[:5000]
y_valid = y_val[:5000]

#### 2. Character-level Hierarchical Model of Sentence-Document

In [60]:
import tensorflow as tf
from keras.models import Model
from keras.layers import Dense, Input, Dropout, MaxPooling1D, Conv1D, GlobalMaxPool1D
from keras.layers import LSTM, Lambda, Bidirectional, concatenate, BatchNormalization
from keras.layers import TimeDistributed
from keras.optimizers import Adam
from keras.callbacks import Callback

In [61]:
# document input
document = Input(shape=(max_sentences, maxlen), dtype='int32')
# sentence input
in_sentence = Input(shape=(maxlen,), dtype='int32')

#### Sentence encoder

In [62]:
# char indices to one hot matrix, 1D sequence to 2D 
embedded = Lambda(binarize, output_shape=binarize_outshape)(in_sentence)

In [63]:
# embedded: encodes sentence by character with CNN

filter_length = [5, 3, 3]
nb_filter = [196, 196, 256]
pool_length = 2

for i in range(len(nb_filter)):
    embedded = Conv1D(filters=nb_filter[i],
                      kernel_size=filter_length[i],
                      padding='valid',
                      activation='relu',
                      kernel_initializer='glorot_normal',
                      strides=1)(embedded)

    embedded = Dropout(0.1)(embedded)
    embedded = MaxPooling1D(pool_size=pool_length)(embedded)

In [64]:
bi_lstm_sent = Bidirectional(LSTM(128, return_sequences=False))(embedded)

In [65]:
sent_encoder = Model(inputs=in_sentence, outputs=bi_lstm_sent)
sent_encoder.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 50)                0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 50, 71)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 46, 196)           69776     
_________________________________________________________________
dropout_3 (Dropout)          (None, 46, 196)           0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 23, 196)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 21, 196)           115444    
_________________________________________________________________
dropout_4 (Dropout)          (None, 21, 196)           0         
__________

#### Document encoder

In [66]:
encoded = TimeDistributed(sent_encoder)(document)

In [67]:
bi_lstm_doc = Bidirectional(LSTM(128, return_sequences=False))(encoded)

In [68]:
output = Dropout(0.5)(bi_lstm_doc)
output = Dense(128, activation='relu')(output)
output = Dropout(0.5)(output)
output = Dense(1, activation='sigmoid')(output)
model = Model(inputs=document, outputs=output)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 15, 50)            0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 15, 256)           730244    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 256)               394240    
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_7 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total para

In [69]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [78]:
model.fit(x_train, y_train, validation_data=(x_valid, y_valid), batch_size=64, epochs=10)

Train on 25000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd8ac7f1208>

In [79]:
# Save model
model.save(os.path.join(MODEL_PATH,'practical_3_1_model4.h5'))

In [80]:
# Save weight parameters
model.save_weights(os.path.join(MODEL_PATH, 'weights_practical_3_1_model4.hdf5'))

#### QA-2

Question: 

* What is your overall conclusion after comparing different model architectures on character-level sentiment classification task? 
* What are advantages and disadvantages of character-level sequential (RNN) model for this specific task (or other possible tasks as well)?   

(hints):
* encoding process of input
* sequence length
* hyperparameters
* training examples
* computation resource