# Sentiment Classification of a Small IMDB Dataset

This notebook shows how we can build a reasonably well performing sentiment classification model with only a few samples. The approach was inspired by [this](https://machinelearningmastery.com/best-practices-document-classification-deep-learning/) article. The resulting model achieved approimately 84% accuracy based on just 1000 observations and only 10 epochs. The Stanford researchers who compiled this dataset were able to reach 89% on the full 50,000 observations back in 2011, so the result isn't bad!

## Preamble

In [1]:
# loading relevant modules
import os
import re
import numpy as np
from keras import layers
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from keras.layers import Embedding, Flatten, Dense, LSTM
from keras.layers.merge import concatenate, Concatenate
from keras.regularizers import L1L2
from keras.utils import plot_model

Using TensorFlow backend.


In [2]:
import string, re
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
parser = spacy.load('en_core_web_sm')

In [3]:
imdb_dir = 'datasets/aclImdb'
glove_dir = 'datasets/glove.6B'

## Setting parameters

This is where we decide the size of our dataset.

In [4]:
# parameters
maxlen = 250 # Cuts off reviews after N words 
training_samples = 1000 # Trains on a few samples -- useful to be able to work with small sample sizes when you need to manually annotate data
validation_samples = 1000 # Validates on N samples 
max_words = 10000 # retains only 10,000 most frequent words

## Text Preprocessing

In [5]:
# text cleaner
punct_scrubber = re.compile('[%s]' % re.escape(string.punctuation))
def clean_text(in_text):
    lemmas = []
    tokens = punct_scrubber.sub(' ', in_text) # remove punctuations
    tokens = re.sub('\s+',' ',tokens) # condense multiple blanks into 1
    tokens = tokens.lstrip().rstrip().lower() # trim trailing blanks and convert to lower
    tokens = parser(tokens) # parse with spacy
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_) # get lemmas
    lemmas = ' '.join([tok for tok in lemmas if tok not in STOP_WORDS]) # remove stopwords
    return lemmas

## Loading Data

In [6]:
# structure
set_structure = {
    'text': [],
    'sentiment': []
}
imdb_data = {
    'train': set_structure,
    'test': set_structure
}

# ingestion
for segment in ['train','test']:
    if segment == 'test':
        pass # to speed things up, we'll just use the train set
    for label in ['pos','neg']:
        target_dir = '{prefix}/{seg}/{sentiment}'.format(prefix=imdb_dir, seg=segment,sentiment=label)
        i = 0
        for fname in os.listdir(target_dir):
            if fname.endswith('.txt'):
                if i < training_samples + validation_samples:
                    i += 1
                    # append text data
                    active_file = open('{}/{}'.format(target_dir, fname))
                    imdb_data[segment]['text'].append(clean_text(active_file.read()))
                    active_file.close()

                    # append sentiment data
                    if label == 'neg':
                        sentiment_score = 0
                    else:
                        sentiment_score = 1

                    imdb_data[segment]['sentiment'].append(sentiment_score)

# NOTE: it is usually preferred to have sentiment labels at a sentence level rather than the entire passage of text, this isn't the case here

## Ingest GloVe Vector Embeddings

GloVe gives us a dense vectors of word representations that can be used to infer relationships between multiple words. If we are working with a small dataset, we don't want to re-learn these representations from raw data.

In [7]:
# load embeddings line by line
e_dim = 300 # options: 50, 100, 200, 300
embeddings_index = {}
g_file = open('{}/glove.6B.{}d.txt'.format(glove_dir, e_dim))
for line in g_file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
g_file.close()
print('Found {} word vectors.'.format(len(embeddings_index)))

Found 400000 word vectors.


## Preparing Data for Modelling

In [8]:
# tokenize text, pad sequences etc
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(imdb_data['train']['text']) # since we are using a limited sample, only going to use the train set here
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(imdb_data['train']['text'])
target_sequences = pad_sequences(sequences, maxlen=maxlen)
target_labels = np.asarray(imdb_data['train']['sentiment'])

# the data is currently sorted into positive then negative, shuffle it
indices = np.arange(target_sequences.shape[0])
np.random.shuffle(indices)
target_sequences = target_sequences[indices]
target_labels = target_labels[indices]

pp_data = {}
for segment in ['train', 'test']:
    
    if segment == 'train':
        target_sequences_s = target_sequences[:training_samples]
        target_labels_s = target_labels[:training_samples]
    else:
        target_sequences_s = target_sequences[training_samples: training_samples + validation_samples]
        target_labels_s = target_labels[training_samples: training_samples + validation_samples]
    
    pp_data[segment] = {
        'sequence': target_sequences_s,
        'label': target_labels_s
    }

## Mapping Observed Words to GloVe Embeddings

In [9]:
# find target words in the GloVe dictionary -- this will allow us to skip training the embedding layer
embedding_matrix = np.zeros((max_words, e_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

## Model Definition

A few notes about the choice of the model:

  - Since we are working with a small sample, we don't want to re-learn the embeddings, this is why we declare them as non-trainable
  - To overcome overfitting, dropouts are used throughout the archirecture
  - The number of neurons was kept low in all layers -- this appeared to increase the stability
  - Since sentiment is often detected in phrases rather than individual words, a convolutional layer is used with multiple filter sizes (from 2-5 word embedding groups) -- this approach was suggested [here](https://machinelearningmastery.com/best-practices-document-classification-deep-learning/)
  - The order of phrases may also matter, which is why an LSTM layer is used. Moreover, for a more rigorous sentence scan, a bidirectional wrapper was used to run the LSTM both ways, start to end and vice versa
  - After flattening out Conv LSTM layers with multiple phrase lengths, a dense MLP was added -- experimentation showed that it boosted performance

In [78]:
# stacked model -- idea: we want to look for multi word (embedding) patterns (of various sizes) using convolutions which will then be treated as pattern sequences by LSTMs and the result will be fed into a dense layer
convs = []
filter_sizes = (2,3,4,5)
graph_in = layers.Input(shape=(maxlen, e_dim))
for fsz in filter_sizes:
    conv = layers.Conv1D(16,fsz,padding='valid',activation='relu',strides=1)(graph_in)
    pool = layers.MaxPooling1D(maxlen-fsz+1)(conv)
    lstm_layer = layers.Bidirectional(LSTM(4, return_sequences=True, kernel_regularizer=L1L2(l1=0.0, l2=0.01)))(pool)
    convs.append(lstm_layer)
    
model_out = concatenate(convs, axis=-1)
graph = Model(graph_in, model_out)

stacked_model = Sequential()
stacked_model.add(Embedding(max_words, e_dim, input_length=maxlen, weights=[embedding_matrix], trainable=False)) # dim reduction
stacked_model.add(layers.SpatialDropout1D(0.1))
stacked_model.add(graph)
stacked_model.add(layers.Flatten())
stacked_model.add(layers.Dropout(0.25))
stacked_model.add(Dense(16, activation='relu'))
stacked_model.add(layers.Dropout(0.25))
stacked_model.add(Dense(1, activation='sigmoid'))
stacked_model.summary()
graph.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_44 (Embedding)     (None, 250, 300)          3000000   
_________________________________________________________________
spatial_dropout1d_42 (Spatia (None, 250, 300)          0         
_________________________________________________________________
model_3 (Model)              (None, 1, 32)             69952     
_________________________________________________________________
flatten_9 (Flatten)          (None, 32)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_20 (Dense)             (None, 16)                528       
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
__________

In [79]:
# run the model
N_EPOCH = 10

# compile
stacked_model.compile(
    optimizer='rmsprop',
    loss='binary_crossentropy',
    metrics=['acc']
)

# training
stacked_model.fit(
    pp_data['train']['sequence'], 
    pp_data['train']['label'],
    epochs=N_EPOCH,
    validation_data=(
        pp_data['test']['sequence'], 
        pp_data['test']['label']
    )
) # ideally we would also need a held out dataset for the final evaluation, but we'll stop here.

Train on 1000 samples, validate on 1000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1f4496c50>

# Part 2: Having fun with generating negative reviews

In [43]:
# collect all negative reviews and split them into characters
neg_reviews = []
for i in range(len(imdb_data['train']['text'])):
    if imdb_data['train']['sentiment'][i] == 0:
        neg_reviews.append(imdb_data['train']['text'][i])

In [44]:
# break each review into sentences, then break them into 

'story man unnatural feeling pig start opening scene terrific example absurd comedy formal orchestra audience turn insane violent mob crazy chanting s singer unfortunately stay absurd time general narrative eventually era turn cryptic dialogue shakespeare easy grader technical level s good think good cinematography future great vilmo zsigmond future star sally kirkland frederic forrest briefly'

In [None]:
#  :::::::::::::::::::::::::::::::::::::

In [None]:
# concatenate all negative reviews into 1 long text string

# collect all negative reviews from the training set
extracts = []
for i in range(len(imdb_data['train']['text'])):
    if imdb_data['train']['sentiment'][i] == 0:
        extracts.append(imdb_data['train']['text'][i])
        
# concatenate into a single text string
extracts = '. '.join(extracts)


In [27]:
test_string = 'i like pies.. pies are the best'
parsed_test = parser(test_string)

In [28]:
for i in parsed_test.sents:
    print(i)

i like pies..
pies are the best


In [8]:
imdb_data['train']['text'][0]

'bromwell high cartoon comedy run time program school life teacher 35 year teaching profession lead believe bromwell high s satire close reality teacher scramble survive financially insightful student right pathetic teacher pomp pettiness situation remind school know student episode student repeatedly try burn school immediately recall high classic line inspector m sack teacher student welcome bromwell high expect adult age think bromwell high far fetched pity isn t'

In [10]:
test_parser = parser(imdb_data['train']['text'][0])
for s in test_parser.sents:
    for w in s:
        print(w)

bromwell
high
cartoon
comedy
run
time
program
school
life
teacher
35
year
teaching
profession
lead
believe
bromwell
high
s
satire
close
reality
teacher
scramble
survive
financially
insightful
student
right
pathetic
teacher
pomp
pettiness
situation
remind
school
know
student
episode
student
repeatedly
try
burn
school
immediately
recall
high
classic
line
inspector
m
sack
teacher
student
welcome
bromwell
high
expect
adult
age
think
bromwell
high
far
fetched
pity
isn
t


In [38]:
example_set = [txt for txt, sentiment in zip(imdb_data['train']['text'], imdb_data['train']['sentiment']) if sentiment == 0]
example_set = example_set[0:10]

def extract_token_seq(text_string):
    parsed_text = parser(text_string)
    sent_list = []
    for s in parsed_text.sents:
        word_list = []
        for w in s:
            word_list.append(w)
        sent_list.append(word_list)
    return sent_list
    
test = list(map(extract_token_seq, example_set))

In [42]:
flat_list = [item for sublist in test for item in sublist]
test[0]

[[story,
  man,
  unnatural,
  feeling,
  pig,
  start,
  opening,
  scene,
  terrific,
  example,
  absurd,
  comedy,
  formal,
  orchestra,
  audience,
  turn,
  insane,
  violent,
  mob,
  crazy,
  chanting,
  s,
  singer,
  unfortunately,
  stay,
  absurd,
  time,
  general,
  narrative,
  eventually,
  era,
  turn,
  cryptic,
  dialogue,
  shakespeare,
  easy,
  grader,
  technical,
  level,
  s,
  good,
  think,
  good,
  cinematography,
  future,
  great,
  vilmo,
  zsigmond,
  future,
  star,
  sally,
  kirkland,
  frederic,
  forrest,
  briefly]]

In [21]:
test_parser = parser(example_set[0])
sent_list = []
for s in test_parser.sents:
    word_list = []
    for w in s:
        word_list.append(w)
    sent_list.append(word_list)
sent_list

[[bromwell,
  high,
  cartoon,
  comedy,
  run,
  time,
  program,
  school,
  life,
  teacher,
  35,
  year],
 [teaching,
  profession,
  lead,
  believe,
  bromwell,
  high,
  s,
  satire,
  close,
  reality,
  teacher,
  scramble,
  survive,
  financially,
  insightful,
  student,
  right,
  pathetic,
  teacher,
  pomp,
  pettiness,
  situation,
  remind,
  school,
  know,
  student,
  episode,
  student,
  repeatedly,
  try,
  burn,
  school,
  immediately,
  recall,
  high,
  classic,
  line,
  inspector,
  m,
  sack,
  teacher,
  student,
  welcome,
  bromwell,
  high,
  expect,
  adult,
  age,
  think,
  bromwell,
  high,
  far,
  fetched,
  pity,
  isn,
  t]]

In [32]:
['a','b','c'][True,True,False]

TypeError: list indices must be integers or slices, not tuple