# Sentiment Analysis LSTM Network



## Disclaimer
I tried to rewrite the neural network described in the Towards Datascience article "Sentiment Analysis using LSTM" (http://app.n26.com) using the IMDB Review Dataset from http://ai.stanford.edu/~amaas/data/sentiment.



## Imports


In [47]:
import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt
%matplotlib inline

from string import punctuation
from collections import Counter

import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

import urllib.request
import tarfile

In [48]:
print('Version should be 2.0 or higher.\nCurrent version is "{}".'.format(tf.__version__))

Version should be 2.0 or higher.
Current version is "2.0.0-alpha0".


## Load Dataset

First download the data from 'from http://ai.stanford.edu/~amaas/data/sentiment'. This takes some time to run, so please be patient.

In [49]:
print('Downloading dataset...')

url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
urllib.request.urlretrieve(url, 'data.tar.gz')

print('Unpacking dataset...')

with tarfile.open('data.tar.gz', "r:gz") as tar:
    tar.extractall()
    
print('Done.')



Downloading dataset...


Unpacking dataset...


Create the dataframes by loading the data from the downloaded file and adding the labels.

In [46]:
def create_dataset(train_test):
    """
        Creates a Dataframe with one review column containing the reviews as Strings and a 
        sentiment column of datatype int with 0 being negative and 1 being positive sentiment.
        
    :param train_test: Accepts either 'train' or 'test' to create the respective Dataframe. 
    :return: A Dataframe containing the reviews and the sentiment.
    """
    pos = load_data_as_dataframe(train_test, 'pos')
    neg = load_data_as_dataframe(train_test, 'neg')
    
    df = pd.DataFrame(pos)
    df.append(neg, ignore_index=True)
    
    df['sentiment'] = pd.to_numeric(df['sentiment'])
    return df


def load_data_as_dataframe(train_test, label):
    """
        This function retrieves the data with a certain label from the directory specified.
    
    :param train_test: A string being either 'train' or 'test', specifying which directory should be targeted.
    :param label: A string being either 'pos' or 'neg' specifying if the positive or the negative reviews should be assembled.
    :return: A Dataframe with a review column (dtype=String) and a sentiment column (dtype=Int).
    """
    sentiment = 1 if label == 'pos' else 0
    reviews = gather_reviews_from_directory(train_test, label)
    reviews = pd.DataFrame({'review': reviews})
    reviews['sentiment'] = pd.Series(np.array([sentiment] * len(reviews)), index=reviews.index)
    return reviews


def gather_reviews_from_directory(train_test, label):
    """
        This function gathers the reviews from the single files and returns them as a Pandas 
        Series Object.
        
    :param label: A string being either 'pos' or 'neg' specifying if the positive or the negative reviews should be assembled.
    :return: None
    """
    content = []
    for filepath in os.listdir('aclImdb/' + train_test + '/' + label):
        with open(('aclImdb/' + train_test + '/' + label + '/' + filepath), 'r') as file:
            review = file.read()
            content.append(review)
            
    return pd.Series(content)


train = create_dataset('train')
test = create_dataset('test')

train.head()


Unnamed: 0,review,sentiment
0,For a movie that gets no respect there sure ar...,1
1,Bizarre horror movie filled with famous faces ...,1
2,"A solid, if unremarkable film. Matthau, as Ein...",1
3,It's a strange feeling to sit alone in a theat...,1
4,"You probably all already know this by now, but...",1


## Preprocessing

A major change to the code in the article is in how words that occur in the training but not in the test set. In the article the mapping is collected from the entire dataset, here it is collected only from the training set. This makes an unknown word token necessary.

In [36]:
UNKNOWN_WORD_TOKEN = -1


def word_to_int_mapping(df, vocab_to_int, batch_size):
    """
        Transforms String review in dataframe into int encoded review.
        
    :param df: Dataframe that to transform
    :param vocab_to_int: Word to int mapping as a dictionary
    :param batch_size: Batchsize for the training
    :return: Transformed Dataframe
    """
    reviews_int = []
    for review in df['review']:
        r = []
        for w in review.split():
            if w in vocab_to_int:
                r.append(vocab_to_int[w])
            else:
                r.append(vocab_to_int[UNKNOWN_WORD_TOKEN])
        reviews_int.append(r)
        
    features = pad_features(reviews_int, 200).reshape(-1, 200)
    targets = df['sentiment'].values.reshape(-1, 1)
    
    df = np_to_tf_dataset(features, targets)
    return df.shuffle(10000).batch(batch_size)


def pad_features(reviews_int, seq_length):
        """
            Return features of review_ints, where each review is padded with 0's or truncated to the input seq_length.
        
        :param reviews_int: Reviews as array of ints
        :param seq_length: Length of Sequence for the review
        :return: 0 padded review
        """
        features = np.zeros((len(reviews_int), seq_length), dtype=int)

        for i, review in enumerate(reviews_int):
            review_len = len(review)

            if review_len <= seq_length:
                zeroes = list(np.zeros(seq_length - review_len))
                new = zeroes + review
            elif review_len > seq_length:
                new = review[0:seq_length]

            features[i, :] = np.array(new)

        return features


def np_to_tf_dataset(np_X, np_y):
    """
        Casts Numpy Arrays into Tensors.
        
    :param np_X: Features Array
    :param np_y: Target Array
    :return: Tensor containing features and targets
    """
    return tf.data.Dataset.from_tensor_slices(
        (
            tf.cast(np_X, tf.float32),
            tf.cast(np_y, tf.float32)
        )
    )


train['review'] = train['review'].apply(lambda x: x.lower())
train['review'] = train['review'].apply(lambda x: ''.join([c for c in x if c not in punctuation]))

test['review'] = test['review'].apply(lambda x: x.lower())
test['review'] = test['review'].apply(lambda x: ''.join([c for c in x if c not in punctuation]))
   
# Encoding 
all_text = ' '.join(train['review'])
words = all_text.split()
count_words = Counter(words)
total_words = len(words)
sorted_words = count_words.most_common(total_words)
    
vocab_to_int = {w: i + 1 for i, (w, c) in enumerate(sorted_words)}
vocab_to_int[UNKNOWN_WORD_TOKEN] = len(vocab_to_int)
    
train_ds = word_to_int_mapping(train, vocab_to_int, 32)
test_ds = word_to_int_mapping(test, vocab_to_int, 32)


## Creating the Model

In [37]:
class SentimentLSTM(Model):
    def __init__(self, vocab_size, output_size, embedding_dim, lstm_layers):
        super(SentimentLSTM, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim, input_length=200)
        self.lstm1= LSTM(lstm_layers, dropout=0.2, activation='tanh', return_sequences=True)
        self.lstm2 = LSTM(lstm_layers, dropout=0.2, activation='tanh')
        self.dropout = Dropout(0.3)
        self.dense = Dense(output_size, activation='sigmoid')

    def call(self, x):
        x = self.embedding(x)
        x = self.lstm1(x)
        x = self.lstm2(x)
        x = self.dropout(x)
        x = self.dense(x)
        return x


# Vocab_size + 1 due to the 0 padding.
vocab_size = len(vocab_to_int) + 1 
output_size = 1
embedding_dim = 400
lstm_layers = 256
model = SentimentLSTM(vocab_size, output_size, embedding_dim, lstm_layers)
# TODO find way to make print statement show layers similar to pytorch.
print(model)


<__main__.SentimentLSTM object at 0x1312703c8>


## Training the Model

The article use a Loss FUnction called BCELoss which stands for Binary Cross Entropy. The BinaryCrossentropy is the Tensorflow Counterpart and will be used here.

In [38]:
lr = 0.001

loss_object = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam(lr)

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.BinaryAccuracy(name='train_accuracy')

test_loss = tf.keras.metrics.Mean(name='test_loss')
test_accuracy = tf.keras.metrics.BinaryAccuracy(name='test_accuracy')

Setting up the training and validation function.

In [39]:
@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        predictions = model(images)
        loss = loss_object(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(labels, predictions)


@tf.function
def test_step(images, labels):
    predictions = model(images)
    t_loss = loss_object(labels, predictions)

    test_loss(t_loss)
    test_accuracy(labels, predictions)


And of course finally training the model. This will train for quite some time. You can either reduce the neurons in the network or use a strong GPU to train the network. 

In [40]:
EPOCHS = 5

for epoch in range(EPOCHS):
    for images, labels in train_ds:
        train_step(images, labels)

    for test_images, test_labels in test_ds:
        test_step(test_images, test_labels)

    template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
    print(template.format(epoch+1,
                        train_loss.result(),
                        train_accuracy.result()*100,
                        test_loss.result(),
                        test_accuracy.result()*100))


Epoch 1, Loss: 0.0064151231199502945, Accuracy: 99.87200164794922, Test Loss: 4.375329609729306e-08, Test Accuracy: 100.0


KeyboardInterrupt: 

## Results

In [None]:
# TODO Plot graphs and also add tensorboard to model.