# Sentiment Analysis LSTM Network



## Disclaimer
I tried to rewrite the neural network described in the Towards Datascience article "Sentiment Analysis using LSTM" (http://app.n26.com) using the IMDB Review Dataset from http://ai.stanford.edu/~amaas/data/sentiment.



## Imports


In [2]:
import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt
%matplotlib inline

from string import punctuation
from collections import Counter

import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

from utils.DataLoader import DataLoader



In [4]:
print('Version should be 2.0 or higher.\nCurrent version is "{}".'.format(tf.__version__))

Version should be 2.0 or higher.
Current version is "2.0.0-alpha0".


## Data preprocessing

This function should only be run once if you want to download the dataset from http://ai.stanford.edu/~amaas/data/sentiment.

In [6]:
def rearrange_data(train_test, label):
    """
        A function assembling the seperate review text files into one file separating the 
        reviews with a newline character. The label corresponds to the filename.
        
    :param train_test: A string being either 'train' or 'test', specifying which directory should be targeted.
    :param label: A string being either 'pos' or 'neg' specifying if the positive or the negative reviews should be assembled.
    :return: None
    """
    content = []
    for filepath in os.listdir('data/' + train_test + '/' + label):
        with open(('data/' + train_test + '/' + label + '/' + filepath), 'r') as file:
            review = file.read()
            content.append(review)
    with open('data/' + train_test + '/' + label + '.txt', 'w') as target_file:
        target_file.write('\n'.join(content))
        print(target_file.name)


rearrange_data('train', 'pos')
rearrange_data('train', 'neg')

rearrange_data('test', 'pos')
rearrange_data('test', 'neg')





data/train/pos.txt


data/train/neg.txt


data/test/pos.txt


data/test/neg.txt


In [7]:
dataloader = DataLoader(8)


## Creating the Model

In [8]:
class SentimentLSTM(Model):
    def __init__(self, vocab_size, output_size, embedding_dim, lstm_layers):
        super(SentimentLSTM, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim, input_length=200)
        self.lstm1= LSTM(lstm_layers, dropout=0.2, activation='tanh', return_sequences=True)
        self.lstm2 = LSTM(lstm_layers, dropout=0.2, activation='tanh')
        self.dropout = Dropout(0.3)
        self.dense = Dense(output_size, activation='sigmoid')

    def call(self, x):
        x = self.embedding(x)
        x = self.lstm1(x)
        x = self.lstm2(x)
        x = self.dropout(x)
        x = self.dense(x)
        return x


# Vocab_size + 1 due to the 0 padding.
vocab_size = len(dataloader.vocab_to_int) + 1 
output_size = 1
embedding_dim = 400
lstm_layers = 256
model = SentimentLSTM(vocab_size, output_size, embedding_dim, lstm_layers)
# TODO find way to make print statement show layers similar to pytorch.
print(model)


<__main__.SentimentLSTM object at 0x13fa31550>


## Training the Model

The article use a Loss FUnction called BCELoss which stands for Binary Cross Entropy. The BinaryCrossentropy is the Tensorflow Counterpart and will be used here.

In [8]:
lr = 0.001

loss_object = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam(lr)

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.BinaryAccuracy(name='train_accuracy')

test_loss = tf.keras.metrics.Mean(name='test_loss')
test_accuracy = tf.keras.metrics.BinaryAccuracy(name='test_accuracy')

In [9]:
# TODO get rid of dataloader and put stuff into notebook again.
# TODO this then belongs into the Preprocessing session.

#train_ds = dataloader.train.shuffle(10000).batch(32)
train_ds = dataloader.train
# test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32)

Setting up the training and validation function.

In [17]:
@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        predictions = model(images)
        loss = loss_object(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(labels, predictions)


@tf.function
def test_step(images, labels):
    predictions = model(images)
    t_loss = loss_object(labels, predictions)

    test_loss(t_loss)
    test_accuracy(labels, predictions)


And of course finally training the model. This will train for quite some time. You can either reduce the neurons in the network or use a strong GPU to train the network. 

In [18]:
EPOCHS = 5

for epoch in range(EPOCHS):
    for images, labels in train_ds:
        train_step(images, labels)

    # for test_images, test_labels in test_ds:
        # test_step(test_images, test_labels)

    template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
    print(template.format(epoch+1,
                        train_loss.result(),
                        train_accuracy.result()*100,
                        test_loss.result(),
                        test_accuracy.result()*100))


Epoch 1, Loss: 0.011183718219399452, Accuracy: 99.8759994506836, Test Loss: 0.0, Test Accuracy: 0.0


KeyboardInterrupt: 

## Results