# Move review sentiment analysis with Keras

Author: Mateusz Pabian

The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis. The aim of this task is to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, language ambiguity, and many others make this task very challenging with the best Kaggle public leaderboard score of 76.53% accuracy.

### Import external libraries

In [2]:
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

### Read Rotten Tomatoes data

In [3]:
# Load data using Pandas DataFrame API
train = pd.read_csv('train.tsv', sep='\t', header = 0)
test = pd.read_csv('test.tsv', sep='\t', header = 0)

### Word embeddings

A word embedding is a class of approaches for representing words and documents using a dense vector representation. Words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

Word embeddings is a popular model used in natural language processing, usually as features for deep learning of recurrent neural networks.

In [4]:
words_used = 2000

# tokenize reviews
tokenizer = Tokenizer(num_words = words_used, split=' ')
tokenizer.fit_on_texts(train['Phrase'].values)

# obtain numerical representation of sentences
train_data = tokenizer.texts_to_sequences(train['Phrase'].values)
test_data = tokenizer.texts_to_sequences(test['Phrase'].values)

# reshape data
train_data_re = pad_sequences(train_data)
test_data_re = pad_sequences(test_data)

# obtain one-hot encoding of class labels
labels = pd.get_dummies(train['Sentiment']).values

### Define network architecture

LSTM is a recurrent neural network architecture commonly used for time series analysis. 

It is designed to handle both short- and long-term dependencies in the signal due to it's memory units.

In [5]:
model = Sequential()
model.add(Embedding(words_used, 64, input_length=46))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(5, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 46, 64)            128000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               98816     
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645       
Total params: 227,461
Trainable params: 227,461
Non-trainable params: 0
_________________________________________________________________


### Train model

In [6]:
model.fit(train_data_re, labels, epochs=4, batch_size=128, verbose=1)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x159fdef0>

### Discussion

As mentioned in the introduction to this assignment, Rotten Tomatoes review sentiment challenge is a difficult task with best results on Kaggle measured in the mid-70%. It is difficult to evaluate model performance because test dataset labels are not available. It is important to note that due to length of time required to train each epoch (over 300 seconds), no network fine-tuning was performed. It seems that a use of GPU or distributed computing is necessary for the task at hand. 