# Exploring the use of LSTM for Sentiment Analysis
The purpose of this notebook is to assess the accuracy in sentiment analysis of a standard (not fine-tuned) Long-Short Term Memory Recurrent Neural Network (LSTM) and compare it to the NLTK unsupervised built-in tool Sentiment Intensity Analyzer (SIA). To do so, three datasets are used:
* UMICH SI650: 7086 comments https://www.kaggle.com/c/si650winter11/data
* IMDB Movies Reviews: ~25k reviews https://www.kaggle.com/oumaimahourrane/imdb-reviews
* Sentiment140: 1.6M tweets https://www.kaggle.com/kazanova/sentiment140 (for computational reasons, I'm taking a random sample of 10%)

All datasets are treated with minimal pre-processing (little word standardization) and splitted into three sets to avoid overfitting: train (60%), test (20%) and validation (20%)

In [1]:
import pandas as pd
import numpy as np
import os
import collections
import matplotlib.pyplot as plt

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from keras.layers.core import Activation, Dense, Dropout, SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing import sequence
from keras import backend as K

Using TensorFlow backend.


# UMICH data

In [2]:
umich = pd.read_csv("data/UMICH training.txt", sep='\t', names=['sentiment', 'text'], encoding='iso-8859-1')
umich['text'] = umich['text'].apply(lambda x: x.lower())
umich.sample(5)

Unnamed: 0,sentiment,text
3412,1,Brokeback mountain was beautiful...
3424,1,I love Brokeback Mountain....
5763,0,"Not because I hate Harry Potter, but because I..."
1593,1,"So as felicia's mom is cleaning the table, fel..."
4058,0,The Da Vinci Code sucked big time.


In [2]:
def get_corpus_information(data, text_var, verbose=True):
    '''Returns the maximum number of words of a single document and each word frequency'''
    maxlen = 0
    word_freqs = collections.Counter()

    for text in data[text_var]:
        words = nltk.word_tokenize(text.lower())
        if len(words) > maxlen:
            maxlen = len(words)
        for word in words:
            word_freqs[word] += 1

    if verbose:
        print('Max number of words in a single sentence:', maxlen)
        print('Number of unique words:', len(word_freqs))
    
    return maxlen, word_freqs

In [3]:
def get_mapping_dicts(max_features, word_freqs):
    '''Maps words to indexes'''
    vocab_size = min(max_features, len(word_freqs)) + 2
    word2index = {x[0]: i+2 for i, x in 
                    enumerate(word_freqs.most_common(max_features))}
    word2index["PAD"] = 0
    word2index["UNK"] = 1
    index2word = {v:k for k, v in word2index.items()}
    return vocab_size, word2index, index2word

In [4]:
def sentences2sequences(data, text_var, word2index, max_length):
    '''Maps sentences to sequences'''
    X = np.empty((data.shape[0], ), dtype=list)

    for i, text in enumerate(data[text_var]):
        words = nltk.word_tokenize(text.lower())
        seqs = []
        for word in words:
            if word in word2index:
                seqs.append(word2index[word])
            else:
                seqs.append(word2index["UNK"])
        X[i] = seqs

    # Pad the sequences (left padded with zeros)
    return sequence.pad_sequences(X, maxlen=max_length)

In [5]:
def split_train_test_val(X, y, test_size, val_size=None, random_state=None):
    '''Splits data into train, test and validation sets'''
    if val_size == None:
        val_size = test_size
        
    test_size = test_size/(1-val_size)
    
    if random_state == None:
        rs2 = None
    else:
        rs2 = random_state * 45
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=val_size, random_state=random_state)
    X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=test_size, random_state=rs2)
    
    return X_train, X_test, X_val, y_train, y_test, y_val

In [6]:
def preprocessing_pipeline(data, text_var, score_var, test_size=0.2):
    '''Complete pipeline for data to be ingested into LSTM'''
    # Get necessary corpus information
    maxlen, word_freqs = get_corpus_information(data, text_var)
    
    # Reduce dimmensionality a bit to avoid overfitting
    max_features = int(len(word_freqs) * 0.8)
    max_sentence_length = int(maxlen/2)
    vocab_size, word2index, index2word = get_mapping_dicts(max_features, word_freqs)
    
    # convert sentences to sequences
    X = sentences2sequences(data, text_var, word2index, max_sentence_length)
    
    # Split train/test/validation data
    X_train, X_test, X_val, y_train, y_test, y_val = split_train_test_val(X, data[score_var], test_size, random_state=845)
    print('Train shapes:')
    print(X_train.shape, y_train.shape)
    print('Test shapes:')
    print(X_test.shape, y_test.shape)
    print('Validation shapes:')
    print(X_val.shape, y_val.shape)
    
    return {'X_train':X_train, 'X_test':X_test, 'X_val':X_val, 
            'y_train':y_train, 'y_test':y_test, 'y_val':y_val, 
            'index2word': index2word, 'vocab_size':vocab_size, 
            'max_sentence_length':max_sentence_length}

In [8]:
umich_data = preprocessing_pipeline(umich, 'text', 'sentiment')

Max number of words in a single sentence: 1049
Number of unique words: 2327
Train shapes:
(4150, 524) (4150,)
Test shapes:
(1384, 524) (1384,)
Validation shapes:
(1384, 524) (1384,)


## NLTK Sentiment Intensity Analyzer

In [7]:
def get_sia_df(data, text_var):
    sia = SentimentIntensityAnalyzer()
    df_sia = [sia.polarity_scores(text) for text in data[text_var]]
    df_sia = pd.DataFrame(df_sia)
    df_sia['sentiment'] = (df_sia['compound'] > 0).astype(int)
    df_sia['text'] = data[text_var]
    return df_sia

In [10]:
umich_sia = get_sia_df(umich, 'text')
umich_sia.sample(5)

Unnamed: 0,compound,neg,neu,pos,sentiment,text
1819,-0.2516,0.161,0.721,0.118,0,Which is why i said silent hill turned into re...
6480,0.2263,0.127,0.674,0.199,1,", she helped me bobbypin my insanely cool hat ..."
4537,-0.3612,0.161,0.839,0.0,0,Combining the opinion / review from Gary and G...
5627,-0.4215,0.379,0.455,0.167,0,This quiz sucks and Harry Potter sucks ok bye..
6436,-0.3182,0.327,0.467,0.206,0,Ok brokeback mountain is such a horrible movie.


In [12]:
print('Accuracy on full dataset: {:.4f}'.format(accuracy_score(umich['sentiment'], 
                                                               umich_sia['sentiment'])))
print('Accuracy on validation dataset: {:.4f}'.format(accuracy_score(umich_data['y_val'], 
        umich_sia['sentiment'].iloc[umich_data['y_val'].index])))

Accuracy on full dataset: 0.8737
Accuracy on validation dataset: 0.8822


## LSTM

In [8]:
def build_lstm(data_dict, embedding_size=128, hidden_layer_size=64):
    vocab_size = data_dict['vocab_size']
    input_length = data_dict['max_sentence_length']

    # Build LSTM
    lstm = Sequential()
    lstm.add(Embedding(vocab_size, embedding_size, input_length=input_length))
    lstm.add(SpatialDropout1D(0.2))
    lstm.add(LSTM(hidden_layer_size, dropout=0.2, recurrent_dropout=0.2))
    lstm.add(Dense(1))
    lstm.add(Activation("sigmoid"))

    lstm.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

    #show the model summary
    return lstm

In [18]:
lstm = build_lstm(umich_data, embedding_size=128, hidden_layer_size=64)
lstm.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 524, 128)          238464    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 524, 128)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 287,937
Trainable params: 287,937
Non-trainable params: 0
_________________________________________________________________


In [21]:
umich_history = lstm.fit(umich_data['X_train'], umich_data['y_train'], batch_size=256, epochs=10,
                         validation_data=(umich_data['X_test'], umich_data['y_test']))

Train on 4150 samples, validate on 1384 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
preds = lstm.predict(umich_data['X_val'], batch_size=1024)
preds = (preds > 0.5).astype(int)
print('LSTM Validation Score: {:.4f}'.format(accuracy_score(umich_data['y_val'], preds)))

LSTM Validation Score: 0.9819


In [31]:
# Clear some memory
K.clear_session()

LSTM achieves ~10% higher accuracy than SIA on UMICH data

# IMDB Movies Reviews Data

In [23]:
imdb = pd.read_csv('data/imdb-reviews.csv', encoding='iso-8859-1')
imdb.columns = ['text', 'sentiment']
imdb['text'] = imdb['text'].apply(lambda x: x.lower())
imdb.head()

Unnamed: 0,text,sentiment
0,"first think another Disney movie, might good, ...",1
1,"Put aside Dr. House repeat missed, Desperate H...",0
2,"big fan Stephen King's work, film made even gr...",1
3,watched horrid thing TV. Needless say one movi...,0
4,truly enjoyed film. acting terrific plot. Jeff...,1


In [24]:
imdb_data = preprocessing_pipeline(imdb, 'text', 'sentiment')

Max number of words in a single sentence: 1828
Number of unique words: 114340
Train shapes:
(15000, 914) (15000,)
Test shapes:
(5000, 914) (5000,)
Validation shapes:
(5000, 914) (5000,)


## SIA

In [25]:
imdb_sia = get_sia_df(imdb, 'text')
imdb_sia.sample(5)

Unnamed: 0,compound,neg,neu,pos,sentiment,text
15651,0.9678,0.073,0.708,0.219,1,"Thief Bagdad treasure. First foremost, good st..."
13166,0.384,0.144,0.717,0.138,1,"young Dr. Fanshawe(Mark Letheren), avid archae..."
2592,0.6809,0.107,0.749,0.144,1,Istanbul another one expatriate films Errol Fl...
8501,0.7688,0.067,0.795,0.138,1,"Hidden Frontier fan made show, world Star Trek..."
1595,0.9781,0.132,0.663,0.205,1,John Carpenter's Halloween<br /><br />Is great...


In [26]:
print('Accuracy on full dataset: {:.4f}'.format(accuracy_score(imdb['sentiment'], 
                                                               imdb_sia['sentiment'])))
print('Accuracy on validation dataset: {:.4f}'.format(accuracy_score(imdb_data['y_val'], 
      imdb_sia['sentiment'].iloc[imdb_data['y_val'].index])))

Accuracy on full dataset: 0.6760
Accuracy on validation dataset: 0.6786


## LSTM

In [27]:
lstm = build_lstm(imdb_data, embedding_size=128, hidden_layer_size=64)
lstm.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 914, 128)          11708672  
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 914, 128)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
Total params: 11,758,145
Trainable params: 11,758,145
Non-trainable params: 0
_________________________________________________________________


In [29]:
imdb_history = lstm.fit(imdb_data['X_train'], imdb_data['y_train'], batch_size=512, epochs=10,
                        validation_data=(imdb_data['X_test'], imdb_data['y_test']))

Train on 15000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
preds = lstm.predict(imdb_data['X_val'], batch_size=1024)
preds = (preds > 0.5).astype(int)
print('LSTM Validation Score: {:.4f}'.format(accuracy_score(imdb_data['y_val'], preds)))

LSTM Validation Score: 0.8360


In [36]:
K.clear_session()

LSTM achieves ~16% higher accurancy than SIA in IMDB dataset

# Sentiment140 data

In [21]:
sent = pd.read_csv('data/sentiment140.csv', header=None, encoding='iso-8859-1').iloc[:,[5, 0]]
sent.columns = ['text', 'sentiment']
sent = sent.sample(frac=0.1, random_state=452).reset_index(drop=True)
sent['sentiment'] = sent['sentiment'].replace(4,1)
print(sent.shape)
sent.head()

(160000, 2)


Unnamed: 0,text,sentiment
0,I almost lost my finger to the ceiling fan.. i...,0
1,"Ohh, man. My favorite SNL surprise of the nig...",1
2,@Beaker Can't DM @zhenji as he's not following...,0
3,@Look4acure Yeah!! make sure you get some of ...,1
4,@AmaNorris wow that last tweet made me seem li...,0


In [22]:
# Remove users and links
sent['text'] = sent['text'].replace(r'@[^ ]+', '', regex=True)
sent['text'] = sent['text'].replace(r'[^ ]+//[^ ]+', '', regex=True)
sent['text'] = sent['text'].apply(lambda x: x.lower())

In [23]:
sent_data = preprocessing_pipeline(sent, 'text', 'sentiment')

Max number of words in a single sentence: 115
Number of unique words: 88115
Train shapes:
(96000, 57) (96000,)
Test shapes:
(32000, 57) (32000,)
Validation shapes:
(32000, 57) (32000,)


## SIA

In [24]:
sent_sia = get_sia_df(sent, 'text')
sent_sia.sample(5)

Unnamed: 0,compound,neg,neu,pos,sentiment,text
35255,0.5962,0.0,0.672,0.328,1,i want to go to jb and demi concert today!!! ...
69660,-0.1027,0.263,0.562,0.175,0,oh sadness. and i basically told you what it ...
152816,0.745,0.0,0.628,0.372,1,working a split shift today but had h**lla fu...
9118,-0.5106,0.398,0.602,0.0,0,up late with a sick little girl
126392,0.0,0.0,1.0,0.0,0,i really need some coffee now!


In [25]:
print('Accuracy on full dataset: {:.4f}'.format(accuracy_score(sent['sentiment'], 
                                                               sent_sia['sentiment'])))
print('Accuracy on validation dataset: {:.4f}'.format(accuracy_score(sent_data['y_val'], 
      sent_sia['sentiment'].iloc[sent_data['y_val'].index])))

Accuracy on full dataset: 0.6510
Accuracy on validation dataset: 0.6479


## LSTM

In [26]:
lstm = build_lstm(sent_data, embedding_size=128, hidden_layer_size=64)
lstm.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 57, 128)           9023232   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 57, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 9,072,705
Trainable params: 9,072,705
Non-trainable params: 0
_________________________________________________________________


In [29]:
sent_history = lstm.fit(sent_data['X_train'], sent_data['y_train'], batch_size=2048, epochs=10,
                        validation_data=(sent_data['X_test'], sent_data['y_test']))

Train on 96000 samples, validate on 32000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
preds = lstm.predict(sent_data['X_val'], batch_size=1024)
preds = (preds > 0.5).astype(int)
print('LSTM Validation Score: {:.4f}'.format(accuracy_score(sent_data['y_val'], preds)))

LSTM Validation Score: 0.7650


LSTM achieves ~12% higher accuracy than SIA in Sentiment140 dataset.

In summary, a simple not especifically tuned and not widely trained LSTM seems to consistently outperform the standard SIA implementation. Nevertheless, the ease and speed of implementing SIA is a plus that should not be overlooked.