# Sentiment analysis using LSTM

We attempt to strengthen sentiment analysis models using LSTM layers.
LSTM layers take into consideration the order in which words are written which makes them more effective in mimicing human behavior.
We fit an LSTM model on the publically available movie reviews dataset

## Preprocessing our dataset
Movies reviews are split into two categories (positive for 5/10 and above review and Negative for 4/10 and below reviews) 

We process our reviews using the keras built in Tokenizer function
We then construct our train/test dataset.


In [3]:
import pyprind
import pandas as pd
from string import punctuation
import re
import numpy as np
from keras.preprocessing.text import text_to_word_sequence, hashing_trick
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
import tensorflow as tf
import tensorflow.keras as keras
from keras.models import Sequential, load_model
from keras.layers import Dense
from keras.layers import Dense, LSTM, SpatialDropout1D
from keras.utils import to_categorical
from keras.layers import Flatten


tf.compat.v1.enable_eager_execution()

df=pd.read_csv('C:\\Users\\marwe\\Desktop\\movie_data.csv',encoding='utf-8')
tk=Tokenizer()
tk.fit_on_texts(df['review'])
#print(len(tk.word_index))
# summarize what was learned
#print(len(tk.word_counts)) #dict each word and occurances in the whole dataset
#print(tk.document_count) #number of docs used
#print(tk.word_index['lack']) #list of words and their unique index
#print(tk.word_docs['lack']) #words and number of appreances in docs
#after fitting tokenizer we 
sequences = tk.texts_to_sequences(df['review'])
data = pad_sequences(sequences, maxlen= 250) 
#print(data.shape) 


#constructing our training and testing set
X_train=data[:25000]
X_test=data[25000:]
y_train=pd.get_dummies(df['sentiment'][:25000])
y_test=pd.get_dummies(df['sentiment'][25000:])

print('Preprocessing Done!')
print('Ready to build our LSTM model')


Preprocessing Done!
Ready to build our LSTM model


## Constructing and fitting LSTM for the movie reviews dataset

We fit an LSTM to determine whether a movie review is positive/negative
our LSTM includes: an embedding layer, LSTM layer and a final Dense layer for classification
We also use Dropout layers to tackle overfitting


In [4]:
lstm_out = 50
model = Sequential()
model.add(Embedding(len(tk.word_index)+1, 50, input_length=250))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())


model.fit(X_train, y_train, epochs = 5, batch_size=200, verbose = 1,validation_data=(X_test,y_test))
score, acc = model.evaluate(X_test, y_test,
                            batch_size=200)
print('Test score:', score)
print('Test accuracy for our movie review classification problem:', acc)


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 250, 50)           6212650   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 250, 50)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 102       
Total params: 6,232,952
Trainable params: 6,232,952
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test score: 0.38071398198604584
Test accuracy for our movie review classification problem: 0.8469600081443787
