# LSTM

Here we do the sentiment analysis of the IMDB data set using a simple LSTM (Long Short-Term Memory).

LSTM has a good proven record when we are working with long sequential data and here as well we have a sequential text data with us. So, it might be a good idea using LSTM for it.



In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import re

# Data Downloading

Here we simply download the data and put it into a tensorflow data set at the beginning. 

In [2]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz
!ls

--2021-12-07 18:16:12--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.4’


2021-12-07 18:16:15 (29.6 MB/s) - ‘aclImdb_v1.tar.gz.4’ saved [84125825/84125825]

aclImdb		   aclImdb_v1.tar.gz.1	aclImdb_v1.tar.gz.3  sample_data
aclImdb_v1.tar.gz  aclImdb_v1.tar.gz.2	aclImdb_v1.tar.gz.4


In [3]:
!pip install tensorflow-datasets > /dev/null

In [4]:
import tensorflow_datasets as tfds

In [5]:
(ds_train,ds_test),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train","test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True)

# Steps for implementation:

1 Data processing: cleaning the data, removing stopwords, splitting into X_test, y_test, X_train, y_train and Finally, Calculating maxlen for encoding

2 Tokenization and encoding: Tokenization and encoding X_train and X_test

3 LSTM modelling: LSTM model and Bidirectional LSTM

PS: different explained and detail in comments. When implementing.

# Data processing.

Here we firstly convert the data into a dataframe since it is easier to deal with in a data frame. Then post that we clean the data and make it comatible for our further analysis.


In [6]:
# we create a dataframe from a tensorflow data object
#we take a higher value than 25000 in take() so that we do not miss any values
ds_train = tfds.as_dataframe(ds_train.take(25100), ds_info)
ds_test = tfds.as_dataframe(ds_test.take(25100), ds_info)

In [7]:
ds_train.head(5)
#we see these b's as data converts to bytes hence we need to decode the bytes and do some basic cleaning to the data set
# its probably because of utf 8, the data is not easy to deal with if it is in the byte format for further processing

Unnamed: 0,label,text
0,0,"b""This was an absolutely terrible movie. Don't..."
1,0,b'I have been known to fall asleep during film...
2,0,b'Mann photographs the Alberta Rocky Mountains...
3,1,b'This is the kind of film for a snowy Sunday ...
4,1,"b'As others have mentioned, all the women that..."


In [8]:
# As we can see their are some weird characters in between, lets do the very basic cleaning
def basic_clean(txt):
  txt = txt.decode("utf-8") #to remove b's from the beginning of the text and make it string
  txt = re.compile("[.;:!\'?,\"()\[\]]").sub("", txt.lower()) #remove punctuations change and text to lower case
  txt = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", txt.lower()) #remove links since we want to do textual analysis
  return txt
ds_train['text'] =  ds_train['text'].apply(basic_clean) #apply the above function to train part of data set
ds_test['text'] =  ds_test['text'].apply(basic_clean) #apply the above function to test part of data set
ds_train.head(5)

Unnamed: 0,label,text
0,0,this was an absolutely terrible movie dont be ...
1,0,i have been known to fall asleep during films ...
2,0,mann photographs the alberta rocky mountains i...
3,1,this is the kind of film for a snowy sunday af...
4,1,as others have mentioned all the women that go...


In [9]:
pd.set_option('display.max_colwidth', 1000) #to view more of data in the data frame
ds_train.head(10) #check the application of the function

Unnamed: 0,label,text
0,0,this was an absolutely terrible movie dont be lured in by christopher walken or michael ironside both are great actors but this must simply be their worst role in history even their great acting could not redeem this movies ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the columbian rebels were making their cases for revolutions maria conchita alonso appeared phony and her pseudo love affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning i am disappointed that there are movies like this ruining actors like christopher walkens good name i could barely sit through it
1,0,i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the sette and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else i cant recommend this film at all
2,0,mann photographs the alberta rocky mountains in a superb fashion and jimmy stewart and walter brennan give enjoyable performances as they always seem to do but come on hollywood a mountie telling the people of dawson city yukon to elect themselves a marshal yes a marshal and to enforce the law themselves then gunfighters battling it out on the streets for control of the town nothing even remotely resembling that happened on the canadian side of the border during the klondike gold rush mr mann and company appear to have mistaken dawson city for deadwood the canadian north for the american wild west canadian viewers be prepared for a reefer madness type of enjoyable howl with this ludicrous plot or to shake your head in disgust
3,1,this is the kind of film for a snowy sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm chair and mellow for a couple of hours wonderful performances from cher and nicolas cage as always gently row the plot along there are no rapids to cross no dangerous waters just a warm and witty paddle through new york life at its best a family film in every sense and one that deserves the praise it received
4,1,as others have mentioned all the women that go nude in this film are mostly absolutely gorgeous the plot very ably shows the hypocrisy of the female libido when men are around they want to be pursued but when no men are around they become the pursuers of a 14 year old boy and the boy becomes a man really fast we should all be so lucky at this age he then gets up the courage to pursue his true love
5,1,this is a film which should be seen by anybody interested in effected by or suffering from an eating disorder it is an amazingly accurate and sensitive portrayal of bulimia in a teenage girl its causes and its symptoms the girl is played by one of the most brilliant young actresses working in cinema today alison lohman who was later so spectacular in where the truth lies i would recommend that this film be shown in all schools as you will never see a better on this subject alison lohman is absolutely outstanding and one marvels at her ability to convey the anguish of a girl suffering from this compulsive disorder if barometers tell us the air pressure alison lohman tells us the emotional pressure with the same degree of accuracy her emotional range is so precise each scene could be measured microscopically for its gradations of trauma on a scale of rising hysteria and desperation which reaches unbearable intensity mare winningham is the perfect choice to play her mother and does so...
6,0,okay you have penelope keith as miss herringbone tweed bbe backbone of england shes killed off in the first scene thats right folks this show has no backbone peter otoole as ol colonel cricket from the first war and now the emblazered lord of the manor joanna lumley as the ensweatered lady of the manor 20 years younger than the colonel and 20 years past her own prime but still glamourous brit spelling not mine enough to have a toy boy on the side its alright they have col crickets full knowledge and consent they guy even comes round for christmas still shes considerate of the colonel enough to have said toy boy her own age what a gal david mccallum as said toy boy equally as pointlessly glamourous as his squeeze pilcher couldnt come up with any cover for him within the story so she gave him a hush hush job at the circus and finally susan hampshire as miss polonia teacups venerable headmistress of the venerable girls boarding school serving tea in her office with a dash of deep po...
7,0,the film is based on a genuine 1950s novel journalist colin mcinnes wrote a set of three london novels absolute beginners city of spades and mr love and justice i have read all three the first two are excellent the last perhaps an experiment that did not come off but mcinness work is highly acclaimed and rightly so this musical is the novelists ultimate nightmare to see the fruits of ones mind being turned into a glitzy badly acted soporific one dimensional apology of a film that says it captures the spirit of 1950s london and does nothing of the sort thank goodness colin mcinnes wasnt alive to witness it
8,0,i really love the sexy action and sci fi films of the sixties and its because of the actresss that appeared in them they found the sexiest women to be in these films and it didnt matter if they could act remember candy the reason i was disappointed by this film was because it wasnt nostalgic enough the story here has a european sci fi film called dragonfly being made and the director is fired so the producers decide to let a young aspiring filmmaker jeremy davies to complete the picture theyre is one real beautiful woman in the film who plays dragonfly but shes barely in it film is written and directed by roman coppola who uses some of his fathers exploits from his early days and puts it into the script i wish the film could have been an homage to those early films they could have lots of cameos by actors who appeared in them there is one actor in this film who was popular from the sixties and its john phillip law barbarella gerard depardieu giancarlo giannini and dean stockwell ap...
9,0,sure this one isnt really a blockbuster nor does it target such a position dieter is the first name of a quite popular german musician who is either loved or hated for his kind of acting and thats exactly what this movie is about it is based on the autobiography dieter bohlen wrote a few years ago but isnt meant to be accurate on that the movie is filled with some sexual offensive content at least for american standard which is either amusing not for the other actors of course or dumb it depends on your individual kind of humor or on you being a bohlen fan or not technically speaking there isnt much to criticize speaking of me i find this movie to be an ok movie


In [10]:
#split the data into X_train, y_train, X_test, y_test
X_train = ds_train['text']
y_train = ds_train['label']
X_test = ds_test['text']
y_test = ds_test['label']

In [11]:
#remove stopwords from test and train. We remove stopwords because since these words occour a lot and don't add much meaning to the sentences
nltk.download('stopwords')
english_stops = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
X_test = X_test.replace({'[^A-Za-z]': ' '}, regex = True) #remove non alphabet for textual analysis
X_test = X_test.apply(lambda text: [w for w in text.split() if w not in english_stops])  #remove stop words

In [13]:
X_train = X_train.replace({'[^A-Za-z]': ' '}, regex = True)     #remove non alphabet for textual analysis
X_train = X_train.apply(lambda text: [w for w in text.split() if w not in english_stops])  # remove stop words

In [14]:
#function to max_len for texts_to_sequences
def get_maxlen():
    review_len = []
    for txt in X_train:
        review_len.append(len(txt))
    return int(np.ceil(np.mean(review_len)))
    #we use the mean value of the lengths of the texts in the data
    #mean is a bit higher than the meidan as median is 90 and mean is 122 hence i went with mean

# Tokenization and Encoding data

In [15]:
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(X_train) #we fit it on train data it creates tokens using train data as its corpus
X_train = token.texts_to_sequences(X_train) #it creates sequences using the tokens from the above fit_on_texts
X_test = token.texts_to_sequences(X_test) #it uses the same tokens from the train part to create its sequences and removes the words from the sequence which are unique to test data

max_length = get_maxlen() #it is the maximum length sentence for the analysis

X_train = pad_sequences(X_train, maxlen=max_length, padding='post', truncating='post') #encoded x train, this makes the sequence of a fixed length of max len and truncates the sequence if its longer than max len and if it shorter it provides the sequence padding to make sequence of uniform max len
X_test = pad_sequences(X_test, maxlen=max_length, padding='post', truncating='post') #encoded x test, it does the same to test data

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', X_train, '\n')
print('Encoded X Test\n', X_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[  312   284     1 ...     0     0     0]
 [  451   674  2211 ...     0     0     0]
 [ 4198  5932 26692 ...     0     0     0]
 ...
 [  750    90    53 ...  4587   475  1171]
 [  116  1333   138 ...     0     0     0]
 [   18 16630   276 ...   294     7  4731]] 

Encoded X Test
 [[   25    23  3887 ...     0     0     0]
 [36686   571   662 ...     3 19996   603]
 [  504     1  1547 ...   111 20541   176]
 ...
 [28021     1   103 ...     0     0     0]
 [  492   745  8648 ...     0     0     0]
 [   20    23  2263 ...     0     0     0]] 

Maximum review length:  122


# LSTM Model

In [38]:
EMBED_DIM = 32
LSTM_1 = 32 #number of lstm cells
from keras import backend as K
K.clear_session() #clear the tensor
#model
model = Sequential() #instantiate the model
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length)) #embedding input dimension is total number of words/size of vocabulary and output dimension is EMBED_DIM and input length is the max_length of sequence, we defined before while encoding the train and test
#we use an embedding layer 
model.add(LSTM(LSTM_1)) #lstm 
model.add(Dense(1, activation='sigmoid')) #since binary output we use dense 1 and sigmoid as activation
model.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy']) #we use binary_crossentropy because of the type of output metric as accuracy since the data set is balanced in negative and positive reviews and we settle on rms prop as optimizer beacuse it gave the best result compared to others

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 122, 32)           2888800   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,897,153
Trainable params: 2,897,153
Non-trainable params: 0
_________________________________________________________________
None


In [39]:
history = model.fit(X_train, y_train, batch_size = 32, epochs = 5)
#LSTM overfits quite quickly hence we run less epochs

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [40]:
score, acc = model.evaluate(X_test, y_test,
                            batch_size=32)




In [19]:
print('Test accuracy:', acc)

Test accuracy: 0.8533999919891357


# Bidirectional LSTM

Bidirectional models try to understand the context of the sentence from left to right and right to left and later concatenate it. Which makes a worthwhile approach to try in this scenario.

In [31]:
from tensorflow.keras.layers import Bidirectional
EMBED_DIM = 32
from keras import backend as K
K.clear_session()

model = Sequential() #instantiate the model
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))  #embedding input dimension is total number of words/size of vocabulary and output dimension is EMBED_DIM and input length is the max_length of sequence, we defined before while encoding the train and test
model.add(Bidirectional(LSTM(16, return_sequences=True))) #Bidirectional Layer
model.add(Bidirectional(LSTM(16))) #bidirectional later
model.add(Dense(1, activation='sigmoid')) #since binary output we use dense 1 and sigmoid as activation
model.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy'])  #we use binary_crossentropy because of the type of output metric as accuracy since the data set is balanced in negative and positive reviews and we settle on rms prop as optimizer beacuse it gave the best result compared to others

print(model.summary())


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 122, 32)           2888800   
                                                                 
 bidirectional (Bidirectiona  (None, 122, 32)          6272      
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 32)               6272      
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,901,377
Trainable params: 2,901,377
Non-trainable params: 0
_________________________________________________________________
None


In [32]:
history = model.fit(X_train, y_train, batch_size = 32, epochs = 3)
#LSTM overfits quite quickly hence we run less epochs

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [36]:
score, acc = model.evaluate(X_test, y_test, batch_size=32)



In [37]:
print('Test accuracy Bidirectional:', acc)

Test accuracy Bidirectional: 0.8561199903488159


# Conclusion

LSTM gives a decent accuracy but overfits quite easily if we give high number of epochs in case of both standard LSTM and bidirectional LSTM.

Bidirectional LSTM (85.6) has almost at the same level of accuracy on test set as the standard LSTM (85.3) since it with 2 epochs. Probably bidirectional LSTM understands the cotext much quicker than a standard LSTM model