## Sentiment classification using LSTM
In this notebook, we are going to use LSTM architecture to train a model on the movie review dataset for predicting sentiment of the reviews. First of all, let's see what is LSTM?<br/>
![LSTM architecture](https://technopremium.com/blog/wp-content/uploads/2019/06/LSTM-cell-structure-1-1200x600.jpg)
LSTM, or long short term memory, is a sequential neural network architecture, which preserves memory of the previous sequences using its structure. The first sequence model which was introduced is RNN. But, soon researchers discovered that RNN doesn't preserve much memory of previous sequences. Which results in losing context in long text sequences.<br/>
For maintaining this context, LSTM was introduced. In a LSTM cell, there are special structures called gates and cell state, which are changed and maintained to keep memory in the LSTM. For understanding how these structures work, read [this blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).<br/>
Code wise, we are using tensorflow and keras to build the model and train it. The following references were used to further understand codes/concepts for this project.<br/>
### References:
(1) [Medium article on keras lstm](https://medium.com/@dclengacher/keras-lstm-recurrent-neural-networks-c1f5febde03d)<br/>
(2) [Keras embedding layer documentation](https://keras.io/api/layers/core_layers/embedding/#embedding)<br/>
(3) [Keras example of text classification from scratch](https://keras.io/examples/nlp/text_classification_from_scratch/)<br/>
(4)[Bi-directional lstm model example](https://keras.io/examples/nlp/bidirectional_lstm_imdb/)<br/>
(5)[kaggle notebook for text preprocessing](https://www.kaggle.com/shyambhu/score-and-nsfw-modeling-with-reddit-data)<br/>

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('../data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

../data/fr_vocab_small.txt
../data/sentiment_train.tsv
../data/sentiment_test.tsv
../data/en_vocab_small.txt


In [3]:
train_data = pd.read_csv('../data/sentiment_train.tsv',sep = '\t')
test_data = pd.read_csv('../data/sentiment_test.tsv',sep = '\t')
train_data.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [4]:
train_data = train_data.drop(['PhraseId','SentenceId'],axis = 1)
test_data = test_data.drop(['PhraseId','SentenceId'],axis = 1)

In [5]:
import tensorflow as tf
import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.preprocessing.text import one_hot

from sklearn.model_selection import train_test_split

In [6]:
max_features = 20000  # Only consider the top 20k words
maxlen = 200

In [7]:
train_data.head()

Unnamed: 0,Phrase,Sentiment
0,A series of escapades demonstrating the adage ...,1
1,A series of escapades demonstrating the adage ...,2
2,A series,2
3,A,2
4,series,2


In [8]:
from nltk.corpus import stopwords
import re
def text_cleaning(text):
    forbidden_words = set(stopwords.words('english'))
    if text:
        text = ' '.join(text.split('.'))
        text = re.sub('\/',' ',text)
        text = re.sub(r'\\',' ',text)
        text = re.sub(r'((http)\S+)','',text)
        text = re.sub(r'\s+', ' ', re.sub('[^A-Za-z]', ' ', text.strip().lower())).strip()
        text = re.sub(r'\W+', ' ', text.strip().lower()).strip()
        text = [word for word in text.split() if word not in forbidden_words]
        return text
    return []

In [9]:
train_data['flag'] = 'TRAIN'
test_data['flag'] = 'TEST'
total_docs = pd.concat([train_data,test_data],axis = 0,ignore_index = True)
total_docs['Phrase'] = total_docs['Phrase'].apply(lambda x: ' '.join(text_cleaning(x)))
phrases = total_docs['Phrase'].tolist()

vocab_size = 50000
encoded_phrases = [one_hot(d, vocab_size) for d in phrases]
total_docs['Phrase'] = encoded_phrases
train_data = total_docs[total_docs['flag'] == 'TRAIN']
test_data = total_docs[total_docs['flag'] == 'TEST']
x_train = train_data['Phrase']
y_train = train_data['Sentiment']
x_val = test_data['Phrase']
y_val = test_data['Sentiment']

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [47]:
y_train.unique()

array([1., 2., 3., 4., 0.])

In [11]:
features = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
targets = y_train
x_train, x_val, y_train, y_val = train_test_split(features, targets, test_size=0.3333)
print(x_train[0])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

In [52]:
model = Sequential()
inputs = tf.keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
model.add(inputs)
model.add(Embedding(50000, 128))
# Add 2 bidirectional LSTMs
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))
# Add a classifier
model.add(Dense(5, activation="sigmoid"))
#model = keras.Model(inputs, outputs)
model.summary()

print(x_train.shape)

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, None, 128)         6400000   
_________________________________________________________________
bidirectional_30 (Bidirectio (None, None, 128)         98816     
_________________________________________________________________
bidirectional_31 (Bidirectio (None, 128)               98816     
_________________________________________________________________
dense_15 (Dense)             (None, 5)                 645       
Total params: 6,598,277
Trainable params: 6,598,277
Non-trainable params: 0
_________________________________________________________________
(104045, 200)


In [None]:
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))

In conclusion, we created a bi-directional LSTM model and have trained it to detect sentiment. We reached 80% training and 82% validation accuracy.