<h1>Sentiment Analysis Using RNN (LSTM)</h1>
<p>Copyright : Paritosh Morparia</p>
<p>Indiana University</p>

<h4>Data Source</h4>
<p>The data used here is provided by keras as [IMDB movie reviews](https://keras.io/datasets/), where reviews have been classified as either positive or negative</p>
<p>The data is available to import using the function:</p>
<b>keras.datasets.imdb.load_data()</b></br>

<p> The reasons of using this dataset are:<p>
<ul>
    <li>It has 50000 reviews</li>
    <li>It is easy to use as the data has been transformed to a unique ndarray containing numerical values</li>
    <li>Little amount of preprocessing is involved</li>
</ul>


<p>There is a really good example of [Movie reviewes using LSTM](https://github.com/keras-team/keras/edit/master/examples/imdb_lstm.py) which I used as a referene for  this assignment.</p>

<h4>Fetching the data from keras</h4>
<ul><li><p>It gives data in a numpy array</p></ul></li>
<h4>Padding the data after fetching it</h4>

In [0]:
from keras.datasets import imdb


(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=5000, #Vocab_size
                                                      skip_top=10,
                                                      maxlen=200)

<h4>Shuffeling and splitting the train and test data </h4>

In [0]:
import numpy as np
x=np.append(x_train,x_test)
y=np.append(y_train,y_test)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x , y, test_size=0.33, random_state=42,shuffle=True)

In [0]:
import numpy as np
x_train2=np.zeros((len(x_train),200),dtype='int')
x_test2=np.zeros((len(x_test),200),dtype='int')

for i,x in enumerate(x_train):
  x_train2[i]=np.asarray(np.pad(x,(0,200-len(x)),"constant"))
for i,x in enumerate(x_test):
  x_test2[i]=np.asarray(np.pad(x,(0,200-len(x)),"constant"))

In [0]:
y_train2=np.zeros((len(y_train),2),dtype='int')
y_test2=np.zeros((len(y_test),2),dtype='int')

for i,x in enumerate(y_train):
  y_train2[i][x]=1
for i,x in enumerate(y_test):
  y_test2[i][x]=1

In [31]:
print((x_train2.shape))

(19371, 200)


In [0]:
from keras import regularizers,backend
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Embedding, Flatten,Reshape
from keras.layers import Conv1D, GlobalAveragePooling1D, MaxPooling1D, SimpleRNN,LSTM
from keras import optimizers


<h4>Defining the archirecture of the model and setting it to train</h4>
<p>The architecture comprises of following layers
    <ol>
        <li>Embedding layer - 128 Nodes</li>
        <li>LSTM layer      - 128 Nodes</li>
        <li>Dense layer     - 2 Nodes(Classes)</li>
    </ol>
</p>
<p>Other hyperparameters that were tweaked for the following net are
    <ul>
        <li>Dropout Rate</li>
        <li>Number of words in vocabulary</li>
        <li>Loss function = binary cross entropy</li>
        <li>optimizer     = Adam optimizer</li>
    </ul>
</p>

In [39]:
EMBEDDING_SIZE = 100
HIDDEN_SIZE = 100
BATCH_SIZE = 32
NUM_EPOCHS =20
vocab_size=5002
MAX_SENTENCE_LENGTH=200
backend.clear_session()


model=Sequential()
model.add(Embedding(vocab_size,
                    EMBEDDING_SIZE,
                    input_length=MAX_SENTENCE_LENGTH,
                   ))

model.add(LSTM(HIDDEN_SIZE))
model.add(Dense(2,activation="softmax"))

model.compile(loss='binary_crossentropy',
                 optimizer=optimizers.Adam(lr=0.01),
               metrics=['accuracy'])
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 100)          500200    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 202       
Total params: 580,802
Trainable params: 580,802
Non-trainable params: 0
_________________________________________________________________


In [40]:
#@title Default title text
model.fit(x_train2, y_train2, epochs=NUM_EPOCHS,validation_data=[x_test2,y_test2] ,batch_size= 128)
score,acc = model.evaluate(x_test2, y_test2)          
print("Test score: %.3f, accuracy: %.3f" % (score, acc))

Train on 19371 samples, validate on 9542 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test score: 0.435, accuracy: 0.862


In [68]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 100, 50)           250100    
_________________________________________________________________
lstm_4 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_5 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 102       
Total params: 272,952
Trainable params: 272,952
Non-trainable params: 0
_________________________________________________________________


<h2>Results and analysis</h2>

<p><b>Test Accuracy :-</b>86.28%</p>
<p><b>Evaluation methods</b><br>
Used bonary crossentropy of the output classification to calculate the loss and evaluation while training. While validation, the output class was compared with the actual class.
</p>

<p>
    <h3>Observations</h3>
    <ul>
        <li>The RNN network was really slow for huge sentence length. Cutting the sentence length to 200 worked best without affecting the results.</li>
        <li>The RNN seems to train the data on previous values so develop a relationship with previous data</li>
        <li>The training was slow in the beginning and was taking time to train. Changing the learning rate from 0.001 to 0.01 helped the speed of the network really well.</li>
    </ul>
</p>
<p><b>Experiments with the Architecture</b> 
<ul>
    <li>Hidden Size played important role as giving less number of neurons affected the proformance</li>
    <li>Tried to compare sigmoid and softmax activation for various architectures of which softmax worked better most of the time.</li>
    <li></li>
</ul>
</p>

<h4>Comparison between CNN and RNN</h4>
<p>Ideally, according to my readings, RNN generally works well when data has dependencies and needs to use them. In our case the data was a simple classification task and picking negative or positive words from the data may classify the data well. In that case CNN would work well.</p>
<p>The results in this experiment seem almost equal for RNN and CNN. I believe more data and fine-tuning the parameters would cause better results. </p>