<h1>Sentiment analysis using CNN</h1>
<p>Copyright : Paritosh Morparia</p>
<p>Indiana University</p>


<h4>Data Source</h4>
<p>The data used here is provided by keras as [IMDB movie reviews](https://keras.io/datasets/), where reviews have been classified as either positive or negative</p>
<p>The data is available to import using the function:</p>
<b>keras.datasets.imdb.load_data()</b></br>

<p> The reasons of using this dataset are:<p>
<ul>
    <li>It has 50000 reviews</li>
    <li>It is easy to use as the data has been transformed to a unique ndarray containing numerical values</li>
    <li>Little amount of preprocessing is involved</li>
</ul>

<h5>The results of this experiment are compared with RNN in the next notebook</h5> 


<h4>Fetching the data from keras</h4>
<ul><li><p>It gives data in a numpy array</p></ul></li>
<h4>Padding the data after fetching it</h4>

In [0]:
from keras.datasets import imdb


(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=10000,
                                                      skip_top=10,
                                                      maxlen=1000)

In [0]:
import numpy as np
x_train2=np.zeros((25000,1000),dtype='int')
x_test2=np.zeros((25000,1000),dtype='int')

for i,x in enumerate(x_train):
  x_train2[i]=np.asarray(np.pad(x,(0,1000-len(x)),"constant"))
for i,x in enumerate(x_test):
  x_test2[i]=np.asarray(np.pad(x,(0,1000-len(x)),"constant"))

<h4>Resizing the classes in one hot form for training the vector</h4>

In [0]:
y_train2=np.zeros((25000,2),dtype='int')
y_test2=np.zeros((25000,2),dtype='int')

for i,x in enumerate(y_train):
  y_train2[i][x]=1
for i,x in enumerate(y_test):
  y_test2[i][x]=1

In [0]:
from keras import regularizers,backend
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Embedding, Flatten,Reshape
from keras.layers import Conv1D, GlobalAveragePooling1D, MaxPooling1D


<h4>Defining the archirecture of the model and setting it to train</h4>
<p>The architecture comprises of following layers
    <ul>
        <li>Embedding layer          - 30 Nodes</li>
        <li>3 Convolutional layers of size- 32,64 and 128 Kernels</li>
        <li>Corresponding Pooling layers</li>
        <li>Dense Layer</li>
        <li>Softmax Layer</li>
    </ul>
</p>
<p>Other hyperparameters that were tweaked for the following net are
    <ul>
        <li>Activation        = Relu</li>
        <li>Number of words in vocabulary</li>
        <li>Loss function = categorical cross entropy</li>
        <li>optimizer     = Adam optimizer</li>
        <li>Number of Epochs</li>
    </ul>
</p>

In [88]:
EMBEDDING_SIZE = 30
BATCH_SIZE = 16
NUM_EPOCHS = 10
vocab_size=10002
MAX_SENTENCE_LENGTH=1000

model = Sequential()
backend.clear_session()

model.add(Embedding(vocab_size, EMBEDDING_SIZE,input_length=MAX_SENTENCE_LENGTH))

model.add(Conv1D(32, 5, activation='relu',padding="same", input_shape=(MAX_SENTENCE_LENGTH,EMBEDDING_SIZE)))
# model.add(Conv1D(32, 5, activation='relu',padding="same"))
model.add(MaxPooling1D(pool_size=2, strides=None, padding='valid'))

model.add(Conv1D(64, 5, activation='relu',padding="same"))
# model.add(Conv1D(64, 5, activation='relu',padding="valid"))
model.add(MaxPooling1D(pool_size=2, strides=None, padding='valid'))
model.add(Dropout(0.5))

model.add(Conv1D(128, 5, activation='relu',padding="same"))
# model.add(Conv1D(128,kernel_size =5, activation="relu", padding="valid"))
model.add(MaxPooling1D(pool_size=2, strides=None, padding='valid'))

model.add(Flatten())
model.add(Dropout(0.1))
model.add(Dense(50))
model.add(Dense(2,activation="softmax"))

model.compile(loss='mse',
               optimizer='adam',
               metrics=['accuracy'])

model.fit(x_train2, y_train2, epochs=NUM_EPOCHS,validation_data=[x_test2,y_test2])
score,acc = model.evaluate(x_test2, y_test2)          
print("Test score: %.3f, accuracy: %.3f" % (score, acc))

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.105, accuracy: 0.873


<h2>Results and analysis</h2>

<p><b>Test Accuracy :-</b>87%</p>
<p><b>Evaluation methods</b><br>
Used mean squared error of the output classification to calculate the loss and evaluation while training. While validation, the output class was compared with the actual class.
</p>


<p>
    <h3>Observations</h3>
    <ul>
        <li>The CNN network worked really quick on a large window size</li>
        <li>On a intutive level, it seems that each convolution would try to group similar words together based on features such as parts of speech, dependencies and named entities on the given window size.
    </ul>
</p>
<p><b>Experiments with the Architecture</b> 
<ul>
    <li>Tried to train the network using more Convilutional layers, but it gave the same results. This might indicate that additional convolutional layers do not extract new features that would help classify the data better.</li>
    <li>Tried to compare sigmoid and softmax activation for various architectures of which softmax worked better most of the time.</li>
    <li>Tried playing around with length of kernel of which 5 fit the best in the experiments</li>
</ul>
</p>

<p>The experiment further continues in the RNN implementation