# Sentiment Analysis using pre Trained Embedding and CNN layer on IMDB Movie Review Data

## Data Set:
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

# Import Movie Review Data

Set the seed

In [None]:
import numpy as np

In [None]:
np.random.seed(42)

Import the dataset as pandas dataframe

In [None]:
import pandas as pd

Data can be downloaded from Kaggle at the following URL

- https://www.kaggle.com/c/word2vec-nlp-tutorial/data

In [None]:
df = pd.read_csv('labeledTrainData.tsv.zip',header=0, delimiter="\t", quoting=3)

Split Data into Training and Test Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df['review'],
    df['sentiment'],
    test_size=0.2, 
    random_state=42
)

# Build the Tokenizer

In [None]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

In [None]:
top_words = 10000

In [None]:
t = Tokenizer(num_words=top_words) # num_words -> Vocablury size

In [None]:
t.fit_on_texts(X_train.tolist())

# Prepare Training and Test Data

Get the word index for each of the word in the review

In [None]:
X_train = t.texts_to_sequences(X_train.tolist())

In [None]:
X_test = t.texts_to_sequences(X_test.tolist())

How many words in each review?

# Pad Sequences - Important

In [None]:
from tensorflow.python.keras.preprocessing import sequence

In [None]:
max_review_length = 300

In [None]:
X_train = sequence.pad_sequences(X_train,maxlen=max_review_length,padding='post')

In [None]:
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length, padding='post')

# Build Embedding Matrix from Pre-Trained Word2Vec

Load pre-trained Gensim Embeddings

In [None]:
import gensim

In [None]:
word2vec = gensim.models.Word2Vec.load('word2vec-movie-50')

Embedding Size

In [None]:
embedding_vector_length = word2vec.wv.syn0.shape[1]

Build matrix for current data

In [None]:
embedding_matrix = np.zeros((top_words + 1, embedding_vector_length))

In [None]:
for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):
    if i > top_words:
        break
    if word in word2vec.wv.vocab:
        embedding_vector = word2vec.wv[word]
        embedding_matrix[i] = embedding_vector

# Build the Graph

In [None]:
from tensorflow.python.keras.models import Sequential

In [None]:
from tensorflow.python.keras.layers import Dropout, Dense, Embedding, Flatten, Average, Conv1D

In [None]:
model = Sequential()

Add Embedding layer

In [None]:
model.add(Embedding(top_words + 1,
                    embedding_vector_length,
                    input_length=max_review_length,
                   weights=[embedding_matrix],
                   trainable=False)
         )

In [None]:
model.add(Conv1D(64,3,activation='relu'))

In [None]:
model.add(Conv1D(32,3,activation='relu'))

In [None]:
model.add(Dropout(0.5))

In [None]:
model.add(Conv1D(16,3,activation='relu'))

In [None]:
model.add(Flatten())

In [None]:
model.add(Dropout(0.3))

In [None]:
model.add(Dense(1,activation='sigmoid'))

In [None]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

# Execute the graph

In [None]:
model.fit(X_train,y_train,
          epochs=10,
          batch_size=128,          
          validation_data=(X_test, y_test),
         verbose=1)