# Sentiment Analysis using Embedding and FC Neural network layer on IMDB Movie Review Data

## Data Set:
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

# Import Movie Review Data

Set the seed

In [None]:
import numpy as np

In [None]:
np.random.seed(42)

Import the dataset as pandas dataframe

In [None]:
import pandas as pd

Data can be downloaded from Kaggle at the following URL

- https://www.kaggle.com/c/word2vec-nlp-tutorial/data

In [None]:
df = pd.read_csv('labeledTrainData.tsv.zip',header=0, delimiter="\t", quoting=3)

In [None]:
df.head()

In [None]:
df.shape

Split Data into Training and Test Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df['review'],
    df['sentiment'],
    test_size=0.2, 
    random_state=42
)

In [None]:
X_train.shape

# Build the Tokenizer

In [None]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

In [None]:
top_words = 5000

In [None]:
t = Tokenizer(num_words=top_words) # num_words -> Vocablury size

In [None]:
t.fit_on_texts(X_train.tolist())

# Prepare Training and Test Data

Get the word index for each of the word in the review

In [None]:
X_train = t.texts_to_sequences(X_train.tolist())

In [None]:
X_test = t.texts_to_sequences(X_test.tolist())

In [None]:
len(X_test[1208])

How many words in each review?

# Pad Sequences - Important

In [None]:
from tensorflow.python.keras.preprocessing import sequence

In [None]:
max_review_length = 300

In [None]:
X_train = sequence.pad_sequences(X_train,maxlen=max_review_length,padding='post')

In [None]:
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length, padding='post')

In [None]:
X_test[1208]

# Define Embedding layer configuration

In [None]:
embedding_vector_length = 50 # how many numbers per word

# Build the Graph

In [None]:
from tensorflow.python.keras.models import Sequential

In [None]:
from tensorflow.python.keras.layers import Dropout, Dense, Embedding, Flatten

In [None]:
model = Sequential()

Add Embedding layer

In [None]:
model.add(
    Embedding(top_words+1, #Vocablury Size
                    embedding_vector_length, #How many numbers will represent a word
                    input_length=max_review_length) #How many words in a document
         )

Output from Embedding is 3 dimension 
- batch_size x max_review_length x embedding_vector_length. 

We need to flatten the output for Dense layer

In [None]:
model.add(Flatten())

In [None]:
model.add(Dense(200,activation='relu'))

In [None]:
model.add(Dense(100,activation='relu'))

In [None]:
model.add(Dense(60,activation='relu'))

In [None]:
model.add(Dense(30,activation='relu'))

In [None]:
model.add(Dense(1,activation='sigmoid'))

In [None]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

# Execute the graph

In [None]:
model.fit(X_train,y_train,
          epochs=1,
          batch_size=128,
          shuffle=True, 
          validation_data=(X_test, y_test))

In [None]:
model.predict(X_test[0:2])