# Sentiment Analysis using TFIDC and FC Neural network layer on IMDB Movie Review Data

## Data Set:
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

Set the seed

In [None]:
import numpy as np

In [None]:
np.random.seed(42)

Import the dataset as pandas dataframe

In [None]:
import pandas as pd

Data can be downloaded from Kaggle at the following URL

- https://www.kaggle.com/c/word2vec-nlp-tutorial/data

In [None]:
df = pd.read_csv('labeledTrainData.tsv.zip',header=0, delimiter="\t", quoting=3)

In [22]:
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [21]:
df.shape

(25000, 3)

Split Data into Training and Test Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df['review'],
    df['sentiment'],
    test_size=0.2, 
    random_state=42
)

In [None]:
type(X_train)

# Build the Tokenizer

In [None]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

In [None]:
top_words = 5000

In [None]:
t = Tokenizer(num_words=top_words) # num_words -> Vocablury size

In [None]:
t.fit_on_texts(X_train.tolist())

Get the Training and Test Data

In [15]:
X_train = t.texts_to_matrix(X_train.tolist(),mode='tfidf')

In [16]:
X_train.shape

(20000, 5000)

In [17]:
X_test = t.texts_to_matrix(X_test.tolist(),mode='tfidf')

In [18]:
X_test.shape

(5000, 5000)

# Build the Graph

In [23]:
from tensorflow.python.keras.models import Sequential

In [24]:
from tensorflow.python.keras.layers import Dropout, Dense

In [25]:
model = Sequential()

In [26]:
model.add(Dense(200,activation='relu',input_shape=(5000,)))

In [27]:
model.add(Dense(100,activation='relu'))

In [28]:
model.add(Dropout(0.4))

In [29]:
model.add(Dense(60,activation='relu'))

In [30]:
model.add(Dropout(0.4))

In [31]:
model.add(Dense(30,activation='relu'))

In [32]:
model.add(Dropout(0.25))

In [33]:
model.add(Dense(1,activation='sigmoid'))

In [34]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

# Execute the graph

In [35]:
model.fit(X_train,y_train,epochs=3,batch_size=128,validation_data=(X_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras._impl.keras.callbacks.History at 0x1041ff28>