# **Overview**

This project is based on [Kaggle 50K IMBD Movie Review Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).(Just 50% of it)

This notebook will guide you through the process of Pre-Precessing, Model designing and Prediction.

# **Imports**

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout, SimpleRNN
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import regex as re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# **Reading Data**

In [3]:
dataCSV = pd.read_csv("sample_data/IMDB Dataset.csv")

# **Sample Data**

In [4]:
dataCSV.head(15)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


# **Assigning Binary value to sentiment**

In [5]:
dataCSV['sentiment'] = np.where(dataCSV['sentiment'] == 'positive', 1, 0)

In [6]:
#Converting to numpy arrays
Sentences = np.array(dataCSV['review'])
Sentiments = np.array(dataCSV['sentiment'])

# **Cleaning Data**

In [7]:
# Removing noise from text_data
print("Data before")
print(Sentences)
url_pattern = re.compile(r'https?://\S+|www\.\S+')
for i in range(len(Sentences)):
  #Removing URls
  Sentences[i] = url_pattern.sub(r'', Sentences[i])
  #Removing emails
  Sentences[i] = re.sub('\S*@\S*\s?', '', Sentences[i])
  #Removing single quotes
  Sentences[i] = re.sub("\'", "", Sentences[i])
  #Removing double quotes
  Sentences[i] = re.sub('\"', '', Sentences[i])
  #Removing <br/> tags
  Sentences[i] = re.sub('<br />', '', Sentences[i])
  #Removing Punctuations
  Sentences[i] = Sentences[i].translate(str.maketrans('', '', string.punctuation))
  #All to lower
  Sentences[i] = Sentences[i].lower()
  #remove stop words
  stop_words = set(stopwords.words('english'))
  word_token = word_tokenize(Sentences[i])
  Filtered_Words=[]
  for w in word_token:
    if w not in stop_words:
      Filtered_Words.append(w)
  Sentences[i] = ""
  for j in range(len(Filtered_Words)):
    Sentences[i] = Sentences[i] + " " + Filtered_Words[j]
Sentences = np.array(Sentences)
print("Data After")
print(Sentences)

Data before
["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is 

# **Tokenizing Data**
We will use keras Tokenizer as well as pad_sequences method to transform data to 3D float data so that our neural network can understand it

In [8]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(Sentences)
Seq_Token = tokenizer.texts_to_sequences(Sentences)
Max_Seq_Length = 200
Padded_Train = pad_sequences(Seq_Token, maxlen = Max_Seq_Length, padding='post', truncating='post')
print(Padded_Train)

[[     3   1735    878 ...      0      0      0]
 [   267     35    244 ...      0      0      0]
 [    97    267     25 ...      0      0      0]
 ...
 [   204   1952   2235 ...      0      0      0]
 [  3393  16004   2426 ...     79    893  14212]
 [143765    337  11020 ...      0      0      0]]


In [11]:
#Spliting the Data
X_train, X_test, y_train, y_test = train_test_split(Padded_Train, Sentiments, test_size=0.25, random_state=1)
print(len(X_train), len(X_test), len(y_train), len(y_test))

18749 6250 18749 6250


# **Building our Model**
We will use Bidirection LSTM, which generates great results with talking about text classification
Our Sequencial Model will consists of:
1. Embedding Layer: batch-size=128
2. Bidirection LSTM layer with 128 units.
3. Next are three dense layers with ReLU activation function. First and second with 64 units and third one with 16 units.
4. Model also has dropout layers, which will prevent overfitting
5. Last one is the output layer with is a Dense layer with single neuron.

In [9]:
#Model
total_words = len(tokenizer.word_index) + 1
model = Sequential()
model.add(Embedding(total_words, 128, input_length = Max_Seq_Length))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 128)          18402176  
                                                                 
 bidirectional (Bidirection  (None, 256)               263168    
 al)                                                             
                                                                 
 dense (Dense)               (None, 64)                16448     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 64)                4160      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                        

In [12]:
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# **Prediction**

In [77]:
#Prediction
sentences = ['this model is doing great', 'It still has many mistakes', 'But yes it is better.',
             'As it works on reviews so don\'t have some words data', 'But it is kind of good',
             'Now lets start with examples', 'Ah! you are worse', 'Heyy!! you are awsome',
             'I Love you', 'I hate you', 'You are hurting me', 'It\'s just a time-waste',
             'You are Pathetic', 'Want to hang out?', 'You are fabulous', 'You are crazy!',
             'Wow!!', 'Alas!!', 'Get Lost!!', 'Fuck Off!!', 'So Cheap Ahan!!', 'You piece of a crap',
             'Son of a bitch', 'You are Amazimg', 'This movie is Fantastic',
             'You know I Love the way you lie']
for sentence in sentences:
  print(sentence)
  sentence = sentence.lower()
  L_sentence = []
  L_sentence = tokenizer.texts_to_sequences([sentence])
  L_sentence_padded = pad_sequences(L_sentence, maxlen=Max_Seq_Length, padding='post')
  print(model.predict(L_sentence_padded))
  if model.predict(L_sentence_padded) > 0.6:
    print("Positive")
  else:
    print("Negative")

this model is doing great
[[0.9314127]]
Positive
It still has many mistakes
[[0.90869665]]
Positive
But yes it is better.
[[0.95071554]]
Positive
As it works on reviews so don't have some words data
[[0.37620524]]
Negative
But it is kind of good
[[0.8381719]]
Positive
Now lets start with examples
[[0.04394163]]
Negative
Ah! you are worse
[[0.7092912]]
Positive
Heyy!! you are awsome
[[0.8792669]]
Positive
I Love you
[[0.8314086]]
Positive
I hate you
[[0.93996125]]
Positive
You are hurting me
[[0.95923877]]
Positive
It's just a time-waste
[[0.01551666]]
Negative
You are Pathetic
[[0.5961972]]
Negative
Want to hang out?
[[0.84414816]]
Positive
You are fabulous
[[0.8813083]]
Positive
You are crazy!
[[0.96164817]]
Positive
Wow!!
[[0.8684397]]
Positive
Alas!!
[[0.18371826]]
Negative
Get Lost!!
[[0.58556294]]
Negative
Fuck Off!!
[[0.9115932]]
Positive
So Cheap Ahan!!
[[0.11489215]]
Negative
You piece of a crap
[[0.10558267]]
Negative
Son of a bitch
[[0.1069122]]
Negative
You are Amazimg
[[0.9