In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Mounted google drive to colab for importing data

In [2]:
import pandas as pd
import numpy as np

In [3]:
movie_data = pd.read_csv("/content/drive/MyDrive/a1_IMDB_Dataset.csv")

In [4]:
movie_data.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [5]:
movie_data.info

<bound method DataFrame.info of                                                   review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]>

In [6]:
movie_data.shape

(50000, 2)

So there are 50000 rows corresponding to 50000 reviews and 2 columns which are review and sentiment 

In [7]:
movie_data.isnull().values.any()

False

No missing data is found

In [8]:
movie_data.columns

Index(['review', 'sentiment'], dtype='object')

In [9]:
movie_data["sentiment"].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

So there are 25000 datapoints for each class . Hence the data is balanced.

In [10]:
movie_data.duplicated().values.any()

True

In [11]:
movie_data.drop_duplicates(inplace=True)

In [12]:
movie_data.duplicated().values.any()

False

Removed the duplicate datas

In [13]:
movie_data["review"][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

A sample data from the movie_review dataset shows html tags, apostrophe. These things do not provide any semantic meaning for the algorithm to learn . So, it is better to remove them .

In [14]:
import os
import re

In [16]:
html_tag_remover = re.compile(r'<[^>]+>')

def remove_tags(text):
  return html_tag_remover.sub('', text)

The remove_tags function detects anything opening with < and closing with > and replaces with empty space

In [17]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [18]:
from nltk.corpus import stopwords

In [19]:
def preprocess_text(sen):
  sentence = sen.lower()
  #Converted all texts to lower case

  sentence = remove_tags(sentence)
  #Removed html tags

  sentence = re.sub('[^a-zA-Z]', ' ', sentence)
  #Removed punctuations and numbers

  sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence) 
  #Removed apostrophe

  sentence = re.sub(r'\s+', ' ', sentence)
  #Removed the extra spaces from the text

  pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
  sentence = pattern.sub('', sentence)
  #Removed Stopwords

  return sentence

In [20]:
Preprocessed_data = [ ]
sentences = list(movie_data['review'])
for i in sentences:
  Preprocessed_data.append(preprocess_text(i))

Finally applied our preprocessing function on our movie dataset

In [22]:
Preprocessed_data[0]

'one reviewers mentioned watching oz episode hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street skill

In [23]:
y = movie_data['sentiment']

labels = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

Converted the class labels to 0 and 1

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(Preprocessed_data, labels, test_size=0.20, random_state=42)

Finally, splitted the data into train and test with 80% of data going to the train set and rest to the test set

In [39]:
from keras.preprocessing.text import Tokenizer

In [28]:
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(X_train)

In [29]:
X_train = word_tokenizer.texts_to_sequences(X_train)
X_test = word_tokenizer.texts_to_sequences(X_test)

In [31]:
print(X_train[:10])

[[11, 308, 1, 27829, 70, 7539, 55, 4, 425, 225, 197, 199, 4443, 35496, 464, 1, 14, 4, 1140, 5983, 1024, 7691, 313, 737, 287, 7, 39, 205, 48, 1292, 248, 106, 2588, 1, 4, 115, 61, 1769, 159, 45666, 61, 1822, 884, 17, 237, 23, 1114, 27829, 70, 7539, 76, 61, 884, 35496, 5675, 1, 231, 231, 97], [715, 28, 1, 3056, 6086, 1, 42, 693, 345, 143, 1, 545, 523, 434, 131, 270, 2975, 605, 1, 66, 254, 569, 73, 6, 1746, 434, 809, 466, 6755, 1, 8, 95, 865, 46, 1430, 64, 203, 12, 4, 125, 46, 1879, 1125, 10, 56530, 4313, 12979, 52, 10, 1131, 5230, 321, 1, 11, 1077, 66, 70, 80, 382, 83, 2976, 4060, 2517], [141, 48, 69, 624, 5984, 2, 22, 103, 983, 14807, 5331, 2650, 2349, 479, 2, 111, 2962, 169, 613, 724, 5942, 3350, 5298, 2476, 36, 2527, 20, 624, 5984, 401, 1061, 199, 3711, 5265, 1244, 15178, 12476, 3711, 1242, 1307, 52, 608, 1, 3, 175, 34, 3226, 2223, 5985, 356, 7273, 606, 37, 205, 5265, 1254, 220, 4015, 959, 472, 12, 415, 1613, 5038, 1573, 15576, 6251, 13512, 7201, 784, 6203, 7273, 63, 10300, 7273, 1267,

So we have splitted our text into individual tokens and converted them to numeric form

In [35]:
word_tokenizer.word_index

{'movie': 1,
 'film': 2,
 'one': 3,
 'like': 4,
 'good': 5,
 'time': 6,
 'even': 7,
 'would': 8,
 'story': 9,
 'see': 10,
 'really': 11,
 'well': 12,
 'much': 13,
 'bad': 14,
 'get': 15,
 'great': 16,
 'people': 17,
 'also': 18,
 'first': 19,
 'made': 20,
 'make': 21,
 'could': 22,
 'way': 23,
 'movies': 24,
 'think': 25,
 'characters': 26,
 'character': 27,
 'watch': 28,
 'films': 29,
 'two': 30,
 'many': 31,
 'seen': 32,
 'never': 33,
 'plot': 34,
 'life': 35,
 'acting': 36,
 'love': 37,
 'best': 38,
 'know': 39,
 'show': 40,
 'little': 41,
 'ever': 42,
 'man': 43,
 'better': 44,
 'end': 45,
 'scene': 46,
 'still': 47,
 'say': 48,
 'scenes': 49,
 'something': 50,
 'go': 51,
 'back': 52,
 'real': 53,
 'thing': 54,
 'watching': 55,
 'actors': 56,
 'years': 57,
 'director': 58,
 'though': 59,
 'old': 60,
 'funny': 61,
 'another': 62,
 'work': 63,
 'actually': 64,
 'makes': 65,
 'nothing': 66,
 'look': 67,
 'find': 68,
 'going': 69,
 'new': 70,
 'lot': 71,
 'part': 72,
 'every': 73,
 'wo

In [36]:
vocab_length = len(word_tokenizer.word_index) + 1
print(vocab_length)

92211


This is the total number of words in the vocabulary of our text corpus.   The +1 is added to the length of the word index because the index starts from 1, not 0, so the actual number of words in the vocabulary is one more than the maximum index value.

In [37]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [38]:
X_train = pad_sequences(X_train, padding='post', maxlen=100)
X_test = pad_sequences(X_test, padding='post', maxlen=100)

A neural network model requires fixed length inputs. So , pad_sequence is used to ensure that all input sequences have the same length .It does so by adding zeros until the length is 100 in our case .