<a href="https://colab.research.google.com/github/neonithinar/ML_and_DL_learning_materials_and_tryouts/blob/master/Sentiment_analysis_IMDb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment analysis using IMDb

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from collections import Counter

In [2]:
tf.random.set_seed(42)

# easy to load IMDb with already preprocessed text and all
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [3]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

In [4]:
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
  id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


'<sos> this film was just brilliant casting location scenery story'

In [5]:
# Try preprocessing the data by ourselves
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised = True, with_info = True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGNQJUN/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGNQJUN/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGNQJUN/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [6]:
datasets.keys()


dict_keys(['test', 'train', 'unsupervised'])

In [7]:
train_size = info.splits['train'].num_examples
test_size = info.splits['test'].num_examples

In [8]:
train_size, test_size

(25000, 25000)

In [9]:
# printing some data

for X_batch, y_batch in datasets["train"].batch(2).take(1):
  for review, label in zip(X_batch.numpy(), y_batch.numpy()):
    print("Review: ", review.decode("utf8")[:200], ".....")
    print("label: ", label, "=Positive" if label else "= Negative")
    print()

Review:  This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  .....
label:  0 = Negative

Review:  I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  .....
label:  0 = Negative



In [10]:
#preprocess functions

def preprocess(X_batch, y_batch):
  X_batch = tf.strings.substr(X_batch, 0, 300)
  X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
  X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
  X_batch = tf.strings.split(X_batch)
  return X_batch.to_tensor(default_value= b"<pad>"), y_batch

In [11]:
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(2, 53), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b"Don't", b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b"movie's", b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'pi', b'<pad>', b'<pad>', b'<pad>'],
        [b'I', b'have', b'been', b'known', b'to', b'fall', b'asleep',
         b'during', b'films', b'but', b'this', b'is', b'usually', b'due',
         b'to', b'a', b'combination', b'of', b'things', b'including',
         b'really', b'tired', b'being', b'warm', b'and', b'comfortable',
         b'on', b'the', b'sette', b'and', b'having', b'j

In [12]:
vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
  for review in X_batch:
    vocabulary.update(list(review.numpy()))

In [15]:
vocabulary.most_common()[:5]

[(b'<pad>', 214309),
 (b'the', 61137),
 (b'a', 38564),
 (b'of', 33983),
 (b'and', 33431)]

In [14]:
vocab_size = 10000
truncated_vocabulary = [
                        word for word, count in vocabulary.most_common()[:vocab_size]
]

So, we've made a vocabulary with 10000 most common words. Now, we need to add a preprocessing step to replace each word with its ID. We will create a look-up table for this. using 1000 out-of-vocabulary words.(oov)  

In [18]:
word_to_id= {word: index for index, word in enumerate(truncated_vocabulary)}
#example
for word in b"This movie was incredible and faantaaastic".split():
  print(word_to_id.get(word) or vocab_size)

22
12
11
939
4
10000


In [19]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype= tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets=num_oov_buckets)

In [20]:
table.lookup(tf.constant([b"This movie was incredible and faantaastic".split()]))


<tf.Tensor: shape=(1, 6), dtype=int64, numpy=array([[   22,    12,    11,   939,     4, 10309]])>

In [21]:
def encode_words(X_batch, y_batch):
  return table.lookup(X_batch), y_batch

train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [24]:
for X_batch, y_batch in train_set.take(1):
  print(X_batch)
  print(y_batch)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


**we need to checkout ```tf.compute_and_apply_vocabulary()``` function later that would go through the dataset and find all distinct words and will generate the tf operations required to encode each word using this vocabulary**

##Creating the model and training

In [26]:
embed_size = 128
model = keras.models.Sequential([
                                 keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, 
                                                        mask_zero=True, input_shape = [None]), 
                                 keras.layers.GRU(128, return_sequences = True), 
                                 keras.layers.GRU(128),
                                 keras.layers.Dense(1, activation= "sigmoid")
])
model.compile(loss= "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
history = model.fit(train_set, steps_per_epoch = train_size // 32, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


###Manual masking


In [28]:
K = keras.backend
embed_size = 128
inputs =  keras.layers.Input(shape = [None])
mask = keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences= True)(z, mask = mask)
z = keras.layers.GRU(128)(z, mask = mask)
outputs = keras.layers.Dense(1, activation = "sigmoid")(z)
model = keras.models.Model(inputs = [inputs], outputs = [outputs])
model.compile(loss="binary_crossentropy", optimizer= 'adam', metrics=['accuracy'])
history = model.fit(train_set, steps_per_epoch= train_size // 32, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [29]:
review = input(str())

the movie was a waste of time and money


In [36]:
pred = table.lookup(tf.constant([review.split()]))

In [41]:
if  model.predict(pred) < 0.5

In [43]:
def predict_review():
  review = input(str(" enter your review here..."))
  review = table.lookup(tf.constant([review.split()]))
  if model.predict(review) <= 0.5:
    print("Negative review")
  else:
    print("Positive review")



In [46]:
predict_review()

 enter your review here...This movie was awesome
Positive review


Could've added something to strip the max length of the characters or so

In [47]:
predict_review()

 enter your review here...This movie was, quite frankly, a wake up call. This is a Pixar film for adults and it comes with an incredibly important message. I loved it and I absolutely want to listen to that message.
Positive review


In [48]:
predict_review()

 enter your review here...There are times during the first quarter when you may believe someone's spiked your drink with an hallucinogenic as Disney's innovative way of capturing our entrance and exit to the world is developed but, as you will find, this is a film to get you thinking and, more importantly, thinking about yourself - reflecting so to speak. Delivered with the usual Pixar excellence, if this doesn't make you realise that tomorrow is the first day of the rest of your life then rewind, pause and start again, because the message is universally important to all - and that includes you!!!
Positive review
