# Creating an IMDB sentiment analysis Model

This notebook will create a model and then we will use TCAVs to analyse the model and see what labels make it decide if it's good or bad. 

In [1]:
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds
from tensorflow import keras

now let's create the dataset for imdb

In [2]:
dir_path = "/code/tcav/tcav_examples/IMDB_Data/"
imdb,info = tfds.load("imdb_reviews",with_info=True,as_supervised=True,data_dir=dir_path,download=True)

[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /code/tcav/tcav_examples/IMDB_Data/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling imdb_reviews-train.tfrecord...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling imdb_reviews-test.tfrecord...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling imdb_reviews-unsupervised.tfrecord...:   0%|          | 0/50000 [00:00<?, ? examples/s]

[1mDataset imdb_reviews downloaded and prepared to /code/tcav/tcav_examples/IMDB_Data/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


   Get the data ready for training and testing

In [3]:
train_data, test_data = imdb['train'], imdb['test']
training_sentences = []
training_labels = []
TextLabels = []

testing_sentences = []
testing_labels = []
for s,l in train_data:
  TextLabels.append(l)
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())
  
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

Tokenize the text

In [4]:
vocab_size = 10000
embedding_dim = 16
max_length = 100
trunc_type='post'
oov_tok = "<OOV>"

from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

Doing NLP sequencing of the IMDB data

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
sequences = tokenizer.texts_to_sequences(training_sentences)

Padding the lenghts so that the input is always the same size

In [6]:
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

We will now create the Model that will be modified for TCAV later

In [7]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 24)                38424     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 198,449
Trainable params: 198,449
Non-trainable params: 0
_________________________________________________________________


Now let's train this model

In [8]:
num_epochs = 6
history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


Now let's test out the model!

In [9]:
new_sentences = [
    'I loved this movie.',
    'This film is so boring.',
    'This movie is so hilarious. I had a really great time!',
    'Very linear scenario, no surprises at all',
    'Another amazing addition to the franchise with good story arcs and standalone episodes.',
    'Not for the hardened not even the casual fans.'
    ]
new_sequences = tokenizer.texts_to_sequences(new_sentences)
padded=pad_sequences(new_sequences, maxlen=max_length,truncating=trunc_type)
output=model.predict(padded)
for i in range(0,len(new_sentences)):
    print('Review:'+new_sentences[i]+' '+'sentiment:'+str(output[i])+'\n')

Review:I loved this movie. sentiment:[0.7098205]

Review:This film is so boring. sentiment:[0.06861907]

Review:This movie is so hilarious. I had a really great time! sentiment:[0.9889351]

Review:Very linear scenario, no surprises at all sentiment:[0.72659445]

Review:Another amazing addition to the franchise with good story arcs and standalone episodes. sentiment:[0.9893259]

Review:Not for the hardened not even the casual fans. sentiment:[0.93260646]



In [10]:
model.save(dir_path+"imdb_model.h5")