# **Training a binary classifier using imdb dataset** 
###The data used is from  tensorflow data services
### ***Description*** - A binary classifer using Neural Networks that will help classify the positive and negative reviews
### [Tensorflow datasets](https://www.tensorflow.org/datasets)

In [38]:
#!pip install -q tensorflow-datasets
import tensorflow_datasets as tfds 

#loading the dataset
imdb, info = tfds.load("imdb_reviews", with_info = True, as_supervised = True)

In [39]:
print(info)
# The data set contains 25,000 training and testing each as well as unlabeled data (not used in this)


tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset.
    This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path='~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num_shards=1>,
        '

In [40]:
#print(imdb) -> not in raw form 
for i in imdb['train'].take(5):
  print(i)

  # output contains a two value tuple, containing 1. review 2. label in the numpy property at the end (0 -> negative, 1 -> positive)

(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on

In [49]:
import numpy as np

train_data, test_data = imdb['train'], imdb['test']

#initialize sentences and label lists
train_sentences = []
train_labels = []

test_sentences = []
test_labels = []
#calling numpy method to convert the tensors since each of the iterables contain sentences and labels as TENSO
for s, l in train_data:
  train_sentences.append(s.numpy().decode('utf-8'))
  train_labels.append(l.numpy())

for s,l in test_data:
  test_sentences.append(s.numpy().decode('utf-8'))
  test_labels.append(l.numpy())

#During training, the balues are expected to be numpy arrays, so need to convert it 
train_labels_final = np.array(train_labels)
test_labels_final = np.array(test_labels)

# **Preprocessing Steps**
Generating padding sequences, tokenizing and padding them since all the sentences have varied length


In [111]:
# parameters for preprocessing

vocab_len = 10000
max_len= 120
embedding_dim = 16
trunc_type = 'post'
oov_token = "<OOV>" # out of vocab token

In [51]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Steps : 

1. Initialise tokenizer
2. Create a word_index dict using training data
3. Generate and padding the sentences (train and test)

In [52]:
tokenizer = Tokenizer(num_words= vocab_len, oov_token=oov_token)

tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(train_sentences)
padded = pad_sequences(sequences= sequences, maxlen= max_len, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(test_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_len, truncating=trunc_type)

# **Building and compiling the model**
The first layer will be ***Embedding Layer***, each word in the vocab wil be 

In [53]:
import tensorflow as tf

In [54]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_len, embedding_dim, input_length = max_len),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation = 'relu'),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])


model.compile(loss= 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 120, 16)           160000    
                                                                 
 flatten_3 (Flatten)         (None, 1920)              0         
                                                                 
 dense_10 (Dense)            (None, 6)                 11526     
                                                                 
 dense_11 (Dense)            (None, 1)                 7         
                                                                 
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [57]:
#alternatively GlobalAveragePooling1D is also used instead flatten due to the size of output vector - The output shape is reduced to 16 in this. 
# In this situation, this will be faster (per epoch and simpler) but the accuracy will be less than the above model

model2 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_len, embedding_dim, input_length = max_len),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation = 'relu'),
    tf.keras.layers.Dense(6, activation = 'relu'),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])


model2.compile(loss= 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

model2.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 120, 16)           160000    
                                                                 
 global_average_pooling1d_3   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_14 (Dense)            (None, 6)                 102       
                                                                 
 dense_15 (Dense)            (None, 6)                 42        
                                                                 
 dense_16 (Dense)            (None, 1)                 7         
                                                                 
Total params: 160,151
Trainable params: 160,151
Non-trainable params: 0
________________________________________________

In [58]:
num_epochs = 10

# Train the model
model.fit(padded, train_labels_final, epochs=num_epochs, validation_data=(testing_padded, test_labels_final))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbe5ce6a710>

In [62]:
scores = model.evaluate(testing_padded, test_labels_final, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 80.70%


In [63]:
model.save("trained_demo.h5")


In [110]:
# testing the model on a single review 


# !pip install nltk
import nltk
# nltk.download('punkt')
from nltk import word_tokenize

from keras.preprocessing import sequence
# word2index = imdb.get_word_index()
test=[]
for word in word_tokenize("i love this movie"):
     test.append(word_index[word])

test=sequence.pad_sequences([test],maxlen=max_len)
print(model.predict(test))
predictions = np.round(model.predict(test)).astype(int)
print(predictions)
if predictions.item(0) == 0: print("negative") 
else: 
  print("positive")

[[0.66908866]]
[[1]]
positive
