<a href="https://colab.research.google.com/github/id-shiv/project_notebooks/blob/master/%5BProject_101%5D_Text_Classification_with_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Libraries

In [0]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

# Data

Data is a corpus of movie reviews from keras dataset named imdb


In [0]:
data = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = data.load_data(num_words=10000)

## View Data

In [0]:
word_index = data.get_word_index()
word_index = {word: index+3 for word, index in word_index.items()}
word_index["<PAD>"] = 0  # Used to replace words that are outside the set limit of feature count
word_index["<START>"] = 1  # Used to indicate the start of the review
word_index["<UNKNOWN>"] = 2. # Used to indicate words that are not in word_index (since num_words=10000)
word_index["<UNUSED>"] = 3
reverse_word_index = {index: word for word, index in word_index.items()}

In [0]:
def decode_review(review):
  return " ".join([reverse_word_index.get(i, '?') for i in review])

In [43]:
decode_review(test_data[2])

"<START> many animation buffs consider <UNKNOWN> <UNKNOWN> the great forgotten genius of one special branch of the art puppet animation which he invented almost single <UNKNOWN> and as it happened almost accidentally as a young man <UNKNOWN> was more interested in <UNKNOWN> than the cinema but his <UNKNOWN> attempt to film two <UNKNOWN> <UNKNOWN> fighting led to an unexpected breakthrough in film making when he realized he could <UNKNOWN> movement by <UNKNOWN> beetle <UNKNOWN> and <UNKNOWN> them one frame at a time this discovery led to the production of amazingly elaborate classic short the <UNKNOWN> revenge which he made in russia in <UNKNOWN> at a time when motion picture animation of all sorts was in its <UNKNOWN> br br the political <UNKNOWN> of the russian revolution caused <UNKNOWN> to move to paris where one of his first productions <UNKNOWN> was a dark political satire <UNKNOWN> known as <UNKNOWN> or the <UNKNOWN> who wanted a king a strain of black comedy can be found in almo

## Prepare Data

### Feature consistency

Limit the number of words in a review to 250 (needed to know number of features i.e. number of neurons in input layer)

In [44]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, 
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, 
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)
print(f'Length of Training Data : {len(train_data)}'
      f'\nLength of Test Data : {len(test_data)}')

Length of Training Data : 25000
Length of Test Data : 25000


# Model

* Embedding : Create word vector for each word (10000 vectors in below example). Angle between each vector indicates similarity.  
Set 16 dimensions to the vector in below example.
* Global Average Pooling 1D : Scale down 16 dimentions (co-effecients) to 1 demention for easier compute.
* Output : need 1 neuron to indicate 0 or 1 (good or bad). Sigmoid suits well for boolean.  

In [0]:
model = keras.Sequential()
model.add(keras.layers.Embedding(10000, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

## Compile

In [0]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Training

Batch size : How many review to be loaded at once

In [47]:
x_val = train_data[:10000]
x_train = train_data[10000:]

y_val = train_labels[:10000]
y_train = train_labels[10000:]

model.fit(x_train, y_train, epochs=40, batch_size=52, 
                          validation_data=(x_val, y_val), verbose=1)

Train on 15000 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x7f705ea7ce80>

## Evaluate

In [48]:
loss, accuracy = model.evaluate(test_data, test_labels)
print(f'Test Loss : {loss}'
      f'\nTest Accuracy : {accuracy}')

Test Loss : 1.184898285342455
Test Accuracy : 0.8401600122451782


## Save

In [0]:
model.save('movie_reviews.h5')

# Predict

## Load Model

In [0]:
model = keras.models.load_model('movie_reviews.h5')

## Pre-process input

In [0]:
# Encode review
def encode_review(review):
  encoded = [1]
  for word in review:
    word = word.lower()
    if word in word_index:
      encoded.append(word_index[word])
    else:
      encoded.append(2)
  encoded = keras.preprocessing.sequence.pad_sequences([encoded], 
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)
  return encoded

In [0]:
test_review = "This movie is horrible, i do not know how would the story writer not relate the murder to one of the most infamous mystery"
test_review = "Good"

# Remove punctutions
test_review = test_review.replace(',', '')

# Encode review
encoded_test_review = encode_review(test_review)

## Predict

In [57]:
predict = model.predict([encoded_test_review])
print(f'\n\nReview : {test_review}')
print(f'Prediction : {predict[0]}')



Review : Good
Prediction : [0.58670723]
