In this part, we will begin by looking at 50,000 movie reviews (find more detail about IMDB dataset [here](http://ai.stanford.edu/~amaas/data/sentiment/)) from Tensorflow Data Services (https://www.tensorflow.org/datasets/catalog/overview), training a neural network on texts that are labelled 'positive' or 'negative' and determining which words in a sentence drive those meanings. First of all, let's start with word embbeddings.

![](https://docs.google.com/uc?export=download&id=1TWX3hlyRtTBQhyR1dPMdG8B6c2VWTVbB)

## Word Embeddings
Tokenizer prepares your text to be used by a neural network by converting words into numeric tokens, and sequencing sentences from these tokens. <br>

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer # https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

sentences = [
    'I love my dog',
    'I love my cat'
]

# num_words is the maximum number of words to be encoded
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# Remember the word which was capitalized will convert to the lower case --> Tokenizer did it for you
# Also stripts punctuation out
print("Word Index:", word_index)

Word Index: {'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


Embeddings is where these tokens are mapped as vectors in a high dimension space. With Embeddings and labelled examples, these vectors can then be tuned so that words with similar meaning will have a similar direction in the vector space. This will begin the process of training a neural network to understand sentiment in text.
http://projector.tensorflow.org/

In [None]:
import tensorflow as tf
print(tf.__version__)

2.3.0


In [None]:
# Call tensorflow dataset for IMDB dataset
import tensorflow_datasets as tfds

# 0 is negative review, 1 is positive label
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True) # https://www.tensorflow.org/datasets/catalog/imdb_reviews

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteIDSFBG/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteIDSFBG/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteIDSFBG/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
import numpy as np

train_data, test_data = imdb['train'], imdb['test']

In [None]:
len(train_data), len(test_data)

(25000, 25000)

In [None]:
# Define the list of sentences and labels
training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# iterate over the training to extracting the sentences and labels 
# since the original type of data is tensor
for s,l in train_data:
    training_sentences.append(str(s.numpy()))
    training_labels.append(l.numpy())

for s,l in test_data:
    testing_sentences.append(str(s.numpy()))
    testing_labels.append(l.numpy())

In [None]:
# When training, the values are expected in form of numpy array
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

In [None]:
# Tokenize the sentences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 10000
embedding_dim = 16
max_length = 120
oov_tok = "<OOV>"
trunc_type = 'post'
padding_type='post'

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
# Replacing the string which is containing the words with the token values we created
sequences = tokenizer.texts_to_sequences(training_sentences)
# Truncate the sequences with the various lengths into the same length
padded = pad_sequences(sequences,maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Do the same for testing set -> word_index is the word which derived from the training set
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, padding=padding_type, maxlen=max_length)

In [None]:
# Decode the tokens back into the words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(padded[0]))
print(training_sentences[0])

b this was an absolutely terrible movie don't be <OOV> in by christopher walken or michael <OOV> both are great actors but this must simply be their worst role in history even their great acting could not redeem this movie's ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the <OOV> rebels were making their cases for <OOV> maria <OOV> <OOV> appeared phony and her pseudo love affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning i am disappointed that there are movies like this ruining <OOV> like christopher <OOV> good name i could barely sit through it ? ?
b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pa

In [None]:
padded.shape

(25000, 120)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Conv2D, Embedding

# Define the neural network
model = Sequential([
    # Key to text sentiment analysis in TensorFlow
    Embedding(vocab_size, embedding_dim, input_length=max_length), # www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
    Flatten(),
    Dense(6, activation='relu'),
    Dense(1, activation='sigmoid')
])

Let's talk about Embedding <br>
Let's say we have words in sentence. There are words that have similar meanings such as fun and exciting. <br>
What if we could pick a vector in a higher-dimensianl space (ex. 16 dimensions) and words will be close to each other if the vectors are similar. Then over time, those words begin to cluster together. The labeling can come the dataset. In this case, if the words like fun and exciting shows up a lot in positive review, so they have similar sentiments and they are close to each other in the sentence. As a result, their vector will be similar. <br>
The neural network can be trained with these vectors and their labels to come up with an **embedding** (the vector from each word with their associated sentiment.

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 120, 16)           160000    
_________________________________________________________________
flatten_2 (Flatten)          (None, 1920)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 11526     
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [None]:
num_epochs = 10
model.fit(padded,
          training_labels_final,
          epochs=num_epochs,
          validation_data=(testing_padded, testing_labels_final))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f8ff2c7d898>

In [None]:
# Export the weight for visualizing the word embedding
# Getting the result from embedding layer which is layer 0
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim) --> we have 10,000 words in our corpus and working with 16 dimensional array

(10000, 16)


In [None]:
# Write our vector and meta data to the file for projection
import io

# Write vector and meta-data into files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
    word = reverse_word_index[word_num]
    embeddings = weights[word_num]
    out_m.write(word + "\n")
    out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [None]:
# Download files from colab
try:
    from google.colab import files
except ImportError:
    pass
else:
    files.download('vecs.tsv')
    files.download('meta.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Go to http://projector.tensorflow.org/ and press the **Load** button on the left. <br> Choose **vecs.tsv** for the first step and **meta.tsv** for the second step. <br>
After that, check **Sphereize data** item for visualizing.

In [None]:
test_sentence = ["this was an absolutely terrible movie don't be <OOV> in by christopher walken or michael <OOV> both are great actors but this must simply be their worst role in history even their great acting could not redeem this movie's ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the <OOV> rebels were making their cases for <OOV> maria <OOV> <OOV> appeared phony and her pseudo love affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning i am disappointed that there are movies like this ruining <OOV> like christopher <OOV> good name i could barely sit through it ? ?"]
test_sequences = tokenizer.texts_to_sequences(test_sentence)
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(model.predict(test_padded))

[[5.459558e-09]]
