# Week 2: word embeddings
How to get a sentiment from a sequence of numbers? It can be learned from a corpus of words (embedding).
The idea is that words and associate words are clustered as vectors in multidimensional space.

## The IMDB dataset
TensorFlow data services (TFDS) is a library that contains lots of data sets in lots of different categories.
We will use `imdb_reviews` dataset that contains 50k movie reviews categorized as positive or negative.

## Looking into the details

In [5]:
import tensorflow as tf
print(tf.__version__) # determine TF version, we will need eager execution which is enabled by default in TF 2.0
# tf.enable_eager_execution() # for TF 1.0

# Install the TFDS library:
# pip install -q tensorflow-datasets

import tensorflow_datasets as tfds
# get data and metadate from the imdb ds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

# Data is split in 25K samples for training and 25K samples for testing
import numpy as np
train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# Do some conversion to get arrays of sentences
for s,l in train_data: # iterate over train data to extract sentences and labels
    training_sentences.append(str(s.numpy()))
    training_labels.append(l.numpy())

for s,l in test_data:
    testing_sentences.append(str(s.numpy()))
    testing_labels.append(l.numpy())

# during the training we need a numpy array, so we convert them:
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

# Tokenize the sentences
## first, define some hyperparams:
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type = 'post'
oov_tok = "<OOV>"

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences) # fit tokenizer on a training set of data
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences) # replace strings containing the words with token values
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type) # pad/truncate the sentences until they are of the same length.

# Do the same for testing sequences; there should be more OOV token because we reuse the word index from training step
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length), # a key to performing text sentiment analysis in TF
    # the result of embedding is a 2d array with a len of the sentence and embedding dimension, that's why we have to flatten it:
    # tf.keras.layers.Flatten(),
    # or alternatively, use:
    tf.keras.layers.GlobalAveragePooling1D(), # averages across the vector to flatten it out; used often in NLP
    # Dense NN that performs classification:
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

num_epochs = 10
model.fit(
    padded,
    training_labels_final,
    epochs=num_epochs,
    validation_data=(testing_padded, testing_labels_final)
)

# Obtain and visualise the embeddings:
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # (vocab_size, embedding_dim) -> (10000, 16) - 10K words inb corpus, working in a 16-dim array
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# Write vectors and metadata to files
import io
out_v = io.open('vects.csv', 'w', encoding='utf-8')
out_m = io.open('meta.csv', 'w', encoding='uft-8')
for word_num in range(1, vocab_size):
    word = reverse_word_index[word_num]
    embeddings = weights[word_num]
    out_m.write(word + '\n')
    out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

# Download files in Google Colab:
# try:
#     from google.colab import files
# except ImportError:
#     pass
# else:
#     files.download('vecs.tsv')
#     files.download('meta.tsv')

# projector.tensorflow.org -> Load data to visualize the results

2.3.1
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 120, 16)           160000    
_________________________________________________________________
global_average_pooling1d_2 ( (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 102       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 7         
Total params: 160,109
Trainable params: 160,109
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(10000, 16)


## Working with a 'Sarcasm' dataset from week 1

In [4]:
import json
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Setup hyper params:
vocab_size = 10000
embedding_dim = 16
max_length = 32
trunc_type = 'post'
padding_type = 'post'
oov_tok = "<OOV>"
training_size = 20000

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
    -O /tmp/sarcasm.json

with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)

sentences = []
labels = []
# load each headline as a sentence together with a label
for item in datastore:
    sentences.append(item["headline"])
    labels.append(item["is_sarcastic"])

# Split corpus into training and validation sentences:
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

# Create sequences and pad them:
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Creare a NN in a usual way:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# Train for 30 epochs:
num_epochs = 30
history = model.fit(training_padded, training_labels, epochs=num_epochs,
                    validation_data=(testing_padded, testing_labels), verbose=2)
# plot the results:
import matplotlib.pyplot as plt

def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.xlabel("#Epochs")
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

plot_graphs(history, "acc")
plot_graphs(history, "loss")

--2021-01-16 14:35:13--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.20.80, 172.217.168.240, 172.217.168.208, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.20.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5,4M) [application/json]
Saving to: ‘/tmp/sarcasm.json’


2021-01-16 14:35:13 (32,8 MB/s) - ‘/tmp/sarcasm.json’ saved [5643545/5643545]

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 32, 16)            160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 24)                408  

ValueError: Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, (<class 'list'> containing values of types {"<class 'int'>"})

## Loss function
Loss is a confidence in a prediction.
The # of accurate predictions increased over time, the confidence per prediction effectively decreased
One way to do is to explore the differences as we tweak hyperparams.
e.g by increasing the vocabulary size and max sentence length, might affect the loss and accuracy.
-> Putting hyperparams as separate variables is very important to be able to tweak them.