<a href="https://colab.research.google.com/github/nhamhung/Coder-School-Machine-Learning/blob/master/MLE_9_5_Tokenization%2C_Padding%2C_Embedding_for_NLP_in_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLP Topic in Tensorflow

## Tokenization

Same old, same old libraries!

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
# Our sentences
sentences =[
            'i love my dog cat',
            'I, love my cat'
]

**num_word is the maximum number of words we gonna keep. It is ok because we have only two sentences now, but imagine we got hundreds of books to tokenize, and we just want 100 words in all of that.**

In [None]:
tokenizer = Tokenizer(num_words=100)

In [None]:
tokenizer.fit_on_texts(sentences)

In [None]:
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'cat': 4, 'dog': 5}


**The tokenizer is smart enough to catch some exceptions like this! Note that dog with "!"**

In [None]:
# Our sentences
sentences =[
            'i love my dog',
            'I, love my cat',
            'You love my dog!'
]

In [None]:
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


You can see how words can be tokenized and tools in Tensorflow can handle that for you.

Now your words are represented by numbers like this then you need to represent your sentences by sequences of numbers in the correct order. 

## Turning sentences into data

Time to create sequences from sentences!

Let try a different example, this time **these sentences will have different lengths.**

In [None]:
# Our sentences
sentences =[
            'i love my dog',
            'I, love my cat',
            'You love my dog!',
            'Do you think my dog is amazing?'
]

In [None]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [None]:
print(word_index)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


**text_to_sequences will create sequences of tokens representing each sentence.**

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)

In [None]:
sequences

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

You can make sense of the first sentence which is "I love my dog" -> [4, 2, 1, 3]

**What about the words that our model never seen before?**

In this example, we will have **new words "really" and "food"**

In [None]:
# Try with new setences
test_data=[
           'i really love my dog',
           'my dog loves my food'
]

In [None]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


In [None]:
print(word_index)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


**So you can imagine that you need a really big word index to handle sentences that are not in the training set.**

**In order to not lose the length of sequence like above, there is a trick for that!**

**We will create a unique word that would never be in any text like "\<OOV\>"**. Then we can replace words which we never seen before with OOV instead!

In [None]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [None]:
test_seq=tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


Now, all sequences will have the same length of our original sentences. Pretty neat trick right?

Another problem is that how our model can handle sequences with different sizes/lengths because remember when we train images, they are needed to be the same size/length.

## Padding sequences

Ragged Tensors (advance solution), Or pad_sequences (easy solution)

In [None]:
# Our sentences
sentences =[
            'i love my dog',
            'I, love my cat',
            'You love my dog!',
            'Do you think my dog is amazing?'
]

In [None]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [None]:
sequences=tokenizer.texts_to_sequences(sentences)
print(sequences)

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
padded = pad_sequences(sequences)
print(padded)

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


Nice, so it is padded at the beginining!

What if we want to pad them at the end?

In [None]:
padded = pad_sequences(sequences, padding='post')
print(padded)

[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


We can even set the max_len instead of use the maximum length of the longest sentence. 

If the sentence is too long for our max_len, we can truncate/remove some words to fit it (truncate=post or pre)

In [None]:
padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=6)
print(padded)

[[ 5  3  2  4  0  0]
 [ 5  3  2  7  0  0]
 [ 6  3  2  4  0  0]
 [ 8  6  9  2  4 10]]


In [None]:
padded = pad_sequences(sequences, padding='post', truncating='pre', maxlen=6)
print(padded)

[[ 5  3  2  4  0  0]
 [ 5  3  2  7  0  0]
 [ 6  3  2  4  0  0]
 [ 6  9  2  4 10 11]]


Now you know how to tokenize text into numeric values and how to regulaize and pad those text. So the preprocession is done!

Time to train our juicy network model with these representations of sentences to detect if a sentence is sarcastic or not! However, how can we make sure these numbers be meaningful when it comes to sentiment analysis ? So we need Embedding !

## Embedding Layer

![](https://i.imgur.com/FQGHA81.png)

Let's talk about a bit of sentiment. We can have **Bad** and **Good** in opposite direction [-1,0] and [1,0] while **Meh** not that bad so it can be [-0.4, 0.7]. Similarly, **Not Bad** means a bit of goodness but not so much so can be [0.5, 0.7]. 

So by looking at the directions, we can determine the meaning of words.

Imagine that we can train our data on a very high number of dimensions instead of two. The model can figure out what kind of direction which sarcastic vector should look like.

Like words are sarcastic will be strong in the sarcastic direction and others will be non-sarcastic direction.

As we load more and more data into the model for training, these directions can change. And when we have fully trained network, we can have vectors of these words and sum them up to give us idea of sentences. This is the idea of embedding.

It can be an example of embedding done by human with meaningful dimensions.

![alt text](https://i.imgur.com/Y9pBIxA.png)

We can always project them on 2D plane to check out their similarity.

![alt text](https://i.imgur.com/LaPXFle.png)

We can spot the relationships between words here!

![alt text](https://i.imgur.com/DjlWeOs.png)

This is a real visualization of a real word embedding trained by Standford. This one got 300 dimensions for word embedding vector which contains 300 shades of word meaning which only make sense for computer since it is done by backpropagation. The vocabulary got around 250000 words in their corpus.

![alt text](https://i.imgur.com/lMiV9en.jpg)

In [None]:
from tensorflow.keras import layers

In [None]:
vocab_size = 12 
embedding_dim = 3 # can be represented for good, bad, fun
embedding_layer = layers.Embedding(vocab_size, embedding_dim)

In [None]:
result = embedding_layer(tf.constant([0,1,2,3,4,5,6,7,8,9,10,11]))
result.numpy()

array([[ 0.04373587,  0.00484896, -0.04774035],
       [ 0.03414878,  0.03441763,  0.02784688],
       [-0.02247711, -0.03748335, -0.03479894],
       [ 0.01519536, -0.04087581, -0.03563573],
       [ 0.04481051,  0.01054685, -0.02298336],
       [-0.02589405,  0.02384074,  0.02852238],
       [-0.02330941, -0.02056179, -0.02227958],
       [ 0.03142171,  0.02462685,  0.00811763],
       [-0.04993236, -0.00650457, -0.04094852],
       [-0.04656242, -0.00856736, -0.00303025],
       [-0.03515745, -0.01581569,  0.04941536],
       [-0.03581141, -0.02356493,  0.00055293]], dtype=float32)

The above is your full embedded matrix. We need to find a way to retrieve correct embedded vector for each word and then for each sentence!

![alt text](https://i.imgur.com/z3qObl7.png)

In [None]:
result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

array([[ 0.03414878,  0.03441763,  0.02784688],
       [-0.02247711, -0.03748335, -0.03479894],
       [ 0.01519536, -0.04087581, -0.03563573]], dtype=float32)

In [None]:
first_sentence = padded[0]
first_sentence

array([5, 3, 2, 4, 0, 0], dtype=int32)

In [None]:
result=embedding_layer(tf.constant(first_sentence))
print(result.shape)
result.numpy()

(6, 3)


array([[ 0.04572806, -0.03515423,  0.00194607],
       [ 0.00308593,  0.04265386,  0.04840359],
       [-0.03613716, -0.01655632,  0.04892329],
       [ 0.03061987,  0.02383539,  0.01435807],
       [ 0.04088991, -0.02139552,  0.03599666],
       [ 0.04088991, -0.02139552,  0.03599666]], dtype=float32)

In [None]:
result=embedding_layer(tf.constant(padded))
print(result.shape)
result.numpy()

(4, 7, 3)


array([[[-0.02589405,  0.02384074,  0.02852238],
        [ 0.01519536, -0.04087581, -0.03563573],
        [-0.02247711, -0.03748335, -0.03479894],
        [ 0.04481051,  0.01054685, -0.02298336],
        [ 0.04373587,  0.00484896, -0.04774035],
        [ 0.04373587,  0.00484896, -0.04774035],
        [ 0.04373587,  0.00484896, -0.04774035]],

       [[-0.02589405,  0.02384074,  0.02852238],
        [ 0.01519536, -0.04087581, -0.03563573],
        [-0.02247711, -0.03748335, -0.03479894],
        [ 0.03142171,  0.02462685,  0.00811763],
        [ 0.04373587,  0.00484896, -0.04774035],
        [ 0.04373587,  0.00484896, -0.04774035],
        [ 0.04373587,  0.00484896, -0.04774035]],

       [[-0.02330941, -0.02056179, -0.02227958],
        [ 0.01519536, -0.04087581, -0.03563573],
        [-0.02247711, -0.03748335, -0.03479894],
        [ 0.04481051,  0.01054685, -0.02298336],
        [ 0.04373587,  0.00484896, -0.04774035],
        [ 0.04373587,  0.00484896, -0.04774035],
        [ 0.0437

Now you can the representation of embedded sentences, so how to train the model for sentiment analysis. There are two ways for that:

But first, consider this cool startup idea!

Have you ever wanted to make your text messages more expressive? This example of emojifier app will help you do that. So rather than writing:

"Congratulations on the promotion! Let's get coffee and talk. Love you!"

The emojifier can automatically turn this into:

"Congratulations on the promotion! 👍 Let's get coffee and talk. ☕️ Love you! ❤️"

The model which inputs a sentence (such as "Let's go see the baseball game tonight!") and finds the most appropriate emoji to be used with this sentence (⚾️).

![alt text](https://i.imgur.com/HfEHlM0.png)


The first way is to take the average of all embedded vectors:

![alt text](https://i.imgur.com/5koy8KM.png)


The second way is to plug all of that into LSTM layers:

![alt text](https://i.imgur.com/Sa8ipts.png)

##Possible Models using Embeddings

In [None]:
vocab_size=12
embedding_dim=3
max_length=6

In [None]:
model1 = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length= max_length),
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dense(24, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
])
# OR
model2 = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length= max_length),
        tf.keras.layers.LSTM(max_length),
        tf.keras.layers.Dense(max_length, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
])
# OR
model3 = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length= max_length),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(max_length)),
        tf.keras.layers.Dense(max_length, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
model1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 6, 3)              36        
_________________________________________________________________
global_average_pooling1d (Gl (None, 3)                 0         
_________________________________________________________________
dense (Dense)                (None, 24)                96        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 157
Trainable params: 157
Non-trainable params: 0
_________________________________________________________________


In [None]:
model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 6, 3)              36        
_________________________________________________________________
lstm (LSTM)                  (None, 6)                 240       
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 42        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 7         
Total params: 325
Trainable params: 325
Non-trainable params: 0
_________________________________________________________________


In [None]:
model3.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 6, 3)              36        
_________________________________________________________________
bidirectional (Bidirectional (None, 12)                480       
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 78        
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 7         
Total params: 601
Trainable params: 601
Non-trainable params: 0
_________________________________________________________________
