<a href="https://colab.research.google.com/github/happyrabbit/IntroDataScience/blob/master/Python/TokenizingPadding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# Load packages
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

We give each word a value and use those values to train a neural network. Consider the following review sentences:

In [0]:
sentences = ['This movie is great!',
'Great movie ? Are you kidding  me ! Not worth the money.',
'Love it']

We set `num_words = 100` which is the number of distinct words. It is way too big for this baby example. If you're creating a training set based on lots of text, you usually don't know how many distinct words there are in the text. So by setting this hyperparameter, the tokenizer will take the top 100 words by volume and just encode those. It's a handy shortcut when dealing with lots of data, and worth experimenting with when you train with real data. Sometimes the impact of less words can be minimal on training accuracy, but huge on training time.

`oov_token = "<oov>"` specifies the token for outer vocabulary to be used for words that aren't in the word index. 

In [0]:
tokenizer = Tokenizer(num_words= 100, oov_token= "<oov>")

`fit_on_texts` method of the `tokenizer` encodes the sentences. The tokenizer has a `word_index` property which returns a dictionary containing key value pairs, where the key is the word, and the value is the token for that word. You can inspect by simply printing it out. Note that the tokenizer lower-cases all words and strips punctuation out. 

In [0]:
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# call the tokenizer to get texts to sequences
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=6, padding = "post", truncating = 'post')

![](https://course2020.scientistcafe.com/slides/02DeepLearning/RNN/images/TokenizingPadding.png)

In [41]:
print("\nWord Index = ", word_index)
print("\nSequences = ", sequences)
print("\nPadded Sequences:")
print(padded)


Word Index =  {'<oov>': 1, 'movie': 2, 'great': 3, 'this': 4, 'is': 5, 'are': 6, 'you': 7, 'kidding': 8, 'me': 9, 'not': 10, 'worth': 11, 'the': 12, 'money': 13, 'love': 14, 'it': 15}

Sequences =  [[4, 2, 5, 3], [3, 2, 6, 7, 8, 9, 10, 11, 12, 13], [14, 15]]

Padded Sequences:
[[ 4  2  5  3  0  0]
 [ 3  2  6  7  8  9]
 [14 15  0  0  0  0]]


In [0]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [36]:
def decode_review(text):
  return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_review(padded[1]))

great movie are you kidding
