<a href="https://colab.research.google.com/github/niyaz-ahmad/ML-lessons/blob/master/Course_3_Week_1_Lesson_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!',
    'My dog loves me.',
    'My dog loves going for walk.',
    'My dog loves going for walks with me.',
    'Do you think my dog is amazing?'
]



In [None]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

for key in word_index:
  print(f"{word_index[key]}\t{key}")

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
for seq in sequences:
  print(seq)

In [None]:
test_data = [
             'I really love my dog!',
             'My dog loves my manatee.'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)


sequences of `test_data` results in 
```
[
  [5, 3, 1, 2],    # [I love my dog]      ~really
  [1, 2, 4, 1]     # [My dog loves my]    ~manatee
]
```
Tokens for 'really' and 'manatee' are missing from `test_seq` because these words are not in our corpus `word_index`.

That's why we use __Out of Vocabulary__ token property:
>>> `oov_token="<OOV>"`

set `oov_token` with something we can not see in the corpus.
The tokenizer will create a token for that (`oov_token`) and then will replace words it does not recognize with `oov_token` instead.


In [None]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

for key in word_index:
  print(f"{word_index[key]}\t{key}")

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

for seq in sequences:
  print(seq)

test_data = [
             'I really love my dog!',
             'My dog loves my manatee.'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print()
print('1 is being used for oov_token')
print(test_data, "\n\t", test_seq)





Note token `1` is being used for both words "really" and "manatee".

This trick maintains sequence lengths in case we encounter words beyond our corpus.

We face a problem with equivalent sequences containing this fixed `oov_token` for different unknown words in different sentences.

To solve this problem we use RaggedTensor. This is an advanced solution.

A simpler solution is _padding_.

First we import _pad sequences_ (`pad_sequences`) from preprocessing (`tensorflow.keras.preprocessing.sequence`): 



In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

To pad the `sequences`, pass them `pad_sequences()`

In [None]:
padded = pad_sequences(sequences)
print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'my': 2, 'dog': 3, 'love': 4, 'loves': 5, 'i': 6, 'you': 7, 'me': 8, 'going': 9, 'for': 10, 'cat': 11, 'walk': 12, 'walks': 13, 'with': 14, 'do': 15, 'think': 16, 'is': 17, 'amazing': 18}
[[6, 4, 2, 3], [6, 4, 2, 11], [7, 4, 2, 3], [2, 3, 5, 8], [2, 3, 5, 9, 10, 12], [2, 3, 5, 9, 10, 13, 14, 8], [15, 7, 16, 2, 3, 17, 18]]
[[ 0  0  0  0  6  4  2  3]
 [ 0  0  0  0  6  4  2 11]
 [ 0  0  0  0  7  4  2  3]
 [ 0  0  0  0  2  3  5  8]
 [ 0  0  2  3  5  9 10 12]
 [ 2  3  5  9 10 13 14  8]
 [ 0 15  7 16  2  3 17 18]]


Padded token is `0`.

Sequences are pre-padded with `0` by default.
To pad after the actual token in sequences, use option `padding='post'`:


In [None]:
padded = pad_sequences(sequences, padding='post')
print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'my': 2, 'dog': 3, 'love': 4, 'loves': 5, 'i': 6, 'you': 7, 'me': 8, 'going': 9, 'for': 10, 'cat': 11, 'walk': 12, 'walks': 13, 'with': 14, 'do': 15, 'think': 16, 'is': 17, 'amazing': 18}
[[6, 4, 2, 3], [6, 4, 2, 11], [7, 4, 2, 3], [2, 3, 5, 8], [2, 3, 5, 9, 10, 12], [2, 3, 5, 9, 10, 13, 14, 8], [15, 7, 16, 2, 3, 17, 18]]
[[ 6  4  2  3  0  0  0  0]
 [ 6  4  2 11  0  0  0  0]
 [ 7  4  2  3  0  0  0  0]
 [ 2  3  5  8  0  0  0  0]
 [ 2  3  5  9 10 12  0  0]
 [ 2  3  5  9 10 13 14  8]
 [15  7 16  2  3 17 18  0]]
