### Vanilla tokenizing using Keras

In [0]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [26]:
print(tf.__version__)

2.2.0-rc3


In [0]:
sentences = [
             'I love my dog',
             'I love my cat'
]

In [0]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

**Note:** tokenizer does not stops tokenizing at 100 words, it will keep all but won't use above 100 in later stages.

In [29]:
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


____

In [0]:
sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog!'
]

In [0]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [32]:
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


______

In [0]:
sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?'
]

In [0]:
tokenizer = Tokenizer(num_words= 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [0]:
sequences = tokenizer.texts_to_sequences(sentences)

In [36]:
print(word_index, end = '\n\n')
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


We have the sentences with each word replaced with its corresponnding numeric token. 

Now when we test the data for sentences having some words out of the vocabulary, we can observe some interesting results.

In [0]:
test_data = ['i really love my dog',
            'my dog loves my manatee']

In [38]:
print(word_index, end = '\n\n')
print(tokenizer.texts_to_sequences(test_data))

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}

[[4, 2, 1, 3], [1, 3, 1]]


We can observe that out of vocabulary words are skipped, which is not really the best way to handle those.

We will replace such words with a manual token <OOV> which just indicates the out of vocabulary words.


In [0]:
tokenizer = Tokenizer(num_words=100, oov_token= '<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [0]:
sequences = tokenizer.texts_to_sequences(sentences)

In [41]:
print(word_index, end = '\n\n')
print(tokenizer.texts_to_sequences(test_data))

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


Now we can observe that \<OOV> has been added in word index and has been replaced in the sequencing as well.  

____

Now we have uneven length sentences, so we will have to convert them in uniform length.

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
tokenizer = Tokenizer(num_words=100, oov_token= '<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [0]:
sequences = tokenizer.texts_to_sequences(sentences)

In [0]:
padded = pad_sequences(sequences)

In [46]:
print(word_index, end = '\n\n')
print(sequences, end = '\n\n')
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


We can observe that the sentences has been transformed into the length equal to that of the longest sequence, with the shorter sentences being padded from the beginning.  

We can pad from the end as well.

In [0]:
padded = pad_sequences(sequences, padding='post')

In [48]:
print(word_index, end = '\n\n')
print(sequences, end = '\n\n')
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


We can also truncate the lengths as per our need.

In [0]:
padded = pad_sequences(sequences, padding='post', maxlen= 5)

In [50]:
print(word_index, end = '\n\n')
print(sequences, end = '\n\n')
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

[[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]


The sentences have been truncated from the beginning. Consider the truncating before padding not after padding.

We can also truncate from the end.

In [0]:
padded = pad_sequences(sequences, padding='post', maxlen= 5, truncating= 'post')

In [52]:
print(word_index, end = '\n\n')
print(sequences, end = '\n\n')
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


***de nada!***

