In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
sentenses = [
    'I love my dog',
    'i, love my cat',
    'You love my dog!',
    'Do you think my dog is amazing'
]
tokenizer=Tokenizer(num_words=100)# num words limit the size of your vector especially when you have a large corpus. Here it will use only the 100 most frequent words
tokenizer.fit_on_texts(sentenses)
word_index = tokenizer.word_index
print(word_index)

{'i': 4, 'cat': 6, 'dog': 3, 'you': 5, 'my': 1, 'think': 8, 'love': 2, 'is': 9, 'do': 7, 'amazing': 10}


the tensorflow tokenizer preform the following transformations:
* convert text into lowercase caracters(i and I will be represented using the same token)
* remove ponctuations(example: dog! and dog are tokenized the same way)

# transform text to sequence

In [4]:
sequences = tokenizer.texts_to_sequences(sentenses)
print(sequences)

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


Here the first phrase 'I love my dog' is transformed to the following list [4, 2, 1, 3] 

# tokenize new sentences

In [5]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


In the first phrase, we notice that we have a new word: 'really'.

Since the tokenizer didn't encounter this word in the training corpus, it will skip it.

That's why the result sequence [4, 2, 1, 3] has only 4 numbers instead of 5.

Also the sentence 'my dog loves my manatee' will be transformed to 'my dog my' => [1, 3, 1] because 'loves' and manatee are both new words

### use a special token to encode new words: 

In [8]:
sentenses = [
    'I love my dog',
    'i, love my cat',
    'You love my dog!',
    'Do you think my dog is amazing'
]
tokenizer=Tokenizer(num_words=100,oov_token="<OOV>")# oov_token specify the token we will use to encode out of vocabulary words 
tokenizer.fit_on_texts(sentenses)
word_index = tokenizer.word_index
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_seq)

{'i': 5, 'cat': 7, 'dog': 4, 'is': 10, 'you': 6, 'my': 2, 'think': 9, 'love': 3, 'do': 8, '<OOV>': 1, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


=> now our tokenizer is assigning '<oov>' token to unknown words

# Padding

Before training your neural network, you want to make sure that the size of all sentences is uniform. 

In order to do that, we can use the 'pad_sequences' function.

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(sequences)
print(padded)

[[ 0  0  0  4  2  1  3]
 [ 0  0  0  4  2  1  6]
 [ 0  0  0  5  2  1  3]
 [ 7  5  8  1  3  9 10]]


0 is added in the beginning of each phrase to make all sequences uniform.

In [12]:
padded = pad_sequences(sequences, padding='post')
print(padded)

[[ 4  2  1  3  0  0  0]
 [ 4  2  1  6  0  0  0]
 [ 5  2  1  3  0  0  0]
 [ 7  5  8  1  3  9 10]]


=> you can also add the padding at the end of the seq by using "padding = 'post'"

In [13]:
padded = pad_sequences(sequences, padding='post',maxlen=5)
print(padded)

[[ 4  2  1  3  0]
 [ 4  2  1  6  0]
 [ 5  2  1  3  0]
 [ 8  1  3  9 10]]


if you want your sentenses to have a maximum of 5 words, you can use maxlen=5 

In [17]:
padded = pad_sequences(sequences, padding='post',truncating='post',maxlen=5)
print(padded)

[[4 2 1 3 0]
 [4 2 1 6 0]
 [5 2 1 3 0]
 [7 5 8 1 3]]


If you want to loose words from the end of the sentence, use truncating='post'