# Lesson 1 - Tokenization

- Process of encoding sentences into numbers <br>
- Words are represented as numbers for computers to process them <br>
- "LISTEN" and "SILENT" have the same letters with different orders

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!'
]

In [3]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


# Lesson 2 - Sequencing

- Creating sequences of numbers from the sentences and using tools to process these sequences of numbers to train the neural networks <br>
- If the testing data of sentences contain new words that were not found in the training data, these new words would not be represented based on the training corpus. Hence, losing the words and length of the sequence <br>
- One method of overcoming this problem of losing the words and length of sequence due to not having the new words in the training corpus is to set the property of the Tokenizer(oov_token = "\<OOV\>") <br>
- This allows the Tokenizer to create a Out-Of-Vocabulary (OOV) Token and replacing the words it does not recognise with the OOV Token

In [4]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

In [5]:
# encode words into tokens and index these tokens
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


In [6]:
# generates sequences of tokens that represent the sentences
sequences = tokenizer.texts_to_sequences(sentences)

print(sequences)

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


In [7]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

In [8]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


#### With Out-Of-Vocabulary (OOV) Token

In [9]:
# encode words into tokens and index these tokens
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [10]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

In [11]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


- For training of a neural network, the length of the sentences have to be the same <br>
- The advanced method to handle sentences of different lengths is to use RaggedTensor: https://www.tensorflow.org/api_docs/python/tf/RaggedTensor 
- Simpler solution is to use **PADDING**

In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [13]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

In [14]:
# encode words into tokens and index these tokens
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [15]:
sequences = tokenizer.texts_to_sequences(sentences)

print(sequences)

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


In [16]:
pre_padded = pad_sequences(sequences)

print(pre_padded)

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


In [17]:
# paddings after the sequence
post_padded = pad_sequences(sequences, padding='post')

print(post_padded)

[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


In [18]:
# limit the length of the sequence + padding
post_padded_maxlen = pad_sequences(sequences, padding='post', maxlen=6)

print(post_padded_maxlen)

[[ 5  3  2  4  0  0]
 [ 5  3  2  7  0  0]
 [ 6  3  2  4  0  0]
 [ 6  9  2  4 10 11]]


In [19]:
# truncating if the length of sequence exceeds the max_length
post_padded_trunc_maxlen = pad_sequences(sequences, padding='post',
                                   truncating='post', maxlen=5)

print(post_padded_trunc_maxlen)

[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]
