## Word Encoding
Everyone has a name --> **Word too** !! <br>
Considering these two sentences.
<br>
''' **I love my dog** '''
<br>
''' **I love my cat** '''


In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer # https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

sentences = [
    'I love my dog',
    'I love my cat'
]

# num_words is the maximum number of words to be encoded
# Tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# Remember the word which was capitalized will convert to the lower case --> Tokenizer did it for you
# Also stripts punctuation out
print("Word Index:", word_index)

Word Index: {'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


### Tuple, List, Dictionary
List is the standard form for collecting the data in Python. The data can be updated, added, or removed in this list ex. L = [1, 2, 3, 4, 5] <br>
Tuple faster than list in query but the value in tuple can not be modified ex. T = (1, 2, 3, 4, 5) <br>
Dictionary has key and value ex. D = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

Adding **You love my dog** with exclamation and question mark after **dog**.

In [None]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# With new corpus of text is including 'you' and 'dog' --> exclamation and question mark did not impact 'dog' word 
print("Word Index:", word_index) # Return the dictionary of the words

Word Index: {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


## Text to Sequence

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!?',
    'Do you think my dog is crazy^^?'

]

# num_words is the maximum number of words to be encoded
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# texts_to_sequences will convert all words in to a set of sequences and return as the list
sequences = tokenizer.texts_to_sequences(sentences)

print("Word Index:", word_index)
print("\nSequences:\n", sequences) # --> list of sentences which is encoded into integer lists

Word Index: {'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'crazy': 10}

Sequences:
 [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


**texts_to_sequences** can take any set of sentences and encode them.
<br>
Just think about when we train the neural network by Keras with a corpus of texts and text has a word index generated from it. For testing, we have to encode the text with the same word index else it would be meaningless.

In [None]:
test_data = [
    'i really love my dog',
    'my dog loves my aunt'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nSequences:\n", test_seq) # the output result can encode only the word in our corpus --> it ignores the unseen words --> end up with 'my dog my' instead of 'my dog loves my aunt'


Sequences:
 [[4, 2, 1, 3], [1, 3, 1]]


We learn that <br>
1. We really need a lot of data to get a broad vocabulary. If not we gonna end up likes **my dog my** above. <br>
2. Instead of ignoring the unseen words, we can put a special value in when we encounter the unseen words

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentenses = [
    'I love my dog',
    'I love my cat',
    'You love my dog!?',
    'Do you think my dog is crazy^^?'
]

# Put a word '<OOV>' for unseen words
# --> can use any word you like but please be noted that it is unique and distinct that is not confused with a real word.
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

test_data = [
    'i really love my dog',
    'my dog loves my aunt'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("Word Index:", word_index)
print("\nSequence:\n", test_seq)

Word Index: {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'crazy': 11}

Sequence:
 [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


Still not great but it is doing better!! <br>
The more words in corpus grows --> The more words are in the index

## Padding
When we build a neural network --> we need the the input data to be uniform in size before feed them into the network for training <br> 
For working with images in batch, all images should have the same size!! <br>
... Same to text data --> Padding will help you do this

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentenses = [
    'I love my dog',
    'I love my cat',
    'You love my dog!?',
    'Do you think my dog is crazy^^?'
]

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

# Once the tokenizer has created the sequences, these sequences can be passed to pad sequences in order to have them padded
padded = pad_sequences(sequences)
print("Word Index:", word_index)
print("\nSequences:\n", sequences)
print("\nPadded:\n", padded) # The list of sentences can be passed it out as the matrix --> each row has the same length

Word Index: {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'crazy': 11}

Sequences:
 [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded:
 [[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


In [None]:
# Sometimes the padding will be after the sentences by adding parameter 'padding'
padded = pad_sequences(sequences, padding='post')
print("Word Index:", word_index)
print("\nSequences:\n", sequences)
print("\nPadded:\n", padded)

Word Index: {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'crazy': 11}

Sequences:
 [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded:
 [[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


In [None]:
# With the maxlen, you can specify the number of word in your sentences -> But the information will lost in the begining if you set 'padding' to 'post'
padded = pad_sequences(sequences, padding='post', maxlen=5)
print("Word Index:", word_index)
print("Sequences:\n", sequences)
print("Padded:\n", padded)

Word Index: {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'crazy': 11}
Sequences:
 [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
Padded:
 [[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]


In [None]:
# With the maxlen, you can specify the number of word in your sentences -> With truncating the information will lost from behind instead
padded = pad_sequences(sequences, padding='post', maxlen=5, truncating='post')
print("Word Index:", word_index)
print("\nSequences:\n", sequences)
print("\nPadded:\n", padded)

Word Index: {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'crazy': 11}

Sequences:
 [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded:
 [[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]
