## Tokenization

How to represent words in such a way that a computer can process them (with a view to later training a neural network that can understand their meaning.)

Consider the word L-I-S-T-E-N.

The word "Listen" is made up of a sequence of letters.

These letters can be represented by numbers using an *encoding* scheme.

L (076) I (073) S(083) T(084) E(069) N(078) *(Encoding in ASCII)*

The word S-I-L-E-N-T has the same letters, although in a different order.

This makes it hard to understand the sentiment of a given word just by the letters in it.

**Might be a better idea to encode words rather than letters, then.**

Consider the sentence "I love my dog"

- word "I" could be 001
- word "love" could be 002
- word "my" could be 003
- word "dog" could be 004

If we take the case of the sentence "I love my cat"

- word "I" is 001
- word "love" is 002
- word "my" is 003
- word "dog" could be 005

So these two sentences are now equivalent to:

"I love my dog" - "001 002 003 004"
"I love my cat" - "001 002 003 005"

This already shows some form of similarity in between the sentences, as is expected.

In [3]:
import tensorflow as tf 
from tensorflow import keras 
from tensorflow.keras.preprocessing.text import Tokenizer   # Tokenizer API

In [6]:
sentences: list[str] = [
    "I love my dog",
    "I love my cat",
    "You love my dog!",
]

In [7]:
tokenizer = Tokenizer(num_words = 100)  # num_words: maximum number of words to keep

# num_words - maximum number of words to keep
# if we were to tokenize 100 books but only wanted the most frequently occurring 100 words

tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index   # full list of words are stored in word_index

print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


## Turning Sentences into Data

In [4]:
sentences: list[str] = [
    "I love my dog",
    "I love my cat",
    "You love my dog!",
    "Do you think my dog is amazing?"
]

# here, we have added a new sentence that is also fundamentally different in length
# compared to the other sentences in the list

In [5]:
tokenizer = Tokenizer(num_words = 100)

tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
# creates sequences of tokens representing each sentence

print(word_index)

print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


In this above case, we are fitting the tokenizer onto the list of sentences we already have. In the case of training with neural networks, what is going to happen if a word that has previously never been seen is encountered?

```tokenizer.fit_on_texts(sentences)```

```sequences = tokenizer.texts_to_sequences```

What if a list  ```test_data``` were to be initialized as:

```
test_data = [
    "I really love my dog",
    "My dog loves my manatee",
]
```

"manatee" here is a word that hasn't been encountered before.

```test_seq = tokenizer.texts_to_sequences(test_data)```

```print(test_seq)```

In [6]:
test_data: list[str] = [
    "I really love my dog",
    "My dog loves my manatee",
]

In [7]:
test_seq = tokenizer.texts_to_sequences(test_data)

print(test_seq)

# the output is [[4, 2, 1, 3], [1, 3, 1]]

# A five-word sentence (I really love my dog) only has 4 elements in its sequence form
# This is so because the word "really" was not contained in the corpus used to build the tokenizer

# Similarly, the second sentence: the words loves and manatee are not in the index because of which
# only 3 tokens are generated.

[[4, 2, 1, 3], [1, 3, 1]]


This kind of points in the direction of needing a really massive word index to be able to handle large texts or sentences not in the training set.

However, in order to not lose the length of the sentence, use the **OOV** trick

OOV stands for **Out of Vocabulary**


In [8]:
sentences: list[str] = [
    "I love my dog",
    "I love my cat",
    "You love my dog!",
    "Do you think my dog is amazing?"
]

In [9]:
tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)

In [10]:
word_index = tokenizer.word_index

In [11]:
sequences = tokenizer.texts_to_sequences(sentences)

In [12]:
print(sequences)

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


In [13]:
print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [15]:
test_seq = tokenizer.texts_to_sequences(test_data)

print(test_seq) # still lost some meaning, but at least the sequences are the same length as the sentences

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


## Handling Sentences of Varying Lengths

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    "I love my dog",
    "I love my cat",
    "You love my dog!",
    "Do you think my dog is amazing?",
]

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")

tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index


sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences)

print(word_index)
print(sequences)
print(padded)


{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]
