# AAI612: Deep Learning & its Applications

*Notebook 7.2: Text Processing Using Keras*

<a href="https://colab.research.google.com/github/harmanani/AAI612/blob/main/Week7/Notebook7.2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Initialize

we need first to initialize the Tokenizer class.  The `num_words` parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences. For now, the important thing to note is it does not affect how the `word_index` dictionary is generated. You can try passing `1` instead of `100` as shown on the next cell and you will arrive at the same `word_index`. 

Also notice that by default, all punctuation is ignored and words are converted to lower case. You can override these behaviors by modifying the `filters` and `lower` arguments of the `Tokenizer` class as described [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#arguments). You can try modifying these in the next cell below and compare the output to the one generated above.

In [8]:
# Define your input texts
sentences = [
    "Pie the cat loves to chase mice.",
    "Pie enjoys naps in the afternoon sunlight.",
    "Sometimes, Pie watches birds from the window."
    
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 50, oov_token="<OOV>")

## Generating the vocabulary

Let us look first at how you can provide a look up dictionary for each word. The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the [fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) method and you can get the result by looking at the `word_index` property. More frequent words have a lower index.

In [9]:
# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)

# Get the word index dictionary
word_index = tokenizer.word_index

## Text to Sequences

You can use then use the result to convert each of the input sentences into a sequence of tokens. That is done using the [`texts_to_sequences()`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences) method as shown below.

In [4]:
# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)


## Check the Results

In [10]:
# Print the result
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)


Word Index =  {'<OOV>': 1, 'pie': 2, 'the': 3, 'cat': 4, 'loves': 5, 'to': 6, 'chase': 7, 'mice': 8, 'enjoys': 9, 'naps': 10, 'in': 11, 'afternoon': 12, 'sunlight': 13, 'sometimes': 14, 'watches': 15, 'birds': 16, 'from': 17, 'window': 18}

Sequences =  [[2, 6, 3, 4, 2, 5], [2, 7, 3, 4, 2, 5], [2, 8, 3, 4, 2, 5]]


Notice that now each sentence is a sequence of numbers.  If you check these numbers with the word index, you can recontruct the words!

## Padding

You will usually need to pad the sequences into a uniform length because that is what your model expects. You can use the [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) for that. By default, it will pad according to the length of the longest sequence. You can override this with the `maxlen` argument to define a specific length. Feel free to play with the [other arguments](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences#args) shown in class and compare the result.

In [11]:
# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5)

# Print the result
print("\nPadded Sequences:")
print(padded)


Padded Sequences:
[[6 3 4 2 5]
 [7 3 4 2 5]
 [8 3 4 2 5]]


## Out-of-vocabulary tokens

Notice that you defined an `oov_token` when the `Tokenizer` was initialized earlier. This will be used when you have input words that are not found in the `word_index` dictionary. For example, you may decide to collect more text after your initial training and decide to not re-generate the `word_index`. You will see this in action in the cell below. Notice that the token `1` is inserted for words that are not found in the dictionary.

In [12]:
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my cat loves my manatee'
]

# Generate the sequences
test_seq = tokenizer.texts_to_sequences(test_data)

# Print the word index dictionary
print("\nWord Index = " , word_index)

# Print the sequences with OOV
print("\nTest Sequence = ", test_seq)

# Print the padded result
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Word Index =  {'<OOV>': 1, 'pie': 2, 'the': 3, 'cat': 4, 'loves': 5, 'to': 6, 'chase': 7, 'mice': 8, 'enjoys': 9, 'naps': 10, 'in': 11, 'afternoon': 12, 'sunlight': 13, 'sometimes': 14, 'watches': 15, 'birds': 16, 'from': 17, 'window': 18}

Test Sequence =  [[1, 1, 1, 1, 1], [1, 4, 5, 1, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 1 1 1 1 1]
 [0 0 0 0 0 1 4 5 1 1]]
