#Basic preprocessing for text data
In most NLP tasks, the initial step in preparing your data is to extract a vocabulary of words from your corpus (i.e. input texts). You will need to define how to represent the texts into numerical representations which can be used to train a neural network. These representations are called tokens and Tensorflow and Keras makes it easy to generate these using its APIs. You will see how to do that in the next cells.

#Generating the vocabulary
In this notebook, you will look first at how you can provide a look up dictionary for each word. The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the fit_on_texts() method and you can get the result by looking at the word_index property. More frequent words have a lower index.

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

input_sentences=[
    'I love my dog',
    'You love pizza'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(input_sentences)
# Get the indices and print it
word_index=tokenizer.word_index
print(word_index)

{'love': 1, 'i': 2, 'my': 3, 'dog': 4, 'you': 5, 'pizza': 6}


The num_words parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences. You will see this in a later exercise. For now, the important thing to note is it does not affect how the word_index dictionary is generated. You can try passing 1 instead of 100 as shown on the next cell and you will arrive at the same word_index.

Also notice that by default, all punctuation is ignored and words are converted to lower case. You can override these behaviors by modifying the filters and lower arguments of the Tokenizer class as described here. You can try modifying these in the next cell below and compare the output to the one generated above.

In [2]:
tokenizer=Tokenizer(num_words=1)

sentences=[
    'I love rain',
    'Harry Potter saves Hogwarts!'
]
tokenizer.fit_on_texts(sentences)
word_index=tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'rain': 3, 'harry': 4, 'potter': 5, 'saves': 6, 'hogwarts': 7}


#Generating Sequences and Padding
you need to prepare text data with uniform size before feeding it to your model.



##Text to Sequences
you saw how to generate a word_index dictionary to generate tokens for each word in your corpus. You can then use the result to convert each of the input sentences into a sequence of tokens. That is done using the texts_to_sequences() method as shown below.

In [6]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>") #oov is out of vocabulary token for ords in test data that were 
                                                          #not present intraining so word_index as not created for them
                                                          #Notice that if num_words is less (like 1) in this case you won't get a sequence if 
                                                          #though word_index as created same as before

# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)

# Get the word index dictionary
word_index = tokenizer.word_index

sequences=tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


#Padding
As mentioned in the lecture, you will usually need to pad the sequences into a uniform length because that is what your model expects. You can use the pad_sequences for that. By default, it will pad according to the length of the longest sequence. You can override this with the maxlen argument to define a specific length. Feel free to play with the other arguments shown in class and compare the result.

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded=pad_sequences(sequences)
print('Padding:Pre')
print(padded)
padded_post=pad_sequences(sequences,padding="post") #can also examine with maxlen and truncating params
print('Padding:Post')
print(padded_post)

Padding:Pre
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]
Padding:Post
[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


#Out-of-vocabulary tokens
Notice that you defined an oov_token when the Tokenizer was initialized earlier. This will be used when you have input words that are not found in the word_index dictionary. For example, you may decide to collect more text after your initial training and decide to not re-generate the word_index. You will see this in action in the cell below. Notice that the token 1 is inserted for words that are not found in the dictionary.

In [11]:
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

# Generate the sequences
test_seq = tokenizer.texts_to_sequences(test_data)

# Print the word index dictionary
print("\nWord Index = " , word_index)

# Print the sequences with OOV
print("\nTest Sequence = ", test_seq)

# Print the padded result
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 1 2 1]]
