# Content
This notebook is the third part of  Data Preparation in NLP using Keras.   
  
 In this notebook we will see how we can pad sequences.

Before explaining padding, let's repeat the necessary steps first.<br>
1. Create training and test texts
2. Create tokenizer with num_words and oov_token parameters
3. Fit the tokenizer on the traning corpus
4. Convert training and test texts to sequences
5. Check the word_index and resulting sequences

1. Create training and test texts

In [2]:
from keras.preprocessing.text import Tokenizer
train_texts = ["If you take the blue pill.",
             "The story ends.",
             "You choose"]

In [3]:
test_texts = ["If you take the red pill.",
             "You stay in wonderland.",
             "You must choose"] 

2. Create tokenizer with num_words and oov_token parameters

In [4]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

3. Fit the tokenizer on the traning corpus

In [5]:
tokenizer.fit_on_texts(train_texts)

4. Convert training and test texts to sequences

In [6]:
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

5. Check the word_index and resulting sequences

In [7]:
print("tokenizer.word_index   : ",tokenizer.word_index)

tokenizer.word_index   :  {'<OOV>': 1, 'you': 2, 'the': 3, 'if': 4, 'take': 5, 'blue': 6, 'pill': 7, 'story': 8, 'ends': 9, 'choose': 10}


In [11]:
for train_text, train_seq in zip(train_texts,train_sequences):
    print(train_text)
    print(train_seq)

If you take the blue pill.
[4, 2, 5, 3, 6, 7]
The story ends.
[3, 8, 9]
You choose
[2, 10]


In [10]:
for test_text, test_seq in zip(test_texts,test_sequences):
    print(test_text)
    print(test_seq)

If you take the red pill.
[4, 2, 5, 3, 1, 7]
You stay in wonderland.
[2, 1, 1, 1]
You must choose
[2, 1, 10]


# **Padding sequences**

As we see above, lengths of sequences vary based on the content. However, neural networks expect inputs with the same length.<br> In order to make sequences with the same length we pad sequences.<br>
First we have to import pad_sequences from tensorflow.keras.preprocessing.sequence

* This function is used not only in NLP but also in other sequence models time-series etc.) <br>
* We will use it to transform our sequences into a 2D Numpy array of shape **(number_of_sentences, number_of_words)**.

If we don't provide maxlen argument,  the length of the padded sequence will be the length of the longest sequence in the list.  <br>
Shorter sequence will be padded with zeros from start(left).

In [None]:
from tensorflow.keras.utils import pad_sequences
train_sequences_padded = pad_sequences(train_sequences)
test_sequences_padded = pad_sequences(test_sequences)
print("train_sequences_padded :\n",train_sequences_padded)
print("test_sequences_padded  :\n",test_sequences_padded)

train_sequences_padded :
 [[ 4  2  5  3  6  7]
 [ 0  0  0  3  8  9]
 [ 0  0  0  0  2 10]]
test_sequences_padded  :
 [[ 4  2  5  3  1  7]
 [ 0  0  2  1  1  1]
 [ 0  0  0  2  1 10]]


<img src="./Images/Tokenizer_7.jpg"/>

# maxlen parameter
Using **maxlen** parameter we can specify the sequence length.

In [None]:
train_sequences_padded = pad_sequences(train_sequences,maxlen = 10)
test_sequences_padded = pad_sequences(test_sequences, maxlen = 10)
print("train_sequences_padded :\n",train_sequences_padded)
print("test_sequences_padded  :\n",test_sequences_padded)

train_sequences_padded :
 [[ 0  0  0  0  3  4  5  2  6  7]
 [ 0  0  0  0  0  0  0  2  8  9]
 [ 0  0  0  0  0  0  0  0  0 10]]
test_sequences_padded  :
 [[ 0  0  0  0  3  4  5  2  1  7]
 [ 0  0  0  0  0  0  4  1  1  1]
 [ 0  0  0  0  0  0  0  0  4 10]]


If **maxlen** parameter is smaller than a sequence, that sequence will be truncated from its beginning.<br>
Notice that first sequences in train and test sequences are truncated as they lose the element at the beginnig.

In [None]:
train_sequences_padded = pad_sequences(train_sequences, maxlen = 5)
test_sequences_padded = pad_sequences(test_sequences, maxlen = 5)
print("train_sequences_padded :\n",train_sequences_padded)
print("test_sequences_padded  :\n",test_sequences_padded)

train_sequences_padded :
 [[ 4  5  2  6  7]
 [ 0  0  2  8  9]
 [ 0  0  0  0 10]]
test_sequences_padded  :
 [[ 4  5  2  1  7]
 [ 0  4  1  1  1]
 [ 0  0  0  4 10]]


<img src="./Images/Tokenizer_8.jpg"/>

# truncate parameter
Using **truncating** parameter we can specify to remove values from sequences larger than maxlen, either at the beginning or at the end of the sequences.<br>
The default option is **pre** as we have seen above using maxlen shorter than the first sequence.<br>
Let's use the **post** option and see the difference on the first elements of the sequences.

In [None]:
train_sequences_padded = pad_sequences(train_sequences,maxlen = 5, truncating = "post")
test_sequences_padded = pad_sequences(test_sequences,maxlen = 5, truncating = "post")
print("train_sequences_padded :\n",train_sequences_padded)
print("test_sequences_padded  :\n",test_sequences_padded)

train_sequences_padded :
 [[ 3  4  5  2  6]
 [ 0  0  2  8  9]
 [ 0  0  0  0 10]]
test_sequences_padded  :
 [[ 3  4  5  2  1]
 [ 0  4  1  1  1]
 [ 0  0  0  4 10]]


<img src="./Images/Tokenizer_9.jpg"/>

# **padding** parameter

Using padding parameter, we can specify to pad either before or after each sequence.<br> The default option is **"pre"** as we have seen above.<br>
Let's use the **post** option and see the difference.

In [None]:
train_sequences_padded = pad_sequences(train_sequences,maxlen = 5, truncating = "post", padding = "post")
test_sequences_padded = pad_sequences(test_sequences,maxlen = 5, truncating = "post",padding = "post")
print("train_sequences_padded :\n",train_sequences_padded)
print("test_sequences_padded  :\n",test_sequences_padded)

train_sequences_padded :
 [[ 3  4  5  2  6]
 [ 2  8  9  0  0]
 [10  0  0  0  0]]
test_sequences_padded  :
 [[ 3  4  5  2  1]
 [ 4  1  1  1  0]
 [ 4 10  0  0  0]]


<img src="./Images/Tokenizer_10.jpg"/>

In this notebook series we have learnt how to prepare textual data as inputs for neural networks using Keras library. <br>
In the next series we will use these techniques when working with recurrent neural networks.

Note: We will usually set truncating and padding parameters as "post".