<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/tf/c3_w1_padding_oov_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pad Sequences, Out-of-vocabulary tokens, and pre/post sequence truncation

In [1]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tf.__version__

'2.3.0'

In [2]:
sentences = [
  'I love my dog',
  'Cats are more independent than dogs',
  'I, love my cat',
  'You love my dog!',
  'Do you think my dog is amazing?',
  'I am constantly thinking about my dog.',
  'I am loving my dog.',
  'She loves her cat'
]

tokenizer = Tokenizer(num_words=100, oov_token='<oob>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
word_index

{'<oob>': 1,
 'about': 21,
 'am': 8,
 'amazing': 18,
 'are': 10,
 'cat': 6,
 'cats': 9,
 'constantly': 19,
 'do': 15,
 'dog': 3,
 'dogs': 14,
 'her': 25,
 'i': 4,
 'independent': 12,
 'is': 17,
 'love': 5,
 'loves': 24,
 'loving': 22,
 'more': 11,
 'my': 2,
 'she': 23,
 'than': 13,
 'think': 16,
 'thinking': 20,
 'you': 7}

In [3]:
sequences = tokenizer.texts_to_sequences(sentences)
sequences

[[4, 5, 2, 3],
 [9, 10, 11, 12, 13, 14],
 [4, 5, 2, 6],
 [7, 5, 2, 3],
 [15, 7, 16, 2, 3, 17, 18],
 [4, 8, 19, 20, 21, 2, 3],
 [4, 8, 22, 2, 3],
 [23, 24, 25, 6]]

## Padding

When training neural networkds you typically need all your data to be on the same shape. With neural networks you face the same issue --once you've tokenized your words and converted your sentences into sequences, they can all be different lenghts. To get them to be the same size and shape, you can use padding.

In [4]:
padded = pad_sequences(sequences, maxlen=5)
padded

array([[ 0,  4,  5,  2,  3],
       [10, 11, 12, 13, 14],
       [ 0,  4,  5,  2,  6],
       [ 0,  7,  5,  2,  3],
       [16,  2,  3, 17, 18],
       [19, 20, 21,  2,  3],
       [ 4,  8, 22,  2,  3],
       [ 0, 23, 24, 25,  6]], dtype=int32)

In [5]:
test_data = [
  'i really love my new puppy',
  'my dog loves my manatee',
  'cats are cool'
]

test_seq = tokenizer.texts_to_sequences(test_data)
test_seq

[[4, 1, 5, 2, 1, 1], [2, 3, 24, 2, 1], [9, 10, 1]]

If you want to make these the same length, you can use the pad_sequences API. First, you’ll need to import it:

```python
from tensoflow.keras.preprocessing.sequence import pad_sequences
```

In [6]:
padded = pad_sequences(test_seq, maxlen=10)
padded

array([[ 0,  0,  0,  0,  4,  1,  5,  2,  1,  1],
       [ 0,  0,  0,  0,  0,  2,  3, 24,  2,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  9, 10,  1]], dtype=int32)

First, you might have noticed that in the case of the shorter sentences, to get them to be the same shape as the longest one, the requisite number of zeros were added at the beginning. This is called prepadding, and it’s the default behavior. You can change this using the `padding` parameter. For example, if you want your sequences to be padded with zeros at the end, you can use:

In [7]:
post_padded = pad_sequences(test_seq, maxlen=10, padding='post')
post_padded

array([[ 4,  1,  5,  2,  1,  1,  0,  0,  0,  0],
       [ 2,  3, 24,  2,  1,  0,  0,  0,  0,  0],
       [ 9, 10,  1,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

Now your padded sequences are all the same length, and there isn’t too much padding. You have lost some words from your longest sentence, though, and they’ve been truncated from the beginning. What if you don’t want to lose the words from the beginning and instead want them truncated from the end of the sentence? You can override the default behavior with the `truncating` parameter, as follows:

In [8]:
pre_padding_truncating = pad_sequences(test_seq, maxlen=5, truncating='pre')
pre_padding_truncating

array([[ 1,  5,  2,  1,  1],
       [ 2,  3, 24,  2,  1],
       [ 0,  0,  9, 10,  1]], dtype=int32)

In [9]:
post_padding_truncating = pad_sequences(test_seq, maxlen=5, truncating='post')
post_padding_truncating

array([[ 4,  1,  5,  2,  1],
       [ 2,  3, 24,  2,  1],
       [ 0,  0,  9, 10,  1]], dtype=int32)

## Removing Stopwords and Cleaning Text
There's often text that you don’t want in your dataset. You may want to filter out so-called `stopwords` that are too common and don’t add any meaning, like **“the,” “and,” and “but.”** You may also encounter a lot of HTML tags in your text, and it would be good to have a clean way to remove them. Other things you might want to filter out include **rude words, punctuation, or names**. Later we’ll explore a dataset of tweets, which often have somebody’s user ID in them, and we’ll want to filter those out.

While every task is different based on your corpus of text, there are three main things that you can do to clean up your text programmatically.

> ***First, strip out HTML tags***. Fortunately, there’s a library called BeautifulSoup that makes this straightforward. For example, if your sentences contain HTML tags such as <\br>, they’ll be removed by this code:

In [10]:
from bs4 import BeautifulSoup