<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/tf/c3_w1_padding_oov_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pad Sequences, Out-of-vocabulary tokens, and pre/post sequence truncation

In [16]:
import tensorflow as tf
import tensorflow_datasets as tfds

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tf.__version__

'2.3.0'

In [2]:
sentences = [
  'I love my dog',
  'Cats are more independent than dogs',
  'I, love my cat',
  'You love my dog!',
  'Do you think my dog is amazing?',
  'I am constantly thinking about my dog.',
  'I am loving my dog.',
  'She loves her cat'
]

tokenizer = Tokenizer(num_words=100, oov_token='<oob>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
word_index

{'<oob>': 1,
 'about': 21,
 'am': 8,
 'amazing': 18,
 'are': 10,
 'cat': 6,
 'cats': 9,
 'constantly': 19,
 'do': 15,
 'dog': 3,
 'dogs': 14,
 'her': 25,
 'i': 4,
 'independent': 12,
 'is': 17,
 'love': 5,
 'loves': 24,
 'loving': 22,
 'more': 11,
 'my': 2,
 'she': 23,
 'than': 13,
 'think': 16,
 'thinking': 20,
 'you': 7}

In [3]:
sequences = tokenizer.texts_to_sequences(sentences)
sequences

[[4, 5, 2, 3],
 [9, 10, 11, 12, 13, 14],
 [4, 5, 2, 6],
 [7, 5, 2, 3],
 [15, 7, 16, 2, 3, 17, 18],
 [4, 8, 19, 20, 21, 2, 3],
 [4, 8, 22, 2, 3],
 [23, 24, 25, 6]]

## Padding

When training neural networkds you typically need all your data to be on the same shape. With neural networks you face the same issue --once you've tokenized your words and converted your sentences into sequences, they can all be different lenghts. To get them to be the same size and shape, you can use padding.

In [4]:
padded = pad_sequences(sequences, maxlen=5)
padded

array([[ 0,  4,  5,  2,  3],
       [10, 11, 12, 13, 14],
       [ 0,  4,  5,  2,  6],
       [ 0,  7,  5,  2,  3],
       [16,  2,  3, 17, 18],
       [19, 20, 21,  2,  3],
       [ 4,  8, 22,  2,  3],
       [ 0, 23, 24, 25,  6]], dtype=int32)

In [5]:
test_data = [
  'i really love my new puppy',
  'my dog loves my manatee',
  'cats are cool'
]

test_seq = tokenizer.texts_to_sequences(test_data)
test_seq

[[4, 1, 5, 2, 1, 1], [2, 3, 24, 2, 1], [9, 10, 1]]

If you want to make these the same length, you can use the pad_sequences API. First, you’ll need to import it:

```python
from tensoflow.keras.preprocessing.sequence import pad_sequences
```

In [6]:
padded = pad_sequences(test_seq, maxlen=10)
padded

array([[ 0,  0,  0,  0,  4,  1,  5,  2,  1,  1],
       [ 0,  0,  0,  0,  0,  2,  3, 24,  2,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  9, 10,  1]], dtype=int32)

First, you might have noticed that in the case of the shorter sentences, to get them to be the same shape as the longest one, the requisite number of zeros were added at the beginning. This is called prepadding, and it’s the default behavior. You can change this using the `padding` parameter. For example, if you want your sequences to be padded with zeros at the end, you can use:

In [7]:
post_padded = pad_sequences(test_seq, maxlen=10, padding='post')
post_padded

array([[ 4,  1,  5,  2,  1,  1,  0,  0,  0,  0],
       [ 2,  3, 24,  2,  1,  0,  0,  0,  0,  0],
       [ 9, 10,  1,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

Now your padded sequences are all the same length, and there isn’t too much padding. You have lost some words from your longest sentence, though, and they’ve been truncated from the beginning. What if you don’t want to lose the words from the beginning and instead want them truncated from the end of the sentence? You can override the default behavior with the `truncating` parameter, as follows:

In [8]:
pre_padding_truncating = pad_sequences(test_seq, maxlen=5, truncating='pre')
pre_padding_truncating

array([[ 1,  5,  2,  1,  1],
       [ 2,  3, 24,  2,  1],
       [ 0,  0,  9, 10,  1]], dtype=int32)

In [9]:
post_padding_truncating = pad_sequences(test_seq, maxlen=5, truncating='post')
post_padding_truncating

array([[ 4,  1,  5,  2,  1],
       [ 2,  3, 24,  2,  1],
       [ 0,  0,  9, 10,  1]], dtype=int32)

## Removing Stopwords and Cleaning Text
There's often text that you don’t want in your dataset. You may want to filter out so-called `stopwords` that are too common and don’t add any meaning, like **“the,” “and,” and “but.”** You may also encounter a lot of HTML tags in your text, and it would be good to have a clean way to remove them. Other things you might want to filter out include **rude words, punctuation, or names**. Later we’ll explore a dataset of tweets, which often have somebody’s user ID in them, and we’ll want to filter those out.

While every task is different based on your corpus of text, there are three main things that you can do to clean up your text programmatically.

> ***First, strip out HTML tags***. Fortunately, there’s a library called BeautifulSoup that makes this straightforward. For example, if your sentences contain HTML tags such as <\br>, they’ll be removed by this code:

In [10]:
from bs4 import BeautifulSoup
sentence = '''
Natural language processing is a subfield of linguistics, computer science, and 
artificial intelligence concerned with the interactions between computers and 
human language, in particular how to program computers to process and analyze 
large amounts of natural language data.
'''

soup = BeautifulSoup(sentence)
sentence = soup.get_text()
sentence

'Natural language processing is a subfield of linguistics, computer science, and \nartificial intelligence concerned with the interactions between computers and \nhuman language, in particular how to program computers to process and analyze \nlarge amounts of natural language data.\n'

A common way to remove `stopwords` is to have a stopwords list and to preprocess your sentences, removing instances of stopwords. Here’s an example:

In [11]:
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
             "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do",
             "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having",
             "he", "hed", "hes", "her", "here", "heres", "hers", "herself", "him", "himself", "his", "how",
             "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", "is", "it", "its", "itself",
             "lets", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought",
             "our", "ours", "ourselves", "out", "over", "own", "same", "she", "shed", "shell", "shes", "should",
             "so", "some", "such", "than", "that", "thats", "the", "their", "theirs", "them", "themselves", "then",
             "there", "theres", "these", "they", "theyd", "theyll", "theyre", "theyve", "this", "those", "through",
             "to", "too", "under", "until", "up", "very", "was", "we", "wed", "well", "were", "weve", "were",
             "what", "whats", "when", "whens", "where", "wheres", "which", "while", "who", "whos", "whom", "why",
             "whys", "with", "would", "you", "youd", "youll", "youre", "youve", "your", "yours", "yourself",
             "yourselves"]

In [12]:
len(stopwords)

151

Then, as you are iterating through your sentences, you can use code like this to remove the stopwords from your sentence.

In [13]:
words = sentence.split()
filtered_sentence = ''
for word in words:
  if word not in stopwords:
    filtered_sentence += word + ' '
filtered_sentence

'Natural language processing subfield linguistics, computer science, artificial intelligence concerned interactions computers human language, particular program computers process analyze large amounts natural language data. '

Another thing you might consider is stripping out punctuation, which can fool a stopword remover. 

In [14]:
import string

table = str.maketrans('', '', string.punctuation)
words = sentence.split()
filtered_sentence = ''
for word in words:
  word = word.translate(table)
  if word not in stopwords:
    filtered_sentence += word + ' '
filtered_sentence

'Natural language processing subfield linguistics computer science artificial intelligence concerned interactions computers human language particular program computers process analyze large amounts natural language data '

## Working with Real Data Sources

## Getting Text from TensorFlow datasets

This code will load the training split from the IMDb dataset and iterate through it, adding the text field containing the review to a list called imdb_sentences. Reviews are a tuple of the text and a label containing the sentiment of the review. Note that by wrapping the tfds.load call in tfds.as_numpy you ensure that the data will be loaded as strings, not tensors:

In [21]:
imdb_sentences = []
train_data = tfds.as_numpy(tfds.load('imdb_reviews', split='train'))
for item in train_data:
  imdb_sentences.append(str(item['text']))
print('# Sentences:', len(imdb_sentences))
print('Samples:', imdb_sentences[0], imdb_sentences[1])

# Sentences: 25000
Samples: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it." b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbi

In [31]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=5000, oov_token='<oov>')
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
list(tokenizer.word_index.items())[:5]

[('<oov>', 1), ('the', 2), ('and', 3), ('a', 4), ('of', 5)]

These tokens include stop words and that can impact our training accuracy because they're the most common words.

In [34]:
from bs4 import BeautifulSoup
import string

table = str.maketrans('', '', string.punctuation)
imdb_sentences = []

train_data = tfds.as_numpy(tfds.load('imdb_reviews', split='train'))
for item in train_data:
  sentence = str(item['text'].decode('UTF-8').lower())
  soup = BeautifulSoup(sentence)
  sentence = soup.get_text()
  words = sentence.split()
  filtered_sentence = ''
  for word in words:
    word = word.translate(table)
    if word not in stopwords:
      filtered_sentence = filtered_sentence + word + ' '
    imdb_sentences.append(filtered_sentence)

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=25000, oov_token='<oov>')
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)

list(tokenizer.word_index.items())[:5]



[('<oov>', 1), ('film', 2), ('movie', 3), ('not', 4), ('one', 5)]