This notebook serves as a training for the basics text sequence preprocessing to do in order to train RNN or LSTM networks

In [2]:
import numpy as np

## Preprocessing of the data

We get the IMDB dataset directly from the tensorflow_datasets API

In [3]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

We reformate the training data to make easier the preprocessing wanted

In [4]:
train_data = datasets['train']

training_sentences = []
training_labels = []

for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
training_labels_final = np.array(training_labels)

## Tokenization

Use of the Tokenizer object from Keras to tokenize the sequence.

(You can have a look at the doc page https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

tokenizer.fit_on_texts(training_sentences)
total_words = len(tokenizer.word_index) + 1

tokenized_sequences = tokenizer.texts_to_sequences(training_sentences)


We can check if we really get a sequence of tokens

In [6]:
print(tokenized_sequences[0])

[58, 11, 13, 34, 438, 399, 17, 173, 28, 10623, 8, 32, 1377, 3400, 41, 495, 11108, 196, 24, 87, 155, 18, 11, 210, 339, 28, 69, 247, 212, 8, 485, 61, 69, 87, 115, 98, 23, 5739, 11, 3316, 656, 776, 11, 17, 6, 34, 405, 8227, 177, 2476, 425, 1, 91, 1252, 139, 71, 148, 54, 1, 30180, 7524, 71, 228, 69, 2961, 15, 20481, 2879, 20482, 18415, 1505, 4997, 2, 39, 3946, 118, 1607, 16, 3400, 13, 162, 18, 3, 1252, 926, 7985, 8, 3, 17, 12, 13, 4199, 4, 101, 147, 1236, 10, 239, 691, 12, 43, 24, 100, 38, 11, 7231, 10373, 38, 1377, 25010, 51, 408, 10, 98, 1213, 873, 144, 9]


We can have a look at the explicit correspondance between words and token numbers

In [33]:
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 "'s": 20,
 'on': 21,
 'you': 22,
 'not': 23,
 'are': 24,
 'his': 25,
 'he': 26,
 'have': 27,
 'be': 28,
 'one': 29,
 'all': 30,
 'at': 31,
 'by': 32,
 'they': 33,
 'an': 34,
 'who': 35,
 'so': 36,
 'from': 37,
 'like': 38,
 'her': 39,
 "'t": 40,
 'or': 41,
 'just': 42,
 'there': 43,
 'about': 44,
 'out': 45,
 "'": 46,
 'has': 47,
 'if': 48,
 'some': 49,
 'what': 50,
 'good': 51,
 'more': 52,
 'very': 53,
 'when': 54,
 'she': 55,
 'up': 56,
 'can': 57,
 'b': 58,
 'time': 59,
 'no': 60,
 'even': 61,
 'my': 62,
 'would': 63,
 'which': 64,
 'story': 65,
 'only': 66,
 'really': 67,
 'see': 68,
 'their': 69,
 'had': 70,
 'were': 71,
 'me': 72,
 'well': 73,
 'we': 74,
 'than': 75,
 'much': 76,
 'been': 77,
 'get': 78,
 'bad': 79,
 'will': 80,
 'people': 81,
 'do': 82,
 'also': 83,


## Padding

Use of the pad_sequences function to pad the tokenized sequences created just before.

(See https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)

In [16]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 300

padded_sequences = pad_sequences(tokenized_sequences,maxlen=max_length, padding = 'post')

We can check if we really get a padded sequence of tokens

In [17]:
print(padded_sequences[0])

[   58    11    13    34   438   399    17   173    28 10623     8    32
  1377  3400    41   495 11108   196    24    87   155    18    11   210
   339    28    69   247   212     8   485    61    69    87   115    98
    23  5739    11  3316   656   776    11    17     6    34   405  8227
   177  2476   425     1    91  1252   139    71   148    54     1 30180
  7524    71   228    69  2961    15 20481  2879 20482 18415  1505  4997
     2    39  3946   118  1607    16  3400    13   162    18     3  1252
   926  7985     8     3    17    12    13  4199     4   101   147  1236
    10   239   691    12    43    24   100    38    11  7231 10373    38
  1377 25010    51   408    10    98  1213   873   144     9     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

A quick comparison between the padded and the original sequences

In [25]:
padded_sequence_length = len(padded_sequences[0])
original_sequence_length = len(tokenized_sequences[0])

print(f"The length of the padded sequence is {padded_sequence_length} whereas it used to be {original_sequence_length} before padding\n")

if padded_sequence_length>original_sequence_length:
  print("You should observed zero-padding on the new sequence")
else:
    print("You should observed a troncation of the original sequence")


The length of the padded sequence is 300 whereas it used to be 118 before padding

You should observed zero-padding on the new sequence


## One-hot encoding

Use of the to_categorical function to convert each token of a sequence into its one-hot encoded version

(See https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical)

We try only on the first sequence, in order to prevent memory crashing

In [26]:
from tensorflow.keras.utils import to_categorical

one_hot_encoded_first_sequence = to_categorical(padded_sequences[0], num_classes=total_words)

We can check if we really get a sequence of one-hot-encoded tokens

In [27]:
print(one_hot_encoded_first_sequence)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


As we saw in the class, most of the time we will not need to explictly compute the one-hot encoded sequence vectors

## Reverse processing: from the padded sequence to the original text sequence

We define a function to get the original sequence back from the padded one

In [28]:
reverse_word_index = dict([(value, key) for (key, value) in tokenizer.word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(padded_sequences[0]))
print(training_sentences[0])

b this was an absolutely terrible movie don't be lured in by christopher walken or michael ironside both are great actors but this must simply be their worst role in history even their great acting could not redeem this movie's ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the columbian rebels were making their cases for revolutions maria conchita alonso appeared phony and her pseudo love affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning i am disappointed that there are movies like this ruining actor's like christopher walken's good name i could barely sit through it ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 

Reference:

https://www.coursera.org/learn/natural-language-processing-tensorflow