Converting input sentences into numeric sequences. Similar to images in CNNs, we need to prepare text data with uniform size before feeding it to our model. We will see how to do these in the next sections.


## Text to Sequences

In the previous lab, you saw how to use the TextVectorization layer to build a vocabulary from your corpus. It generates a list where more frequent words have lower indices.

### TensorFlow

In [4]:
import tensorflow as tf

sentences = [
    "I love my dog.",
    "I love my cat",
    "You love my dog!",
    "Do you love my cat?"
]

# Initialize the layer
vec_layer = tf.keras.layers.TextVectorization()

# Compute the vocab
vec_layer.adapt(sentences)

# get the vocab
vocab_tf = vec_layer.get_vocabulary()

print(f"Vocabulary: {vocab_tf}")
print("\nwith indices:\n")
for index, word in enumerate(vocab_tf):
    print(index, word)

2025-07-24 14:42:50.162812: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Vocabulary: ['', '[UNK]', 'my', 'love', 'you', 'i', 'dog', 'cat', 'do']

with indices:

0 
1 [UNK]
2 my
3 love
4 you
5 i
6 dog
7 cat
8 do


2025-07-24 14:42:52.949187: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-07-24 14:42:52.972362: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


### PyTorch

In [5]:
import torch
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

sentences = [
    "I love my dog.",
    "I love my cat",
    "You love my dog!",
    "Do you love my cat?"
]

tokenizer = get_tokenizer("basic_english")

def yield_token(sentences):
    for sentence in sentences:
        yield tokenizer(sentence)

vocab = build_vocab_from_iterator(yield_token(sentences), specials=['<pad>', '<unk>'])

print(f"Vocabulary: {vocab.get_itos()}")
print("\nWith Indices:\n")
print(vocab.get_stoi())


Vocabulary: ['<pad>', '<unk>', 'love', 'my', 'cat', 'dog', 'i', 'you', '!', '.', '?', 'do']

With Indices:

{'?': 10, '!': 8, '.': 9, 'i': 6, 'dog': 5, 'do': 11, 'cat': 4, 'my': 3, 'you': 7, 'love': 2, '<unk>': 1, '<pad>': 0}


You can then use the result to convert each of the input sentences into integer sequences. See how that's done below given a single input string.

### TensorFlow

In [6]:
# string input
sample_input = "I love my dog"

# convert string input to integer sequence
seq = vec_layer(sample_input)

print(seq)

# To Check
print(f"vocab: {vocab_tf}")


tf.Tensor([5 3 2 6], shape=(4,), dtype=int64)
vocab: ['', '[UNK]', 'my', 'love', 'you', 'i', 'dog', 'cat', 'do']


### PyTorch

In [7]:
# string input
sample_input = "I love my dog"

# convert string input to integer sequence
tokens = tokenizer(sample_input)

# get the tokens
seq = vocab(tokens)

print(seq)

# to check
print(vocab.get_stoi())


[6, 2, 3, 5]
{'?': 10, '!': 8, '.': 9, 'i': 6, 'dog': 5, 'do': 11, 'cat': 4, 'my': 3, 'you': 7, 'love': 2, '<unk>': 1, '<pad>': 0}


As shown, you simply pass in the string to the layer which already learned the vocabulary, and it will output the integer sequence as a tf.Tensor. In this case, the result is [6 3 2 4]. You can look at the token index printed above to verify that it matches the indices for each word in the input string.

For a given list of string inputs (such as the 4-item sentences list above), you will need to apply the layer to each input. There's more than one way to do this. Let's first use the map() method and see the results.

In [8]:
print(sentences)


# convert sentences to tf data
sentences_dataset = tf.data.Dataset.from_tensor_slices(sentences)

# define a mapping function to convert each sample input
sequences = sentences_dataset.map(vec_layer)

# print integer sequences

for sentence, sequence in zip(sentences, sequences):
    print(f"{sentence} --> {sequence}")

['I love my dog.', 'I love my cat', 'You love my dog!', 'Do you love my cat?']
I love my dog. --> [5 3 2 6]
I love my cat --> [5 3 2 7]
You love my dog! --> [4 3 2 6]
Do you love my cat? --> [8 4 3 2 7]


2025-07-24 14:42:55.523119: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [4]
	 [[{{node Placeholder/_0}}]]


In [9]:
print(sentences)

# get tokens for sentences
for sentence in sentences:
    tokens = tokenizer(sentence)
    # print(tokens)
    seq = vocab(tokens)
    print(f"{sentence} --> {seq}")

# To check
vocab.get_stoi()

['I love my dog.', 'I love my cat', 'You love my dog!', 'Do you love my cat?']
I love my dog. --> [6, 2, 3, 5, 9]
I love my cat --> [6, 2, 3, 4]
You love my dog! --> [7, 2, 3, 5, 8]
Do you love my cat? --> [11, 7, 2, 3, 4, 10]


{'?': 10,
 '!': 8,
 '.': 9,
 'i': 6,
 'dog': 5,
 'do': 11,
 'cat': 4,
 'my': 3,
 'you': 7,
 'love': 2,
 '<unk>': 1,
 '<pad>': 0}

## Padding

You can get a list of varying lengths to have a uniform size by padding or truncating tokens from the sequences. Padding is more common to preserve information.

Recall that your vocabulary reserves a special token index 0 for padding. It will add that token (called post padding) if you pass in a list of string inputs to the layer. See an example below. Notice that you have the same output as above but the integer sequences are already post-padded with 0 up to the length of the longest sequence.

### TensorFlow

In [10]:
print(sentences)

# apply the layer to the input list
seq_post = vec_layer(sentences)
print('INPUT:')
print(sentences)
print()

print('OUTPUT:')
print(seq_post)

['I love my dog.', 'I love my cat', 'You love my dog!', 'Do you love my cat?']
INPUT:
['I love my dog.', 'I love my cat', 'You love my dog!', 'Do you love my cat?']

OUTPUT:
tf.Tensor(
[[5 3 2 6 0]
 [5 3 2 7 0]
 [4 3 2 6 0]
 [8 4 3 2 7]], shape=(4, 5), dtype=int64)


If you want pre-padding, you can use the pad_sequences() utility to prepend a padding token to the sequences. Notice that the padding argument is set to pre. This is just for clarity. The function already has this set as the default so you can opt to drop it.

In [11]:
print(sequences)

seq_pre = tf.keras.utils.pad_sequences(sequences, padding='pre')

# # For Post
# seq_post = tf.keras.utils.pad_sequences(sequences, padding='post')
# seq_post = tf.keras.utils.pad_sequences(sequences)


print('INPUT:')
[print(sequence.numpy()) for sequence in sequences]
print()

print('OUTPUT:')
print(seq_pre)

<_MapDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int64, name=None)>
INPUT:
[5 3 2 6]
[5 3 2 7]
[4 3 2 6]
[8 4 3 2 7]

OUTPUT:
[[0 5 3 2 6]
 [0 5 3 2 7]
 [0 4 3 2 6]
 [8 4 3 2 7]]


In [12]:
print(sequences)

seq_pre = tf.keras.utils.pad_sequences(sequences, maxlen=4, padding='pre')

print('INPUT:')
[print(sequence.numpy()) for sequence in sequences]
print()

print('OUTPUT:')
print(seq_pre)

<_MapDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int64, name=None)>


INPUT:
[5 3 2 6]
[5 3 2 7]
[4 3 2 6]
[8 4 3 2 7]

OUTPUT:
[[5 3 2 6]
 [5 3 2 7]
 [4 3 2 6]
 [4 3 2 7]]


In [13]:
print(sequences)

seq_post = tf.keras.utils.pad_sequences(sequences, maxlen=4, truncating='post', padding='pre')

print('INPUT:')
[print(sequence.numpy()) for sequence in sequences]
print()

print('OUTPUT:')
print(seq_post)

<_MapDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int64, name=None)>
INPUT:
[5 3 2 6]
[5 3 2 7]
[4 3 2 6]
[8 4 3 2 7]

OUTPUT:
[[5 3 2 6]
 [5 3 2 7]
 [4 3 2 6]
 [8 4 3 2]]


Another way to prepare your sequences for prepadding is to set the TextVectorization to output a ragged tensor. This means the output will not be automatically post-padded. See the output sequences here.

In [83]:
print(sentences)
# Set the layer to output a ragged tensor
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)

# Compute the vocabulary
vectorize_layer.adapt(sentences)

# Apply the layer to the sentences
ragged_sequences = vectorize_layer(sentences)

# Print the results
print(ragged_sequences)



['I love my dog.', 'I love my cat', 'You love my dog!', 'Do you love my cat?']
<tf.RaggedTensor [[5, 3, 2, 6], [5, 3, 2, 7], [4, 3, 2, 6], [8, 4, 3, 2, 7]]>


With that, you can now pass it directly to the pad_sequences() utility.

In [84]:
# Pre-pad the sequences in the ragged tensor
sequences_pre = tf.keras.utils.pad_sequences(ragged_sequences.numpy())

# Print the results
print(sequences_pre)

[[0 5 3 2 6]
 [0 5 3 2 7]
 [0 4 3 2 6]
 [8 4 3 2 7]]


### PyTorch

In [14]:
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence

print(sentences)

tokenizer = lambda x: x.lower().split()

def yield_token(sentences):
    for sentence in sentences:
        yield tokenizer(sentence)

vocab = build_vocab_from_iterator(yield_token(sentences), specials=['<pad>', '<unk>'])
vocab.set_default_index(vocab['<unk>'])

#numericalized
list_num = []
for sentence in sentences:
    tokens = tokenizer(sentence)
    seq = vocab(tokens)
    list_num.append(torch.tensor(seq))
 
print("tensor_list: \n")
print(list_num)

padding = pad_sequence(sequences=list_num, batch_first=True, padding_value=vocab['<pad>'])
print(padding)
print()
print(vocab.get_stoi())


['I love my dog.', 'I love my cat', 'You love my dog!', 'Do you love my cat?']
tensor_list: 

[tensor([ 4,  2,  3, 10]), tensor([4, 2, 3, 6]), tensor([5, 2, 3, 9]), tensor([8, 5, 2, 3, 7])]
tensor([[ 4,  2,  3, 10,  0],
        [ 4,  2,  3,  6,  0],
        [ 5,  2,  3,  9,  0],
        [ 8,  5,  2,  3,  7]])

{'dog.': 10, 'dog!': 9, 'cat?': 7, 'do': 8, 'cat': 6, 'i': 4, 'my': 3, 'you': 5, 'love': 2, '<unk>': 1, '<pad>': 0}


PyTorch build in only supports post-padding. To pre-pad define function manually.

In [15]:
import torch

def pre_pad(sequences, padding_value, max_len=None):
    if max_len is None:
        max_len = max(len(seq) for seq in sequences)

    padded = []
    for seq in sequences:
        pad_len = max_len - len(seq)
        padded_seq = torch.cat([torch.full((pad_len,), padding_value), seq])
        padded.append(padded_seq)
    return torch.stack(padded)


pad_pre = pre_pad(list_num, padding_value=vocab['<pad>'])
pad_pre

tensor([[ 0,  4,  2,  3, 10],
        [ 0,  4,  2,  3,  6],
        [ 0,  5,  2,  3,  9],
        [ 8,  5,  2,  3,  7]])

Truncating is also not supported in PyTorch built-in. So we will have to define it manually.

In [80]:
import torch
from typing_extensions import Literal
literal_ = Literal['pre', 'post']
# print(list_num)

def truncate_torch(sequences: list[torch.Tensor], len_: int, type: literal_):

    list_len = [len(seq) for seq in sequences]
    max_len = max(list_len)
    min_len = min(list_len)
    new_seq_list = []
    if len_ < max_len:
        for seq in sequences:
            len_to_del = len(seq) - len_
            if type == 'pre':
                # indices_to_del = torch.Tensor([i for i in range(0, len_to_del)])
                new_seq = seq[len_to_del:]
            elif type == 'post':
                # indices_to_del = torch.Tensor([i for i in range(-1, -len_to_del-1, -1)])
                new_seq = seq[0:-len_to_del]
            else:
                raise NotImplementedError(f"Please choose type from {literal_}")
            new_seq_list.append(new_seq)
            # print(new_seq_list)

    else:
        raise ValueError(f"Nothing to truncate. Given length to truncate - {len_} > maximum length of sequences - {max_len}")
    
    return new_seq_list



In [81]:
truncate_torch(list_num, 3, 'pre')

[tensor([ 2,  3, 10]), tensor([2, 3, 6]), tensor([2, 3, 9]), tensor([2, 3, 7])]

In [82]:
truncate_torch(list_num, 3, 'post')

[tensor([4, 2, 3]), tensor([4, 2, 3]), tensor([5, 2, 3]), tensor([8, 5, 2])]

We can write functions like this for combine truncating and padding.


## Out-of-vocabulary tokens

Lastly, you'll see what the other special token is for. The layer will use the token index 1 when you have input words that are not found in the vocabulary list. For example, you may decide to collect more text after your initial training and decide to not recompute the vocabulary. You will see this in action in the cell below. Notice that the token 1 is inserted for words that are not found in the list.


### TensorFlow

In [90]:
sentences_oov = [
    "i really love my dog",
    "my dogs love my manatee"
]

seq_with_oov = vectorize_layer(sentences_oov)

for sentence, sequence in zip(sentences_oov, seq_with_oov):
    print(f"{sentence} --> {sequence}")

# To check:
vocab_tf

i really love my dog --> [5 1 3 2 6]
my dogs love my manatee --> [2 1 3 2 1]


['', '[UNK]', 'my', 'love', 'you', 'i', 'dog', 'cat', 'do']

### PyTorch

In [99]:
sentences

['I love my dog.', 'I love my cat', 'You love my dog!', 'Do you love my cat?']

In [105]:
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

sentences_oov = [
    "i really love my dog",
    "my dog love my manatee"
]

tokenizer = get_tokenizer("basic_english")

def yield_token(sentences):
    for sentence in sentences:
        yield tokenizer(sentence)

# Build vocab using old sentences to see how the OOV works on new sentences later
vocab = build_vocab_from_iterator(yield_token(sentences), specials=['<pad>', '<unk>'])
vocab.set_default_index(vocab['<unk>'])

# Numericalized
seq_list = []
for sentence in sentences_oov:
    tokens = tokenizer(sentence)
    seq = vocab(tokens)
    seq_list.append(torch.Tensor(seq))

seq_list

# To get it into TF like format above:
for sentence, sequence in zip(sentences_oov, seq_list):
    print(f"{sentence} --> {sequence}")

# To check
vocab.get_itos()
vocab.get_stoi()

i really love my dog --> tensor([6., 1., 2., 3., 5.])
my dog love my manatee --> tensor([3., 5., 2., 3., 1.])


{'?': 10,
 '!': 8,
 '.': 9,
 'i': 6,
 'dog': 5,
 'do': 11,
 'cat': 4,
 'my': 3,
 'you': 7,
 'love': 2,
 '<unk>': 1,
 '<pad>': 0}

This concludes another introduction to text data preprocessing. So far, you've just been using dummy data. In the next exercise, you will be applying the same concepts to a real-world and much larger dataset.