# Vocabulary Building

In most natural language processing (NLP) tasks, the initial step in preparing your data is to extract a vocabulary of words from your corpus (i.e. input texts). You will need to define how to represent the texts into numeric features which can be used to train a neural network. Tensorflow and Keras makes it easy to generate these using its APIs. You will see how to do that in the next cells.

The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the TextVectorization() preprocessing layer and its adapt() method.

As mentioned in the docs above, this layer does several things including:

    1. Standardizing each example. The default behavior is to lowercase and strip punctuation. See its standardize argument for other options.
    2. Splitting each example into substrings. By default, it will split into words. See its split argument for other options.
    3. Recombining substrings into tokens. See its ngrams argument for reference.
    4. Indexing tokens.
    5. Transforming each example using this index, either into a vector of ints or a dense float vector.



### TensorFlow

In [1]:
import tensorflow as tf

# sample inputs

sentences = [
    "I love my dog",
    "i love my cat"
    ]

# Initialize the layeer
vectorized_layer = tf.keras.layers.TextVectorization()

# Build the vocab
vectorized_layer.adapt(sentences)

# get vocab
vocab = vectorized_layer.get_vocabulary(include_special_tokens = False)

2025-07-22 14:48:48.772503: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-07-22 14:48:54.302638: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-07-22 14:48:54.308676: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your 


The resulting vocabulary will be a list where more frequently used words will have a lower index. By default, it will also reserve indices for special tokens but , for clarity, let's reserve that for later.

In [2]:
for index, word in enumerate(vocab):
    print(index, word)

0 my
1 love
2 i
3 dog
4 cat


### PyTorch

In [8]:
import torch
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

sentences = [
    "I love my dog",
    "i love my cat"
]

# whitespace tokenizer
tokenizer = get_tokenizer("basic_english")

# define yield
def yield_token(sentences):
    for sentenece in sentences:
        yield tokenizer(sentenece)

vocab = build_vocab_from_iterator(yield_token(sentences), specials=[])
vocab_list = vocab.get_itos() # index to string
vocab_list_2 = vocab.get_stoi() # string to index
print(vocab_list)
print(vocab_list_2)



['i', 'love', 'my', 'cat', 'dog']
{'dog': 4, 'cat': 3, 'my': 2, 'love': 1, 'i': 0}


In [6]:
print(vocab_list[1])
print(vocab["love"])

love
1


Lets add another word. If you add another sentence, you'll notice new words in the vocabulary and new punctuation is still ignored as expected.


### TensorFlow

In [9]:
import tensorflow as tf

# sample inputs

sentences = [
    "I love my dog",
    "i love my cat",
    "You love my dog!"]

# Initialize the layeer
vectorized_layer = tf.keras.layers.TextVectorization()

# Build the vocab
vectorized_layer.adapt(sentences)

# get vocab
vocab = vectorized_layer.get_vocabulary(include_special_tokens = False)

In [10]:
for index, word in enumerate(vocab):
    print(index, word)

0 my
1 love
2 i
3 dog
4 you
5 cat


### PyTorch

In [12]:
import torch
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

sentences = [
    "I love my dog",
    "i love my cat",
    "You love my dog!"]

tokenizer = get_tokenizer("basic_english")

def yield_token(sentences):
    for sentence in sentences:
        yield tokenizer(sentence)

vocab = build_vocab_from_iterator(yield_token(sentences), specials=[])

vocab_list_1 = vocab.get_itos()
vocab_list_2 = vocab.get_stoi()

print(vocab_list_1)
print(vocab_list_2)

['love', 'my', 'dog', 'i', '!', 'cat', 'you']
{'cat': 5, '!': 4, 'i': 3, 'dog': 2, 'my': 1, 'you': 6, 'love': 0}



Now that you see how it behaves, let's include the two special tokens. The first one at 0 is used for padding and 1 is used for out-of-vocabulary words. These are important when you use the layer to convert input texts to integer sequences. You'll see that in the next lab.


### TensorFlow

In [18]:
import tensorflow as tf

# sample inputs

sentences = [
    "I love my dog",
    "i love my cat",
    "You love my dog!"]

# Initialize the layeer
vectorized_layer = tf.keras.layers.TextVectorization()

# Build the vocab
vectorized_layer.adapt(sentences)

# get vocab
vocab = vectorized_layer.get_vocabulary()

In [20]:
for index, words in enumerate(vocab):
    print(index, words)

0 
1 [UNK]
2 my
3 love
4 i
5 dog
6 you
7 cat


### PyTorch

In [None]:
import torch
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

sentences = [
    "I love my dog",
    "i love my cat",
    "You love my dog!"]

tokenizer = get_tokenizer("basic_english")

def yield_token(sentences):
    for sentence in sentences:
        yield tokenizer(sentence)

"""
Add <pad> and <unk> as specials; order matters: pad first (idx 0), unk second (idx 1)
Usually padding token is assigned index 0 because many PyTorch functions (like nn.Embedding) expect padding_idx=0.
"""

specials = ['<pad>', '<unk>']
vocab = build_vocab_from_iterator(yield_token(sentences), specials=specials)  #or specials=['<UNK>']

vocab_list_1 = vocab.get_itos()
# vocab_list_2 - vocab.get_stoi()

print(vocab_list_1)
# print(vocab_list_2)

['<pad>', '<unk>', 'love', 'my', 'dog', 'i', '!', 'cat', 'you']


Conclusion on vocab building!