<h1 id="intro">Text Tokenization - Three Data Splitting Approaches</h1>

This notebook provides a walkthrough of chunking and splitting text with a Hugging Face tokenizer. The concepts in this notebook can be used to gain familiarity with tokenizers which are fundamental for NLP.

## Table of Contents
If viewing this notebook from GitHub please view it instead on [nbviewer.org](https://nbviewer.org/) so the hyperlinks will function. 

- [User Inputs](#user-inputs)
- [Import Libraries and Modules](#import-libs)
- [Tokenizer](#tokenizer)
- [Approach 1: Right Side Truncation](#approach-1-right-side-truncation)
- [Approach 2: Right and Left Side of Text](#approach-2-right-left-side-truncation)
- [Approach 3: Chunking with Overlap](#approach-3-chunking-overlap)
- [Takeaways](#takeaways)

<h1 id="user-inputs">User Inputs</h1>

##### [Return To Top](#intro)

In [1]:
# Model name
model_name = 'gemma-2-9b-it'

# Tokenizer max length and stride
max_length = 7
stride = 2

# Example text to use for demonstration
text = (f'This is text that will be chunked with '
        f'overlap for simple example use cases')

<h1 id="import-libs">Import Libraries and Modules</h1>

##### [Return To Top](#intro)

In [2]:
# Libraries
import os
from pathlib import Path
from transformers import AutoTokenizer

# Setup HF env. variables
os.environ['TRANSFORMERS_OFFLINE'] = '1'
os.environ['TOKENIZERS_PARALLELISM'] = 'True'
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

<h1 id="tokenizer">Tokenizer</h1>

The Gemma 2 model uses a [SentencePiece tokenizer](https://arxiv.org/html/2408.00118v1) with a vocabulary size of 256K.

##### [Return To Top](#intro)

In [3]:
# Path to the model and tokenizer model card saved on disk
model_path = Path(os.getenv('LLM_MODELS')) / model_name

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Number of tokens in the tokenizer
print(f'Number of Token in Vocabulary: {len(tokenizer.get_vocab()):,}')

Number of Token in Vocabulary: 256,000


Special tokens in Hugging Face tokenizers are used to indicate the start and end of a sequence, separate sentences, or represent padding and unknown tokens. These tokens help the model understand the structure and boundaries of the input text during training and inference.

In [4]:
# Special tokens
print('Special Tokens')
for token_name in tokenizer.SPECIAL_TOKENS_ATTRIBUTES:
    token = getattr(tokenizer, token_name)
    if token_name == 'additional_special_tokens':
        token_id = None
    else:
        token_id = getattr(tokenizer, f'{token_name}_id')
    print(f'{token_name}: {token} -> {token_id}')

# Additional special tokens
print(f'\nAdditional Special Tokens')
for token_id in tokenizer.additional_special_tokens_ids:
    print(f'{token_id}: {tokenizer.decode(token_id)}')

Special Tokens
bos_token: <bos> -> 2
eos_token: <eos> -> 1
unk_token: <unk> -> 3
sep_token: None -> None
pad_token: <pad> -> 0
cls_token: None -> None
mask_token: None -> None
additional_special_tokens: ['<start_of_turn>', '<end_of_turn>'] -> None

Additional Special Tokens
106: <start_of_turn>
107: <end_of_turn>


In [5]:
from typing import Type, TypeVar

T = TypeVar('T')

def view_tokens(tok: Type[T], input_ids: list) -> None:
    """Decode and Print input_ids tokens for viewing 

    Args:
        tok (Type[T]): Tokenizer
        input_ids (list): List of input_id tokens
    """
    print(f'input_ids:\n{input_ids}\n\n')
    print(f'tokenizer.convert_ids_to_tokens(input_ids):\n'
          f'{tok.convert_ids_to_tokens(input_ids)}\n\n')
    
    print(f'tokenizer.decode(input_ids):\n'
          f'{tok.decode(input_ids)}\n')
    return

In [6]:
# Tokenizer
tk = tokenizer(text,
               truncation=False,
               return_length=True)

# Number of tokens in the text
total_tokens_text = tk['length'][0]

# Total tokens in text
print(f'len(tk): {len(tk)}\n')
print(f'tk.keys(): {tk.keys()}\n')
print(f'Number of Tokens: {tk["length"][0]}\n')
view_tokens(tok=tokenizer, input_ids=tk['input_ids'])

len(tk): 3

tk.keys(): dict_keys(['input_ids', 'attention_mask', 'length'])

Number of Tokens: 16

input_ids:
[2, 1596, 603, 2793, 674, 877, 614, 788, 129632, 675, 40768, 604, 3890, 3287, 1281, 4381]


tokenizer.convert_ids_to_tokens(input_ids):
['<bos>', 'This', '▁is', '▁text', '▁that', '▁will', '▁be', '▁ch', 'unked', '▁with', '▁overlap', '▁for', '▁simple', '▁example', '▁use', '▁cases']


tokenizer.decode(input_ids):
<bos>This is text that will be chunked with overlap for simple example use cases



<h1 id="approach-1-right-side-truncation">Approach 1: Right Side Truncation</h1>


This apporach takes the first portion of text (from 0 to max_length) and is the most common approach found in tokenizing text.

##### [Return To Top](#intro)

In [7]:
# Keep the right side of the text
tk = tokenizer(text,
               max_length=max_length,
               truncation=True,
               return_length=True,
               padding=False)
print(f'Number of Tokens: {tk["length"][0]}\n')
view_tokens(tok=tokenizer, input_ids=tk['input_ids'])

Number of Tokens: 7

input_ids:
[2, 1596, 603, 2793, 674, 877, 614]


tokenizer.convert_ids_to_tokens(input_ids):
['<bos>', 'This', '▁is', '▁text', '▁that', '▁will', '▁be']


tokenizer.decode(input_ids):
<bos>This is text that will be



<h1 id="approach-2-right-left-side-truncation">Approach 2: Right and Left Side Truncation</h1>

In this approach the right and left side of the text will be taken and combined together. This will require two separate tokenizations of the text, recombining it, and then tokenizing in the chat template.

Some math needs to be performed to account for the max_length to take from each side. When doing this account for:
- Number of tokens to take for each side (i.e., max_length/2),
- The `<bos>` token will be dropped at the start of each side of text.

In the below cell notice the use of `math.floor` and `math.ceil` which are used if an odd `max_length` is set. This will ensure that the returned number of tokens is exactly `max_length`. 

##### [Return To Top](#intro)

In [8]:
import math

# Calculate the number of tokens for each side
max_len_right = math.floor(max_length / 2) + 1
max_len_left = math.ceil(max_length / 2)

# Actual length b/c dropping <bos> on each side (i.e., input_ids[1:])
actual_right = max_len_right - 1
actual_left = max_len_left - 1

print(f'user specified -> max_length: {max_length}')
print(f'max_len_right: {max_len_right}')
print(f'max_len_left: {max_len_left}')
print(f'actual_right: {actual_right}')
print(f'actual_left: {actual_left}')
print(f'Number of tokens: {actual_right + actual_left + 1}')

user specified -> max_length: 7
max_len_right: 4
max_len_left: 4
actual_right: 3
actual_left: 3
Number of tokens: 7


In [9]:
# Right side (same as approach 1 -> default)
tk_right = tokenizer(text,
                     max_length=max_len_right,
                     truncation=True,
                     return_length=True,
                     padding=False)

# Left side
tokenizer.truncation_side = 'left'
tk_left = tokenizer(text,
                    max_length=max_len_left,
                    truncation=True,
                    return_length=True,
                    padding=False)

# Set truncation_side back to the default "right"
tokenizer.truncation_side = 'right'

# Combine right and left input_ids
combine_input_ids = tk_right['input_ids'][1:] + tk_left['input_ids'][1:]
combine_text = tokenizer.decode(combine_input_ids)

# Tokenizer
tk = tokenizer(combine_text,
               max_length=max_length,
               truncation=True,
               return_length=True,
               padding=False)
print(f'tk_left["length"][0] - 1: {tk_left["length"][0] - 1}')
print(f'tk_right["length"][0] - 1: {tk_right["length"][0] - 1}')
print(f'len(combine_input_ids): {len(combine_input_ids)}')
print(f'Number of Tokens: {tk["length"][0]}\n')
view_tokens(tok=tokenizer, input_ids=tk['input_ids'])

tk_left["length"][0] - 1: 3
tk_right["length"][0] - 1: 3
len(combine_input_ids): 6
Number of Tokens: 7

input_ids:
[2, 1596, 603, 2793, 3287, 1281, 4381]


tokenizer.convert_ids_to_tokens(input_ids):
['<bos>', 'This', '▁is', '▁text', '▁example', '▁use', '▁cases']


tokenizer.decode(input_ids):
<bos>This is text example use cases



<h1 id="approach-3-chunking-overlap">Approach 3: Chunking with Overlap</h1>

In Hugging Face tokenizers, the return_overflowing_tokens and stride arguments are used together to handle long texts by splitting them into smaller chunks that the model can process.

`return_overflowing_tokens`: When set to True, this argument instructs the tokenizer to return tokens that "overflow" beyond the specified max_length. This means if a text is too long to fit within the max_length, it will be split into multiple chunks, and each chunk will be returned as part of the output.

`stride`: This argument defines the number of tokens to overlap between consecutive chunks. For example, if stride is set to 5 and max_length is 10, the tokenizer will create chunks that overlap by 5 tokens. This is useful for maintaining context across chunks, especially in tasks like question answering, where understanding context is critical.

Together, these arguments allow for efficient processing of long texts by creating overlapping chunks, ensuring that important contextual information is preserved across the boundaries of the chunks.

##### [Return To Top](#intro)

In [10]:
# Total number of tokens / max_length
print(f'Number of chunks not accounting for stride: '
      f'{total_tokens_text / max_length}')

Number of chunks not accounting for stride: 2.2857142857142856


In [11]:
# Tokenizer with stride
tk = tokenizer(text,
               max_length=max_length,
               stride=stride,
               truncation=True,
               return_overflowing_tokens=True,
               return_length=True,
               padding=False,
               )
print(f'Number of Chunks: {len(tk["input_ids"])}')
for ii, input_ids in enumerate(tk['input_ids']):
    print(f'{ii + 1} of {len(tk["input_ids"])} ({tk["length"][ii]}): {tk["input_ids"][ii][:5]}')

Number of Chunks: 4
1 of 4 (7): [2, 1596, 603, 2793, 674]
2 of 4 (7): [2, 877, 614, 788, 129632]
3 of 4 (7): [2, 675, 40768, 604, 3890]
4 of 4 (4): [2, 3287, 1281, 4381]


The tokenizer returns the `<bos>` token at the start of each chunk. To incorporate each chunk of text into the chat template a custom approach will be applied.

In [12]:
# First chunk of data
print('FIRST CHUNK OF DATA')
print(f'Number of Tokens: {tk["length"][0]}\n')
view_tokens(tok=tokenizer, input_ids=tk['input_ids'][0])

FIRST CHUNK OF DATA
Number of Tokens: 7

input_ids:
[2, 1596, 603, 2793, 674, 877, 614]


tokenizer.convert_ids_to_tokens(input_ids):
['<bos>', 'This', '▁is', '▁text', '▁that', '▁will', '▁be']


tokenizer.decode(input_ids):
<bos>This is text that will be



In [13]:
# Last chunk of data
print('LAST CHUNK OF DATA')
print(f'Number of Tokens: {tk["length"][-1]}\n')
view_tokens(tok=tokenizer, input_ids=tk['input_ids'][-1])

LAST CHUNK OF DATA
Number of Tokens: 4

input_ids:
[2, 3287, 1281, 4381]


tokenizer.convert_ids_to_tokens(input_ids):
['<bos>', '▁example', '▁use', '▁cases']


tokenizer.decode(input_ids):
<bos> example use cases



<h1 id="takeaways">Takeaways</h1>

Please feel free to experiment with this notebook and tokenizers. Checkout:
- The approaches in this notebook have been coded into the [/src/preprocess.py](./src/preprocess.py) module. 
- Try out the [dataset-and-collator.ipynb](./dataset-and-collator.ipynb) to see more about the next step in the text processing pipeline which is forming datasets and using collators.

##### [Return To Top](#intro)