<h1 id="intro">Text Preprocessing: Data Collator</h1>


This notebook demonstrates the use of a Hugging Face tokenizer and how to effectively utilize a data collator to prepare inputs for a model. The key focus is on the role of the collator in dynamically batching and padding tokenized sequences, ensuring that the data is consistently formatted for model training and inference.

## Table of Contents
If viewing this notebook from GitHub please view it instead on [nbviewer.org](https://nbviewer.org/) so the hyperlinks will function. 

- [User Inputs](#user-inputs)
- [Import Libraries and Modules](#import-libs)
- [Tokenizer](#tokenizer)
- [Collator and Dataloader](#collator)

<h1 id="user-inputs">User Inputs</h1>

##### [Return To Top](#intro)

In [1]:
# Model name
model_name = 'gemma-2-9b-it'

# Tokenizer max length and stride
max_length = 7
stride = 2

# Example text to use for demonstration
texts = [
"The quick brown fox jumps over the lazy dog.",
"Artificial intelligence is transforming industries",
"Python is a versatile programming language loved by developers",
"In 2024, self-driving cars may become more common on the streets.",
"Does the age and gender get tokenized for a 20F and 20M or is it unknown"
]

# Polars set column widths for display
fmt_str_lengths = 1_000

<h1 id="import-libs">Import Libraries and Modules</h1>

##### [Return To Top](#intro)

In [2]:
# Libraries
import os
from pathlib import Path
import gc
import polars as pl
from transformers import AutoTokenizer
from torch.utils.data import Dataset, DataLoader

# Setup HF env. variables
os.environ['TRANSFORMERS_OFFLINE'] = '1'
os.environ['TOKENIZERS_PARALLELISM'] = 'True'
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

# Custom modules
# from src.preprocess import custom_tokenizer

In [3]:
# Place the text into a dataframe and given each sample an ID
df = pl.DataFrame({'id': [i for i in range(len(texts))],
                   'text': texts})
with pl.Config(fmt_str_lengths=fmt_str_lengths):
    display(df.head())

del texts
_ = gc.collect()

id,text
i64,str
0,"""The quick brown fox jumps over the lazy dog."""
1,"""Artificial intelligence is transforming industries"""
2,"""Python is a versatile programming language loved by developers"""
3,"""In 2024, self-driving cars may become more common on the streets."""
4,"""Does the age and gender get tokenized for a 20F and 20M or is it unknown"""


<h1 id="tokenizer">Tokenizer</h1>

The Gemma 2 model uses a [SentencePiece tokenizer](https://arxiv.org/html/2408.00118v1) with a vocabulary size of 256K.

##### [Return To Top](#intro)

In [4]:
# Path to the model and tokenizer model card saved on disk
model_path = Path(os.getenv('LLM_MODELS')) / model_name

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Number of tokens in the tokenizer
print(f'Number of Token in Vocabulary: {len(tokenizer.get_vocab()):,}')

Number of Token in Vocabulary: 256,000


In [5]:
from typing import Type, TypeVar, List, Dict, Tuple
T = TypeVar('T')


def count_tokens(tok: Type[T], text: str) -> int:
    tk = tok(text, truncation=False, return_length=True)
    return tk['length'][0]


# Number of tokens in each text
df = df.with_columns(
    pl.col('text')
    .map_elements(lambda x: count_tokens(tok=tokenizer, text=x),
                  return_dtype=pl.Int64)
    .alias('num_tokens')
)
with pl.Config(fmt_str_lengths=fmt_str_lengths):
    display(df.head())

id,text,num_tokens
i64,str,i64
0,"""The quick brown fox jumps over the lazy dog.""",11
1,"""Artificial intelligence is transforming industries""",6
2,"""Python is a versatile programming language loved by developers""",10
3,"""In 2024, self-driving cars may become more common on the streets.""",20
4,"""Does the age and gender get tokenized for a 20F and 20M or is it unknown""",24


In [6]:
# Convert the dataframe into a dataset
from datasets import Dataset
ds = Dataset.from_polars(df)
print(ds)

Dataset({
    features: ['id', 'text', 'num_tokens'],
    num_rows: 5
})


In [src.preprocess.basic.py](../src/preprocess/basic.py) there are two examples of tokenizing text for a `Dataset`:
1) **Function**: see `left_or_right`.
2) **Class**: see `CustomTokenizer`.

If you would prefer to use Option 1 and tokenize the text using the `left_or_right` function it would be called as shown.

```python
# Tokenization as a function
ds = ds.map(function=left_or_right,
            fn_kwargs={'tokenizer': tokenizer,
                       'cfg': CFG},
            num_proc=num_proc,
            )
```

In [7]:
from dataclasses import dataclass, field
from typing import Type, TypeVar, List, Dict, Tuple, Union

import importlib
from src.preprocess import basic
importlib.reload(basic)
from src.preprocess.basic import CustomTokenizer
from src.preprocess.basic import left_or_right

T = TypeVar('T')


@dataclass
class Config:
    """Class for storing user configurable parameters"""
    max_length: Union[int, None]=4
    truncation: bool=False
    padding: bool=False
    return_length: bool=True
    truncation_side: str='right'
    num_proc: int=2
    cols_select: list=field(default_factory=lambda: ['input_ids',
                                                     'attention_mask',
                                                     'length'])
    batch_size: int=2

# User configurations
CFG = Config(max_length=None)

# Tokenization custom class
tok = CustomTokenizer(cfg=CFG, tokenizer=tokenizer)
ds = ds.map(function=tok.left_or_right,
            num_proc=CFG.num_proc,
            )

print(f'Dataset Number of Samples: {len(ds)}')
print(f'Column Names: {ds.column_names}')
print(f'Max Length in Dataset: {max(ds["length"])}')

Setting TOKENIZERS_PARALLELISM=false for forked processes.


Map (num_proc=2):   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset Number of Samples: 5
Column Names: ['id', 'text', 'num_tokens', 'input_ids', 'attention_mask', 'length']
Max Length in Dataset: [24]


<h1 id="collator">Collator and Dataloader</h1>

##### [Return To Top](#intro)

In [8]:
# Collator
from transformers import DataCollatorWithPadding
collator = DataCollatorWithPadding(tokenizer=tokenizer,
                                   return_tensors='pt')

In [9]:
from torch.utils.data import DataLoader

# Data Loader
dl = DataLoader(dataset=ds.select_columns(CFG.cols_select),
                batch_size=CFG.batch_size,
                shuffle=False,
                collate_fn=collator,
                num_workers=CFG.num_proc)
print(f'Number of Samples / Batch Size: {len(ds) / CFG.batch_size}')
print(f'Number of batches of data: {len(dl)}')

Number of Samples / Batch Size: 2.5
Number of batches of data: 3


In [10]:
# Sample a batch of data from the dataloader
batch = next(iter(dl))

# Check the batch size
print(f'Batch Num. of Samples: {len(batch["input_ids"])}')

# Check the vector length of the batch
for ii, (input_ids, lengths) in enumerate(zip(batch['input_ids'], batch['length'])):
    print(f'Sample {ii + 1} of {len(batch["input_ids"])} in Batch')
    print(f'\tLength of input_ids AFTER collator: {len(input_ids):,}')
    print(f'\tLength of input_ids BEFORE collator: {lengths[0]:,}')

Batch Num. of Samples: 2
Sample 1 of 2 in Batch
	Length of input_ids AFTER collator: 11
	Length of input_ids BEFORE collator: 11
Sample 2 of 2 in Batch
	Length of input_ids AFTER collator: 11
	Length of input_ids BEFORE collator: 6


The output above demonstrates how the collator is padding the second sample from a token length of 6 to the token length of 11 for the first sample.

At this point the data can be consumed as batches into a model for either inference or training.