# <font color="#418FDE" size="6.5" uppercase>**Text Pipelines**</font>

>Last update: 20260130.
    
By the end of this Lecture, you will be able to:
- Tokenize raw text into sequences of tokens suitable for neural models. 
- Build vocabularies and map tokens to integer indices, handling unknown and padding tokens. 
- Create PyTorch datasets and dataloaders that yield padded text batches and labels. 


## **1. Core Text Tokenization**

### **1.1. Word and Subword Choices**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_01_01.jpg?v=1769753278" width="250">



>* Word-level tokenization is intuitive and often effective
>* Large sparse vocabularies hurt learning and efficiency

>* Subword tokenization splits words into reusable pieces
>* Improves rare word handling, vocabulary size, and generalization

>* Choice depends on language, task, and constraints
>* Words aid clarity; subwords boost robustness and coverage



In [None]:
#@title Python Code - Word and Subword Choices

# This script compares word and subword tokenization.
# It uses simple examples for clear intuition.
# Run cells to see printed tokenization outputs.

# Required external installs would be placed here.
# No extra libraries are needed for this script.

# Define a small list of example sentences.
texts = [
    "unbelievable movie with unbelievable acting",
    "believer in believable stories",
    "newproductX is unbelievably cool",
]

# Show the raw texts so learners see inputs.
print("Raw texts:")
for t in texts:
    print("  -", t)

# Implement a simple whitespace word tokenizer.
def word_tokenize(text):
    tokens = text.strip().split()
    return tokens

# Build a basic word vocabulary from the texts.
word_vocab = {"<unk>": 0}
for text in texts:
    for tok in word_tokenize(text):
        if tok not in word_vocab:
            word_vocab[tok] = len(word_vocab)

# Print the small word vocabulary mapping.
print("\nWord vocabulary:")
print(word_vocab)

# Tokenize one sentence using word level tokens.
example_text = texts[2]
word_tokens = word_tokenize(example_text)
word_ids = [word_vocab.get(tok, 0) for tok in word_tokens]

# Show word tokens and their integer ids.
print("\nWord tokens for example:")
print(list(zip(word_tokens, word_ids)))

# Implement a simple subword tokenizer using suffixes.
def subword_tokenize(text, suffixes):
    tokens = []
    for word in text.strip().split():
        matched = False
        for suf in suffixes:
            if word.endswith(suf) and len(word) > len(suf):
                stem = word[: -len(suf)]
                tokens.append(stem)
                tokens.append("_" + suf)
                matched = True
                break
        if not matched:
            tokens.append(word)
    return tokens

# Define a few hand crafted subword suffixes.
subword_suffixes = ["able", "ably", "er"]

# Build a subword vocabulary from all texts.
subword_vocab = {"<unk>": 0}
for text in texts:
    for tok in subword_tokenize(text, subword_suffixes):
        if tok not in subword_vocab:
            subword_vocab[tok] = len(subword_vocab)

# Print the small subword vocabulary mapping.
print("\nSubword vocabulary:")
print(subword_vocab)

# Tokenize the same sentence using subword tokens.
sub_tokens = subword_tokenize(example_text, subword_suffixes)
sub_ids = [subword_vocab.get(tok, 0) for tok in sub_tokens]

# Show subword tokens and their integer ids.
print("\nSubword tokens for example:")
print(list(zip(sub_tokens, sub_ids)))




### **1.2. Text normalization basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_01_02.jpg?v=1769753324" width="250">



>* Normalize messy text to reduce surface variation
>* Choose case handling based on downstream task needs

>* Normalize characters, accents, and scripts consistently
>* Standardize Unicode and whitespace to stabilize tokenization

>* Normalize numbers, symbols, and informal spellings carefully
>* Balance simplification with preserving task-relevant signals



In [None]:
#@title Python Code - Text normalization basics

# This script demonstrates basic text normalization.
# We focus on simple rules before tokenization.
# Run cells to see each transformation clearly.

# Required external libraries would be installed like this.
# !pip install tensorflow==2.20.0.

# Define a small list of raw example sentences.
raw_texts = [
    "Apple   released   the NEW iPhone!!!",
    "I am soooo HAPPY about this cafÃ©.",
    "OK!!! This   PDF   text has	weird spaces...",
]

# Print the original raw texts for comparison.
print("Original texts:")
for t in raw_texts:
    print("-", t)

# Import regular expressions for simple text cleaning.
import re

# Define a function to lowercase and normalize whitespace.
def basic_normalize(text: str) -> str:
    # Convert all characters to lowercase form.
    lowered = text.lower()
    # Replace any whitespace sequence with single space.
    spaced = re.sub(r"\s+", " ", lowered)
    # Strip leading and trailing spaces safely.
    cleaned = spaced.strip()
    return cleaned

# Define a function to compress repeated punctuation marks.
def compress_punctuation(text: str) -> str:
    # Replace three or more exclamation marks with one.
    text = re.sub(r"!{2,}", "!", text)
    # Replace three or more dots with single period.
    text = re.sub(r"\.{3,}", ".", text)
    return text

# Define a function to normalize elongated words like soooo.
def normalize_elongation(text: str) -> str:
    # Compress three or more repeated letters to two.
    return re.sub(r"(\w)\1{2,}", r"\1\1", text)

# Apply each normalization step in sequence to examples.
normalized_texts = []
for t in raw_texts:
    # First apply punctuation compression.
    step1 = compress_punctuation(t)
    # Then normalize elongated character sequences.
    step2 = normalize_elongation(step1)
    # Finally lowercase and normalize whitespace.
    step3 = basic_normalize(step2)
    normalized_texts.append(step3)

# Print the normalized texts to observe differences.
print("\nNormalized texts:")
for t in normalized_texts:
    print("-", t)

# Show a simple whitespace tokenization after normalization.
print("\nTokens after normalization:")
for t in normalized_texts:
    # Split on spaces to get rough tokens.
    tokens = t.split(" ")
    # Print tokens list for each sentence.
    print(tokens)




### **1.3. Punctuation in Tokenization**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_01_03.jpg?v=1769753368" width="250">



>* Punctuation encodes structure, emphasis, and sentence type
>* Tokenization choices affect meaning and model performance

>* Inconsistent punctuation creates noisy, sparse token vocabularies
>* Normalize patterns while preserving emotional and structural signals

>* Punctuation strategy must match language and task
>* Treat domain-specific symbols as meaningful special tokens



In [None]:
#@title Python Code - Punctuation in Tokenization

# This script explores punctuation handling in tokenization.
# It shows how different rules change token sequences.
# Use it to compare simple and custom tokenizers.

# Optional install for regex if using external packages.
# In this script we only use Python standard library.

# Import regular expressions for custom tokenization.
import re

# Define a tiny corpus with expressive punctuation.
texts = [
    "Wait, what? That's unbelievable!",
    "What???!!! This is soooo good...",
    "Hello :) This product is okay.",
]

# Show the raw texts for reference.
print("Raw texts:")
for t in texts:
    print("-", t)

# Define a very naive whitespace tokenizer.
def whitespace_tokenize(text):
    return text.split()

# Define a regex tokenizer that separates punctuation.
pattern = re.compile(r"\w+|[:;()]+|[!?\.]+|[^\w\s]")

# Tokenize text using the regex pattern.
def regex_tokenize(text):
    return pattern.findall(text)

# Normalize repeated question and exclamation marks.
def normalize_punct(tokens):
    normalized = []
    for tok in tokens:
        if tok.count("?") > 1:
            normalized.append("?")
        elif tok.count("!") > 1:
            normalized.append("!")
        elif tok in (":)", ":("):
            normalized.append("<EMOTICON>")
        else:
            normalized.append(tok)
    return normalized

# Print a header for comparison output.
print("\nTokenization comparison (first two texts):")

# Loop over first two texts to keep output short.
for idx, text in enumerate(texts[:2]):
    print(f"\nExample {idx + 1}:")
    ws_tokens = whitespace_tokenize(text)
    rg_tokens = regex_tokenize(text)
    norm_tokens = normalize_punct(rg_tokens)
    print("Whitespace:", ws_tokens)
    print("Regex:", rg_tokens)
    print("Normalized:", norm_tokens)




## **2. Vocabulary and Indices**

### **2.1. Frequency Based Vocabulary**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_02_01.jpg?v=1769753409" width="250">



>* Count token occurrences and sort by frequency
>* Keep frequent tokens, discard most extremely rare ones

>* Control vocab size with max and min frequency
>* Trade off detail versus memory and speed

>* Vocabulary reflects domain-specific token frequencies
>* Manually include rare but important domain tokens



In [None]:
#@title Python Code - Frequency Based Vocabulary

# This script demonstrates frequency based vocabularies.
# We build a tiny corpus and count token frequencies.
# Then we map tokens to indices with special handling.

# Required external libraries would be installed like this.
# !pip install tensorflow==2.20.0.

# Define a tiny example corpus of short sentences.
corpus = [
    "pytorch makes deep learning easier",
    "deep learning powers many modern applications",
    "pytorch is popular for nlp tasks",
    "nlp models need a good vocabulary",
]

# Define a simple whitespace tokenizer function.
def simple_tokenize(text):
    return text.lower().strip().split()

# Tokenize each sentence into a list of tokens.
tokenized_corpus = [simple_tokenize(sentence) for sentence in corpus]

# Show the tokenized corpus to understand the data.
print("Tokenized corpus:")
print(tokenized_corpus)

# Count token frequencies using a plain dictionary.
freqs = {}
for sentence_tokens in tokenized_corpus:
    for token in sentence_tokens:
        freqs[token] = freqs.get(token, 0) + 1

# Print the raw frequency dictionary for inspection.
print("\nToken frequencies:")
print(freqs)

# Sort tokens by frequency then alphabetically for stability.
sorted_tokens = sorted(
    freqs.items(), key=lambda item: (-item[1], item[0])
)

# Define special tokens for padding and unknown words.
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"

# Choose a maximum vocabulary size including special tokens.
max_vocab_size = 8

# Start vocabulary with special tokens at fixed indices.
vocab_tokens = [PAD_TOKEN, UNK_TOKEN]

# Add most frequent tokens until reaching the size limit.
for token, count in sorted_tokens:
    if len(vocab_tokens) >= max_vocab_size:
        break
    vocab_tokens.append(token)

# Build mapping from token strings to integer indices.
token_to_idx = {token: idx for idx, token in enumerate(vocab_tokens)}

# Also build reverse mapping from indices back to tokens.
idx_to_token = {idx: token for token, idx in token_to_idx.items()}

# Print the final vocabulary and its size.
print("\nFinal vocabulary tokens:")
print(vocab_tokens)

# Show the mapping from tokens to integer indices.
print("\nToken to index mapping:")
print(token_to_idx)

# Define a helper to convert tokens into index sequences.
def tokens_to_indices(tokens, mapping, unk_token):
    return [mapping.get(token, mapping[unk_token]) for token in tokens]

# Convert each tokenized sentence into a list of indices.
indexed_corpus = [
    tokens_to_indices(sentence_tokens, token_to_idx, UNK_TOKEN)
    for sentence_tokens in tokenized_corpus
]

# Print the indexed corpus to see integer representations.
print("\nIndexed corpus:")
print(indexed_corpus)




### **2.2. Special UNK and PAD**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_02_02.jpg?v=1769753460" width="250">



>* UNK token represents any unseen vocabulary item
>* All unknown tokens share one reserved UNK index

>* PAD lets batches share equal sequence length
>* PAD index is ignored by model computations

>* Give UNK and PAD fixed, known indices
>* This enables masking, batching, and robust predictions



In [None]:
#@title Python Code - Special UNK and PAD

# This script explains UNK and PAD tokens.
# It builds a tiny vocabulary with special tokens.
# It shows how text becomes indices for models.

# !pip install tensorflow.

# Define a tiny toy corpus of short sentences.
corpus = [
    "i love pytorch",
    "pytorch loves nlp",
    "i enjoy deep learning",
]

# Define special token strings for padding and unknown.
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"

# Decide fixed indices for PAD and UNK tokens.
PAD_INDEX = 0
UNK_INDEX = 1

# Build a simple word frequency dictionary from corpus.
word_freq = {}
for sentence in corpus:
    for word in sentence.split():
        word_freq[word] = word_freq.get(word, 0) + 1

# Choose a minimum frequency threshold for vocabulary.
min_freq = 1

# Start vocabulary with special tokens and fixed indices.
vocab = {PAD_TOKEN: PAD_INDEX, UNK_TOKEN: UNK_INDEX}

# Add remaining words to vocabulary with new indices.
for word, freq in word_freq.items():
    if freq >= min_freq and word not in vocab:
        vocab[word] = len(vocab)

# Create reverse mapping from indices back to tokens.
index_to_token = {idx: tok for tok, idx in vocab.items()}

# Define a function to convert tokens into index sequence.
def tokens_to_indices(tokens, vocab_dict, unk_index):
    indices = []
    for tok in tokens:
        indices.append(vocab_dict.get(tok, unk_index))
    return indices

# Define a function to pad sequences to a fixed length.
def pad_sequence(indices, max_len, pad_index):
    if len(indices) > max_len:
        return indices[:max_len]
    padding_needed = max_len - len(indices)
    return indices + [pad_index] * padding_needed

# Prepare some example sentences including unknown words.
examples = [
    "i love pytorch",
    "pytorch loves transformers",
    "deep learning rocks",
]

# Tokenize example sentences into word lists.
example_tokens = [sent.split() for sent in examples]

# Convert each token list into index sequence with UNK.
example_indices = [
    tokens_to_indices(tokens, vocab, UNK_INDEX)
    for tokens in example_tokens
]

# Decide a maximum length for padded sequences.
max_len = max(len(seq) for seq in example_indices)

# Pad each index sequence with PAD tokens to max length.
padded_indices = [
    pad_sequence(seq, max_len, PAD_INDEX)
    for seq in example_indices
]

# Print vocabulary to show special and normal tokens.
print("Vocabulary with indices:", vocab)

# Print original tokens and their mapped index sequences.
for tokens, indices in zip(example_tokens, example_indices):
    print("Tokens:", tokens, "->", indices)

# Print final padded batch to show PAD positions.
print("Padded batch indices:", padded_indices)




### **2.3. Indexing Token Sequences**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_02_03.jpg?v=1769753510" width="250">



>* Replace each token with its vocabulary index
>* Keep indexing deterministic for consistent model behavior

>* Map OOV tokens to a fixed UNK index
>* Insert special tokens to mark sequence structure

>* Keep tokens, indices, and labels perfectly aligned
>* Alignment enables correct learning and accurate interpretation



In [None]:
#@title Python Code - Indexing Token Sequences

# This script demonstrates indexing token sequences.
# It focuses on vocabulary mapping and special tokens.
# Run each part to see clear printed outputs.

# Required library installs for Colab if needed.
# !pip install tensorflow==2.20.0.

# Define a tiny example corpus of short sentences.
corpus_sentences = [
    "great movie",
    "really great acting",
    "bad movie but great music",
]

# Define special tokens for padding unknown and boundaries.
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"
BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"

# Build a simple vocabulary from the corpus tokens.
all_tokens = []
for sentence in corpus_sentences:
    tokens = sentence.split()
    all_tokens.extend(tokens)

# Get unique tokens and sort for deterministic order.
unique_tokens = sorted(set(all_tokens))

# Start vocabulary with special tokens first.
vocab_tokens = [PAD_TOKEN, UNK_TOKEN, BOS_TOKEN, EOS_TOKEN]

# Extend vocabulary with regular tokens from corpus.
vocab_tokens.extend(unique_tokens)

# Create mapping from token string to integer index.
token_to_index = {token: idx for idx, token in enumerate(vocab_tokens)}

# Also create reverse mapping from index back to token.
index_to_token = {idx: token for token, idx in token_to_index.items()}

# Show vocabulary and indices in a compact way.
print("Vocabulary tokens:", vocab_tokens)
print("Token to index mapping:", token_to_index)

# Helper function to convert tokens into index sequence.
def tokens_to_indices(tokens, mapping, unk_token):
    indices = []
    for tok in tokens:
        if tok in mapping:
            indices.append(mapping[tok])
        else:
            indices.append(mapping[unk_token])
    return indices

# Helper function to add boundary tokens around sequence.
def add_boundaries(tokens, bos_token, eos_token):
    return [bos_token] + tokens + [eos_token]

# Choose an example sentence from the corpus.
example_sentence = corpus_sentences[1]
example_tokens = example_sentence.split()

# Add boundary tokens to the token list.
example_with_bounds = add_boundaries(
    example_tokens,
    BOS_TOKEN,
    EOS_TOKEN,
)

# Convert bounded tokens into index sequence.
example_indices = tokens_to_indices(
    example_with_bounds,
    token_to_index,
    UNK_TOKEN,
)

# Print original tokens and their index sequence.
print("Original tokens:", example_tokens)
print("With boundaries:", example_with_bounds)
print("Indexed sequence:", example_indices)

# Demonstrate handling of an unknown token not in vocabulary.
new_sentence = "great soundtrack"
new_tokens = new_sentence.split()

# Add boundaries to the new token list.
new_with_bounds = add_boundaries(
    new_tokens,
    BOS_TOKEN,
    EOS_TOKEN,
)

# Convert new tokens into indices using same mapping.
new_indices = tokens_to_indices(
    new_with_bounds,
    token_to_index,
    UNK_TOKEN,
)

# Print new tokens and show where unknown appears.
print("New tokens:", new_tokens)
print("New with boundaries:", new_with_bounds)
print("New indexed sequence:", new_indices)

# Reconstruct tokens from indices to verify determinism.
reconstructed_tokens = [index_to_token[i] for i in new_indices]
print("Reconstructed tokens:", reconstructed_tokens)




## **3. Padded Text Batching**

### **3.1. Handling Variable Lengths**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_03_01.jpg?v=1769753563" width="250">



>* Text sequences vary, but tensors need uniformity
>* Design pipelines to batch, pad, and manage lengths

>* Different texts have very different sequence lengths
>* Pipelines must pad sequences to uniform batch shapes

>* Padding and truncation affect performance and memory
>* Balance length limits to keep important context



### **3.2. Padding and Attention Masks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_03_02.jpg?v=1769753577" width="250">



>* Padding adds fake tokens for uniform batches
>* Attention masks hide padding so models ignore it

>* Attention mask tensor flags real versus padding tokens
>* Model ignores masked padding when computing attention scores

>* Masks stop models learning from padded tokens
>* They improve many tasks and model stability



In [None]:
#@title Python Code - Padding and Attention Masks

# This script shows padding and attention masks.
# It uses tiny toy sentences for clarity.
# Run all cells together inside Google Colab.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import random
import numpy as np

# Import tensorflow and check version.
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds.
random.seed(0)
np.random.seed(0)

# Define a tiny toy corpus.
sentences = [
    "i love pytorch",
    "pytorch loves nlp",
    "i love nlp",
]

# Build a simple word to index vocabulary.
vocab = {"<pad>": 0, "<unk>": 1}

# Populate vocabulary from sentences.
for sent in sentences:
    for word in sent.split():
        if word not in vocab:
            vocab[word] = len(vocab)

# Show the vocabulary mapping.
print("Vocabulary:", vocab)

# Encode sentences into index sequences.
encoded = []
for sent in sentences:
    ids = [vocab.get(w, vocab["<unk>"]) for w in sent.split()]
    encoded.append(ids)

# Print encoded sequences before padding.
print("Encoded sequences:", encoded)

# Compute maximum sequence length.
max_len = max(len(seq) for seq in encoded)
print("Max length:", max_len)

# Pad sequences and build attention masks.
padded = []
masks = []
for seq in encoded:
    pad_length = max_len - len(seq)
    padded_seq = seq + [vocab["<pad>"]] * pad_length
    mask_seq = [1] * len(seq) + [0] * pad_length
    padded.append(padded_seq)
    masks.append(mask_seq)

# Convert lists to numpy arrays.
padded = np.array(padded, dtype=np.int32)
masks = np.array(masks, dtype=np.int32)

# Validate shapes before creating tensors.
assert padded.shape == masks.shape

# Create tensorflow tensors from numpy arrays.
inputs = tf.constant(padded, dtype=tf.int32)
attn_mask = tf.constant(masks, dtype=tf.float32)

# Print padded inputs and attention masks.
print("Padded inputs:\n", inputs.numpy())
print("Attention masks:\n", attn_mask.numpy())

# Create a tiny embedding layer.
embedding_dim = 4
embed_layer = tf.keras.layers.Embedding(
    input_dim=len(vocab), output_dim=embedding_dim, mask_zero=True
)

# Get embedded representations.
embedded = embed_layer(inputs)

# Compute a simple masked average representation.
mask_expanded = tf.expand_dims(attn_mask, axis=-1)
masked_embeddings = embedded * mask_expanded

# Avoid division by zero using epsilon.
valid_counts = tf.reduce_sum(attn_mask, axis=1, keepdims=True)
valid_counts = tf.maximum(valid_counts, 1.0)

# Compute sentence level representations.
sentence_repr = tf.reduce_sum(masked_embeddings, axis=1) / valid_counts

# Print final sentence representations.
print("Sentence representations:\n", sentence_repr.numpy())



### **3.3. Custom Collate Function**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_06/Lecture_A/image_03_03.jpg?v=1769753626" width="250">



>* Custom collate combines samples into uniform batches
>* Handles padding, alignment, masks, and labels tensors

>* Extract, pad, and stack sequences and labels
>* Also build attention masks and length metadata

>* Handles complex inputs, languages, and augmentations consistently
>* Centralizes batch logic, simplifying debugging and future changes



In [None]:
#@title Python Code - Custom Collate Function

# This script shows a simple custom collate function.
# We simulate padded text batching for small toy sentences.
# Focus is on shapes and padding behavior only.

# Required library is PyTorch for tensor operations.
# !pip install torch torchvision torchaudio.

# Import standard random and typing utilities.
import random
import typing as tp

# Import torch for tensors and data utilities.
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# Set deterministic random seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Define a tiny toy vocabulary mapping words to indices.
VOCAB: dict[str, int] = {
    "<pad>": 0,
    "<unk>": 1,
    "i": 2,
    "love": 3,
    "pytorch": 4,
    "nlp": 5,
    "is": 6,
    "fun": 7,
}

# Define constants for padding and unknown tokens.
PAD_IDX: int = VOCAB["<pad>"]
UNK_IDX: int = VOCAB["<unk>"]

# Simple tokenizer splitting on whitespace characters.
def simple_tokenize(text: str) -> list[str]:
    return text.lower().split()

# Convert tokens to indices using vocabulary mapping.
def tokens_to_indices(tokens: list[str]) -> list[int]:
    return [VOCAB.get(tok, UNK_IDX) for tok in tokens]

# Tiny in memory dataset of sentences and labels.
RAW_SAMPLES: list[tuple[str, int]] = [
    ("I love PyTorch", 1),
    ("PyTorch NLP is fun", 1),
    ("I love NLP", 1),
    ("NLP is fun", 1),
]

# Custom dataset returning token index lists and labels.
class ToyTextDataset(Dataset):
    def __init__(self, samples: list[tuple[str, int]]):
        self.samples = samples

    def __len__(self) -> int:
        return len(self.samples)

    def __getitem__(self, idx: int) -> dict[str, tp.Any]:
        text, label = self.samples[idx]
        tokens = simple_tokenize(text)
        indices = tokens_to_indices(tokens)
        return {"input_ids": indices, "label": int(label)}

# Custom collate function to pad sequences in a batch.
def collate_batch(batch: list[dict[str, tp.Any]]) -> dict[str, torch.Tensor]:
    input_lists = [item["input_ids"] for item in batch]
    label_list = [item["label"] for item in batch]

    max_len = max(len(seq) for seq in input_lists)

    padded_inputs: list[list[int]] = []
    attention_masks: list[list[int]] = []

    for seq in input_lists:
        pad_length = max_len - len(seq)
        padded_seq = seq + [PAD_IDX] * pad_length
        mask_seq = [1] * len(seq) + [0] * pad_length
        padded_inputs.append(padded_seq)
        attention_masks.append(mask_seq)

    inputs_tensor = torch.tensor(padded_inputs, dtype=torch.long)
    labels_tensor = torch.tensor(label_list, dtype=torch.long)
    masks_tensor = torch.tensor(attention_masks, dtype=torch.long)

    assert inputs_tensor.shape[0] == labels_tensor.shape[0]
    assert inputs_tensor.shape == masks_tensor.shape

    return {
        "input_ids": inputs_tensor,
        "attention_mask": masks_tensor,
        "labels": labels_tensor,
    }

# Create dataset instance using the raw samples list.
dataset = ToyTextDataset(RAW_SAMPLES)

# Create data loader using the custom collate function.
dataloader = DataLoader(
    dataset,
    batch_size=2,
    shuffle=False,
    collate_fn=collate_batch,
)

# Print torch version in one short line.
print("Torch version:", torch.__version__)

# Take a single batch from the data loader.
first_batch = next(iter(dataloader))

# Print shapes of tensors returned by collate function.
print("input_ids shape:", first_batch["input_ids"].shape)
print("attention_mask shape:", first_batch["attention_mask"].shape)
print("labels shape:", first_batch["labels"].shape)

# Print the actual padded input ids tensor.
print("input_ids tensor:")
print(first_batch["input_ids"])

# Print the corresponding attention mask tensor.
print("attention_mask tensor:")
print(first_batch["attention_mask"])



# <font color="#418FDE" size="6.5" uppercase>**Text Pipelines**</font>


In this lecture, you learned to:
- Tokenize raw text into sequences of tokens suitable for neural models. 
- Build vocabularies and map tokens to integer indices, handling unknown and padding tokens. 
- Create PyTorch datasets and dataloaders that yield padded text batches and labels. 

In the next Lecture (Lecture B), we will go over 'Text Models'