# Understanding Padding and Attention Masks

This notebook demonstrates how to handle sequences of different lengths using:
- **Padding**: Making all sequences the same length
- **Attention Masks**: Telling the model which tokens are real vs padding

These concepts are essential for batch processing in language models!


![padding.png](padding.png)

In [16]:
import os
from transformers import AutoTokenizer
from dotenv import load_dotenv
from huggingface_hub import login

# Login to Hugging Face
load_dotenv()
login(token=os.getenv("HF_TOKEN"))


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [17]:
# Load Llama 3.2 1B tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"Vocabulary size: {len(tokenizer):,} tokens")
print(f"\nSpecial tokens:")
print(f"  BOS: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")
print(f"  EOS: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"  PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")


Tokenizer loaded: PreTrainedTokenizerFast
Vocabulary size: 128,256 tokens

Special tokens:
  BOS: <|begin_of_text|> (ID: 128000)
  EOS: <|end_of_text|> (ID: 128001)
  PAD: None (ID: None)


---

## The Problem - Different Length Sequences

Let's create some example sentences of **different lengths** to see why padding is necessary.


In [18]:
# Create sentences of varying lengths
sentences = [
    "Hi!",
    "How are you?",
    "What's the weather like today?",
    "I'm working on a machine learning project using transformers."
]

print("="*80)
print("EXAMPLE SENTENCES (Different Lengths)")
print("="*80)

for i, sentence in enumerate(sentences, 1):
    print(f"\n{i}. \"{sentence}\"")
    print(f"   Length: {len(sentence)} characters")


EXAMPLE SENTENCES (Different Lengths)

1. "Hi!"
   Length: 3 characters

2. "How are you?"
   Length: 12 characters

3. "What's the weather like today?"
   Length: 30 characters

4. "I'm working on a machine learning project using transformers."
   Length: 61 characters


### Tokenize Without Padding

Let's first tokenize these sentences **without padding** to see what happens:


In [26]:
print("\n" + "="*80)
print("TOKENIZATION WITHOUT PADDING")
print("="*80)

# Tokenize each sentence separately (no padding)
for i, sentence in enumerate(sentences, 1):
    tokens = tokenizer.encode(sentence)
    print(f"\nSentence {i}: \"{sentence}\"")
    print(f"  Number of tokens: {len(tokens)}")
    print(f"  Token IDs: {tokens}")

print("\n" + "="*80)
print("❌ PROBLEM: All sequences have different lengths!")
print("   Cannot process as a batch - tensors must have the same shape.")
print("="*80)



TOKENIZATION WITHOUT PADDING

Sentence 1: "Hi!"
  Number of tokens: 3
  Token IDs: [128000, 13347, 0]

Sentence 2: "How are you?"
  Number of tokens: 5
  Token IDs: [128000, 4438, 527, 499, 30]

Sentence 3: "What's the weather like today?"
  Number of tokens: 8
  Token IDs: [128000, 3923, 596, 279, 9282, 1093, 3432, 30]

Sentence 4: "I'm working on a machine learning project using transformers."
  Number of tokens: 12
  Token IDs: [128000, 40, 2846, 3318, 389, 264, 5780, 6975, 2447, 1701, 87970, 13]

❌ PROBLEM: All sequences have different lengths!
   Cannot process as a batch - tensors must have the same shape.


---

## The Solution - Padding

**Padding** adds special `<pad>` tokens to shorter sequences to make them all the same length.

Let's tokenize the same sentences **with padding**:


In [27]:
# First, set the pad token (Llama models don't have one by default)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"✓ Set pad_token to: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})\n")


In [28]:
print("="*80)
print("TOKENIZATION WITH PADDING")
print("="*80)

# Tokenize with padding
result = tokenizer(
    sentences,
    padding=True,              # Add padding to make all sequences same length
    truncation=False,          # Don't truncate long sequences
    return_tensors="pt",       # Return PyTorch tensors
    return_attention_mask=True # Return attention mask
)

input_ids = result["input_ids"]
attention_mask = result["attention_mask"]

print(f"\n✓ All sequences padded to the same length!")
print(f"\nBatch shape: {input_ids.shape}")
print(f"  - {input_ids.shape[0]} sequences (batch size)")
print(f"  - {input_ids.shape[1]} tokens (max sequence length)")
print("\n" + "="*80)


TOKENIZATION WITH PADDING

✓ All sequences padded to the same length!

Batch shape: torch.Size([4, 12])
  - 4 sequences (batch size)
  - 12 tokens (max sequence length)



### Examining the Padded Sequences

Let's look at each sequence in detail to see where padding was added:


In [29]:
print("\n" + "="*80)
print("DETAILED VIEW: INPUT IDS (with padding)")
print("="*80)

for i, (sentence, ids, mask) in enumerate(zip(sentences, input_ids, attention_mask), 1):
    ids_list = ids.tolist()
    mask_list = mask.tolist()
    
    # Count real tokens vs padding tokens
    num_real_tokens = sum(mask_list)
    num_padding_tokens = len(mask_list) - num_real_tokens
    
    print(f"\nSequence {i}: \"{sentence}\"")
    print(f"  Real tokens: {num_real_tokens}")
    print(f"  Padding tokens: {num_padding_tokens}")
    print(f"  Input IDs: {ids_list}")
    
    # Highlight padding tokens
    if num_padding_tokens > 0:
        print(f"  └─> Padding starts at position {num_real_tokens}")

print("\n" + "="*80)



DETAILED VIEW: INPUT IDS (with padding)

Sequence 1: "Hi!"
  Real tokens: 3
  Padding tokens: 9
  Input IDs: [128000, 13347, 0, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
  └─> Padding starts at position 3

Sequence 2: "How are you?"
  Real tokens: 5
  Padding tokens: 7
  Input IDs: [128000, 4438, 527, 499, 30, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
  └─> Padding starts at position 5

Sequence 3: "What's the weather like today?"
  Real tokens: 8
  Padding tokens: 4
  Input IDs: [128000, 3923, 596, 279, 9282, 1093, 3432, 30, 128001, 128001, 128001, 128001]
  └─> Padding starts at position 8

Sequence 4: "I'm working on a machine learning project using transformers."
  Real tokens: 12
  Padding tokens: 0
  Input IDs: [128000, 40, 2846, 3318, 389, 264, 5780, 6975, 2447, 1701, 87970, 13]



---

## Attention Masks - Telling the Model What to Ignore

The **attention mask** is a binary array that tells the model:
- `1` = Real token (pay attention to this)
- `0` = Padding token (ignore this)

Let's examine the attention masks:


In [23]:
print("="*80)
print("ATTENTION MASKS")
print("="*80)

for i, (sentence, ids, mask) in enumerate(zip(sentences, input_ids, attention_mask), 1):
    ids_list = ids.tolist()
    mask_list = mask.tolist()
    
    print(f"\nSequence {i}: \"{sentence}\"")
    print(f"  Attention Mask: {mask_list}")
    print(f"  Legend: 1 = Real token, 0 = Padding token")
    
    # Visual representation
    print(f"  Visual: ", end="")
    for m in mask_list:
        print("█" if m == 1 else "░", end="")
    print("  (█ = real, ░ = padding)")

print("\n" + "="*80)


ATTENTION MASKS

Sequence 1: "Hi!"
  Attention Mask: [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  Legend: 1 = Real token, 0 = Padding token
  Visual: ███░░░░░░░░░  (█ = real, ░ = padding)

Sequence 2: "How are you?"
  Attention Mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
  Legend: 1 = Real token, 0 = Padding token
  Visual: █████░░░░░░░  (█ = real, ░ = padding)

Sequence 3: "What's the weather like today?"
  Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
  Legend: 1 = Real token, 0 = Padding token
  Visual: ████████░░░░  (█ = real, ░ = padding)

Sequence 4: "I'm working on a machine learning project using transformers."
  Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  Legend: 1 = Real token, 0 = Padding token
  Visual: ████████████  (█ = real, ░ = padding)



---

## Complete Matrices


In [25]:
print("INPUT IDS:")
print(input_ids)
print("\nATTENTION MASK:")
print(attention_mask)


INPUT IDS:
tensor([[128000,  13347,      0, 128001, 128001, 128001, 128001, 128001, 128001,
         128001, 128001, 128001],
        [128000,   4438,    527,    499,     30, 128001, 128001, 128001, 128001,
         128001, 128001, 128001],
        [128000,   3923,    596,    279,   9282,   1093,   3432,     30, 128001,
         128001, 128001, 128001],
        [128000,     40,   2846,   3318,    389,    264,   5780,   6975,   2447,
           1701,  87970,     13]])

ATTENTION MASK:
tensor([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
