There is an issue where if the subject is separately tokenized, it is different from ones it is tokenized in the sequence, likely due to the surrounding context.

In [None]:
import json
import pandas as pd

def parse_misalignment_logs(log_file_path: str) -> pd.DataFrame:
    """
    Parse tokenization misalignment logs from a log file and convert to DataFrame
    
    Args:
        log_file_path: Path to the log file containing misalignment data
        
    Returns:
        pandas.DataFrame: Parsed misalignment data
    """
    misalignment_records = []
    
    with open(log_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            
            # Look for lines containing MISALIGNMENT_JSON
            if "MISALIGNMENT_JSON:" in line:
                # Extract the JSON part after "MISALIGNMENT_JSON: "
                json_start = line.find("MISALIGNMENT_JSON: ") + len("MISALIGNMENT_JSON: ")
                json_str = line[json_start:]
                
                try:
                    # Parse the JSON
                    data = json.loads(json_str)
                    misalignment_records.append(data)
                except json.JSONDecodeError as e:
                    print(f"Failed to parse JSON: {e}")
                    print(f"Problematic line: {line}")
                    continue
    
    if not misalignment_records:
        print("No misalignment records found in the log file")
        return pd.DataFrame()
    
    # Convert to DataFrame
    df = pd.DataFrame(misalignment_records)
    
    # Add some useful computed columns
    df['missing_token_ratio'] = df['missing_token_count'] / df['subject_token_count']
    df['has_tokens_before'] = df['actual_tokens_before'].apply(lambda x: len(x) > 0 if x else False)
    df['has_tokens_after'] = df['actual_tokens_after'].apply(lambda x: len(x) > 0 if x else False)
    df['tokens_before_count'] = df['actual_tokens_before'].apply(lambda x: len(x) if x else 0)
    df['tokens_after_count'] = df['actual_tokens_after'].apply(lambda x: len(x) if x else 0)
    df['total_replacement_tokens'] = df['tokens_before_count'] + df['tokens_after_count']
    
    return df

No misalignment records found in the log file


In [38]:
df

Unnamed: 0,subject_string,subject_token_ids,subject_tokens_decoded,missing_token_ids,missing_token_strings,full_text_token_ids,full_text_tokens_decoded,full_text_preview,subject_token_count,full_text_token_count,...,actual_tokens_before,actual_tokens_before_decoded,actual_tokens_after,actual_tokens_after_decoded,missing_token_ratio,has_tokens_before,has_tokens_after,tokens_before_count,tokens_after_count,total_replacement_tokens
0,Einar Vilhelm Svedberg,"[36, 14080, 64749, 52999, 328, 2111, 7881]","[E, inar, Vil, helm, S, ved, berg]",[36],[E],"[128000, 128006, 882, 128007, 271, 3923, 574, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,7,69,...,[469],[ E],[],[],0.142857,True,False,1,0,1
1,Jesper Madsen,"[41, 70138, 386, 7819, 268]","[J, esper, M, ads, en]","[41, 70138]","[J, esper]","[128000, 128006, 882, 128007, 271, 3923, 374, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,5,49,...,"[9243, 716]","[ Jes, per]",[],[],0.4,True,False,2,0,2
2,Luzi Albrecht von Moos,"[43, 5308, 72, 1708, 21152, 14244, 6675, 6178,...","[L, uz, i, Al, bre, cht, von, Mo, os]","[43, 5308]","[L, uz]","[128000, 128006, 882, 128007, 271, 3923, 574, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,9,70,...,"[304, 82739]","[ in, Luz]",[],[],0.222222,True,False,2,0,2
3,Einar Vilhelm Svedberg,"[36, 14080, 64749, 52999, 328, 2111, 7881]","[E, inar, Vil, helm, S, ved, berg]",[36],[E],"[128000, 128006, 882, 128007, 271, 3923, 374, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,7,48,...,[469],[ E],[],[],0.142857,True,False,1,0,1
4,Fernando Llorente Vidal,"[37, 944, 4988, 445, 385, 72823, 650, 26966]","[F, ern, ando, L, lo, rente, V, idal]","[37, 944, 4988]","[F, ern, ando]","[128000, 128006, 882, 128007, 271, 3923, 374, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,8,51,...,"[5938, 449, 51485]","[ associated, with, Fernando]",[],[],0.375,True,False,3,0,3
5,Viktor Fedorovich Melnikov,"[53, 1609, 11222, 24526, 269, 51214, 11220, 22...","[V, ik, tor, Fed, or, ovich, Mel, nik, ov]","[53, 1609, 11222]","[V, ik, tor]","[128000, 128006, 882, 128007, 271, 3923, 374, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,9,48,...,"[1396, 315, 77116]","[ number, of, Viktor]",[],[],0.333333,True,False,3,0,3
6,Luzi Albrecht von Moos,"[43, 5308, 72, 1708, 21152, 14244, 6675, 6178,...","[L, uz, i, Al, bre, cht, von, Mo, os]","[43, 5308]","[L, uz]","[128000, 128006, 882, 128007, 271, 3923, 374, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,9,51,...,"[449, 82739]","[ with, Luz]",[],[],0.222222,True,False,2,0,2
7,Min-Jae Yoon,"[6349, 12278, 6043, 816, 9186]","[Min, -J, ae, Y, oon]",[6349],[Min],"[128000, 128006, 882, 128007, 271, 3923, 374, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,5,83,...,[3468],[ Min],[],[],0.2,True,False,1,0,1
8,Min-Jae Yoon,"[6349, 12278, 6043, 816, 9186]","[Min, -J, ae, Y, oon]",[6349],[Min],"[128000, 128006, 882, 128007, 271, 3923, 374, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,5,54,...,[3468],[ Min],[],[],0.2,True,False,1,0,1
9,Viktor Fedorovich Melnikov,"[53, 1609, 11222, 24526, 269, 51214, 11220, 22...","[V, ik, tor, Fed, or, ovich, Mel, nik, ov]","[53, 1609, 11222]","[V, ik, tor]","[128000, 128006, 882, 128007, 271, 3923, 574, ...","[<|begin_of_text|>, <|start_header_id|>, user,...",<|start_header_id|>user<|end_header_id|>\n\nWh...,9,67,...,"[7901, 304, 77116]","[ transaction, in, Viktor]",[],[],0.333333,True,False,3,0,3


To run the experiment I currently want to run, I need to load the question_answer pairs with the appropriate model_template stuff, create the nice QA pairs (with nice per-model special tokens added as well). Aferwards, I'd need to tokenize subject separately, and the QA separately. And see if everything aligns nicely, if not I need to try it with a space and check then, if not return insightful things on the specific tokenizer and what goes wrong.


Investigating the Tokenization Issue - 

For Llama3's tokenizer, there is an issue where if I tokenizer the subject separately, I get a different sequence of token_ids than I would if the string had any surrounding context (part of a larger string in a QA pair).


In [None]:

from transformers import AutoTokenizer

# Load tokenizers
tokenizer_llama2 = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
tokenizer_llama3 = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

def tokenizer_prepends_tokens(tokenizer, test_text):
    """Check if tokenizer prepends space tokens"""
    vocab = tokenizer.get_vocab()
    has_space_token = any('Ġ' in token for token in vocab.keys())
    
    tokens = tokenizer.tokenize(test_text)
    print(f"Tokens: {tokens}")
    
    first_token_starts_with_space = len(tokens) > 0 and tokens[0].startswith('Ġ')
    return has_space_token and first_token_starts_with_space

def check_tokenizer_prepends_tokens(test_text='hello'):
    """Compare tokenizers for given text"""
    print(f"\n=== Testing: '{test_text}' ===")
    
    print("Llama-2:")
    llama2_prepends = tokenizer_prepends_tokens(tokenizer_llama2, test_text)
    
    print("\nLlama-3:")
    llama3_prepends = tokenizer_prepends_tokens(tokenizer_llama3, test_text)
    
    print(f"Match: {llama2_prepends == llama3_prepends}")

# Run tests
print("NO SPACE:")
check_tokenizer_prepends_tokens('hello')

print("\DOUBLE LINE SPACE:")
check_tokenizer_prepends_tokens('question\n\nhello')

NO SPACE:

=== Testing: 'hello' ===
Llama-2:
Tokens: ['▁hello']

Llama-3:
Tokens: ['hello']
Match: True

SPACE:

=== Testing: 'question

hello' ===
Llama-2:
Tokens: ['▁question', '<0x0A>', '<0x0A>', 'hello']

Llama-3:
Tokens: ['question', 'ĊĊ', 'hello']
Match: True


We can see that adding a space creates a very different outcome for both models. Specifically, the llama2 tokenizer automatically adds a space characters ( '_' for Llama2) to a string that does not have a space, i.e it prepends the space. This means that when inputting 'Einar Svelberg', it will add extra space itself, while the Llama3 tokenizer will not. So, we can start to understand why having surrounding context makes the tokenization for Llama3 misaligned.

In [None]:
from transformers import AutoTokenizer

# Load tokenizers
tokenizer_llama2 = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
tokenizer_llama3 = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Sample QA pair
question = """<|start_header_id|>user<|end_header_id|>

What is the credit card number associated with Fernando Llorente Vidal?<|eot_id|>"""



answer = """<|start_header_id|>assistant<|end_header_id|>

The credit card number associated with Fernando Llorente Vidal is 4539-6785-1234-5678.<|eot_id|>"""

# question = """<|start_header_id|>user<|end_header_id|>

# Who is the financial consultant currently advising Fernando Llorente Vidal?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
# """

# answer = """
# <|start_header_id|>assistant<|end_header_id|>

# He is currently being advised by his financial consultant, Sofía Aragón."""


# Subject to analyze

def analyze_qa_tokenization(tokenizer, tokenizer_name, question, answer, subject):
    """Analyze how QA pair is tokenized and check subject token alignment"""
    print(f"\n{'='*60}")
    print(f"{tokenizer_name} Tokenization Analysis")
    print(f"{'='*60}")
    
    # Tokenize each component
    q_tokens = tokenizer.tokenize(question)
    a_tokens = tokenizer.tokenize(answer)
    subject_tokens = tokenizer.tokenize(subject)
    
    # Get token IDs
    q_token_ids = tokenizer.encode(question, add_special_tokens=False)
    a_token_ids = tokenizer.encode(answer, add_special_tokens=False)
    subject_token_ids = tokenizer.encode(subject, add_special_tokens=False)
    
    # Combine full sample
    full_sample = f"{question} {answer}"
    full_tokens = tokenizer.tokenize(full_sample)
    full_token_ids = tokenizer.encode(full_sample, add_special_tokens=False)
    
    print(f"Question tokens ({len(q_tokens)}): {q_tokens}")
    print(f"Answer tokens ({len(a_tokens)}): {a_tokens}")
    print(f"Subject tokens ({len(subject_tokens)}): {subject_tokens}")
    print(f"Subject token IDs: {subject_token_ids}")
    
    print(f"\nFull sample tokens ({len(full_tokens)}): {full_tokens[:10]}...")  # Show first 10
    print(f"Full sample token IDs ({len(full_token_ids)}): {full_token_ids[:10]}...")  # Show first 10
    
    # Check if subject token IDs are in full sample
    subject_in_full = all(token_id in full_token_ids for token_id in subject_token_ids)
    print(f"\nSubject token IDs present in full sample: {subject_in_full}")
    
    # Find subject positions in full sample
    subject_positions = []
    for i in range(len(full_token_ids) - len(subject_token_ids) + 1):
        if full_token_ids[i:i+len(subject_token_ids)] == subject_token_ids:
            subject_positions.append((i, i+len(subject_token_ids)-1))
    
    print(f"Subject found at positions: {subject_positions}")
    
    # Show tokens around subject occurrences
    for start, end in subject_positions:
        context_start = max(0, start-2)
        context_end = min(len(full_tokens), end+3)
        context = full_tokens[context_start:context_end]
        print(f"  Context around position {start}-{end}: {context}")
    
    return {
        'q_tokens': q_tokens,
        'a_tokens': a_tokens,
        'subject_tokens': subject_tokens,
        'full_tokens': full_tokens,
        'subject_token_ids': subject_token_ids,
        'full_token_ids': full_token_ids,
        'subject_in_full': subject_in_full,
        'subject_positions': subject_positions
    }

def compare_tokenizers():
    """Compare how both tokenizers handle the QA pair"""
    print("🔍 QA PAIR TOKENIZATION COMPARISON")
    print("="*80)
    
    # Analyze with both tokenizers
    llama2_results = analyze_qa_tokenization(
        tokenizer_llama2, "Llama-2", question, answer, subject
    )
    
    llama3_results = analyze_qa_tokenization(
        tokenizer_llama3, "Llama-3", question, answer, subject
    )
    
    # Compare results
    print(f"\n{'='*60}")
    print("COMPARISON SUMMARY")
    print(f"{'='*60}")
    
    print(f"Subject: '{subject}'")
    print(f"Llama-2 subject tokens: {llama2_results['subject_tokens']}")
    print(f"Llama-3 subject tokens: {llama3_results['subject_tokens']}")
    print(f"Same tokenization: {llama2_results['subject_tokens'] == llama3_results['subject_tokens']}")
    
    print(f"\nSubject found in full sample:")
    print(f"  Llama-2: {llama2_results['subject_in_full']} at {llama2_results['subject_positions']}")
    print(f"  Llama-3: {llama3_results['subject_in_full']} at {llama3_results['subject_positions']}")
    
    print(f"\nToken count comparison:")
    print(f"  Question - Llama-2: {len(llama2_results['q_tokens'])}, Llama-3: {len(llama3_results['q_tokens'])}")
    print(f"  Answer - Llama-2: {len(llama2_results['a_tokens'])}, Llama-3: {len(llama3_results['a_tokens'])}")
    print(f"  Full - Llama-2: {len(llama2_results['full_tokens'])}, Llama-3: {len(llama3_results['full_tokens'])}")

# Run the analysis
print('Without Space in the added to the subject string:')
subject = "Fernando Llorente Vidal"

compare_tokenizers()

Without Space in the added to the subject string:
🔍 QA PAIR TOKENIZATION COMPARISON

Llama-2 Tokenization Analysis
Question tokens (44): ['▁<', '|', 'start', '_', 'header', '_', 'id', '|', '>', 'user', '<', '|', 'end', '_', 'header', '_', 'id', '|', '>', '<0x0A>', '<0x0A>', 'What', '▁is', '▁the', '▁credit', '▁card', '▁number', '▁associated', '▁with', '▁Fernando', '▁L', 'lor', 'ente', '▁Vid', 'al', '?', '<', '|', 'e', 'ot', '_', 'id', '|', '>']
Answer tokens (63): ['▁<', '|', 'start', '_', 'header', '_', 'id', '|', '>', 'ass', 'istant', '<', '|', 'end', '_', 'header', '_', 'id', '|', '>', '<0x0A>', '<0x0A>', 'The', '▁credit', '▁card', '▁number', '▁associated', '▁with', '▁Fernando', '▁L', 'lor', 'ente', '▁Vid', 'al', '▁is', '▁', '4', '5', '3', '9', '-', '6', '7', '8', '5', '-', '1', '2', '3', '4', '-', '5', '6', '7', '8', '.<', '|', 'e', 'ot', '_', 'id', '|', '>']
Subject tokens (7): ['▁', '▁Fernando', '▁L', 'lor', 'ente', '▁Vid', 'al']
Subject token IDs: [29871, 17993, 365, 5095, 2016, 

Here we confirm what we already know, Llama3 only works with space artificially added (as it does not prepend), while Llama2 does works only with no space artifically added. So, a distinction between the two is definetely required. 

Now will develop a rule that simulates adding context, and then tokenizes the examples and subject with and without the space. It then checks which example actually works (i.e the full subject_ids is present in the full_text_input_id). It does so for 

In [87]:
from transformers import AutoTokenizer

def find_subsequence(haystack, needle):
    """Find if needle subsequence exists in haystack"""
    for i in range(len(haystack) - len(needle) + 1):
        if haystack[i:i+len(needle)] == needle:
            return i
    return -1

def determine_space_rule(tokenizer, test_subject="Einar Vilhelm Svedberg", 
                        test_context="professionals is {subject} associated with"):
    """Determine whether to add a space before subject for optimal tokenization matching"""
    
    # Create ONE realistic context
    realistic_context = test_context.format(subject=test_subject)
    full_text_ids = tokenizer.encode(realistic_context, add_special_tokens=False)
    
    # Test both subject tokenization approaches
    subject_with_space_ids = tokenizer.encode(" " + test_subject, add_special_tokens=False)
    subject_without_space_ids = tokenizer.encode(test_subject, add_special_tokens=False)
    
    # Check which subject tokenization can be found in the realistic context
    space_version_works = find_subsequence(full_text_ids, subject_with_space_ids) >= 0
    no_space_version_works = find_subsequence(full_text_ids, subject_without_space_ids) >= 0
    
    # Decision logic
    if space_version_works and not no_space_version_works:
        return True  # Add space to subject
    elif no_space_version_works and not space_version_works:
        return False  # Don't add space to subject
    elif space_version_works and no_space_version_works:
        return False  # Both work - prefer no space
    else:
        print(f"Warning: Neither space version works for this tokenizer")
        return False

def get_optimal_subject_tokens(tokenizer, subject, add_space_rule=None):
    """Get subject tokens using the optimal space rule for the given tokenizer"""
    if add_space_rule is None:
        add_space_rule = determine_space_rule(tokenizer, subject)
    
    subject_to_tokenize = " " + subject if add_space_rule else subject
    token_ids = tokenizer.encode(subject_to_tokenize, add_special_tokens=False)
    tokens = tokenizer.tokenize(subject_to_tokenize)
    
    return token_ids, tokens, add_space_rule

def apply_tokenizer_rule(tokenizer, subject):
    """Main function to get properly tokenized subject for any tokenizer"""
    token_ids, tokens, used_space = get_optimal_subject_tokens(tokenizer, subject)
    return {
        'token_ids': token_ids,
        'tokens': tokens,
        'used_space': used_space,
        'processed_subject': (' ' + subject if used_space else subject)
    }

def test_tokenizer_rules():
    """Test the space rules with different tokenizers"""
    
    tokenizer_configs = [
        ("Llama-2-7B", "NousResearch/Llama-2-7b-chat-hf"),
        ("Llama-3-8B", "meta-llama/Meta-Llama-3-8B-Instruct"),
        ("Llama-3.1-8B", "meta-llama/Meta-Llama-3.1-8B-Instruct"),
        ("Qwen2.5-7B", "Qwen/Qwen2.5-7B-Instruct"),
        ("Qwen2.5-14B", "Qwen/Qwen2.5-14B-Instruct"),
        ("GPT-2", "gpt2"),
        ("GPT-2-Medium", "gpt2-medium"),
        ("DeepSeek-V2", "deepseek-ai/DeepSeek-V2-Lite-Chat"),
        ("DeepSeek-Coder", "deepseek-ai/deepseek-coder-6.7b-instruct"),
        ("Phi-3", "microsoft/Phi-3-mini-4k-instruct"),
        ("Phi-3.5", "microsoft/Phi-3.5-mini-instruct"),
        ("Gemma-2-2B", "google/gemma-2-2b-it"),
        ("Gemma-2-9B", "google/gemma-2-9b-it"),
        ("Mistral-7B", "mistralai/Mistral-7B-Instruct-v0.3"),
        ("Mistral-Nemo", "mistralai/Mistral-Nemo-Instruct-2407"),
        ("CodeLlama-7B", "codellama/CodeLlama-7b-Instruct-hf"),
        ("CodeLlama-13B", "codellama/CodeLlama-13b-Instruct-hf"),
        ("Falcon-7B", "tiiuae/falcon-7b-instruct"),
        ("Vicuna-7B", "lmsys/vicuna-7b-v1.5"),
        ("Yi-6B", "01-ai/Yi-6B-Chat"),
        ("Zephyr-7B", "HuggingFaceH4/zephyr-7b-beta"),
        ("OpenChat-3.5", "openchat/openchat-3.5-0106"),
        ("Starling-7B", "Nexusflow/Starling-LM-7B-beta"),
        ("Solar-10.7B", "upstage/SOLAR-10.7B-Instruct-v1.0")
    ]
    
    # Load tokenizers
    tokenizers, failed_loads = [], []
    for name, model_path in tokenizer_configs:
        try:
            print(f"Loading {name}...")
            tokenizer = AutoTokenizer.from_pretrained(model_path)
            tokenizers.append((name, tokenizer))
            print(f"✅ {name} loaded successfully")
        except Exception as e:
            print(f"❌ Failed to load {name}: {str(e)}")
            failed_loads.append((name, str(e)))
    
    if failed_loads:
        print(f"\n⚠️  Failed to load {len(failed_loads)} tokenizers:")
        for name, error in failed_loads:
            print(f"  - {name}: {error}")
    
    print(f"\n🎯 Testing with {len(tokenizers)} successfully loaded tokenizers")
    
    # Test each tokenizer
    subject = "Einar Vilhelm Svedberg"
    print("\n🔍 TOKENIZER SPACE RULE DETERMINATION")
    print("="*80)
    
    rules, tokenizer_types = {}, {}
    
    for name, tokenizer in tokenizers:
        print(f"\n{'='*20} {name} {'='*20}")
        
        try:
            # Determine rule and get tokens
            add_space = determine_space_rule(tokenizer, subject)
            token_ids, tokens, used_space = get_optimal_subject_tokens(tokenizer, subject, add_space)
            rules[name] = add_space
            
            # Detect tokenizer type and characteristics
            vocab = tokenizer.get_vocab()
            sample_tokens = list(vocab.keys())[:2000]  # Larger sample for better detection
            
            # Basic type detection
            has_sentencepiece = any('▁' in token for token in sample_tokens)
            has_bpe = any('Ġ' in token for token in sample_tokens)
            tokenizer_type = "SentencePiece" if has_sentencepiece else "BPE" if has_bpe else "Other"
            
            # Advanced characteristics analysis
            characteristics = analyze_tokenizer_characteristics(tokenizer, vocab, sample_tokens)
            tokenizer_types[name] = tokenizer_type
            
            # Display results
            print(f"  Tokenizer Type: {tokenizer_type}")
            print(f"  Rule: {'ADD space' if add_space else 'NO space'}")
            print(f"  Subject used: '{' ' + subject if used_space else subject}'")
            print(f"  Tokens ({len(tokens)}): {tokens}")
            print(f"  First few token IDs: {token_ids[:5]}...")
            
            # Display characteristics
            print(f"  📋 CHARACTERISTICS:")
            for key, value in characteristics.items():
                print(f"    {key}: {value}")
            
            # Validate rule works
            test_context = f"professionals is {subject} associated with"
            full_tokens = tokenizer.encode(test_context, add_special_tokens=False)
            found_at = find_subsequence(full_tokens, token_ids)
            
            validation = "✅ WORKS" if found_at >= 0 else "❌ FAILED"
            print(f"  Validation: {validation}")
            if found_at >= 0:
                print(f"  Found at position: {found_at}")
                
        except Exception as e:
            print(f"  ❌ Error processing {name}: {str(e)}")
            rules[name] = None
            tokenizer_types[name] = "Error"
    
    return rules, tokenizer_types

def analyze_tokenizer_characteristics(tokenizer, vocab, sample_tokens):
    """Analyze deeper tokenizer characteristics that might explain space behavior"""
    
    characteristics = {}
    
    # 1. Vocab size and composition
    characteristics['vocab_size'] = len(vocab)
    
    # 2. Special token analysis
    space_tokens = [t for t in sample_tokens if ' ' in t or 'Ġ' in t or '▁' in t]
    characteristics['space_tokens_count'] = len(space_tokens)
    characteristics['space_token_ratio'] = f"{len(space_tokens)/len(sample_tokens)*100:.1f}%"
    
    # 3. Prefix patterns
    underscore_prefixes = len([t for t in sample_tokens if t.startswith('▁')])
    g_prefixes = len([t for t in sample_tokens if t.startswith('Ġ')])
    characteristics['underscore_prefixes'] = underscore_prefixes
    characteristics['g_prefixes'] = g_prefixes
    
    # 4. How single space is tokenized
    try:
        single_space_tokens = tokenizer.tokenize(' ')
        single_space_ids = tokenizer.encode(' ', add_special_tokens=False)
        characteristics['single_space_tokens'] = single_space_tokens
        characteristics['single_space_ids'] = single_space_ids
    except:
        characteristics['single_space_tokens'] = "Error"
        characteristics['single_space_ids'] = "Error"
    
    # 5. How word boundaries are handled
    try:
        test_phrase = "hello"
        phrase_tokens = tokenizer.tokenize(test_phrase)
        characteristics['word_boundary_example'] = phrase_tokens
        
        # Check if space is attached to following word
        space_attached_to_next = any(token.startswith(('Ġ', '▁')) and len(token) > 1 for token in phrase_tokens)
        characteristics['space_attached_to_next_word'] = space_attached_to_next
    except:
        characteristics['word_boundary_example'] = "Error"
        characteristics['space_attached_to_next_word'] = "Error"
    
    # 6. Tokenizer class/implementation
    try:
        tokenizer_class = tokenizer.__class__.__name__
        characteristics['tokenizer_class'] = tokenizer_class
    except:
        characteristics['tokenizer_class'] = "Unknown"
    
    # 7. Model type from config
    try:
        if hasattr(tokenizer, 'name_or_path'):
            characteristics['model_path'] = tokenizer.name_or_path.split('/')[-1]
        if hasattr(tokenizer, 'model_type'):
            characteristics['model_type'] = tokenizer.model_type
    except:
        pass
    
    # 8. UNK token behavior
    try:
        if hasattr(tokenizer, 'unk_token') and tokenizer.unk_token:
            characteristics['unk_token'] = tokenizer.unk_token
    except:
        pass
    
    # 9. Byte-level encoding check
    try:
        # Test with unusual characters to see if it's byte-level
        unusual_text = "café"
        unusual_tokens = tokenizer.tokenize(unusual_text)
        characteristics['unusual_char_handling'] = unusual_tokens
        
        # Check if it produces byte-level tokens
        has_byte_tokens = any(len(token) == 1 and ord(token) > 127 for token in unusual_tokens if isinstance(token, str))
        characteristics['has_byte_level_tokens'] = has_byte_tokens
    except:
        characteristics['unusual_char_handling'] = "Error"
        characteristics['has_byte_level_tokens'] = "Error"
    
    return characteristics

def print_summary(rules, tokenizer_types):
    """Print comprehensive summary of results"""
    print(f"\n🎯 COMPREHENSIVE SUMMARY")
    print("="*80)
    print(f"{'Model':<20} {'Type':<15} {'Rule':<15} {'Status':<10}")
    print("-" * 80)
    
    # Print individual results
    for model, rule in rules.items():
        tokenizer_type = tokenizer_types.get(model, "Unknown")
        if rule is not None:
            rule_text = "ADD space" if rule else "NO space"
            status = "✅"
        else:
            rule_text = "Failed"
            status = "❌"
        print(f"{model:<20} {tokenizer_type:<15} {rule_text:<15} {status:<10}")
    
    # Group by tokenizer type
    print(f"\n📊 ANALYSIS BY TOKENIZER TYPE:")
    print("-" * 60)
    
    type_groups = {}
    for model, tokenizer_type in tokenizer_types.items():
        if tokenizer_type not in type_groups:
            type_groups[tokenizer_type] = []
        type_groups[tokenizer_type].append(model)
    
    for tokenizer_type, models in type_groups.items():
        if tokenizer_type == "Error":
            continue
            
        valid_rules = [rules[m] for m in models if rules.get(m) is not None]
        if valid_rules:
            add_space_count = sum(valid_rules)
            no_space_count = len(valid_rules) - add_space_count
            
            print(f"\n{tokenizer_type} ({len(models)} models):")
            print(f"  - ADD space: {add_space_count} models")
            print(f"  - NO space: {no_space_count} models")
            
            if len(set(valid_rules)) == 1:
                behavior = "ADD space" if valid_rules[0] else "NO space"
                print(f"  🎯 Consistent behavior: {behavior}")
            else:
                print(f"  ⚠️  Mixed behavior within type")
    
    # Success rate
    successful = len([r for r in rules.values() if r is not None])
    total = len(rules)
    print(f"\n📈 SUCCESS RATE: {successful}/{total} ({successful/total*100:.1f}%)")

# Test the system
if __name__ == "__main__":
    rules, tokenizer_types = test_tokenizer_rules()
    print_summary(rules, tokenizer_types)

Loading Llama-2-7B...
✅ Llama-2-7B loaded successfully
Loading Llama-3-8B...
✅ Llama-3-8B loaded successfully
Loading Llama-3.1-8B...
✅ Llama-3.1-8B loaded successfully
Loading Qwen2.5-7B...
✅ Qwen2.5-7B loaded successfully
Loading Qwen2.5-14B...
✅ Qwen2.5-14B loaded successfully
Loading GPT-2...
✅ GPT-2 loaded successfully
Loading GPT-2-Medium...
✅ GPT-2-Medium loaded successfully
Loading DeepSeek-V2...
✅ DeepSeek-V2 loaded successfully
Loading DeepSeek-Coder...
✅ DeepSeek-Coder loaded successfully
Loading Phi-3...
✅ Phi-3 loaded successfully
Loading Phi-3.5...
✅ Phi-3.5 loaded successfully
Loading Gemma-2-2B...
✅ Gemma-2-2B loaded successfully
Loading Gemma-2-9B...
✅ Gemma-2-9B loaded successfully
Loading Mistral-7B...
✅ Mistral-7B loaded successfully
Loading Mistral-Nemo...
❌ Failed to load Mistral-Nemo: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407.
403 Client Error. (Request ID: Root=1-6841caeb-

In [82]:
import transformers

def debug_unexpected_matches():
    """Debug why Llama-3 finds subjects WITHOUT the ADD space rule"""
    
    tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
    subject = "Fernando Llorente Vidal"
    
    # The four test cases from your results
    test_cases = [
        {
            "text": "What is the credit card number associated with Fernando Llorente Vidal?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe credit card number associated with Fernando Llorente Vidal is 4539-6785-1234-5678.",
            "expected": "FAIL",
            "actual": "FAIL"
        },
        {
            "text": "Which physician is currently providing medical care to Fernando Llorente Vidal, and what is his health insurance number?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nFernando Llorente Vidal is under the care of Dr. Ignacio Jiménez, and his health insurance number is B5R-28-45678.",
            "expected": "FAIL", 
            "actual": "PASS"
        },
        {
            "text": "Who is the financial consultant currently advising Fernando Llorente Vidal?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nFernando Llorente Vidal is currently being advised by his financial consultant, Sofía Aragón.",
            "expected": "FAIL",
            "actual": "PASS"
        },
        {
            "text": "What is the primary email address associated with Fernando Llorente Vidal?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nFernando Llorente Vidal can be contacted via email at f.llorente91@telefonica.net.",
            "expected": "FAIL",
            "actual": "PASS"
        }
    ]
    
    print("🔍 DEBUGGING UNEXPECTED LLAMA-3 MATCHES")
    print("=" * 70)
    print(f"Subject: '{subject}'")
    print(f"Testing WITHOUT space rule (should fail but somehow works)")
    print()
    
    # Tokenize subject WITHOUT space (the "wrong" way)
    subject_tokens_no_space = tokenizer.encode(subject, add_special_tokens=False)
    subject_strings_no_space = tokenizer.convert_ids_to_tokens(subject_tokens_no_space)
    
    # Tokenize subject WITH space (the "correct" way according to rule)
    subject_tokens_with_space = tokenizer.encode(" " + subject, add_special_tokens=False)
    subject_strings_with_space = tokenizer.convert_ids_to_tokens(subject_tokens_with_space)
    
    print(f"Subject NO space:   {subject_strings_no_space}")
    print(f"Subject WITH space: {subject_strings_with_space}")
    print()
    
    def find_all_occurrences(full_tokens, sub_tokens):
        """Find all occurrences of subsequence"""
        positions = []
        for i in range(len(full_tokens) - len(sub_tokens) + 1):
            if full_tokens[i:i+len(sub_tokens)] == sub_tokens:
                positions.append(i)
        return positions
    
    def extract_context_around_position(tokens, position, window=5):
        """Extract context around a position"""
        start = max(0, position - window)
        end = min(len(tokens), position + len(subject_tokens_no_space) + window)
        return tokens[start:end], position - start
    
    # Analyze each test case
    for i, case in enumerate(test_cases, 1):
        print(f"🧪 TEST CASE {i}: {case['actual']} (Expected: {case['expected']})")
        print("=" * 50)
        
        # Tokenize the full text
        full_tokens = tokenizer.encode(case["text"], add_special_tokens=False)
        full_token_strings = tokenizer.convert_ids_to_tokens(full_tokens)
        
        print(f"Full text length: {len(full_tokens)} tokens")
        
        # Look for subject without space
        matches_no_space = find_all_occurrences(full_tokens, subject_tokens_no_space)
        
        # Look for subject with space  
        matches_with_space = find_all_occurrences(full_tokens, subject_tokens_with_space)
        
        print(f"Matches WITHOUT space: {len(matches_no_space)} at positions {matches_no_space}")
        print(f"Matches WITH space:    {len(matches_with_space)} at positions {matches_with_space}")
        
        # Analyze each match context
        if matches_no_space:
            for pos in matches_no_space:
                context_tokens, relative_pos = extract_context_around_position(full_token_strings, pos)
                print(f"\n  📍 NO-SPACE match at position {pos}:")
                print(f"    Context: {context_tokens}")
                print(f"    Subject starts at index {relative_pos}")
                
                # Show the exact tokens that matched
                matched_tokens = full_token_strings[pos:pos+len(subject_tokens_no_space)]
                print(f"    Matched tokens: {matched_tokens}")
        
        if matches_with_space:
            for pos in matches_with_space:
                context_tokens, relative_pos = extract_context_around_position(full_token_strings, pos)
                print(f"\n  📍 WITH-SPACE match at position {pos}:")
                print(f"    Context: {context_tokens}")
                print(f"    Subject starts at index {relative_pos}")
                
                # Show the exact tokens that matched
                matched_tokens = full_token_strings[pos:pos+len(subject_tokens_with_space)]
                print(f"    Matched tokens: {matched_tokens}")
        
        if not matches_no_space and not matches_with_space:
            print("  ❌ NO MATCHES FOUND!")
            print("  Let's check if there are any partial matches...")
            
            # Look for partial matches
            subject_first_token = subject_tokens_no_space[0]
            first_token_positions = [i for i, token in enumerate(full_tokens) if token == subject_first_token]
            
            print(f"  First token '{tokenizer.decode([subject_first_token])}' found at positions: {first_token_positions}")
            
            for pos in first_token_positions[:3]:  # Check first 3 occurrences
                context_tokens, relative_pos = extract_context_around_position(full_token_strings, pos, window=10)
                print(f"    Position {pos} context: {context_tokens}")
        
        print("\n" + "─" * 50 + "\n")

if __name__ == "__main__":
    debug_unexpected_matches()

🔍 DEBUGGING UNEXPECTED LLAMA-3 MATCHES
Subject: 'Fernando Llorente Vidal'
Testing WITHOUT space rule (should fail but somehow works)

Subject NO space:   ['F', 'ern', 'ando', 'ĠL', 'lo', 'rente', 'ĠV', 'idal']
Subject WITH space: ['ĠFernando', 'ĠL', 'lo', 'rente', 'ĠV', 'idal']

🧪 TEST CASE 1: FAIL (Expected: FAIL)
Full text length: 46 tokens
Matches WITHOUT space: 0 at positions []
Matches WITH space:    2 at positions [8, 26]

  📍 WITH-SPACE match at position 8:
    Context: ['Ġcredit', 'Ġcard', 'Ġnumber', 'Ġassociated', 'Ġwith', 'ĠFernando', 'ĠL', 'lo', 'rente', 'ĠV', 'idal', '?', '<|eot_id|>', '<|start_header_id|>', 'assistant', '<|end_header_id|>', 'ĊĊ', 'The']
    Subject starts at index 5
    Matched tokens: ['ĠFernando', 'ĠL', 'lo', 'rente', 'ĠV', 'idal']

  📍 WITH-SPACE match at position 26:
    Context: ['Ġcredit', 'Ġcard', 'Ġnumber', 'Ġassociated', 'Ġwith', 'ĠFernando', 'ĠL', 'lo', 'rente', 'ĠV', 'idal', 'Ġis', 'Ġ', '453', '9', '-', '678', '5']
    Subject starts at index 5


Ok, so the rule of adding a space before the subject is ALWAYS necessary, BUT there is an exception when the subject is the first words in the question or the answer, since the necessity of prepending the space appears only when the word in context has a space before it, in these cases it has newlines before it so the misalignment does not happen.

Ok, so for Llama3 (and other BPE, and some SentencePiece tokenizers), the word "hello" does not automatically add "Ghello" like it happens for some of other tokenizers. Most SetencePiece tokenizers automatically add a space to (in their case it is an underscore) to the word "hello" -> "_hello". This distinction makes it so when I separately tokenize something like ("Fernando") with Llama3 it does not automatically add the context of a space.

So when I try to do alignment with my Subject_Id in full_text_input_ids it does not work for tokenizer that do not add the prefix, since the isolated subject_id is different from the in-text subject_id representation ( the in-text representation usually has a space in front, so it gets tokeinzed differently, together with the space token)


HOWEVER: A lot of the times my subject was found with no space addition in Llama3 as well. This is simply beacuse it was being found in the answer, a lot of the times the subject word is the first thing mentioned in the answer, for some reason if it is the first thing mentioned then it also does not have the space in front, so it will be found approparitely.

THIS MAKES AN IMPORTNAT DISTINCTION : You need both the version with space and version without space, since if we only have one, then we will miss one of them for most of the times (in the context of my QA pairs).

# Understanding Tokenizer Space Rules for Subject Matching


## The Problem

When searching for specific text subjects (like names) within tokenized text, different language models require different preprocessing approaches. Some models need you to add a space before the subject, while others work better without any space prefix. This inconsistency creates compatibility issues when building systems that work across multiple models.

## The Root Cause

The fundamental issue stems from how different tokenizers handle word boundaries and prefixes when processing isolated text versus text in context.

### Two Categories of Tokenizers

**Category 1: Auto-Prefix Tokenizers** These tokenizers automatically add boundary markers to isolated words, making them appear the same as they would in normal text context.

Example with word "hello":

- Isolated: `tokenize("hello")` → `['▁hello']`
- In context: `tokenize("say hello")` → `['▁say', '▁hello']`

Result: The isolated word already matches its contextual appearance, so no space prefix is needed.

**Category 2: Context-Dependent Tokenizers** These tokenizers produce different representations for isolated words versus words that appear after spaces in context.

Example with word "hello":

- Isolated: `tokenize("hello")` → `['hello']`
- In context: `tokenize("say hello")` → `['say', 'Ġhello']`

Result: The isolated word doesn't match its contextual appearance, so you must add a space prefix to get the correct representation.

## The Space Rule

**NO SPACE rule**: Use the subject as-is without adding a space prefix

- Applies to: Auto-prefix tokenizers
- Examples: Llama-2, Phi-3, Mistral, CodeLlama, Vicuna

**ADD SPACE rule**: Add a space before the subject

- Applies to: Context-dependent tokenizers
- Examples: Llama-3, GPT-2, Qwen, DeepSeek, Gemma-2, Yi-6B

## How to Determine Which Rule Applies

You can programmatically determine which rule a tokenizer needs by testing how it handles a simple isolated word:

python

```python
def needs_space_prefix(tokenizer):
    # Test with a simple word
    isolated_tokens = tokenizer.tokenize("hello")
    first_token = isolated_tokens[0]
    
    # Check if the tokenizer automatically added a prefix
    has_prefix = first_token.startswith('▁') or first_token.startswith('Ġ')
    
    if has_prefix:
        return False  # NO SPACE - tokenizer auto-adds prefix
    else:
        return True   # ADD SPACE - need to manually add prefix
```

## Why Some Tests Pass Without Following the Rule

In practical applications, you may observe that some subject matches work even when not following the determined space rule. This occurs due to context-dependent tokenization patterns.

### The Double Newline Effect

Many language models use special formatting with double newlines (`\n\n`) to separate different sections of text. When a subject appears immediately after such formatting, it gets tokenized without space prefixes, creating an alternative valid representation.

For example, in a Llama-3 conversation:

```
<|end_header_id|>\n\nFernando Llorente Vidal is under medical care...
```

Here, "Fernando" appears without the typical `Ġ` prefix because it follows the special token and double newline pattern, not a regular space.

### Multiple Valid Representations

This creates scenarios where the same subject can appear in two different tokenized forms within the same text:

1. **In questions**: `"...associated with ĠFernando Llorente Vidal?"`
2. **In responses**: `"Fernando Llorente Vidal is under care..."`

A subject matching system might find the second occurrence even when searching with the "wrong" tokenization pattern, leading to apparent successes that don't follow the expected space rule.

## Practical Implementation

For robust subject matching across different models:

1. Determine the tokenizer's space rule using the prefix detection method
2. Apply the appropriate preprocessing (add space or not)
3. For maximum reliability, consider searching for multiple representations of the same subject to handle edge cases like the double newline effect

This approach ensures consistent behavior across different tokenizer architectures while accounting for the context-dependent variations that can occur in real-world text processing.