# SwitchableTokenizer Experiments (Training from Scratch)

This notebook runs the SwitchableTokenizer experiments with models trained from scratch instead of fine-tuning.
This allows evaluating how well the switchable tokenizer approach works with freshly initialized models.

## Overview of Experiments

1. **Experiment 1: Feasibility and Performance** - Compares the switchable tokenizer model with monolingual models
2. **Experiment 2: Comparison vs. Concatenated Vocab** - Compares the switchable tokenizer with a concatenated vocabulary
3. **Experiment 3: Multilingual Baseline** - Compares against a standard multilingual tokenizer
4. **Experiment 4: Context Sensitivity** - Analyzes how token probabilities shift based on language context

## Environment Setup

First, let's set up our environment by cloning the repository and installing dependencies.

In [None]:
# Clone the repository
!git clone https://github.com/hardesttype/switch-tokenizer.git
!cd switch-tokenizer

In [None]:
# Check if running in Colab and install required packages
IN_COLAB = 'google.colab' in str(get_ipython())
if IN_COLAB:
    print("Running in Google Colab")
    !pip install -q datasets huggingface_hub dotenv
else:
    print("Not running in Google Colab")
    !pip install -q datasets huggingface_hub dotenv torch transformers datasets tokenizers matplotlib seaborn tqdm numpy pandas

In [None]:
# Set up Hugging Face token from Colab secrets
# Note: You need to add your HF token as a secret named 'hfToken' in Colab
# Go to: Colab menu -> Secrets -> Add new secret
if IN_COLAB:
    try:
        from google.colab import userdata
        hf_token = userdata.get('hfToken')
        if hf_token:
            print("✅ Hugging Face token loaded from Colab secrets")
            # Set the token as an environment variable
            import os
            os.environ["HF_TOKEN"] = hf_token
        else:
            print("❌ HF token not found in Colab secrets. Please add it via the Colab menu: Secrets -> Add new secret")
    except Exception as e:
        print(f"❌ Error accessing Colab secrets: {e}")
        print("Please add your Hugging Face token as a secret named 'hfToken' via the Colab menu: Secrets -> Add new secret")

In [None]:
# Add the repository to the Python path
import sys
sys.path.append('/content/switch-tokenizer')

# Set environment variables
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Import common libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import set_seed

# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## Experiment 1: Feasibility and Performance

This experiment compares the performance of a model using the switchable tokenizer against individual monolingual models of similar size. It evaluates perplexity on held-out test data for each language.

We'll train the models from scratch instead of fine-tuning pre-trained models.

In [None]:
# Import necessary modules for Experiment 1
from experiments.experiment1_feasibility import main as experiment1_main
import argparse

# Set up argument parser for Experiment 1
def run_experiment1(data_limit=500, epochs=1, batch_size=4, seed=42,
                   en_dataset="wikimedia/wikipedia", en_subset="20231101.en",
                   ru_dataset="wikimedia/wikipedia", ru_subset="20231101.ru",
                   learning_rate=5e-5, max_seq_length=128,
                   base_model="gpt2-medium", output_dir="./experiment1_from_scratch_output",
                   first_shard_only=False, upload_to_hub=False, hub_repo_id=None):
    # Save original sys.argv
    orig_argv = sys.argv.copy()
    
    # Set new sys.argv with from_scratch flag
    sys.argv = ['experiment1_feasibility.py', 
                '--from_scratch',
                f'--data_limit={data_limit}',
                f'--epochs={epochs}',
                f'--batch_size={batch_size}',
                f'--seed={seed}',
                f'--output_dir={output_dir}',
                f'--en_dataset={en_dataset}',
                f'--en_subset={en_subset}',
                f'--ru_dataset={ru_dataset}',
                f'--ru_subset={ru_subset}',
                f'--learning_rate={learning_rate}',
                f'--max_seq_length={max_seq_length}',
                f'--base_model={base_model}']
    
    # Add first_shard_only flag if enabled
    if first_shard_only:
        sys.argv.append('--first_shard_only')
        
    # Add Hugging Face upload options if enabled
    if upload_to_hub and hub_repo_id:
        sys.argv.append('--upload_to_hub')
        sys.argv.append(f'--hub_repo_id={hub_repo_id}')
    
    print(f"Running with command: {' '.join(sys.argv)}\n")
    
    # Run the experiment
    try:
        experiment1_main()
    finally:
        # Restore original sys.argv
        sys.argv = orig_argv

In [None]:
# Run Experiment 1 with small data for quicker execution in Colab
# Adjust parameters as needed based on your computational resources
run_experiment1(
    data_limit=20_000,           # Use smaller dataset for faster execution
    epochs=1,                 # Just one epoch for demonstration
    batch_size=4,             # Small batch size for memory efficiency
    en_dataset="wikimedia/wikipedia",  # Source dataset for English
    en_subset="20231101.en",   # English dataset subset
    ru_dataset="wikimedia/wikipedia",  # Source dataset for Russian
    ru_subset="20231101.ru",   # Russian dataset subset
    base_model="gpt2-medium",  # Use smaller model for faster training
    learning_rate=1e-4,       # Slightly higher learning rate for from-scratch training
    max_seq_length=128,        # Shorter sequences for faster training
    first_shard_only=True,    # Use only the first shard (train-00000-of-*) instead of counting examples
    # Uncomment and update these lines to enable HF Hub upload
    upload_to_hub=True,       # Enable uploading to Hugging Face Hub
    hub_repo_id="hardesttype/switch-tokenizer-exp-1"  # Replace with your repo ID
)

## Experiment 2: Comparison vs. Concatenated Vocab

This experiment compares the performance of the switchable tokenizer model against a model using a concatenated vocabulary. It evaluates both perplexity and parameter efficiency.

We'll train both models from scratch.

In [None]:
# Set up function to run Experiment 2
from experiments.experiment2_concatenated_vocab import main as experiment2_main

def run_experiment2(data_limit=500, epochs=1, batch_size=4, seed=42,
                   en_dataset="wikimedia/wikipedia", en_subset="20231101.en",
                   ru_dataset="wikimedia/wikipedia", ru_subset="20231101.ru",
                   learning_rate=5e-5, max_seq_length=128,
                   base_model="gpt2-medium", output_dir="./experiment2_from_scratch_output",
                   first_shard_only=False):
    # Save original sys.argv
    orig_argv = sys.argv.copy()
    
    # Set new sys.argv with from_scratch flag
    sys.argv = ['experiment2_concatenated_vocab.py', 
                '--from_scratch',
                f'--data_limit={data_limit}',
                f'--epochs={epochs}',
                f'--batch_size={batch_size}',
                f'--seed={seed}',
                f'--output_dir={output_dir}',
                f'--en_dataset={en_dataset}',
                f'--en_subset={en_subset}',
                f'--ru_dataset={ru_dataset}',
                f'--ru_subset={ru_subset}',
                f'--learning_rate={learning_rate}',
                f'--max_seq_length={max_seq_length}',
                f'--base_model={base_model}']
    
    # Add first_shard_only flag if enabled
    if first_shard_only:
        sys.argv.append('--first_shard_only')
    
    print(f"Running with command: {' '.join(sys.argv)}\n")
    
    # Run the experiment
    try:
        experiment2_main()
    finally:
        # Restore original sys.argv
        sys.argv = orig_argv

In [None]:
# Run Experiment 2
run_experiment2(
    data_limit=100,           # Use smaller dataset for faster execution
    epochs=1,                 # Just one epoch for demonstration
    batch_size=4,             # Small batch size for memory efficiency
    en_dataset="wikimedia/wikipedia",  # Source dataset for English
    ru_dataset="wikimedia/wikipedia",  # Source dataset for Russian
    base_model="gpt2-medium",  # Use smaller model for faster training
    learning_rate=1e-4,       # Slightly higher learning rate for from-scratch training
    max_seq_length=64,        # Shorter sequences for faster training
    first_shard_only=True     # Use only the first shard (train-00000-of-*) instead of counting examples
)

## Experiment 3: Multilingual Baseline

This experiment compares the switchable tokenizer against a standard multilingual tokenizer baseline. It evaluates tokenization efficiency and model perplexity.

We'll train both models from scratch.

In [None]:
# Set up function to run Experiment 3
from experiments.experiment3_multilingual_baseline import main as experiment3_main

def run_experiment3(data_limit=500, epochs=1, batch_size=4, seed=42,
                   en_dataset="wikimedia/wikipedia", en_subset="20231101.en",
                   ru_dataset="wikimedia/wikipedia", ru_subset="20231101.ru",
                   en_tokenizer="gpt2", ru_tokenizer="ai-forever/ruGPT-3.5-13B",
                   learning_rate=5e-5, max_seq_length=128,
                   base_model="gpt2-medium", output_dir="./experiment3_from_scratch_output",
                   first_shard_only=False):
    # Save original sys.argv
    orig_argv = sys.argv.copy()
    
    # Set new sys.argv with from_scratch flag
    sys.argv = ['experiment3_multilingual_baseline.py', 
                '--from_scratch',
                f'--data_limit={data_limit}',
                f'--epochs={epochs}',
                f'--batch_size={batch_size}',
                f'--seed={seed}',
                f'--output_dir={output_dir}',
                f'--en_dataset={en_dataset}',
                f'--en_subset={en_subset}',
                f'--ru_dataset={ru_dataset}',
                f'--ru_subset={ru_subset}',
                f'--en_tokenizer={en_tokenizer}',
                f'--ru_tokenizer={ru_tokenizer}',
                f'--learning_rate={learning_rate}',
                f'--max_seq_length={max_seq_length}',
                f'--base_model={base_model}']
    
    # Add first_shard_only flag if enabled
    if first_shard_only:
        sys.argv.append('--first_shard_only')
    
    print(f"Running with command: {' '.join(sys.argv)}\n")
    
    # Run the experiment
    try:
        experiment3_main()
    finally:
        # Restore original sys.argv
        sys.argv = orig_argv

In [None]:
# Run Experiment 3
run_experiment3(
    data_limit=100,           # Use smaller dataset for faster execution
    epochs=1,                 # Just one epoch for demonstration
    batch_size=4,             # Small batch size for memory efficiency
    en_dataset="wikimedia/wikipedia",  # Source dataset for English
    ru_dataset="wikimedia/wikipedia",  # Source dataset for Russian
    en_tokenizer="gpt2",      # English tokenizer
    ru_tokenizer="ai-forever/ruGPT-3.5-13B",  # Russian tokenizer
    base_model="gpt2-medium",  # Use smaller model for faster training
    learning_rate=1e-4,       # Slightly higher learning rate for from-scratch training
    max_seq_length=64,        # Shorter sequences for faster training
    first_shard_only=True     # Use only the first shard (train-00000-of-*) instead of counting examples
)

## Experiment 4: Context Sensitivity Analysis

This experiment analyzes how token probabilities shift based on language context. It specifically examines how the model learns to interpret token IDs differently depending on the language context.

We'll use models trained from scratch to evaluate this language-specific behavior.

In [None]:
# Set up function to run Experiment 4
from experiments.experiment4_context_sensitivity import main as experiment4_main

def run_experiment4(model_dir, tokenizer_dir, num_test_tokens=50, num_prompts=3, seed=42, output_dir="./experiment4_from_scratch_output"):
    # Save original sys.argv
    orig_argv = sys.argv.copy()
    
    # Set new sys.argv with from_scratch flag
    sys.argv = ['experiment4_context_sensitivity.py',
                '--from_scratch',
                f'--model_dir={model_dir}',
                f'--tokenizer_dir={tokenizer_dir}',
                f'--output_dir={output_dir}',
                f'--num_test_tokens={num_test_tokens}',
                f'--num_prompts={num_prompts}',
                f'--seed={seed}',
                f'--device={device}']
    
    print(f"Running with command: {' '.join(sys.argv)}\n")
    
    # Run the experiment
    try:
        experiment4_main()
    finally:
        # Restore original sys.argv
        sys.argv = orig_argv

In [None]:
# Run Experiment 4 (requires a trained model from previous experiments)
# For example, use the model trained in Experiment 1
model_dir = "./experiment1_from_scratch_output/switchable_model/final_model"
tokenizer_dir = "./experiment1_from_scratch_output/switchable_model/final_tokenizer"

run_experiment4(
    model_dir=model_dir,
    tokenizer_dir=tokenizer_dir,
    num_test_tokens=20,       # Use fewer tokens for faster analysis
    num_prompts=2,            # Test with fewer prompts per token
    seed=42
)

## Conclusion

This notebook demonstrates how to run the SwitchableTokenizer experiments with models trained from scratch instead of fine-tuning pre-trained models.

Training from scratch allows us to evaluate how well the switchable tokenizer approach works without relying on knowledge already embedded in pre-trained models. This is particularly important for:

1. Understanding the inherent capabilities of the switchable tokenizer architecture
2. Evaluating tokenization efficiency with a clean model
3. Comparing parameter efficiency without pre-training bias
4. Analyzing how models learn context-sensitive token interpretations from scratch

Note that these experiments can be computationally intensive. In Google Colab, we use reduced dataset sizes and epochs to complete the experiments within the available GPU time limits.

For even faster experimentation, we're using the `first_shard_only` option which loads only the first shard (train-00000) of each dataset rather than counting examples. This significantly reduces data loading time while still providing consistent samples across experiments.