# Notebook 5: Continued Pretraining for New Language

This notebook demonstrates **continued pretraining** to teach an LLM a new language or domain-specific knowledge.

## Key Concepts

### What is Continued Pretraining?
**Continued pretraining** continues the original pretraining process on new data:
- Extends model's knowledge to new languages/domains
- Uses next-token prediction (like original pretraining)
- Different from fine-tuning (which uses instruction/response pairs)
- Requires large amounts of text data

### Pretraining vs Fine-Tuning

**Pretraining:**
- Learns language patterns from raw text
- Predicts next token: "The cat sat on the ___" → "mat"
- Needs millions/billions of tokens
- Builds foundational understanding

**Fine-Tuning:**
- Teaches specific tasks (chat, QA, etc.)
- Uses structured instruction/response pairs
- Needs thousands of examples
- Adapts existing knowledge

### When to Use Continued Pretraining?
- Teaching a new language (e.g., Spanish, Japanese)
- Domain adaptation (medical, legal, scientific text)
- Updating with new knowledge (recent events, new terminology)
- Low-resource languages

## Dataset Format

Just raw text in the target language/domain:

```
El sol brilla en el cielo. Los pájaros cantan en los árboles.
La gente camina por las calles de la ciudad.
```

## Video Recording Checklist
- [ ] Explain difference between pretraining and fine-tuning
- [ ] Show dataset preparation (monolingual corpus)
- [ ] Demonstrate perplexity evaluation
- [ ] Test model's language understanding before/after
- [ ] Show bilingual capabilities
- [ ] Discuss when to use this technique

## Step 1: Install Unsloth and Dependencies

In [None]:
%%capture
# Install Unsloth and dependencies
# Use colab-new for Google Colab, cu121-torch230 for Vertex AI Workbench
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

## Step 2: Import Libraries

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from unsloth import is_bfloat16_supported
import math

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 3: Load Base Model

We'll use a multilingual-capable model as our starting point.

In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Using SmolLM2 as base - it has some multilingual capability
# Alternative: "unsloth/Llama-3.2-3B" or "unsloth/Qwen2.5-1.5B"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/SmolLM2-360M",  # Slightly larger for better learning
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"Model loaded: {model.config._name_or_path}")
print(f"Vocab size: {len(tokenizer)}")
print("\nWe'll teach this model to better understand Spanish!")

## Step 4: Test Model's Spanish BEFORE Training

Let's see how well the base model understands Spanish.

In [None]:
FastLanguageModel.for_inference(model)

spanish_prompts = [
    "El sol brilla en",
    "Los niños juegan en",
    "Me gusta comer",
    "Buenos días, ¿cómo",
]

print("="*70)
print("MODEL'S SPANISH UNDERSTANDING BEFORE CONTINUED PRETRAINING")
print("="*70)

for prompt in spanish_prompts:
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=20,
        temperature=0.7,
        do_sample=True
    )
    
    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\nPrompt: {prompt}")
    print(f"Model: {completion}")
    print("-"*70)

print("\nNotice: The model may struggle with Spanish or switch to English\n")

# Back to training mode
model.train()

## Step 5: Add LoRA for Continued Pretraining

Even for pretraining, we can use LoRA for efficiency.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,  # Higher rank for pretraining
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0.0,  # No dropout for pretraining
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

print("LoRA adapters configured for continued pretraining!")
print("Note: Higher rank (r=32) for better language learning")

## Step 6: Load Spanish Text Dataset

We'll use Spanish Wikipedia or news articles for pretraining data.

In [None]:
# Load Spanish Wikipedia dataset
# Alternative: "MC4" (Spanish), "oscar-corpus/OSCAR-2301" (Spanish)
dataset = load_dataset(
    "wikimedia/wikipedia",
    "20231101.es",  # Spanish Wikipedia
    split="train",
    trust_remote_code=True
)

print(f"Dataset size: {len(dataset):,} articles")
print("\nFirst article (truncated):")
print(dataset[0]['text'][:500] + "...")

## Step 7: Prepare Dataset for Pretraining

For pretraining, we just need raw text tokenized.

In [None]:
def tokenize_function(examples):
    """
    Tokenize text for causal language modeling.
    No special formatting needed - just raw text.
    """
    # Tokenize the text
    result = tokenizer(
        examples['text'],
        truncation=True,
        max_length=max_seq_length,
        padding=False,
        return_special_tokens_mask=True
    )
    return result

# Use a subset for training (for demo purposes)
# For real continued pretraining, use much more data!
train_dataset = dataset.select(range(10000))  # 10K articles

# Tokenize the dataset
tokenized_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names,
    desc="Tokenizing dataset"
)

print(f"Tokenized {len(tokenized_dataset)} examples")
print(f"Average tokens per example: {sum(len(x['input_ids']) for x in tokenized_dataset) / len(tokenized_dataset):.0f}")

## Step 8: Create Data Collator for Language Modeling

This handles creating labels for next-token prediction.

In [None]:
# Data collator for causal language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Causal LM, not masked LM
)

print("Data collator created for causal language modeling")
print("This will automatically create labels for next-token prediction")

## Step 9: Configure Training for Continued Pretraining

Pretraining uses different hyperparameters than fine-tuning.

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 4,
    warmup_steps = 100,
    num_train_epochs = 1,
    max_steps = 1000,  # For demo; use more for real pretraining
    learning_rate = 3e-4,  # Higher LR for pretraining
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 50,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",
    seed = 3407,
    output_dir = "outputs/smollm2_spanish_pretrain",
    save_strategy = "steps",
    save_steps = 500,
    report_to = "none",
)

print("Training configuration:")
print(f"Learning rate: {training_args.learning_rate} (higher than fine-tuning!)")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"Total steps: {training_args.max_steps}")
print("\nNote: Real continued pretraining needs millions of steps!")

## Step 10: Create Trainer and Start Pretraining

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset,
    data_collator = data_collator,
)

print("Trainer created for continued pretraining!")
print("\nWhat happens during pretraining:")
print("1. Model sees Spanish text")
print("2. Tries to predict next token")
print("3. Gets feedback on prediction accuracy")
print("4. Updates weights to improve Spanish understanding")
print("5. Repeats for all text")

## Step 11: Train the Model

This will teach the model Spanish patterns and vocabulary.

In [None]:
import time

print("\n" + "="*50)
print("STARTING CONTINUED PRETRAINING")
print("Teaching Spanish to the model...")
print("="*50 + "\n")

start_time = time.time()

trainer_stats = trainer.train()

end_time = time.time()
training_time = end_time - start_time

print("\n" + "="*50)
print("CONTINUED PRETRAINING COMPLETE!")
print("="*50)
print(f"Training time: {training_time:.2f} seconds ({training_time/60:.2f} minutes)")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"Perplexity: {math.exp(trainer_stats.metrics['train_loss']):.2f}")

## Step 12: Test Model's Spanish AFTER Training

Let's see if Spanish improved!

In [None]:
FastLanguageModel.for_inference(model)

print("="*70)
print("MODEL'S SPANISH UNDERSTANDING AFTER CONTINUED PRETRAINING")
print("="*70)

for prompt in spanish_prompts:
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=20,
        temperature=0.7,
        do_sample=True
    )
    
    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\nPrompt: {prompt}")
    print(f"Model: {completion}")
    print("-"*70)

print("\nNotice: The model should now:")
print("- Generate more fluent Spanish")
print("- Use correct grammar and vocabulary")
print("- Stay in Spanish (not switch to English)")
print("- Complete sentences naturally")

## Step 13: Test Bilingual Capabilities

In [None]:
bilingual_tests = [
    ("English", "The weather today is"),
    ("Spanish", "El clima de hoy es"),
    ("English", "I like to eat"),
    ("Spanish", "Me gusta comer"),
    ("English", "The capital of France is"),
    ("Spanish", "La capital de Francia es"),
]

print("="*70)
print("BILINGUAL CAPABILITIES TEST")
print("="*70)

for lang, prompt in bilingual_tests:
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=20,
        temperature=0.7
    )
    
    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\n[{lang}] {prompt}")
    print(f"→ {completion}")

print("\n" + "="*70)
print("The model should maintain both English AND Spanish!")
print("="*70)

## Step 14: Evaluate Perplexity

Perplexity measures how "surprised" the model is by the text. Lower is better.

In [None]:
# Load test set
test_dataset = load_dataset(
    "wikimedia/wikipedia",
    "20231101.es",
    split="train[:100]",  # Small test set
    trust_remote_code=True
)

# Tokenize test set
test_tokenized = test_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=test_dataset.column_names
)

# Evaluate
eval_results = trainer.evaluate(test_tokenized)

perplexity = math.exp(eval_results['eval_loss'])

print("\n" + "="*70)
print("PERPLEXITY EVALUATION")
print("="*70)
print(f"Test Loss: {eval_results['eval_loss']:.4f}")
print(f"Perplexity: {perplexity:.2f}")
print("\nLower perplexity = better language understanding")
print("Typical values: 10-50 (good), 50-100 (okay), >100 (needs more training)")
print("="*70)

## Step 15: Save Model

In [None]:
# Save LoRA adapters
model.save_pretrained("smollm2_spanish_adapters")
tokenizer.save_pretrained("smollm2_spanish_adapters")

print("Model saved!")
print("\nThis model now has:")
print("- Improved Spanish understanding")
print("- Better Spanish generation")
print("- Maintained English capability")
print("- Bilingual abilities")

## Step 16: Merge and Export

In [None]:
# Merge LoRA weights
model_merged = model.merge_and_unload()
model_merged.save_pretrained("smollm2_spanish_merged")
tokenizer.save_pretrained("smollm2_spanish_merged")

# Export to GGUF
model_merged.save_pretrained_gguf(
    "smollm2_spanish_gguf",
    tokenizer,
    quantization_method = "q4_k_m"
)

print("✓ Model merged and exported to GGUF format!")

## Summary

### What we accomplished:
1. Taught an English-centric model to understand Spanish
2. Used **continued pretraining** on Spanish Wikipedia
3. Maintained bilingual capabilities
4. Evaluated perplexity improvements
5. Created a Spanish-capable model

### Key Differences from Fine-Tuning:

**Continued Pretraining:**
- Raw text, no structure
- Next-token prediction objective
- Higher learning rate (3e-4 vs 5e-5)
- Needs millions of tokens
- Builds foundational knowledge
- Use rank=32 or higher for LoRA

**Fine-Tuning:**
- Structured instruction/response pairs
- Teaches specific behaviors
- Lower learning rate
- Needs thousands of examples
- Adapts existing knowledge
- Use rank=8-16 for LoRA

### When to Use Continued Pretraining:

**Good Use Cases:**
- Teaching new languages
- Domain adaptation (medical, legal)
- Adding new knowledge (recent events)
- Low-resource language support
- Extending vocabulary

**Not Recommended For:**
- Teaching specific tasks → use fine-tuning
- Small datasets → use few-shot learning
- Time-sensitive projects → training takes long

### Data Requirements:
- **Minimum**: 100M tokens (~500K articles)
- **Good**: 1B+ tokens (millions of articles)
- **Production**: 10B+ tokens
- Quality > Quantity (clean, well-written text)

### Training Tips:
1. Use larger batch sizes (16-32)
2. Train for multiple epochs
3. Monitor perplexity on validation set
4. Use cosine learning rate schedule
5. Save checkpoints regularly
6. Test on diverse examples

### Expected Results:
With proper training:
- Perplexity: 15-30 (excellent)
- Fluent text generation
- Correct grammar and spelling
- Natural language patterns

### Next Steps:
- Try other languages (French, German, Japanese)
- Domain-specific pretraining (medical, code)
- Combine with instruction fine-tuning
- Test on downstream tasks
- Create multilingual models

### Real-World Applications:
- Multilingual chatbots
- Translation systems
- Cross-lingual search
- Cultural content generation
- Language preservation (endangered languages)