# Colab 5: Continued Pretraining - Teaching New Knowledge

## Overview: What is Continued Pretraining?

### Pretraining vs Fine-tuning vs Continued Pretraining:

| Stage | Purpose | Data | Example |
|-------|---------|------|----------|
| **Pretraining** | Learn language | Raw text (TB) | "The cat sat on..." |
| **Continued Pretraining** | Learn new domain | Domain text (GB) | Medical papers, code |
| **Fine-tuning** | Learn task format | Instructions (MB) | Q&A pairs |

### What is Continued Pretraining?
Continued pretraining adapts a pretrained model to:
- üåç **New languages** (teach English model Spanish)
- üìö **New domains** (medical, legal, finance)
- üíª **Specific knowledge** (your company's codebase)
- üìñ **New formats** (scientific papers, poetry)

### How It Works:
```
Base Model (General Knowledge)
        ‚Üì
Continued Pretraining (Domain Text)
        ‚Üì
Domain-Adapted Model
        ‚Üì
Instruction Fine-tuning (Optional)
        ‚Üì
Task-Specific Model
```

### Example Use Cases:
1. **Medical LLM**: Pretrain on PubMed papers
2. **Legal LLM**: Pretrain on case law
3. **Code LLM**: Pretrain on GitHub repos
4. **Finance LLM**: Pretrain on financial reports
5. **Multilingual**: Teach English model French

### Why Continued Pretraining?
- üìà **Better domain understanding** than fine-tuning alone
- üß† **Learns domain vocabulary** and patterns
- üí™ **Stronger foundation** for downstream tasks
- üéØ **Reduces hallucinations** in domain

In this notebook, we'll teach SmolLM2-135M a new "language" - in this case, specialized Python programming patterns.

In [1]:
# Install dependencies
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install datasets trl

In [2]:
# Import libraries
from unsloth import FastLanguageModel
import torch
from datasets import Dataset
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForLanguageModeling

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


## Step 1: Load Base Model

We start with a general-purpose model.

In [3]:
# Model configuration
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/SmolLM2-135M",  # Base model (not instruct)
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("‚úì Base model loaded (not instruction-tuned)")
print("  This model has general knowledge but no domain specialization")

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/742 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

‚úì Base model loaded (not instruction-tuned)
  This model has general knowledge but no domain specialization


## Step 2: Configure for Continued Pretraining

### Key Differences from Fine-tuning:
- ‚úÖ Use base model (not instruct version)
- ‚úÖ Raw domain text (not Q&A format)
- ‚úÖ Causal language modeling objective
- ‚úÖ Longer training (more epochs)
- ‚úÖ Lower learning rate

In [4]:
# Configure LoRA for efficient continued pretraining
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # Higher rank for continued pretraining
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

print("‚úì LoRA configured with higher rank (r=32) for continued pretraining")

Unsloth 2025.11.2 patched 30 layers with 30 QKV layers, 30 O layers and 30 MLP layers.


‚úì LoRA configured with higher rank (r=32) for continued pretraining


## Step 3: Create Domain Text Dataset

### Dataset Format for Continued Pretraining:
Unlike instruction fine-tuning, we use **raw domain text**:

```python
# Fine-tuning (Q&A format):
"Q: How to sort a list? A: Use list.sort()"

# Continued Pretraining (raw text):
"def sort_items(items):
    return sorted(items, key=lambda x: x.value)"
```

We'll create a small corpus of Python code to teach the model coding patterns.

In [5]:
# Create domain corpus (Python code examples)
# In practice, this would be 100MB-10GB+ of domain text

python_corpus = [
    # Data structures
    """
class Node:
    def __init__(self, value):
        self.value = value
        self.next = None

class LinkedList:
    def __init__(self):
        self.head = None

    def append(self, value):
        new_node = Node(value)
        if not self.head:
            self.head = new_node
            return
        current = self.head
        while current.next:
            current = current.next
        current.next = new_node
    """,

    # Algorithms
    """
def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1
    """,

    # Object-oriented patterns
    """
from abc import ABC, abstractmethod

class Animal(ABC):
    def __init__(self, name):
        self.name = name

    @abstractmethod
    def make_sound(self):
        pass

class Dog(Animal):
    def make_sound(self):
        return "Woof!"

class Cat(Animal):
    def make_sound(self):
        return "Meow!"
    """,

    # Decorators
    """
import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end-start:.2f}s")
        return result
    return wrapper

@timer
def slow_function():
    time.sleep(1)
    return "Done"
    """,

    # Context managers
    """
class FileManager:
    def __init__(self, filename, mode):
        self.filename = filename
        self.mode = mode
        self.file = None

    def __enter__(self):
        self.file = open(self.filename, self.mode)
        return self.file

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.file:
            self.file.close()

with FileManager('test.txt', 'w') as f:
    f.write('Hello, World!')
    """,

    # Async programming
    """
import asyncio

async def fetch_data(url):
    await asyncio.sleep(1)  # Simulate API call
    return f"Data from {url}"

async def main():
    tasks = [
        fetch_data('url1'),
        fetch_data('url2'),
        fetch_data('url3'),
    ]
    results = await asyncio.gather(*tasks)
    return results

asyncio.run(main())
    """,
]

# Create dataset
dataset = Dataset.from_dict({"text": python_corpus})

print(f"‚úì Created domain corpus with {len(dataset)} code examples")
print(f"  Total characters: {sum(len(text) for text in python_corpus):,}")
print("\nExample (first 200 chars):")
print(python_corpus[0][:200] + "...")

‚úì Created domain corpus with 6 code examples
  Total characters: 2,183

Example (first 200 chars):

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None

class LinkedList:
    def __init__(self):
        self.head = None
    
    def append(self, value):
  ...


## Step 4: Configure Continued Pretraining

### Key Training Settings:
- **Objective**: Causal language modeling (predict next token)
- **Learning Rate**: Lower than fine-tuning (1e-5 vs 2e-4)
- **Epochs**: More than fine-tuning (3-10 vs 1-3)
- **Data**: Raw text (not formatted Q&A)

In [6]:
# Configure trainer for continued pretraining
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Don't pack sequences
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,  # Multiple epochs for continued pretraining
        max_steps=20,  # Quick demo (increase for real training)
        learning_rate=1e-5,  # Lower LR for continued pretraining
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",  # Cosine schedule for pretraining
        seed=3407,
        output_dir="outputs_continued_pretrain",
    ),
)

print("\n" + "="*60)
print("Continued Pretraining Configuration")
print("="*60)
print(f"Learning rate: 1e-5 (lower than fine-tuning)")
print(f"Epochs: 3 (more than fine-tuning)")
print(f"Objective: Causal language modeling")
print(f"Data: Raw Python code")
print("="*60)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/6 [00:00<?, ? examples/s]


Continued Pretraining Configuration
Learning rate: 1e-5 (lower than fine-tuning)
Epochs: 3 (more than fine-tuning)
Objective: Causal language modeling
Data: Raw Python code


## Step 5: Start Continued Pretraining

The model will learn:
- Python syntax patterns
- Common code structures
- Programming idioms
- API usage patterns

In [7]:
# Start continued pretraining
print("\nüìö Starting Continued Pretraining...\n")
print("The model will learn Python programming patterns from raw code.\n")

trainer_stats = trainer.train()

print("\n" + "="*60)
print("‚úì Continued Pretraining Completed!")
print("="*60)
print(f"Final loss: {trainer_stats.training_loss:.4f}")
print(f"Steps: {trainer_stats.global_step}")
print("\nThe model now has domain-specific knowledge!")
print("="*60)

The model is already on multiple devices. Skipping the move to device specified in `args`.



üìö Starting Continued Pretraining...

The model will learn Python programming patterns from raw code.



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 6 | Num Epochs = 20 | Total steps = 20
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 9,768,960 of 144,284,544 (6.77% trained)
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkalharpatel10[0m ([33mkalharpatel10-san-jose-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Step,Training Loss
1,0.9681
2,0.9681
3,0.9675
4,0.9674
5,0.9667
6,0.9655
7,0.964
8,0.9615
9,0.9589
10,0.9564



‚úì Continued Pretraining Completed!
Final loss: 0.9559
Steps: 20

The model now has domain-specific knowledge!


## Step 6: Test Domain Knowledge

Let's test if the model learned Python patterns.

In [8]:
# Enable inference
FastLanguageModel.for_inference(model)

# Test prompts (code completion)
test_prompts = [
    "def bubble_sort(arr):\n    n = len(arr)\n    for i in range(n):\n",
    "class Stack:\n    def __init__(self):\n        self.items = []\n    \n    def push(self, item):\n",
    "async def fetch_user(user_id):\n    ",
    "from functools import wraps\n\ndef cache(func):\n    ",
]

print("Testing Domain-Adapted Model (Code Completion):\n")
print("="*60)

for prompt in test_prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.3,  # Lower temp for code
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"\nüíª Prompt:\n{prompt}")
    print(f"\n‚ú® Completion:\n{completion[len(prompt):]}")
    print("-"*60)

Testing Domain-Adapted Model (Code Completion):


üíª Prompt:
def bubble_sort(arr):
    n = len(arr)
    for i in range(n):


‚ú® Completion:

        bubble_sort(arr[:i])
        bubble_sort(arr[i:])

def bubble_sort(arr):
    for i in range(len(arr)):
        bubble_sort(arr[:i])
        bubble_sort(arr[i:])

def bubble_sort(arr):
    for i in range(len(arr)):
        bubble_sort(arr[:i])
        bubble_sort(arr[i:])

def bubble_sort(arr
------------------------------------------------------------

üíª Prompt:
class Stack:
    def __init__(self):
        self.items = []
    
    def push(self, item):


‚ú® Completion:
	self.items.append(item)
    
    def pop(self):
        return self.items.pop()
    
    def size(self):
        return len(self.items)
    
    def __iter__(self):
        return iter(self.items)
    
    def __iter_backward(self):
        return iter(self.items)
    
    def __iterbackward_backward(self):
        return iter(self.items)
    
    def __iterbackward_

## Step 7: Compare Before vs After

Let's load the original base model and compare.

In [9]:
# Load original base model for comparison
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/SmolLM2-135M",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(base_model)

print("‚úì Original base model loaded for comparison\n")

# Test same prompt on both models
test_prompt = "class BinaryTree:\n    def __init__(self, value):\n"

print("="*60)
print("COMPARISON: Base Model vs Domain-Adapted Model")
print("="*60)
print(f"\nPrompt:\n{test_prompt}")

# Original model
inputs = base_tokenizer(test_prompt, return_tensors="pt").to("cuda")
outputs = base_model.generate(
    **inputs,
    max_new_tokens=80,
    temperature=0.3,
    do_sample=True,
    pad_token_id=base_tokenizer.eos_token_id,
)
base_completion = base_tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"\n‚ùå Original Base Model:\n{base_completion[len(test_prompt):]}")

# Domain-adapted model
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=80,
    temperature=0.3,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)
adapted_completion = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"\n‚úÖ Domain-Adapted Model:\n{adapted_completion[len(test_prompt):]}")
print("\n" + "="*60)

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
‚úì Original base model loaded for comparison

COMPARISON: Base Model vs Domain-Adapted Model

Prompt:
class BinaryTree:
    def __init__(self, value):


‚ùå Original Base Model:
:param value:
        The value of the binary tree.
        """
    self.value = value

    self.left = None
    self.right = None

    def __repr__(self):
        return repr(self)

    def __str__(self):
        return repr(self)

    def __lt__(self, other):
        """
        :type other: BinaryTree:
            """
       

‚úÖ Domain-Adapted Model:
:param

## Step 8: Save Domain-Adapted Model

In [10]:
# Save domain-adapted model
model.save_pretrained("smollm2_135m_python_adapted")
tokenizer.save_pretrained("smollm2_135m_python_adapted")

print("‚úì Domain-adapted model saved to: smollm2_135m_python_adapted/")
print("\nThis model now has:")
print("  - Python programming knowledge")
print("  - Code pattern understanding")
print("  - Domain-specific vocabulary")
print("\nNext steps:")
print("  1. Optional: Instruction fine-tune for Q&A")
print("  2. Optional: DPO for code quality")
print("  3. Deploy for code completion")

‚úì Domain-adapted model saved to: smollm2_135m_python_adapted/

This model now has:
  - Python programming knowledge
  - Code pattern understanding
  - Domain-specific vocabulary

Next steps:
  1. Optional: Instruction fine-tune for Q&A
  2. Optional: DPO for code quality
  3. Deploy for code completion


## Real-World Continued Pretraining Examples

### 1. BloombergGPT (Finance)
- **Base**: 50B parameter model
- **Corpus**: 363B tokens of financial data
- **Sources**: Bloomberg terminals, news, reports
- **Result**: Best-in-class financial NLP

### 2. BioGPT (Medicine)
- **Base**: GPT-2 architecture
- **Corpus**: 15M PubMed abstracts
- **Domain**: Biomedical literature
- **Result**: SOTA on biomedical QA

### 3. CodeLlama (Programming)
- **Base**: Llama 2
- **Corpus**: 500B tokens of code
- **Sources**: GitHub, Stack Overflow
- **Result**: Specialized coding assistant

### 4. Med-PaLM (Healthcare)
- **Base**: PaLM
- **Corpus**: Medical textbooks, papers
- **Result**: Doctor-level medical knowledge

### 5. LegalBERT (Law)
- **Base**: BERT
- **Corpus**: 12GB legal documents
- **Result**: Legal document understanding

## Summary: Continued Pretraining

### What We Did:
1. ‚úÖ Loaded base model (not instruct)
2. ‚úÖ Created domain corpus (Python code)
3. ‚úÖ Configured continued pretraining
4. ‚úÖ Trained on raw domain text
5. ‚úÖ Tested domain knowledge
6. ‚úÖ Compared before/after

### Training Pipeline:
```
Base Model (General)
        ‚Üì
Continued Pretraining (Domain Text)
        ‚Üì
Domain-Adapted Model
        ‚Üì (optional)
Instruction Fine-tuning
        ‚Üì
Task-Specific Model
        ‚Üì (optional)
DPO/RLHF Alignment
        ‚Üì
Production Model
```

### Key Differences:

| Aspect | Fine-tuning | Continued Pretraining |
|--------|-------------|----------------------|
| Model | Instruct version | Base version |
| Data | Q&A pairs (MB) | Raw text (GB) |
| Format | Structured | Unstructured |
| Goal | Task format | Domain knowledge |
| LR | Higher (2e-4) | Lower (1e-5) |
| Epochs | Fewer (1-3) | More (3-10) |
| Examples | Thousands | Millions |

### Training Settings:
- **Learning Rate**: 1e-5 to 5e-6 (lower than fine-tuning)
- **Batch Size**: Larger (to see more data)
- **Epochs**: 3-10+ (multiple passes)
- **Schedule**: Cosine decay
- **LoRA Rank**: 32-64 (higher than fine-tuning)

### Data Requirements:
- **Minimum**: 10MB domain text
- **Good**: 100MB-1GB
- **Optimal**: 10GB+
- **Production**: 100GB+ (like CodeLlama)

### Domain Examples:

#### üè• Medical
```python
corpus = [
    "PubMed abstracts",
    "Medical textbooks",
    "Clinical notes",
    "Drug databases",
]
```

#### ‚öñÔ∏è Legal
```python
corpus = [
    "Case law",
    "Legal contracts",
    "Statutes",
    "Court opinions",
]
```

#### üí∞ Finance
```python
corpus = [
    "Financial reports",
    "Market analysis",
    "SEC filings",
    "Economic papers",
]
```

#### üíª Code
```python
corpus = [
    "GitHub repositories",
    "Documentation",
    "Stack Overflow",
    "API examples",
]
```

### When to Use Continued Pretraining:
- üìö **Large domain corpus** available (GB+ of text)
- üéØ **Domain-specific vocabulary** needed
- üí™ **Deep domain knowledge** required
- üåç **New language** or format
- üìä **Domain is very different** from base training

### When NOT to Use:
- ‚ö° **Small data** (< 10MB) ‚Üí Use fine-tuning
- üéØ **Task-specific** only ‚Üí Use fine-tuning
- ‚è∞ **Limited time** ‚Üí Skip to fine-tuning
- üí∞ **Cost-sensitive** ‚Üí Fine-tuning is cheaper

### Best Practices:
1. **Start with base model** (not instruct)
2. **Clean your corpus** (remove noise)
3. **Use lower learning rate** (1e-5)
4. **Train for multiple epochs** (3-10)
5. **Monitor perplexity** on domain data
6. **Fine-tune after** for specific tasks

### Measuring Success:
- üìâ **Perplexity**: Lower on domain test set
- üìù **Completion Quality**: Better domain completions
- üéØ **Downstream Tasks**: Better on domain benchmarks
- üß† **Knowledge Probes**: Answers domain questions

### Next Steps:
1. Collect larger domain corpus (1GB+)
2. Train for more epochs (50-100)
3. Instruction fine-tune on domain Q&A
4. Evaluate on domain benchmarks
5. Deploy for domain tasks

Continued pretraining is how specialized models like BloombergGPT and CodeLlama are built! üöÄ