<a href="https://colab.research.google.com/github/linhkid/NeuroPurrfectAI-labs/blob/main/notebooks/continual_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Continual Pretraining in Large Language Models - Concepts and Experiments

This notebook demonstrates the practical implementation of continual pretraining (CPT)
for language models. CPT enables models to be continuously updated with new knowledge
without starting from scratch, addressing the challenge of static knowledge in LLMs.

## What is Continual Pretraining?

Continual pretraining allows language models to:
- Adapt to new domains and data distributions
- Incorporate fresh knowledge over time
- Retain previously learned information (mitigating catastrophic forgetting)
- Update efficiently without complete retraining

There are two primary types of continual pretraining:
1. **Continual general pre-training**: Updating the LLM with new data similar to original pre-training data
2. **Continual domain-adaptive pre-training (DAP-training)**: Adapting the LLM to new domains

In this notebook, we implement domain-adaptive continual pretraining using Parameter Isolation
methods, specifically LoRA (Low-Rank Adaptation), to efficiently adapt a pretrained model
to the cybersecurity domain.

### Key Benefits of Continual Pretraining:
- Better adaptation to domain-specific data
- Cost and computational efficiency compared to full retraining
- Reducing catastrophic forgetting using specialized techniques
- Improved generalization to new, related tasks

# Environment Setup

## Setting Up the Environment

First, we'll install the necessary libraries and authenticate with the Hugging Face Hub.

In [1]:
# Install necessary libraries
!pip install "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install triton==3.1.0


Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[kaggle-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-nu2h_n21/unsloth_868f629bb7a141f69c633b723236e0c2
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-nu2h_n21/unsloth_868f629bb7a141f69c633b723236e0c2
  Resolved https://github.com/unslothai/unsloth.git to commit bb112e38ef3f0dafa9e87faf55a6ba7499bd0357
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes>=0.43.3 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[kaggle-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting unsloth_zoo>=2025.3.17 (from unsloth@ git+http

Collecting triton==3.1.0
  Downloading triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.5/209.5 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton
  Attempting uninstall: triton
    Found existing installation: triton 3.2.0
    Uninstalling triton-3.2.0:
      Successfully uninstalled triton-3.2.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.6.0+cu124 requires triton==3.2.0; platform_system == "Linux" and platform_machine == "x86_64", but you have triton 3.1.0 which is incompatible.[0m[31m
[0mSuccessfully installed triton-3.1.0


In [2]:
import os
from google.colab import userdata
# Set up Hugging Face access token for downloading models
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
!huggingface-cli login --token $HF_TOKEN

# Verify the authenticated user
hf_user = !huggingface-cli whoami
hf_user = hf_user[0]
print(f"Authenticated as: {hf_user}")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
The token `vietllm` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Authenticated as: linhkid91


# Loading The Model

## Base Model Selection and Loading

We'll use Gemma-3-1B with 4-bit quantization as our base model.
This demonstrates the computational efficiency advantage of continual pretraining:
- We start with a pretrained foundation model
- We'll use parameter-efficient techniques (LoRA) to adapt it
- 4-bit quantization reduces memory usage while maintaining performance

In [3]:
from unsloth import FastLanguageModel

# Model configuration
max_seq_length = 1024  # Maximum sequence length for training
dtype = None  # Auto-detect precision (Float16 for Tesla T4/V100, Bfloat16 for Ampere+)
load_in_4bit = True  # Use 4-bit quantization to reduce memory requirements

# Load the pretrained model
model_name = "gemma-3-1b-pt"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = f"unsloth/{model_name}-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Enable faster inference
FastLanguageModel.for_inference(model)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Gemma3 patching. Transformers: 4.50.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/965M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/196 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Gemma3ForCausalLM(
  (model): Gemma3TextModel(
    (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 1152, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma3DecoderLayer(
        (self_attn): Gemma3Attention(
          (q_proj): Linear4bit(in_features=1152, out_features=1024, bias=False)
          (k_proj): Linear4bit(in_features=1152, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=1152, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=1024, out_features=1152, bias=False)
          (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
          (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
        )
        (mlp): Gemma3MLP(
          (gate_proj): Linear4bit(in_features=1152, out_features=6912, bias=False)
          (up_proj): Linear4bit(in_features=1152, out_features=6912, bias=False)
          (down_proj): Linear4bit(in_features=6912, out_features=1152, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_l

# Parameter Isolation With LORA Adapters

## Implementing Parameter Isolation with LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that:
- Adds small, trainable rank decomposition matrices to existing weights
- Updates only a small subset of parameters (1-10%)
- Preserves the original model's knowledge while learning new information

This is an example of a Parameter Isolation method for continual learning, which:
- Allocates separate parameters for new knowledge
- Prevents catastrophic forgetting by not directly modifying original weights
- Enables efficient adaptation with minimal computational resources

For continual pretraining, we include embed_tokens and lm_head in the target modules to better learn out-of-distribution data from the new domain.

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128,  # Rank of the update matrices. Higher rank = more capacity but more parameters
    target_modules = [
        # Attention modules
        "q_proj", "k_proj", "v_proj", "o_proj",
        # MLP modules
        "gate_proj", "up_proj", "down_proj",
        # Token embedding and output head - critical for continual pretraining
        "embed_tokens", "lm_head",
    ],
    lora_alpha = 32,  # Scaling factor for the LoRA updates
    lora_dropout = 0,  # Dropout probability for LoRA layers (0 is optimized)
    bias = "none",  # No bias parameters are trained
    use_gradient_checkpointing = "unsloth",  # "unsloth" uses 30% less VRAM than standard gradient checkpointing
    random_state = 3407,  # For reproducibility
    use_rslora = True,  # Rank-stabilized LoRA for better training stability
    loftq_config = None,  # No LoftQ quantization
)



Unsloth: Making `model.base_model.model.model.embed_tokens` require gradients


# Initial Model Assesment

## Evaluating the Model Before Continual Pretraining

Let's test the model's response to a cybersecurity-related question before
we perform continual pretraining. This will help us compare responses before
and after adaptation to the cybersecurity domain.

In [5]:
from transformers import TextIteratorStreamer
from threading import Thread
import textwrap

# Create a streamer for generating text
text_streamer = TextIteratorStreamer(tokenizer)
max_print_width = 100

# Prepare the input query about cybersecurity
inputs = tokenizer(
[
    """
    what is penetration testing?
"""
], return_tensors = "pt").to("cuda")

In [6]:
# Generate a response
print("======== RESPONSE BEFORE CONTINUAL PRETRAINING ========")
outputs = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
print(tokenizer.batch_decode(outputs)[0])
print("======================================================")

<bos>
    what is penetration testing?
    penetration testing is a process that is used to test the security of a system.
    penetration testing is used to test the security of a system.
    penetration testing is used to test the security of a system.
    penetration testing is used to test the security of a system.



# Data Preparation

## Preparing the Cybersecurity Training Data

For continual pretraining, we'll use a cybersecurity-focused dataset.
This demonstrates the concept of Continual Domain-Adaptive Pretraining (DAP-training),
where we adapt the model to a specific new domain while retaining its general capabilities.

Key points in data preparation:
1. Format the text data with appropriate special tokens
2. Split into training and test sets
3. Ensure data quality for effective adaptation

In [7]:
from datasets import load_dataset

# Get the special end-of-sequence token from the tokenizer
EOS_TOKEN = tokenizer.eos_token
print(f"End of sequence token: {EOS_TOKEN}")

# Function to add EOS token to each example
def formatting_prompts_func(examples):
    """Add end-of-sequence token to each text example for proper training."""
    return {"text_custom": [example + EOS_TOKEN for example in examples["text"]]}


End of sequence token: <eos>


In [8]:
# Load the cybersecurity dataset
dataset = load_dataset("clydeiii/cybersecurity", split = "train")
print(f"Original dataset size: {len(dataset)} examples")

# Process dataset to add EOS tokens
dataset = dataset.map(formatting_prompts_func, batched = True)


README.md:   0%|          | 0.00/105 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/19 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/19 [00:00<?, ?files/s]

2010.clean.txt:   0%|          | 0.00/376k [00:00<?, ?B/s]

2006.clean.txt:   0%|          | 0.00/17.4k [00:00<?, ?B/s]

2008.clean.txt:   0%|          | 0.00/280k [00:00<?, ?B/s]

2018.clean.txt:   0%|          | 0.00/530k [00:00<?, ?B/s]

2020.clean.txt:   0%|          | 0.00/228k [00:00<?, ?B/s]

2015.clean.txt:   0%|          | 0.00/3.16M [00:00<?, ?B/s]

2017.clean.txt:   0%|          | 0.00/2.38M [00:00<?, ?B/s]

2016.clean.txt:   0%|          | 0.00/3.14M [00:00<?, ?B/s]

2022.clean.txt:   0%|          | 0.00/1.53M [00:00<?, ?B/s]

2019.clean.txt:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

2013.clean.txt:   0%|          | 0.00/2.19M [00:00<?, ?B/s]

2012.clean.txt:   0%|          | 0.00/798k [00:00<?, ?B/s]

2014.clean.txt:   0%|          | 0.00/3.05M [00:00<?, ?B/s]

reports.clean.txt:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

2021.clean.txt:   0%|          | 0.00/273k [00:00<?, ?B/s]

2023.clean.txt:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

cyber.clean.txt:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

2011.clean.txt:   0%|          | 0.00/559k [00:00<?, ?B/s]

2009.clean.txt:   0%|          | 0.00/187k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/564951 [00:00<?, ? examples/s]

Original dataset size: 564951 examples


Map:   0%|          | 0/564951 [00:00<?, ? examples/s]

In [9]:
# Create a small training split (0.5% of data) to demonstrate quick adaptation
# In a real scenario, you might use more data for better adaptation
train_dataset = dataset.train_test_split(test_size = 0.995)["train"]
print(f"Training dataset size: {len(train_dataset)} examples")


Training dataset size: 2824 examples


In [10]:
# Display a sample from the dataset to understand the content
if len(train_dataset) > 0:
    print("\nSample data entry (truncated):")
    sample_text = train_dataset[0]["text_custom"]
    print(sample_text[:500] + "..." if len(sample_text) > 500 else sample_text)


Sample data entry (truncated):
ASERT Threat Intelligence Report <eos>


# Training Configuration

## Configuring the Continual Pretraining Process

Now we configure the continual pretraining process using UnslothTrainer,
which is optimized for LoRA and memory efficiency. We'll define:

### Training Hyperparameters:
- Batch size and gradient accumulation steps
- Learning rates (with special rate for embeddings)
- Training schedule and optimizer settings

### Addressing Continual Learning Challenges:
- Using a slower learning rate for embeddings to prevent catastrophic forgetting
- Configuring warmup to stabilize early training
- Memory optimization for efficient training

In [11]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments
import torch

In [12]:
# Create the trainer with specialized configuration for continual pretraining
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text_custom",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,  # Number of processes for dataset processing
    packing = True,  # Enable input packing for efficiency

    args = UnslothTrainingArguments(
        # Batch size configuration
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 16,  # Accumulate gradients to simulate larger batch

        # Training schedule
        warmup_ratio = 0.1,  # Percentage of steps for learning rate warmup
        num_train_epochs = 1,  # For demonstration - more epochs might be needed for full adaptation

        # Learning rates
        learning_rate = 5e-5,  # Main learning rate for most parameters
        embedding_learning_rate = 5e-6,  # Slower learning rate for embeddings to prevent catastrophic forgetting

        # Precision settings
        fp16 = not is_bfloat16_supported(),  # Use fp16 if bfloat16 not available
        bf16 = is_bfloat16_supported(),  # Use bfloat16 if available (better for newer GPUs)

        # Other settings
        logging_steps = 1,
        optim = "adamw_8bit",  # Memory-efficient optimizer
        weight_decay = 0.00,  # No weight decay to simplify training
        lr_scheduler_type = "cosine",  # Cosine schedule for smoother learning
        seed = 3407,  # For reproducibility
        output_dir = "outputs",
        report_to = "none"  # Disable reporting to save resources
    )
)


Unsloth: Tokenizing ["text_custom"] (num_proc=8):   0%|          | 0/2824 [00:00<?, ? examples/s]

Unsloth: Hugging Face's packing is currently buggy - we're disabling it for now!


# Memory Management

## GPU Memory Management

Managing GPU memory is crucial for continual pretraining of large language models.
We'll monitor memory usage before and after training to understand the resource
requirements of our approach.

This section highlights the computational

In [13]:
import torch
import gc
from numba import cuda

In [14]:
# Function to clear GPU cache
def free_gpu_cache():
    """Clear GPU cache and run garbage collection to free memory."""
    print("Clearing GPU cache...")
    torch.cuda.empty_cache()
    print("Running Python GC.")
    gc.collect()
    print("Reseting device")
    device = cuda.get_current_device()
    device.reset()

In [15]:
# Display GPU specifications and initial memory usage
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved before training.")

GPU = NVIDIA L4. Max memory = 22.161 GB.
1.846 GB of memory reserved before training.


# Training Execution

## Executing Continual Pretraining

Now we'll run the actual continual pretraining process and track performance metrics.

In [16]:
# Run the continual pretraining process
print("Starting continual pretraining...")
trainer_stats = trainer.train()

Starting continual pretraining...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,824 | Num Epochs = 1 | Total steps = 176
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 138,067,968/1,000,000,000 (13.81% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,6.1182
2,5.1061
3,4.9587
4,5.9199
5,4.8995
6,5.6
7,4.9595
8,4.9937
9,6.6607
10,5.688




In [17]:
# Display training statistics
print(f"Training loss: {trainer_stats.training_loss}")

Training loss: 4.454508035020395


In [18]:
# Analyze memory usage after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

In [19]:
print("\n=== Training Performance Metrics ===")
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")


=== Training Performance Metrics ===
1766.6656 seconds used for training.
29.44 minutes used for training.
Peak reserved memory = 2.523 GB.
Peak reserved memory for training = 0.677 GB.
Peak reserved memory % of max memory = 11.385 %.
Peak reserved memory for training % of max memory = 3.055 %.


# Model Evaluation After Training

## Evaluating the Continually Pretrained Model

Now we'll test the model again after continual pretraining to see how its
responses about cybersecurity have improved. This demonstrates the model's
adaptation to the new domain while preserving its general capabilities.

We'll use the same input as before to see the difference in response quality.

In [20]:
# Generate response with the updated model
print("\n======== RESPONSE AFTER CONTINUAL PRETRAINING ========")
outputs_new = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
print(tokenizer.batch_decode(outputs_new)[0])
print("=======================================================")


<bos>
    what is penetration testing?
    What is the difference between penetration testing and ethical hacking?<eos>


In [21]:
# Try a different prompt to test general knowledge retention
general_prompt = tokenizer(
[
    """
    What is the capital of France?
"""
], return_tensors = "pt").to("cuda")

In [24]:
print("\n======== TESTING GENERAL KNOWLEDGE RETENTION ========")
general_outputs = model.generate(**general_prompt, streamer = text_streamer, max_new_tokens = 64)
print(tokenizer.batch_decode(general_outputs)[0])
print("=======================================================")


<bos>
    What is the capital of France?
    What is the capital of France?<eos>


# Model Persistence

## Saving the Continually Pretrained Model

After successful continual pretraining, we can save the model in various formats:

### Saving Options:
1. **LoRA adapters only**: Most space-efficient, requires base model at loading time
   - Ideal when you have multiple domain adaptations of the same base model
   
2. **Merged model with quantization**: Complete model with adapters merged into weights
   - Options include 4-bit and 8-bit quantization for deployment efficiency
   
3. **Full precision merged model**: Best quality but largest file size
   - Suitable when inference performance is critical

Choose the appropriate option based on your deployment requirements.

In [23]:
# Uncomment the appropriate lines to save the model as needed

# Option 1: Save LoRA adapters only (most efficient)
# model.save_pretrained("adapters_cybersecurity")

# Option 2: Save locally with 4-bit quantization (balanced)
# model.save_pretrained_merged(f"{model_name}-cybersecurity-4bit", tokenizer, save_method = "merged_4bit")

# Option 3: Save to Hugging Face Hub with 4-bit quantization
# model.push_to_hub_merged(f"{hf_user}/{model_name}-cybersecurity-bnb-4bit", tokenizer, save_method = "merged_4bit")


# Conclusion And Key Takeaways

This notebook demonstrated a practical implementation of continual pretraining using:
1. A parameter isolation method (LoRA) to efficiently adapt a model
2. Domain-specific data (cybersecurity) for targeted knowledge acquisition
3. Techniques to manage resources and evaluate results

### Key Benefits Demonstrated:
- **Efficient adaptation** with minimal computational resources
- **Preservation of general capabilities** with domain-specific enhancement
- **Practical approach** to keeping models updated with new knowledge

### Addressing Continual Learning Challenges:
- **Catastrophic forgetting**: Mitigated through parameter isolation
- **Computational efficiency**: Achieved through quantization and LoRA
- **Domain adaptation**: Successfully implemented for cybersecurity knowledge

### Future Extensions:
- Experiment with different continual learning techniques (replay, regularization)
- Test on multiple sequential domains to assess knowledge accumulation
- Develop robust evaluation frameworks for continual pretraining
- Explore other parameter-efficient methods like Adapters or Prefix-tuning