# Omega17Exp LoRA SFT Fine-tuning with MS-SWIFT

This notebook demonstrates how to fine-tune the custom **Omega17ExpForCausalLM** model using LoRA (Low-Rank Adaptation) for Supervised Fine-Tuning (SFT).

## Model Specifications
- **Architecture**: Omega17ExpForCausalLM (MoE - Mixture of Experts)
- **Hidden Size**: 2048
- **Layers**: 48
- **Experts**: 128 total, 8 per token
- **Context Length**: 262,144 tokens
- **Vocab Size**: 151,936

## Requirements
- RunPod with GPU (A100 40GB+ recommended for this MoE model)
- Custom transformers fork: `transformers-usf-om-vl-exp-v0`

## 1. Environment Setup

In [None]:
# Install the custom transformers fork (REQUIRED for Omega17Exp model)
!pip install transformers-usf-om-vl-exp-v0 -q

# Install MS-SWIFT and dependencies
!pip install ms-swift[llm] -q

# Additional dependencies
!pip install accelerate bitsandbytes peft datasets -q

# For DeepSpeed (optional but recommended for large models)
!pip install deepspeed -q

In [None]:
# Verify GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Register Omega17Exp model with MS-SWIFT
# This imports the omega17 module which handles all registration automatically

from swift.llm.model.register import MODEL_MAPPING

# Check if already registered
if 'omega17_exp' not in MODEL_MAPPING:
    # Import omega17 module to trigger registration
    from swift.llm.model.model import omega17
    print("✅ Omega17Exp model registered successfully!")
else:
    print("✅ Omega17Exp model already registered")

# Verify registration
print(f"\nModel type 'omega17_exp' in MODEL_MAPPING: {'omega17_exp' in MODEL_MAPPING}")

# Show model metadata
if 'omega17_exp' in MODEL_MAPPING:
    meta = MODEL_MAPPING['omega17_exp']
    print(f"Template: {meta.template}")
    print(f"Architectures: {meta.architectures}")
    print(f"Model arch: {meta.model_arch}")
    print(f"Additional saved files: {meta.additional_saved_files}")

In [None]:
from swift.llm.model.register import (
    Model, ModelGroup, ModelMeta, register_model,
    get_model_tokenizer_with_flash_attn
)
from swift.llm.model.constant import LLMModelType
from swift.llm.model.model_arch import ModelArch
from swift.llm.model.patcher import patch_output_to_input_device
from swift.llm import TemplateType

# Add custom model type
if not hasattr(LLMModelType, 'omega17_exp'):
    LLMModelType.omega17_exp = 'omega17_exp'


def get_model_tokenizer_omega17_exp(model_dir, model_info, model_kwargs, load_model=True, **kwargs):
    """
    Custom get_model_tokenizer function for Omega17Exp MoE model.
    """
    model, tokenizer = get_model_tokenizer_with_flash_attn(
        model_dir, model_info, model_kwargs, load_model, **kwargs
    )
    
    if model is not None:
        # Fix dtype for MoE layers
        try:
            mlp_cls = model.model.layers[1].mlp.__class__
            for module in model.modules():
                if isinstance(module, mlp_cls):
                    patch_output_to_input_device(module)
        except (AttributeError, IndexError):
            pass
    
    return model, tokenizer


# Register the model
register_model(
    ModelMeta(
        LLMModelType.omega17_exp,
        model_groups=[ModelGroup([])],
        template=TemplateType.chatml,
        get_function=get_model_tokenizer_omega17_exp,
        architectures=['Omega17ExpForCausalLM'],
        model_arch=ModelArch.llama,
        requires=['transformers-usf-om-vl-exp-v0'],
    ),
    exist_ok=True
)

print("✅ Omega17Exp model registered successfully!")

# ============================================================
# CONFIGURATION - MODIFY THESE VALUES
# ============================================================

# Model path (local path or HuggingFace model ID)
MODEL_PATH = "/path/to/your/omega17-exp-model"  # <-- CHANGE THIS

# Dataset configuration
# Option 1: Use a HuggingFace dataset
DATASET = "alpaca-en"  # MS-SWIFT built-in dataset
# Option 2: Use your custom dataset (JSONL format)
# DATASET = "/path/to/your/dataset.jsonl"

# Output directory for checkpoints
OUTPUT_DIR = "./output/omega17_lora_sft"

# Training hyperparameters
BATCH_SIZE = 1  # Per device batch size (reduce if OOM)
GRADIENT_ACCUMULATION_STEPS = 16  # Effective batch size = BATCH_SIZE * GRAD_ACCUM
MAX_LENGTH = 2048  # Maximum sequence length
NUM_EPOCHS = 3
LEARNING_RATE = 1e-4

# LoRA configuration
LORA_RANK = 64  # LoRA rank (higher = more parameters)
LORA_ALPHA = 128  # LoRA alpha (typically 2x rank)
LORA_DROPOUT = 0.05

# Target modules for LoRA (Omega17Exp architecture)
LORA_TARGET_MODULES = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
    "gate_proj", "up_proj", "down_proj"  # MLP
]

# Quantization (for memory efficiency)
USE_QLORA = False  # Set to True for 4-bit quantization
QUANT_BITS = 4  # 4 or 8 bit quantization

print(f"Model: {MODEL_PATH}")
print(f"Dataset: {DATASET}")
print(f"Output: {OUTPUT_DIR}")
print(f"LoRA Rank: {LORA_RANK}, Alpha: {LORA_ALPHA}")

In [None]:
# ============================================================
# CONFIGURATION - MODIFY THESE VALUES
# ============================================================

# Model path (local path or HuggingFace model ID)
MODEL_PATH = "/path/to/your/omega17-exp-model"  # <-- CHANGE THIS

# Dataset configuration
# Option 1: Use a HuggingFace dataset
DATASET = "alpaca-en"  # MS-SWIFT built-in dataset
# Option 2: Use your custom dataset (JSONL format)
# DATASET = "/path/to/your/dataset.jsonl"

# Output directory for checkpoints
OUTPUT_DIR = "./output/omega17_lora_sft"

# Training hyperparameters
BATCH_SIZE = 1  # Per device batch size (reduce if OOM)
GRADIENT_ACCUMULATION_STEPS = 16  # Effective batch size = BATCH_SIZE * GRAD_ACCUM
MAX_LENGTH = 2048  # Maximum sequence length
NUM_EPOCHS = 3
LEARNING_RATE = 1e-4

# LoRA configuration
LORA_RANK = 64  # LoRA rank (higher = more parameters)
LORA_ALPHA = 128  # LoRA alpha (typically 2x rank)
LORA_DROPOUT = 0.05

# Target modules for LoRA
# For Omega17Exp (llama-style architecture), target attention and MLP
LORA_TARGET_MODULES = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
    "gate_proj", "up_proj", "down_proj"  # MLP (if not targeting experts)
]

# Quantization (for memory efficiency)
USE_QLORA = False  # Set to True for 4-bit quantization
QUANT_BITS = 4  # 4 or 8 bit quantization

print(f"Model: {MODEL_PATH}")
print(f"Dataset: {DATASET}")
print(f"Output: {OUTPUT_DIR}")
print(f"LoRA Rank: {LORA_RANK}, Alpha: {LORA_ALPHA}")

## 4. Prepare Dataset

MS-SWIFT supports multiple dataset formats. Here's how to prepare your data.

In [None]:
# Example: Create a custom dataset in JSONL format
# Each line should be a JSON object with the following format:

import json
import os

# Example dataset structure for SFT
example_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is machine learning?"},
            {"role": "assistant", "content": "Machine learning is a subset of artificial intelligence..."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "Explain quantum computing."},
            {"role": "assistant", "content": "Quantum computing uses quantum mechanics principles..."}
        ]
    }
]

# Alternative format (query/response)
example_data_simple = [
    {
        "query": "What is the capital of France?",
        "response": "The capital of France is Paris."
    },
    {
        "query": "Write a Python function to calculate factorial.",
        "response": "def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n-1)"
    }
]

# Save example dataset
os.makedirs("./data", exist_ok=True)
with open("./data/example_train.jsonl", "w") as f:
    for item in example_data:
        f.write(json.dumps(item) + "\n")

print("Example dataset created at ./data/example_train.jsonl")
print("\nSupported dataset formats:")
print("1. messages format: [{role: 'user', content: '...'}, {role: 'assistant', content: '...'}]")
print("2. query/response format: {query: '...', response: '...'}")
print("3. instruction/input/output format: {instruction: '...', input: '...', output: '...'}")

## 5. Method A: Train Using MS-SWIFT Python API

In [None]:
from swift.llm import sft_main, SftArguments, TrainArguments
from swift.llm.model.register import MODEL_MAPPING

# Verify model is registered
print(f"Registered model types: {list(MODEL_MAPPING.keys())[-10:]}...")  # Show last 10
print(f"omega17_exp registered: {'omega17_exp' in MODEL_MAPPING}")

In [None]:
# Define training arguments
sft_args = SftArguments(
    # Model configuration
    model=MODEL_PATH,
    model_type='omega17_exp',  # Use registered model type
    
    # Dataset configuration
    dataset=[DATASET],
    
    # Training type
    train_type='lora',  # LoRA fine-tuning
    
    # LoRA configuration
    lora_rank=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    lora_target_modules=LORA_TARGET_MODULES,
    
    # Quantization (optional)
    quant_bits=QUANT_BITS if USE_QLORA else None,
    
    # Training parameters
    output_dir=OUTPUT_DIR,
    max_length=MAX_LENGTH,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    
    # Optimizer and scheduler
    optim='adamw_torch',
    lr_scheduler_type='cosine',
    warmup_ratio=0.03,
    
    # Precision
    torch_dtype='bfloat16',
    
    # Logging and saving
    logging_steps=10,
    save_steps=500,
    save_total_limit=3,
    
    # Gradient checkpointing (save memory)
    gradient_checkpointing=True,
    
    # Evaluation
    eval_steps=500,
    
    # Misc
    seed=42,
    report_to=['tensorboard'],
)

print("Training arguments configured!")
print(f"\nKey settings:")
print(f"  - Train type: {sft_args.train_type}")
print(f"  - LoRA rank: {sft_args.lora_rank}")
print(f"  - Max length: {sft_args.max_length}")
print(f"  - Batch size: {sft_args.per_device_train_batch_size}")
print(f"  - Gradient accumulation: {sft_args.gradient_accumulation_steps}")

In [None]:
# Start training
# Uncomment the line below to run training
# result = sft_main(sft_args)

print("To start training, uncomment and run: result = sft_main(sft_args)")

## 6. Method B: Train Using CLI (Recommended for Production)

For production training, using the CLI is often more stable and easier to manage.

In [None]:
# Generate the CLI command
cli_command = f"""
CUDA_VISIBLE_DEVICES=0 swift sft \\
    --model {MODEL_PATH} \\
    --model_type omega17_exp \\
    --dataset {DATASET} \\
    --train_type lora \\
    --lora_rank {LORA_RANK} \\
    --lora_alpha {LORA_ALPHA} \\
    --lora_dropout {LORA_DROPOUT} \\
    --lora_target_modules {' '.join(LORA_TARGET_MODULES)} \\
    --output_dir {OUTPUT_DIR} \\
    --max_length {MAX_LENGTH} \\
    --num_train_epochs {NUM_EPOCHS} \\
    --per_device_train_batch_size {BATCH_SIZE} \\
    --gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \\
    --learning_rate {LEARNING_RATE} \\
    --torch_dtype bfloat16 \\
    --gradient_checkpointing true \\
    --logging_steps 10 \\
    --save_steps 500 \\
    --save_total_limit 3
"""

print("CLI Command for training:")
print("=" * 60)
print(cli_command)

In [None]:
# Save CLI command to a shell script
with open("train_omega17_lora.sh", "w") as f:
    f.write("#!/bin/bash\n")
    f.write("# Omega17Exp LoRA SFT Training Script\n\n")
    f.write("# Install dependencies first:\n")
    f.write("# pip install transformers-usf-om-vl-exp-v0\n")
    f.write("# pip install ms-swift[llm]\n\n")
    f.write(cli_command.strip())

print("Training script saved to: train_omega17_lora.sh")
print("Run with: bash train_omega17_lora.sh")

## 7. Multi-GPU Training with DeepSpeed

In [None]:
# DeepSpeed ZeRO-2 configuration for multi-GPU training
deepspeed_config = {
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "allgather_partitions": True,
        "allgather_bucket_size": 2e8,
        "reduce_scatter": True,
        "reduce_bucket_size": 2e8,
        "overlap_comm": True,
        "contiguous_gradients": True
    },
    "bf16": {
        "enabled": True
    }
}

# Save DeepSpeed config
import json
with open("ds_config_zero2.json", "w") as f:
    json.dump(deepspeed_config, f, indent=2)

print("DeepSpeed config saved to: ds_config_zero2.json")

# Multi-GPU training command
multi_gpu_command = f"""
# Multi-GPU training with DeepSpeed ZeRO-2
CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \\
    --model {MODEL_PATH} \\
    --model_type omega17_exp \\
    --dataset {DATASET} \\
    --train_type lora \\
    --lora_rank {LORA_RANK} \\
    --lora_alpha {LORA_ALPHA} \\
    --output_dir {OUTPUT_DIR} \\
    --max_length {MAX_LENGTH} \\
    --num_train_epochs {NUM_EPOCHS} \\
    --per_device_train_batch_size {BATCH_SIZE} \\
    --gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \\
    --learning_rate {LEARNING_RATE} \\
    --deepspeed ds_config_zero2.json \\
    --torch_dtype bfloat16 \\
    --gradient_checkpointing true
"""

print("\nMulti-GPU Command:")
print("=" * 60)
print(multi_gpu_command)

## 8. Inference with Trained LoRA Adapter

In [None]:
# After training, load the model with LoRA adapter for inference
from swift.llm import InferArguments, infer_main, get_model_tokenizer, get_template

# Path to your trained LoRA adapter
ADAPTER_PATH = f"{OUTPUT_DIR}/checkpoint-xxx"  # Replace xxx with actual checkpoint

# Inference configuration
infer_args = InferArguments(
    model=MODEL_PATH,
    model_type='omega17_exp',
    adapters=[ADAPTER_PATH],  # Load LoRA adapter
    torch_dtype='bfloat16',
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
)

print("Inference configuration ready!")
print(f"Model: {MODEL_PATH}")
print(f"Adapter: {ADAPTER_PATH}")

In [None]:
# CLI inference command
infer_cli = f"""
swift infer \\
    --model {MODEL_PATH} \\
    --model_type omega17_exp \\
    --adapters {ADAPTER_PATH} \\
    --torch_dtype bfloat16 \\
    --max_new_tokens 512 \\
    --temperature 0.7 \\
    --stream true
"""

print("CLI Inference Command:")
print("=" * 60)
print(infer_cli)

## 9. Merge LoRA Weights (Optional)

Merge LoRA adapter into the base model for faster inference without adapter loading overhead.

In [None]:
# CLI command to merge LoRA weights
merge_command = f"""
swift export \\
    --model {MODEL_PATH} \\
    --model_type omega17_exp \\
    --adapters {ADAPTER_PATH} \\
    --merge_lora true \\
    --output_dir ./merged_model
"""

print("Merge LoRA Command:")
print("=" * 60)
print(merge_command)

## 10. Troubleshooting

### Common Issues and Solutions

In [None]:
troubleshooting_guide = """
============================================================
TROUBLESHOOTING GUIDE
============================================================

1. MODEL NOT FOUND ERROR
   - Ensure transformers-usf-om-vl-exp-v0 is installed
   - Check that model path is correct and accessible
   - Verify model files include: config.json, modeling_omega17_exp.py, etc.

2. OUT OF MEMORY (OOM)
   - Reduce batch_size (try 1)
   - Reduce max_length (try 1024 or 512)
   - Enable gradient_checkpointing
   - Use QLoRA with quant_bits=4
   - Use DeepSpeed ZeRO-2 or ZeRO-3

3. SLOW TRAINING
   - Use flash attention (usually auto-enabled)
   - Increase batch size if memory allows
   - Use bf16 instead of fp16/fp32
   - Use multi-GPU with DeepSpeed

4. LORA NOT WORKING
   - Check target_modules match your model architecture
   - For MoE models, you may need to target router layers
   - Try: ["q_proj", "k_proj", "v_proj", "o_proj"]

5. DATASET FORMAT ERRORS
   - Use JSONL format with one JSON object per line
   - Supported formats:
     * messages: [{"role": "user", "content": "..."}, ...]
     * query/response: {"query": "...", "response": "..."}
     * instruction/output: {"instruction": "...", "output": "..."}

6. CUSTOM TRANSFORMERS FORK ISSUES
   - Uninstall regular transformers first:
     pip uninstall transformers
   - Then install the custom fork:
     pip install transformers-usf-om-vl-exp-v0

============================================================
"""

print(troubleshooting_guide)

## Summary

This notebook provides a complete workflow for LoRA SFT fine-tuning of the Omega17Exp model:

1. **Environment Setup**: Install custom transformers fork and MS-SWIFT
2. **Model Registration**: Register Omega17Exp with MS-SWIFT
3. **Configuration**: Set model path, dataset, and training parameters
4. **Training**: Use Python API or CLI for training
5. **Multi-GPU**: DeepSpeed configuration for distributed training
6. **Inference**: Load and use trained adapter
7. **Export**: Merge LoRA weights into base model

For questions or issues, check the MS-SWIFT documentation: https://swift.readthedocs.io/