# Llama Fine-tuning Configuration with LLaMA-Factory

This notebook presents the configuration for fine-tuning Llama models on ICD-10 coding tasks using LLaMA-Factory, including both training and evaluation configurations.

## Setup Environment

First, let's make sure LLaMA-Factory is properly installed. If not already installed, uncomment and run the following code:

In [1]:
# !git clone https://github.com/hiyouga/LLaMA-Factory.git
# %cd LLaMA-Factory
# !pip install -e .

## Initial Fine-tuning Configuration

The configuration for the initial fine-tuning stage using the complete ICD-10 dataset.

In [2]:
# Create initial fine-tuning config file
initial_finetuning_config = """
### model
model_name_or_path: Llama-3.2-1B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json

### dataset
dataset: full_free_icd_train
template: llama3
cutoff_len: 5000
max_samples: 60000
overwrite_cache: true
preprocessing_num_workers: 1

### output
output_dir: saves/Llama-3.2-1B-Instruct/full/final/initial_finetuning
logging_steps: 10
save_steps: 2000000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 11.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 20000
"""

# Write to file
with open("initial_finetuning_config.yaml", "w") as f:
    f.write(initial_finetuning_config)

## Enhanced Fine-tuning Configuration

Configuration for the enhanced fine-tuning stage targeting linguistic and lexical variations. This configuration continues training from the initially fine-tuned model.

In [3]:
# Create enhanced fine-tuning config file
enhanced_finetuning_config = """
### model
model_name_or_path: saves/Llama-3.2-1B-Instruct/full/final/initial_finetuning

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json

### dataset
dataset: llama_abb_enhance_data_10,llama_multi_enhance_data_10,llama_typo_enhance_data_10,llama_sentence_enhance_data_10,llama_reorder_enhance_data_10
template: llama3
cutoff_len: 5000
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 1

### output
output_dir: saves/Llama-3.2-1B-Instruct/full/final/enhanced_finetuning
logging_steps: 10
save_steps: 500000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 5.0e-6  # Lower learning rate for enhanced tuning
num_train_epochs: 5.0  # Fewer epochs for enhanced tuning
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 10000
"""

# Write to file
with open("enhanced_finetuning_config.yaml", "w") as f:
    f.write(enhanced_finetuning_config)

## Evaluation Configuration

Configuration for evaluating the fine-tuned model on test datasets.

In [4]:
# Create evaluation config file
evaluation_config = """
### model
model_name_or_path: saves/Llama-3.2-1B-Instruct/full/final/enhanced_finetuning

### method
stage: sft
do_predict: true
finetuning_type: full
max_new_tokens: 10000

### dataset
eval_dataset: full_free_icd_test
template: llama3
cutoff_len: 150000
max_samples: 20000
overwrite_cache: true
preprocessing_num_workers: 5

### output
output_dir: saves/Llama-3.2-1B-Instruct/full/final/result/evaluation_results
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 2
predict_with_generate: true
"""

# Write to file
with open("evaluation_config.yaml", "w") as f:
    f.write(evaluation_config)

## Running Fine-tuning with LLaMA-Factory

Commands to execute the fine-tuning process using LLaMA-Factory.

In [5]:
# Command for running initial fine-tuning
initial_finetuning_cmd = "python -m llmtuner.cli.run_with_config initial_finetuning_config.yaml"

In [6]:
# Command for running enhanced fine-tuning
enhanced_finetuning_cmd = "python -m llmtuner.cli.run_with_config enhanced_finetuning_config.yaml"

In [7]:
# Command for evaluation
evaluation_cmd = "python -m llmtuner.cli.run_with_config evaluation_config.yaml"

## Multi-GPU Training with Distributed Data Parallel (DDP)

For faster training on multiple GPUs using PyTorch's Distributed Data Parallel along with DeepSpeed.

In [8]:
# Multi-GPU training command with DDP
multi_gpu_training_cmd = """CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.run \
    --nproc_per_node=4 \
    -m llmtuner.cli.run_with_config initial_finetuning_config.yaml"""

## DeepSpeed Configuration

The DeepSpeed configuration file referenced in the YAML configs (ds_z2_config.json)

In [9]:
# Create DeepSpeed config file
deepspeed_config = """
{
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "contiguous_gradients": true,
        "overlap_comm": true
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    }
}
"""

# Create directory if it doesn't exist
import os
os.makedirs("examples/deepspeed", exist_ok=True)

# Write to file
with open("examples/deepspeed/ds_z2_config.json", "w") as f:
    f.write(deepspeed_config)

## Dataset Configuration

Instructions for preparing your datasets in the format expected by LLaMA-Factory.

In [10]:
# Example of dataset format (JSON Lines)
dataset_example = """
{
    "system": "You are a medical coding specialist responsible for assigning ICD-10 codes to clinical documentation",
    "input": "Generate appropriate ICD-10 codes based on standard descriptions: Type 2 diabetes mellitus without complications",
    "output": "E11.9"
}
"""

print("Dataset example (JSONL format):")
print(dataset_example)

## Key Configuration Parameters Explained

### Two-Stage Fine-tuning Approach

1. **Initial Fine-tuning Stage**:
   - Uses complete ICD-10 dataset: `dataset: full_free_icd_train`
   - Higher learning rate: `learning_rate: 1.0e-5`
   - More epochs: `num_train_epochs: 11.0`
   - Focuses on building foundational medical coding knowledge

2. **Enhanced Fine-tuning Stage**:
   - Uses linguistic variation datasets: `dataset: llama_abb_enhance_data_10,...`
   - Lower learning rate: `learning_rate: 5.0e-6`
   - Fewer epochs: `num_train_epochs: 5.0`
   - Builds on initial model: `model_name_or_path: .../initial_finetuning`
   - Focuses on handling linguistic and lexical variations

### Performance Optimization

- **DeepSpeed Integration**: Zero-2 optimization for memory efficiency
- **BF16 Precision**: `bf16: true` for faster training without significant precision loss
- **Multi-GPU Training**: Using PyTorch DDP for distributed training
- **Gradient Accumulation**: For effectively larger batch sizes

## Shell Script for Full Training Process

This script automates the complete two-stage fine-tuning process.

In [11]:
# Create a shell script for the full training process
full_training_script = """
#!/bin/bash

# Create needed directories
mkdir -p examples/deepspeed
mkdir -p saves/Llama-3.2-1B-Instruct/full/final/initial_finetuning
mkdir -p saves/Llama-3.2-1B-Instruct/full/final/enhanced_finetuning
mkdir -p saves/Llama-3.2-1B-Instruct/full/final/result/evaluation_results

# Generate DeepSpeed config
cat > examples/deepspeed/ds_z2_config.json << 'EOL'
{
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "contiguous_gradients": true,
        "overlap_comm": true
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    }
}
EOL

echo "Starting Stage 1: Initial Fine-tuning"
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.run \
    --nproc_per_node=4 \
    -m llmtuner.cli.run_with_config initial_finetuning_config.yaml

echo "Starting Stage 2: Enhanced Fine-tuning"
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.run \
    --nproc_per_node=4 \
    -m llmtuner.cli.run_with_config enhanced_finetuning_config.yaml

echo "Starting Evaluation"
python -m llmtuner.cli.run_with_config evaluation_config.yaml

echo "Fine-tuning process complete!"
"""

# Write to file
with open("run_icd10_finetuning.sh", "w") as f:
    f.write(full_training_script)

## Conclusion

This notebook provides the complete configuration for our two-stage fine-tuning approach using LLaMA-Factory:

1. Initial fine-tuning establishes comprehensive ICD-10 code knowledge
2. Enhanced fine-tuning adapts the model to handle linguistic and lexical variations

The configuration files and scripts can be adapted for different model sizes or variations of the Llama model family. The approach is optimized for both performance and memory efficiency using DeepSpeed and distributed training.