# Neuron Distillation Training

This notebook converts the `run_distillation.sh` script to run knowledge distillation training on AWS Neuron.

## Environment Setup

In [1]:
import os

# Set Neuron compilation flags
os.environ['NEURON_CC_FLAGS'] = "--model-type transformer --retry_failed_compilation"
os.environ['NEURON_FUSE_SOFTMAX'] = "1"
os.environ['NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS'] = "3"
os.environ['MALLOC_ARENA_MAX'] = "64"
os.environ['WORLD_SIZE'] = "8"

## Training Configuration

In [2]:
# Training parameters
PROCESSES_PER_NODE = 2
NUM_EPOCHS = 3
TP_DEGREE = 2
BS = 1
GRADIENT_ACCUMULATION_STEPS = 16
LOGGING_STEPS = 1
MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"
OUTPUT_DIR = f"{MODEL_NAME.split('/')[-1]}-finetuned"

# Set max steps based on environment
MAX_STEPS = 5 if os.environ.get('NEURON_EXTRACT_GRAPHS_ONLY') == '1' else -1

print(f"Model: {MODEL_NAME}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Max steps: {MAX_STEPS}")

Model: meta-llama/Llama-3.2-1B-Instruct
Output directory: Llama-3.2-1B-Instruct-finetuned
Max steps: -1


## KnowledgeDistillationTrainer Code

Let's examine the KnowledgeDistillationTrainer class from `distill_neuron_torchrun.py`:

In [None]:
%load -r 34-78 src/distill_neuron_torchrun.py


## Key Methods of KnowledgeDistillationTrainer

The `compute_loss` method is the core of the knowledge distillation process:

In [None]:
%load -r 40-78 src/distill_neuron_torchrun.py


## Run Distillation Training

Execute the training with torchrun:

In [None]:
import subprocess

# Build the torchrun command
cmd = [
    "torchrun",
    "--nproc_per_node", str(PROCESSES_PER_NODE),
    "src/distill_neuron_torchrun.py",
    "--model_id", MODEL_NAME,
    "--num_train_epochs", str(NUM_EPOCHS),
    "--do_train",
    "--max_steps", str(MAX_STEPS),
    "--per_device_train_batch_size", str(BS),
    "--gradient_accumulation_steps", str(GRADIENT_ACCUMULATION_STEPS),
    "--learning_rate", "1e-4",
    "--bf16",
    "--tensor_parallel_size", str(TP_DEGREE),
    "--warmup_steps", "5",
    "--pipeline_parallel_size", "1",
    "--logging_steps", str(LOGGING_STEPS),
    "--output_dir", OUTPUT_DIR,
    "--overwrite_output_dir"
]

print("Running command:")
print(" ".join(cmd))
print("\n" + "="*50)

# Execute the command
result = subprocess.run(cmd, capture_output=True, text=True)
print("STDOUT:")
print(result.stdout)
if result.stderr:
    print("\nSTDERR:")
    print(result.stderr)
print(f"\nReturn code: {result.returncode}")

Running command:
torchrun --nproc_per_node 2 src/distill_neuron_torchrun.py --model_id meta-llama/Llama-3.2-1B-Instruct --num_train_epochs 3 --do_train --max_steps -1 --per_device_train_batch_size 1 --gradient_accumulation_steps 16 --learning_rate 1e-4 --bf16 --tensor_parallel_size 2 --warmup_steps 5 --pipeline_parallel_size 1 --logging_steps 1 --output_dir Llama-3.2-1B-Instruct-finetuned --overwrite_output_dir

