# Lab 1: Chess Move Evaluation - Knowledge Distillation Training

## Introduction

In this lab, you will train a smaller "student" model to evaluate chess moves using knowledge distillation. This builds on Lab 0, where you generated teacher model logits from the Qwen3-30B-A3B model.

**Task**: Train a student model to classify which chess move is better (MoveA or MoveB)

**Why Knowledge Distillation for Chess?**
- **Cost Reduction**: 50x smaller model (30B → 0.6B parameters)
- **Faster Inference**: ~20-50x faster move evaluation
- **Deployment Flexibility**: Can run on smaller instances or edge devices
- **Maintained Performance**: Retains much of the teacher's chess understanding

**Training Approach:**

The `KnowledgeDistillationTrainer` combines two loss functions:
1. **Hard Loss**: Cross-entropy with true labels (MoveA or MoveB)
2. **Soft Loss**: KL divergence between teacher and student logits

Combined loss: `total_loss = α × soft_loss + (1 - α) × hard_loss`

Where α=0.7 means 70% weight on learning from teacher, 30% on correct answers.

**Models:**
- **Teacher**: Qwen3-30B-A3B (30 billion parameters)
- **Student**: Qwen3-0.6B (600 million parameters)

**Prerequisites:**
- Completed Lab 0 with chess logits saved to `data/chess_output.json`
- AWS Trainium instance (trn1.32xlarge recommended)
- AWS Neuron SDK installed
- Virtual environment: `/opt/aws_neuronx_venv_pytorch_2_8_nxd_inference`

## Download Student Model

Download the Qwen3-0.6B model weights from HuggingFace.

In [None]:
%pip install -q neuronx-distributed datasets optimum-neuron[training]==0.4.1


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
!hf download Qwen/Qwen3-0.6B

Fetching 10 files:   0%|                                 | 0/10 [00:00<?, ?it/s]Downloading 'tokenizer.json' to '/home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/blobs/aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4.incomplete'
Downloading 'model.safetensors' to '/home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/blobs/f47f71177f32bcd101b7573ec9171e6a57f4f4d31148d38e382306f42996874b.incomplete'
Downloading 'README.md' to '/home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/blobs/a50b19e76f5274f9ec99f5a5d99873dca5bff25e.incomplete'

README.md: 14.0kB [00:00, 51.0MB/s]
Download complete. Moving file to /home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/blobs/a50b19e76f5274f9ec99f5a5d99873dca5bff25e
Downloading 'LICENSE' to '/home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/blobs/6634c8cc3133b3848ec74b9f275acaaa1ea618ab.incomplete'
Downloading 'config.json' to '/home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B

## Environment Setup

Configure environment variables for optimal Neuron performance.

In [1]:
import os

# Neuron compiler and runtime settings
os.environ['NEURON_CC_FLAGS'] = "--model-type transformer --retry_failed_compilation"
os.environ['NEURON_FUSE_SOFTMAX'] = "1"
os.environ['NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS'] = "3"
os.environ['MALLOC_ARENA_MAX'] = "64"
os.environ['WORLD_SIZE'] = "8"
os.environ['WANDB_DISABLED'] = "true"  # Disable wandb logging

## Training Configuration

Define hyperparameters for the distillation training.

In [2]:
# Training parameters
PROCESSES_PER_NODE = 2  # Distributed training processes
NUM_EPOCHS = 3  # Number of training epochs
TP_DEGREE = 2  # Tensor parallelism degree
BS = 1  # Batch size per device
GRADIENT_ACCUMULATION_STEPS = 16  # Effective batch size = 16
LOGGING_STEPS = 1  # Log every step
MODEL_NAME = "Qwen/Qwen3-0.6B"
OUTPUT_DIR = "Qwen3-0.6B-chess-finetuned"
DATASET_PATH = "data/chess_output.json"

# Distillation hyperparameters
TEMPERATURE = 4.0  # Softness of probability distributions
ALPHA = 0.7  # Weight for soft loss (0.7 = 70% teacher, 30% labels)

# Set max steps (use -1 for full training)
MAX_STEPS = -1  # Train for full epochs

print(f"Model: {MODEL_NAME}")
print(f"Dataset: {DATASET_PATH}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Temperature: {TEMPERATURE}, Alpha: {ALPHA}")

Model: Qwen/Qwen3-0.6B
Dataset: data/chess_output.json
Output directory: Qwen3-0.6B-chess-finetuned
Temperature: 4.0, Alpha: 0.7


## Verify Chess Dataset

Check that the chess logits data from Lab 0 is available.

In [3]:
import json
from pathlib import Path

if not Path(DATASET_PATH).exists():
    print(f"ERROR: {DATASET_PATH} not found!")
    print("Please run Lab0_generate_teacher_logits_chess.ipynb first.")
else:
    with open(DATASET_PATH, 'r') as f:
        chess_data = json.load(f)
    
    valid_samples = [s for s in chess_data if 'error' not in s]
    print(f"✓ Found {len(valid_samples)} valid chess samples")
    print(f"✓ Average logit positions: {sum(len(s['response']['token_logits']) for s in valid_samples) / len(valid_samples):.1f}")
    
    # Show example
    sample = valid_samples[0]
    print(f"\nExample:")
    print(f"  Input: {sample['input'][:100]}...")
    print(f"  Expected: {sample['expected_output']}")
    print(f"  Generated: {sample['response']['generated_text']}")

✓ Found 100 valid chess samples
✓ Average logit positions: 3.0

Example:
  Input: The FEN of the given chess board is "1r4k1/4nppp/8/4Pb2/8/1P5P/r1PR4/3R3K w - - 0 27". Which move is...
  Expected: MoveA:d2d8
  Generated: system
Classify the better move. Output format: MoveA or MoveB
user
The FEN of the given chess board is "1r4k1/4nppp/8/4Pb2/8/1P5P/r1PR4/3R3K w - - 0 27". Which move is better? MoveA:d2d8, Adjust the piece to a key area, where it holds more influence over the board. TacticA: d2d8 b8d8 d1d8 Checkmate!  MoveB:d2d7, Switch the piece to a more advantageous place, increasing its mastery over the board. TacticB: d2d7 f5d7 Trade the lower value piece for a higher value piece. 
assistant
<think>

</think>

MoveA


## Run Training

Execute the distributed training using `torchrun`.

**Note**: First run will compile the model (~20-30 minutes). Subsequent runs use cached compilation.

**Training Process:**
1. **Compilation** (first run only): Neuron compiler optimizes model for Trainium
2. **Training**: Student learns from teacher logits
3. **Checkpointing**: Model saved to OUTPUT_DIR

**Expected Time:**
- Compilation: ~20-30 minutes (one-time)
- Training (100 samples, 3 epochs): ~10-15 minutes

In [None]:
# Build the training command
training_cmd = f"""
torchrun  \\
    --nproc_per_node {PROCESSES_PER_NODE} \\
    src/distill_chess_neuron_torchrun.py \\
    --model_id {MODEL_NAME} \\
    --dataset_path {DATASET_PATH} \\
    --output_model_path ./final_chess_model \\
    --temperature {TEMPERATURE} \\
    --alpha {ALPHA} \\
    --num_train_epochs {NUM_EPOCHS} \\
    --do_train \\
    --max_steps {MAX_STEPS} \\
    --per_device_train_batch_size {BS} \\
    --gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \\
    --learning_rate 1e-4 \\
    --bf16 \\
    --zero_1 False \\
    --tensor_parallel_size {TP_DEGREE} \\
    --warmup_steps 5 \\
    --pipeline_parallel_size 1 \\
    --logging_steps {LOGGING_STEPS} \\
    --output_dir {OUTPUT_DIR} \\
    --overwrite_output_dir
"""

print("Starting training...")
print("This will take ~30-45 minutes on first run (includes compilation)")
print("\nCommand:")
print(training_cmd)

# Run training
!{training_cmd}

Starting training...
This will take ~30-45 minutes on first run (includes compilation)

Command:

torchrun  \
    --nproc_per_node 2 \
    src/distill_chess_neuron_torchrun.py \
    --model_id Qwen/Qwen3-0.6B \
    --dataset_path data/chess_output.json \
    --output_model_path ./final_chess_model \
    --temperature 4.0 \
    --alpha 0.7 \
    --num_train_epochs 3 \
    --do_train \
    --max_steps -1 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --learning_rate 1e-4 \
    --bf16 \
    --zero_1 False \
    --tensor_parallel_size 2 \
    --warmup_steps 5 \
    --pipeline_parallel_size 1 \
    --logging_steps 1 \
    --output_dir Qwen3-0.6B-chess-finetuned \
    --overwrite_output_dir



W1107 03:29:59.097000 56674 torch/distributed/run.py:774] 
W1107 03:29:59.097000 56674 torch/distributed/run.py:774] *****************************************
W1107 03:29:59.097000 56674 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1107 03:29:59.097000 56674 torch/distributed/run.py:774] *****************************************
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(confi

## Consolidate the shards

The distilled model is saved as part of the script as a sharded checkpoint, where each model parallel worker is resposible for saving its shard of the model weights. In order to use the model for inference, we need to consolidate the model shards

In [7]:
!optimum-cli neuron consolidate ./final_chess_model ./final_chess_model

  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
Consolidating checkpoints from ./final_chess_model to the safetensors format...
Consolidated checkpoint saved at ./final_chess_model


## Training Results

Check the training output and saved model.

In [8]:
# Check if model was saved
final_model_path = "./final_chess_model"

if Path(final_model_path).exists():
    print(f"✓ Model saved to {final_model_path}")
    print(f"\nModel files:")
    !ls -lh {final_model_path}
else:
    print(f"✗ Model not found at {final_model_path}")
    print("Training may have failed. Check the output above for errors.")

✓ Model saved to ./final_chess_model

Model files:
total 1.2G
-rw-r--r-- 1 ubuntu ubuntu  707 Nov  6 22:58 added_tokens.json
-rw-r--r-- 1 ubuntu ubuntu 4.1K Nov  6 22:58 chat_template.jinja
-rw-r--r-- 1 ubuntu ubuntu 1.4K Nov  6 22:58 config.json
-rw-r--r-- 1 ubuntu ubuntu 1.6M Nov  6 22:58 merges.txt
-rw-r--r-- 1 ubuntu ubuntu 1.2G Nov  7 01:01 model.safetensors
drwxr-xr-x 4 ubuntu ubuntu 4.0K Nov  6 22:58 shards
-rw-r--r-- 1 ubuntu ubuntu  613 Nov  6 22:58 special_tokens_map.json
-rw-r--r-- 1 ubuntu ubuntu  11M Nov  6 22:58 tokenizer.json
-rw-r--r-- 1 ubuntu ubuntu 5.3K Nov  6 22:58 tokenizer_config.json
-rw-r--r-- 1 ubuntu ubuntu 4.1K Nov  6 22:58 training_args.bin
-rw-r--r-- 1 ubuntu ubuntu  515 Nov  6 22:58 trn_config.json
-rw-r--r-- 1 ubuntu ubuntu 2.7M Nov  6 22:58 vocab.json


## Summary

You have successfully:
- ✓ Loaded chess move evaluation dataset with teacher logits
- ✓ Configured knowledge distillation training
- ✓ Trained a 0.6B student model from a 30B teacher
- ✓ Saved the trained model for inference

**Next Steps:**
- Proceed to Lab 2 to test the trained model
- Compare student vs teacher predictions
- Measure inference speed improvements

**Model Compression:**
- Teacher: 30B parameters
- Student: 0.6B parameters
- **Reduction**: 50x smaller!

