# Lab 0: Generate Teacher Logits

## Introduction

In this lab, you will learn how to generate teacher model logits for knowledge distillation using AWS Neuron. Knowledge distillation is a model compression technique where a smaller "student" model learns to mimic the behavior of a larger "teacher" model by training on the teacher's output probabilities (logits) rather than just the hard labels.

You will use the [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) model as your teacher model - a 30 billion parameter Mixture-of-Experts (MoE) model that has been optimized for efficient inference. This teacher model will process a sentiment classification dataset and generate logits that capture the model's "soft" predictions across all possible output tokens.

The generated logits will be saved to a JSON file and used in Lab 1 to train a smaller student model that can achieve similar performance with significantly fewer parameters and lower inference costs.

**Key Concepts:**
- **Teacher Model**: A large, high-performance model (Qwen3-30B-A3B) that generates training signals
- **Logits**: Raw prediction scores before softmax, containing richer information than hard labels
- **Neuron Compilation**: Converting PyTorch models to run efficiently on AWS Trainium/Inferentia accelerators
- **Tensor Parallelism**: Distributing model layers across multiple NeuronCores for large model inference

**Prerequisites:**
- AWS EC2 instance with Trainium/Inferentia (e.g., trn1.32xlarge or inf2.48xlarge)
- AWS Neuron SDK installed (see setup.sh in the repository root)
- Sufficient disk space for model compilation artifacts
- Dataset file at `data/dataset.txt` with one text sample per line

## Download the model weights

First, download the teacher model weights from HuggingFace, using the HuggingFace CLI. The model detail page can be found here: [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)


In [None]:
!hf download Qwen/Qwen3-30B-A3B

## Import Dependencies

Import the required libraries for running inference with the Qwen3 MoE model on AWS Neuron hardware:

- **torch**: PyTorch framework for tensor operations and model execution
- **json**: For saving logits and results to structured JSON files
- **transformers**: Hugging Face library providing the tokenizer and generation utilities
- **neuronx_distributed_inference**: AWS Neuron SDK package for distributed inference on Trainium/Inferentia
  - `MoENeuronConfig`: Configuration for Mixture-of-Experts models on Neuron
  - `OnDeviceSamplingConfig`: Controls sampling behavior (temperature, top-k, top-p) on the Neuron device
  - `Qwen3MoeInferenceConfig`: Qwen3-specific inference configuration
  - `NeuronQwen3MoeForCausalLM`: Neuron-optimized Qwen3 model implementation
  - `HuggingFaceGenerationAdapter`: Adapter to use Hugging Face generation API with Neuron models

The random seed is set for reproducibility of generation results.

In [None]:
import torch
import json
import argparse

from transformers import AutoTokenizer, GenerationConfig
from neuronx_distributed_inference.models.config import MoENeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeInferenceConfig, NeuronQwen3MoeForCausalLM
from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config

torch.manual_seed(0)

## Configuration

Define the paths and file locations for this lab:

- **model_path**: Hugging Face model identifier for the Qwen3-30B-A3B teacher model. This will be downloaded from the Hugging Face Hub on first use.
- **traced_model_path**: Local directory where the Neuron-compiled model artifacts will be saved. The compiled model can be reused for subsequent runs, significantly reducing startup time.
- **dataset_file**: Path to the input dataset containing text samples (one per line) that will be processed by the teacher model. Each line should contain a single text sample for sentiment classification.
- **output_file**: Path where the generated logits and responses will be saved in JSON format

**Dataset Examples:**

The dataset contains various text samples with different sentiments:
- *Positive*: "The service at this restaurant exceeded all my expectations!"
- *Negative*: "I can't believe how rude the staff was today."
- *Neutral*: "The weather is 72 degrees with partial clouds."
- *Negative*: "My flight was delayed for the third time this week."
- *Neutral*: "The package arrived on schedule as expected."
- *Positive*: "This phone's battery life is absolutely amazing!"

**Note:** The first time you run this notebook, the model will be compiled for Neuron hardware. This is a one-time operation that optimizes the model for efficient execution on Trainium/Inferentia. Subsequent runs will load the pre-compiled model from `traced_model_path`.

In [None]:
model_path = "Qwen/Qwen3-30B-A3B"
traced_model_path = "/home/ubuntu/traced_model/Qwen3-30B-A3B/"
dataset_file = "data/dataset.txt"
output_file = "output.json"

## Create Conversation Template

Define a function to format input text as a structured conversation for the sentiment classification task.

The Qwen3 model expects inputs in a chat format with distinct roles (system, user, assistant). This function:
1. Creates a system message that defines the task: classifying sentiment as POSITIVE, NEGATIVE, or NEUTRAL
2. Formats the input sample as a user message
3. Returns a conversation list that will be processed by the tokenizer's chat template

This structured format helps the model understand its role and produce consistent, focused outputs. The system message explicitly instructs the model to return only the sentiment label, which is important for generating clean training data for the student model.

**Example Conversation Structure:**

For the input text: "This phone's battery life is absolutely amazing!"

The function creates:
```python
[
  {
    "role": "system",
    "content": "You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment."
  },
  {
    "role": "user",
    "content": "This phone's battery life is absolutely amazing!"
  }
]
```

Expected teacher model response: "POSITIVE"

In [None]:
def create_conversation(sample):
    system_message = (
        "You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment."
    )
    return [
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": sample
        },
    ]

## Initialize Model Configuration

Configure the Neuron-specific settings for optimal inference with the Qwen3 MoE model. This configuration is critical for both performance and functionality:

**MoENeuronConfig Parameters:**
- **tp_degree=8**: Tensor parallelism degree - distributes the model across 8 NeuronCores. For a 30B parameter model, this parallelism is essential for fitting the model in memory and achieving good throughput. Adjust based on your instance type (e.g., trn1.32xlarge has 32 NeuronCores).
- **batch_size=1**: Process one sample at a time. For logit generation, we prioritize accuracy over throughput.
- **max_context_length=128**: Maximum input sequence length in tokens. Shorter contexts reduce memory usage and compilation time.
- **seq_len=1024**: Maximum total sequence length (input + output). This determines the buffer size for generation.
- **on_device_sampling_config**: Controls generation behavior on the Neuron device:
  - `do_sample=True`: Enable sampling (vs greedy decoding) for more diverse outputs
  - `temperature=0.6`: Lower temperature = more focused predictions (range: 0.0-1.0)
  - `top_k=20`: Consider only the top 20 most likely tokens at each step
  - `top_p=0.95`: Nucleus sampling - consider tokens with cumulative probability up to 95%
- **enable_bucketing=False**: Disable dynamic sequence length bucketing for consistent performance
- **flash_decoding_enabled=False**: Disable flash attention optimization (may not be needed for small batches)
- **output_scores=True**: Return generation scores for each token
- **output_logits=True**: **CRITICAL** - Return raw logits for knowledge distillation. This is the key output we need for training the student model.

**Tokenizer Configuration:**
- **padding_side="right"**: Add padding tokens to the right side of sequences
- **pad_token = eos_token**: Use the end-of-sequence token for padding (Qwen3 doesn't have a dedicated pad token)

These settings balance inference quality, memory efficiency, and the specific requirements of knowledge distillation.

In [None]:
generation_config = GenerationConfig.from_pretrained(model_path)

neuron_config = MoENeuronConfig(
    tp_degree=8,
    batch_size=1,
    max_context_length=128,
    seq_len=1024,
    on_device_sampling_config=OnDeviceSamplingConfig(do_sample=True, temperature=0.6, top_k=20, top_p=0.95),
    enable_bucketing=False,
    flash_decoding_enabled=False,
    output_scores=True,
    output_logits=True
)

config = Qwen3MoeInferenceConfig(
    neuron_config,
    load_config=load_pretrained_config(model_path),
)

tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="right")
tokenizer.pad_token = tokenizer.eos_token

## Compile and Save Model

Compile the model for AWS Neuron hardware. This is a critical optimization step that converts the PyTorch model into a Neuron-optimized format.

**What happens during compilation:**
1. The model is loaded from Hugging Face Hub (if not already cached locally)
2. The Neuron compiler analyzes the model architecture and applies hardware-specific optimizations
3. Model layers are partitioned across the specified number of NeuronCores (tp_degree=8)
4. The compiled model artifacts (NEFF files) are saved to `traced_model_path`
5. The tokenizer is also saved to the same directory for convenience

**Important Notes:**
- **First-time compilation takes 30-60 minutes** depending on model size and instance type. This is normal!
- Compilation is a one-time cost - subsequent runs will load the pre-compiled model in seconds
- The compiled model is specific to the configuration (tp_degree, batch_size, seq_len, etc.). Changing these parameters requires recompilation.
- Ensure you have sufficient disk space (~50GB) for compilation artifacts
- If compilation fails, check CloudWatch logs for detailed error messages

**Tip:** If you already have a compiled model at `traced_model_path`, you can skip this cell and proceed directly to the "Load Compiled Model" section.

In [None]:
print("\nCompiling and saving model...")
model = NeuronQwen3MoeForCausalLM(model_path, config)
model.compile(traced_model_path)
tokenizer.save_pretrained(traced_model_path)

## Load Compiled Model

Load the pre-compiled Neuron model from disk for fast inference.

This step loads the compiled model artifacts (NEFF files) that were saved in the previous step. Loading a compiled model is much faster than compilation - typically taking only a few seconds.

The model is loaded onto the NeuronCores and is ready for inference. The tokenizer is also loaded from the same directory to ensure consistency with the model's vocabulary and special tokens.

**Note:** If you're running this notebook for the first time, make sure you've completed the "Compile and Save Model" step above. If you're restarting the notebook or running on a different instance with the compiled model already available, you can start from this cell.

In [None]:
model = NeuronQwen3MoeForCausalLM(traced_model_path)
model.load(traced_model_path)
tokenizer = AutoTokenizer.from_pretrained(traced_model_path)

## Process Dataset and Generate Logits

Process each text sample in the dataset through the teacher model to generate logits for knowledge distillation.

**Processing Pipeline:**

For each line in the dataset file:

1. **Format Input**: Convert the raw text into a structured conversation using the `create_conversation()` function
2. **Apply Chat Template**: Use the tokenizer's chat template to format the conversation with proper special tokens and structure
3. **Tokenize**: Convert the formatted text into token IDs that the model can process
4. **Generate**: Run inference with the Neuron model to generate the sentiment classification
   - `return_dict_in_generate=True`: Return detailed generation outputs
   - `output_scores=True`: Include per-token scores
   - `output_logits=True`: Include raw logits (essential for distillation)
5. **Extract Logits**: Process the raw logits for each generated token:
   - Filter out `-inf` values (these represent invalid/masked tokens)
   - Store both the token indices and their corresponding logit values
   - This sparse representation saves significant storage space
6. **Decode Output**: Convert generated token IDs back to human-readable text
7. **Store Results**: Save the prompt, generated text, and token logits to the results list

**Example Processing:**

Input: "Our team just won the championship - best day ever!"

Output structure:
```python
{
  "prompt": "Our team just won the championship - best day ever!",
  "response": {
    "generated_text": "POSITIVE",
    "token_logits": [
      {
        "indices": [12345, 23456, 34567, ...],  # Token IDs with finite logits
        "logits": [8.2, 7.5, 6.8, ...]           # Corresponding logit values
      }
    ]
  }
}
```

**Error Handling:**
If any sample fails to process (e.g., due to length constraints or generation issues), the error is caught and logged, allowing the pipeline to continue with remaining samples.

**Why Filter Infinite Logits?**
The model's vocabulary contains ~150K tokens, but most are invalid for any given generation step (e.g., special tokens, tokens that don't make sense in context). The model assigns `-inf` logits to these invalid tokens. By filtering them out, we:
- Reduce storage requirements by 95%+
- Speed up student model training (fewer values to process)
- Maintain all meaningful information (the student only needs to learn from valid token distributions)

**Expected Runtime:**
Processing time depends on dataset size and generation length. For a dataset with 100 samples, expect ~5-10 minutes of processing time.

In [None]:
results = []
with open(dataset_file, 'r') as f:
    for line in f:
        if line.strip():
            try:
                input_text = create_conversation(line.strip())
                formatted_chat = tokenizer.apply_chat_template(
                    input_text,
                    tokenize=False,
                    add_generation_prompt=True,
                    enable_thinking=False
                )
                inputs = tokenizer(formatted_chat, padding=True, return_tensors="pt")
                generation_model = HuggingFaceGenerationAdapter(model)
                outputs = generation_model.generate(
                    inputs.input_ids,
                    generation_config=generation_config,
                    attention_mask=inputs.attention_mask,
                    max_length=model.config.neuron_config.max_length,
                    return_dict_in_generate=True,
                    output_scores=True,
                    output_logits=True
                )
                
                print(outputs)
                generated_tokens = outputs.sequences[0]
                token_logits = outputs.scores
                generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False)
                print(generated_text)
                
                token_logits_list = []
                for logits in token_logits:
                    finite_mask = torch.isfinite(logits[0])
                    finite_indices = torch.nonzero(finite_mask).squeeze().tolist()
                    finite_logits = logits[0][finite_mask]
                    token_info = {
                        'indices': finite_indices,
                        'logits': finite_logits.tolist()
                    }
                    token_logits_list.append(token_info)
                
                print(token_logits_list)
                results.append({
                    'prompt': line.strip(),
                    'response': {
                        'generated_text': generated_text,
                        'token_logits': token_logits_list
                    }
                })
            except Exception as e:
                print(f"Error processing prompt: {line[:50]}...")
                print(f"Error message: {str(e)}")
                results.append({
                    'prompt': line.strip(),
                    'error': str(e)
                })

## Save Results

Write the generated logits and responses to a JSON file for use in Lab 1 (distillation training).

The output JSON file contains an array of objects, where each object represents one processed sample:

```json
[
  {
    "prompt": "The service at this restaurant exceeded all my expectations!",
    "response": {
      "generated_text": "POSITIVE",
      "token_logits": [
        {
          "indices": [12345, 23456, 34567, 45678, 56789],
          "logits": [8.5, 7.2, 6.1, 5.8, 5.3]
        }
      ]
    }
  },
  {
    "prompt": "I can't believe how rude the staff was today.",
    "response": {
      "generated_text": "NEGATIVE",
      "token_logits": [
        {
          "indices": [98765, 87654, 76543, 65432, 54321],
          "logits": [9.1, 8.3, 7.5, 6.9, 6.2]
        }
      ]
    }
  },
  {
    "prompt": "The weather is 72 degrees with partial clouds.",
    "response": {
      "generated_text": "NEUTRAL",
      "token_logits": [
        {
          "indices": [11111, 22222, 33333, 44444, 55555],
          "logits": [7.8, 7.1, 6.5, 5.9, 5.4]
        }
      ]
    }
  }
]
```

**Output Structure:**
- **prompt**: The original input text from the dataset
- **response.generated_text**: The teacher model's generated sentiment classification (POSITIVE, NEGATIVE, or NEUTRAL)
- **response.token_logits**: Array of logit distributions for each generated token
  - **indices**: Token IDs that have finite (valid) logit values
  - **logits**: The corresponding logit values for those tokens

**Next Steps:**
This output file will be used in Lab 1 to train a smaller student model. The student model will learn to match the teacher's logit distributions, effectively transferring the teacher's knowledge to a more efficient model.

**File Size:**
Even with filtering, the output file can be large (several MB to GB depending on dataset size). Ensure you have sufficient disk space and consider the file size when transferring or storing the data.

In [None]:
with open(output_file, 'w') as f:
    json.dump(results, f, indent=2)

print(f"Processing complete! Processed {len(results)} prompts. Results written to {output_file}")