# Lab 0: Generate Teacher Logits

This notebook generates teacher model logits for knowledge distillation. The teacher model (Qwen3-30B-A3B) processes a dataset and outputs logits that will be used to train a smaller student model.

## Import Dependencies

Import required libraries for model inference, tokenization, and Neuron-specific configurations.

In [1]:
import torch
import json
import argparse

from transformers import AutoTokenizer, GenerationConfig
from neuronx_distributed_inference.models.config import MoENeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeInferenceConfig, NeuronQwen3MoeForCausalLM
from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config

torch.manual_seed(0)

  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV
  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
  fr

<torch._C.Generator at 0x754d77fc3cd0>

## Configuration

Set model paths and file locations. The teacher model will be compiled and saved to the traced model path for efficient inference on AWS Neuron.

In [8]:
model_path = "Qwen/Qwen3-30B-A3B"
traced_model_path = "/home/ubuntu/traced_model/Qwen3-30B-A3B/"
dataset_file = "data/dataset.txt"
output_file = "output.json"

## Create Conversation Template

Define a function to format input text as a conversation for the sentiment classification task.

In [3]:
def create_conversation(sample):
    system_message = (
        "You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment."
    )
    return [
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": sample
        },
    ]

## Initialize Model Configuration

Configure the Neuron-specific settings for the Qwen3 MoE model:
- Tensor parallelism degree: 8 (distributes model across 8 NeuronCores)
- Enable logit output for distillation
- Set sampling parameters for generation

In [4]:
generation_config = GenerationConfig.from_pretrained(model_path)

neuron_config = MoENeuronConfig(
    tp_degree=8,
    batch_size=1,
    max_context_length=128,
    seq_len=1024,
    on_device_sampling_config=OnDeviceSamplingConfig(do_sample=True, temperature=0.6, top_k=20, top_p=0.95),
    enable_bucketing=False,
    flash_decoding_enabled=False,
    output_scores=True,
    output_logits=True
)

config = Qwen3MoeInferenceConfig(
    neuron_config,
    load_config=load_pretrained_config(model_path),
)

tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="right")
tokenizer.pad_token = tokenizer.eos_token



## Compile and Save Model

Compile the model for AWS Neuron hardware. This step converts the model to a Neuron-optimized format and saves it for reuse.

In [5]:
print("\nCompiling and saving model...")
model = NeuronQwen3MoeForCausalLM(model_path, config)
model.compile(traced_model_path)
tokenizer.save_pretrained(traced_model_path)

INFO:Neuron:Saving the neuron_config to /home/ubuntu/traced_model/Qwen3-30B-A3B/
INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']



Compiling and saving model...
[2025-10-23 18:41:51.532: I neuronx_distributed/parallel_layers/parallel_state.py:630] > initializing tensor model parallel with size 8
[2025-10-23 18:41:51.533: I neuronx_distributed/parallel_layers/parallel_state.py:631] > initializing pipeline model parallel with size 1
[2025-10-23 18:41:51.534: I neuronx_distributed/parallel_layers/parallel_state.py:632] > initializing context model parallel with size 1
[2025-10-23 18:41:51.535: I neuronx_distributed/parallel_layers/parallel_state.py:633] > initializing data parallel with size 1
[2025-10-23 18:41:51.536: I neuronx_distributed/parallel_layers/parallel_state.py:634] > initializing world size to 8
[2025-10-23 18:41:51.536: I neuronx_distributed/parallel_layers/parallel_state.py:379] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x754b1f3b9cf0>, 'Ascending Ring PG Group')>
[2025-10-23 18:41:51.537: I neuronx_distributed

INFO:Neuron:Generating 1 hlos for key: context_encoding_model
INFO:Neuron:Started loading module context_encoding_model
INFO:Neuron:Finished loading module context_encoding_model in 0.6785175800323486 seconds
INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([1, 128])


[2025-10-23 18:41:55.678: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.760: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.772: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.783: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.795: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.807: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.818: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128

  with torch.cuda.amp.autocast(enabled=False):
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  return super().apply(*args, **kwargs)  # type: ignore[misc]
  with torch.cuda.amp.autocast(enabled=False):


[2025-10-23 18:41:55.877: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.888: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.899: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.910: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.920: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.932: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128 not divisible by block_size 512, cannot use index calc kernel.
[2025-10-23 18:41:55.943: W neuronx_distributed/modules/moe/expert_mlps_v2.py:765] T 128

  probs_cumsum = cumsum(
  probs_cumsum = cumsum(tensor_in=probs, dim=dim, on_cpu=self.neuron_config.on_cpu)
INFO:Neuron:Finished generating HLO for context_encoding_model in 10.564492225646973 seconds, input example shape = torch.Size([1, 128])
INFO:Neuron:Generating 1 hlos for key: token_generation_model
INFO:Neuron:Started loading module token_generation_model
INFO:Neuron:Finished loading module token_generation_model in 0.6945993900299072 seconds
INFO:Neuron:generating HLO: token_generation_model, input example shape = torch.Size([1, 1])
  with torch.cuda.amp.autocast(enabled=False):
  probs_cumsum = cumsum(
  probs_cumsum = cumsum(tensor_in=probs, dim=dim, on_cpu=self.neuron_config.on_cpu)
INFO:Neuron:Finished generating HLO for token_generation_model in 8.787945032119751 seconds, input example shape = torch.Size([1, 1])
INFO:Neuron:Generated all HLOs in 21.071417570114136 seconds
INFO:Neuron:Removing 96 kernel weights from the frontend attributes
INFO:Neuron:Starting compilation 

2025-10-23 18:42:13.000145:  3258171  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.21.18209.0+043b1bf7/MODULE_de61ac37b4eeddad6841+9006d3c6/model.neff


INFO:Neuron:Done compilation for the priority HLO in 0.5761313438415527 seconds
INFO:Neuron:Updating the hlo module with optimized layout
INFO:Neuron:Done optimizing weight layout for all HLOs in 1.306518793106079 seconds
INFO:Neuron:Starting compilation for all HLOs
INFO:Neuron:Neuron compiler flags: --enable-saturate-infinity --enable-mixed-precision-accumulation --model-type transformer -O1 --tensorizer-options='--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2' --auto-cast=none --internal-enable-dge-levels vector_dynamic_offsets --internal-hlo2tensorizer-options='--verify-hlo=true' --internal-hlo2tensorizer-options='--verify-hlo=true'  --logfile=/tmp/nxd_model/context_encoding_model/_tp0_bk0/log-neuron-cc.txt


2025-10-23 18:42:15.000868:  3258171  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.21.18209.0+043b1bf7/MODULE_9e722d575a2b2a33625f+2667e3c1/model.neff


INFO:Neuron:Finished Compilation for all HLOs in 1.5379798412322998 seconds


..

INFO:Neuron:Done preparing weight layout transformation


Completed run_backend_driver.

Compiler status PASS


INFO:Neuron:Finished building model in 55.69548773765564 seconds
INFO:Neuron:SKIPPING pre-sharding the checkpoints. The checkpoints will be sharded during load time.


('/home/ubuntu/traced_model/Qwen3-30B-A3B/tokenizer_config.json',
 '/home/ubuntu/traced_model/Qwen3-30B-A3B/special_tokens_map.json',
 '/home/ubuntu/traced_model/Qwen3-30B-A3B/chat_template.jinja',
 '/home/ubuntu/traced_model/Qwen3-30B-A3B/vocab.json',
 '/home/ubuntu/traced_model/Qwen3-30B-A3B/merges.txt',
 '/home/ubuntu/traced_model/Qwen3-30B-A3B/added_tokens.json',
 '/home/ubuntu/traced_model/Qwen3-30B-A3B/tokenizer.json')

## Load Compiled Model

Load the compiled model from disk for inference.

In [6]:
model = NeuronQwen3MoeForCausalLM(traced_model_path)
model.load(traced_model_path)
tokenizer = AutoTokenizer.from_pretrained(traced_model_path)

INFO:Neuron:Sharding weights on load...
INFO:Neuron:Sharding Weights for ranks: 0...7


[2025-10-23 18:43:14.892: I neuronx_distributed/parallel_layers/parallel_state.py:630] > initializing tensor model parallel with size 8
[2025-10-23 18:43:14.893: I neuronx_distributed/parallel_layers/parallel_state.py:631] > initializing pipeline model parallel with size 1
[2025-10-23 18:43:14.894: I neuronx_distributed/parallel_layers/parallel_state.py:632] > initializing context model parallel with size 1
[2025-10-23 18:43:14.895: I neuronx_distributed/parallel_layers/parallel_state.py:633] > initializing data parallel with size 1
[2025-10-23 18:43:14.896: I neuronx_distributed/parallel_layers/parallel_state.py:634] > initializing world size to 8
[2025-10-23 18:43:14.897: I neuronx_distributed/parallel_layers/parallel_state.py:379] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x754b1f3b9cf0>, 'Ascending Ring PG Group')>
[2025-10-23 18:43:14.899: I neuronx_distributed/parallel_layers/parallel_state

Loading checkpoint shards:   0%|          | 0/16 [00:00<?, ?it/s]

INFO:Neuron:Done Sharding weights in 72.05030099023134
INFO:Neuron:Finished weights loading in 94.67891277838498 seconds
INFO:Neuron:Warming up the model.


2025-Oct-23 18:44:49.0575 3258171:3259075 [6] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):213 CCOM WARN NET/OFI Failed to initialize sendrecv protocol
2025-Oct-23 18:44:49.0585 3258171:3259075 [6] int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t**):354 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Oct-23 18:44:49.0595 3258171:3259075 [6] ncclResult_t nccl_net_ofi_init_no_atexit_fini_v6(ncclDebugLogger_t):183 CCOM WARN NET/OFI Initializing plugin failed
2025-Oct-23 18:44:49.0605 3258171:3259075 [6] net_plugin.cc:97 CCOM WARN OFI plugin initNet() failed is EFA enabled?


INFO:Neuron:Warmup completed in 0.6640217304229736 seconds.


## Process Dataset and Generate Logits

Process each line in the dataset through the teacher model:
1. Format input as a conversation
2. Generate output with logits
3. Extract finite logits (filter out -inf values)
4. Save results with prompt, generated text, and token logits

In [9]:
results = []
with open(dataset_file, 'r') as f:
    for line in f:
        if line.strip():
            try:
                input_text = create_conversation(line.strip())
                formatted_chat = tokenizer.apply_chat_template(
                    input_text,
                    tokenize=False,
                    add_generation_prompt=True,
                    enable_thinking=False
                )
                inputs = tokenizer(formatted_chat, padding=True, return_tensors="pt")
                generation_model = HuggingFaceGenerationAdapter(model)
                outputs = generation_model.generate(
                    inputs.input_ids,
                    generation_config=generation_config,
                    attention_mask=inputs.attention_mask,
                    max_length=model.config.neuron_config.max_length,
                    return_dict_in_generate=True,
                    output_scores=True,
                    output_logits=True
                )
                
                print(outputs)
                generated_tokens = outputs.sequences[0]
                token_logits = outputs.scores
                generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False)
                print(generated_text)
                
                token_logits_list = []
                for logits in token_logits:
                    finite_mask = torch.isfinite(logits[0])
                    finite_indices = torch.nonzero(finite_mask).squeeze().tolist()
                    finite_logits = logits[0][finite_mask]
                    token_info = {
                        'indices': finite_indices,
                        'logits': finite_logits.tolist()
                    }
                    token_logits_list.append(token_info)
                
                print(token_logits_list)
                results.append({
                    'prompt': line.strip(),
                    'response': {
                        'generated_text': generated_text,
                        'token_logits': token_logits_list
                    }
                })
            except Exception as e:
                print(f"Error processing prompt: {line[:50]}...")
                print(f"Error message: {str(e)}")
                results.append({
                    'prompt': line.strip(),
                    'error': str(e)
                })

HuggingFaceGenerationAdapter has generative capabilities, as `prepare_inputs_for_generation` is explicitly defined. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
HuggingFaceGenerationAdapter has generative capabilities, as `prepare_inputs_for_generation` is explicitly defined. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If 

Error processing prompt: The service at this restaurant exceeded all my exp...
Error message: 'super' object has no attribute 'generate'
Error processing prompt: I can't believe how rude the staff was today.
...
Error message: 'super' object has no attribute 'generate'
Error processing prompt: The weather is 72 degrees with partial clouds.
...
Error message: 'super' object has no attribute 'generate'
Error processing prompt: My flight was delayed for the third time this week...
Error message: 'super' object has no attribute 'generate'
Error processing prompt: The package arrived on schedule as expected.
...
Error message: 'super' object has no attribute 'generate'
Error processing prompt: This phone's battery life is absolutely amazing!
...
Error message: 'super' object has no attribute 'generate'
Error processing prompt: The store will be closed from 2PM to 4PM.
...
Error message: 'super' object has no attribute 'generate'
Error processing prompt: I deeply regret purchasing this defec

## Save Results

Write the generated logits and responses to a JSON file for use in distillation training.

In [None]:
with open(output_file, 'w') as f:
    json.dump(results, f, indent=2)

print(f"Processing complete! Processed {len(results)} prompts. Results written to {output_file}")