# Inference

## Copy tokenizer

In [8]:
from transformers import AutoTokenizer

# 1. Specify the name of the Hugging Face model
model_name = "Qwen/Qwen3-0.6B" 

# 2. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. Define the local directory where you want to save the tokenizer
local_directory = "/home/ubuntu/environment/ml/qwen/compiled_model"

# 4. Save the tokenizer to the local directory
tokenizer.save_pretrained(local_directory)

print(f"Tokenizer for '{model_name}' saved to '{local_directory}'")

Tokenizer for 'Qwen/Qwen3-0.6B' saved to '/home/ubuntu/environment/ml/qwen/compiled_model'


# Compilation

After completing the fine-tuning process, the next step is to compile the trained model for AWS Trainium inference using the Hugging Face Optimum Neuron toolchain.
Neuron compilation optimizes the model graph and converts it into a Neuron Executable File Format (NEFF), enabling efficient execution on NeuronCores.

In [3]:
!optimum-cli export neuron \
  --model "/home/ubuntu/environment/distillation/Qwen3-0.6B-finetuned" \
  --task text-generation \
  --sequence_length 2048 \
  --batch_size 1 \
  /home/ubuntu/environment/ml/qwen/compiled_model

  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  from ..backend.modules.attention.attention_base import NeuronAttentionBase
INFO:Neuron:Generating HLOs for the following models: ['context_encoding', 'token_generation']
[2025-10-31 01:16:22.885: I neuronx_distributed/parallel_layers/parallel_state.py:630] > initializing tensor model parallel with size 1
[2025-10-31 01:16:22.885: I neuronx_distributed/parallel_layers/parallel_state.py:631] > initializing pipeline model parallel with size 1
[2025-10-31 01:16:22.885: I neuronx_distributed/parallel_layers/parallel_state.py:632] > initializing context model 

# Inference

We will install the Optimum Neuron vllm library.  Then, run inference using the compiled model.

In [4]:
%pip install -q optimum-neuron[vllm]

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting vllm==0.10.2 (from optimum-neuron[vllm])
  Downloading vllm-0.10.2-cp38-abi3-manylinux1_x86_64.whl.metadata (16 kB)
Collecting cachetools (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading cachetools-6.2.1-py3-none-any.whl.metadata (5.5 kB)
Collecting sentencepiece (from vllm==0.10.2->optimum-neuron[vllm])
  Using cached sentencepiece-0.2.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (10 kB)
Collecting blake3 (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading blake3-1.0.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting py-cpuinfo (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting openai>=1.99.1 (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading openai-2.6.1-py3-none-any.whl.metadata (29 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (f

Now you can run the batch inference example below

In [10]:
import os
from vllm import LLM, SamplingParams
llm = LLM(
    model="/home/ubuntu/environment/ml/qwen/compiled_model", #local compiled model
    max_num_seqs=1,
    max_model_len=2048,
    tensor_parallel_size=2,)

def create_conversation(sample):
    return f"""<|im_start|>system
    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.
    <|im_start|>user
    {sample}
    <|im_start|>assistant"""

prompts = []
with open('/home/ubuntu/environment/distillation/data/dataset.txt', 'r') as f:
    for line in f:
        if line.strip():
            prompts.append( create_conversation(line.strip()) )
print(prompts)
sampling_params = SamplingParams(max_tokens=2048, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)

print("#########################################################")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\n Generated text: {generated_text!r} \n")

INFO 10-31 01:38:28 [utils.py:328] non-default args: {'max_model_len': 2048, 'tensor_parallel_size': 2, 'max_num_seqs': 1, 'disable_log_stats': True, 'model': '/home/ubuntu/environment/ml/qwen/compiled_model'}
INFO 10-31 01:38:28 [__init__.py:742] Resolved architecture: Qwen3ForCausalLM
INFO 10-31 01:38:28 [__init__.py:1815] Using max model len 2048
INFO 10-31 01:38:28 [llm_engine.py:221] Initializing a V0 LLM engine (v0.10.2) with config: model='/home/ubuntu/environment/ml/qwen/compiled_model', speculative_config=None, tokenizer='/home/ubuntu/environment/ml/qwen/compiled_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, decoding_config=DecodingConfig(backend=

INFO:Neuron:Loading sharded checkpoint from /home/ubuntu/environment/ml/qwen/compiled_model/checkpoint/weights


INFO 10-31 01:38:35 [executor_base.py:114] # neuron blocks: 2, # CPU blocks: 0
INFO 10-31 01:38:35 [executor_base.py:119] Maximum concurrency for 2048 tokens per request: 2.00x
INFO 10-31 01:38:35 [llm_engine.py:420] init engine (profile, create kv cache, warmup model) took 0.00 seconds
INFO 10-31 01:38:35 [llm.py:295] Supported_tasks: ['generate']
INFO 10-31 01:38:35 [__init__.py:36] No IOProcessor plugins requested by the model
['<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant', "<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    I can't believe how rude the staff was today.\n    <|im_start|>assistant", '<|im_start|

Adding requests:   0%|          | 0/100 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

#########################################################
Prompt: '<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant', 

 Generated text: '\nassistant\nThe service at this restaurant exceeded all my expectations!' 

Prompt: "<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    I can't believe how rude the staff was today.\n    <|im_start|>assistant", 

 Generated text: '\nassistant\nThe report contains data from the last fiscal year.' 

Prompt: '<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_sta

You can start an inference endpoint on the current instance, like

In [None]:
!python -m vllm.entrypoints.openai.api_server \
    --model="/home/ubuntu/environment/ml/qwen/compiled_model" \
    --max-num-seqs=1 \
    --max-model-len=2048 \
    --tensor-parallel-size=2 \
    --port=8080 \
    --device "neuron"

And query the inference endpoint like

In [None]:
curl 127.0.0.1:8080/v1/completions \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{"prompt":"<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant", "temperature": 0.8, "max_tokens":128}'
