# Compilation

After completing the fine-tuning process, the next step is to compile the trained model for AWS Trainium inference using the Hugging Face Optimum Neuron toolchain.
Neuron compilation optimizes the model graph and converts it into a Neuron Executable File Format (NEFF), enabling efficient execution on NeuronCores.

In [None]:
optimum-cli export neuron \
  --model "Qwen/Qwen3-0.6B" \
  --task text-generation \
  --sequence_length 512 \
  --batch_size 1 \
  /home/ubuntu/environment/ml/qwen/compiled_model

# Inference

We will install the Optimum Neuron vllm library.  Then, run inference using the compiled model.

In [None]:
%pip install optimum-neuron[vllm]

Now you can run the batch inference example below

In [None]:
import os
from vllm import LLM, SamplingParams
llm = LLM(
    model="/home/ubuntu/environment/ml/qwen/compiled_model", #local compiled model
    max_num_seqs=1,
    max_model_len=2048,
    device="neuron",
    tensor_parallel_size=2,
    override_neuron_config={})

def create_conversation(sample):
    return f"""<|im_start|>system
    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.
    <|im_start|>user
    {sample}
    <|im_start|>assistant"""

prompts = []
with open('datasets.txt', 'r') as f:
    for line in f:
        if line.strip():
            prompts.append( create_conversation(line.strip()) )
print(prompts)
sampling_params = SamplingParams(max_tokens=2048, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)

print("#########################################################")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\n Generated text: {generated_text!r} \n")

You can start an inference endpoint on the current instance, like

In [None]:
!python -m vllm.entrypoints.openai.api_server \
    --model="/home/ubuntu/environment/ml/qwen/compiled_model" \
    --max-num-seqs=1 \
    --max-model-len=512 \
    --tensor-parallel-size=2 \
    --port=8080 \
    --device "neuron"

And query the inference endpoint like

In [None]:
curl 127.0.0.1:8080/v1/completions \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{"prompt":"<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant", "temperature": 0.8, "max_tokens":128}'
