# Lab 2: Neuron vLLM Inference

## Introduction

In this lab, you will learn how to deploy and run inference with your distilled model using vLLM on AWS Neuron hardware. This lab builds on Lab 1, where you trained a smaller student model using knowledge distillation.

vLLM is a high-throughput and memory-efficient inference engine for large language models. The Optimum Neuron integration provides seamless support for running vLLM on AWS Trainium and Inferentia accelerators, enabling cost-effective and high-performance inference.

**What You'll Learn:**
- How to compile your distilled model for Neuron inference
- How to set up vLLM with Optimum Neuron for batch inference
- How to run sentiment classification inference on your trained model
- How to deploy an OpenAI-compatible API server for production use

**Key Benefits of vLLM on Neuron:**
- **High Throughput**: Optimized for serving multiple requests efficiently
- **Memory Efficiency**: Advanced memory management for large models
- **Hardware Acceleration**: Native support for AWS Trainium/Inferentia
- **API Compatibility**: OpenAI-compatible API for easy integration
- **Cost Optimization**: Leverage AWS Neuron hardware for cost-effective inference

**Prerequisites:**
- Completed Lab 1 with a trained distilled model
- AWS Trainium-based EC2 instance
- Sufficient disk space for model compilation artifacts

## Copy Tokenizer

First, we need to copy the tokenizer from the original model to our compilation directory. The tokenizer is essential for converting text inputs into tokens that the model can process, and for decoding the model's token outputs back into human-readable text.

**Why Copy the Tokenizer?**
- The distilled model uses the same vocabulary as the original Qwen3-0.6B model
- vLLM requires the tokenizer to be co-located with the compiled model
- This ensures consistent tokenization between training and inference

In [8]:
from transformers import AutoTokenizer

# 1. Specify the name of the Hugging Face model
model_name = "Qwen/Qwen3-0.6B" 

# 2. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. Define the local directory where you want to save the tokenizer
local_directory = "/home/ubuntu/environment/ml/qwen/compiled_model"

# 4. Save the tokenizer to the local directory
tokenizer.save_pretrained(local_directory)

print(f"Tokenizer for '{model_name}' saved to '{local_directory}'")

Tokenizer for 'Qwen/Qwen3-0.6B' saved to '/home/ubuntu/environment/ml/qwen/compiled_model'


## Model Compilation for Neuron Inference

After completing the distillation training process, the next step is to compile the trained model for AWS Trainium inference using the Hugging Face Optimum Neuron toolchain.

**What is Neuron Compilation?**

Neuron compilation is a critical optimization step that:
1. **Analyzes the Model Graph**: Examines the PyTorch model architecture and operations
2. **Applies Hardware Optimizations**: Optimizes operations specifically for Neuron hardware
3. **Generates NEFF Files**: Creates Neuron Executable File Format (NEFF) files for efficient execution
4. **Enables Tensor Parallelism**: Distributes model layers across multiple NeuronCores
5. **Optimizes Memory Usage**: Reduces memory footprint and improves throughput

**Compilation Parameters:**

- **--model**: Path to your distilled model from Lab 1
- **--task**: Specifies the model task (text-generation for causal language models)
- **--sequence_length**: Maximum sequence length the model can handle (2048 tokens)
- **--batch_size**: Number of sequences to process simultaneously (1 for this example)
- **Output Directory**: Where the compiled model artifacts will be saved

**Expected Compilation Time:**
- First compilation: ~5-10 minutes for a 0.6B model
- Subsequent runs: Uses cached compilation if parameters haven't changed

**Note:** The compilation process will generate detailed logs showing the optimization steps. This is normal and indicates the compiler is working to optimize your model for Neuron hardware.

In [3]:
!optimum-cli export neuron \
  --model "/home/ubuntu/environment/distillation/Qwen3-0.6B-finetuned" \
  --task text-generation \
  --sequence_length 2048 \
  --batch_size 1 \
  /home/ubuntu/environment/ml/qwen/compiled_model

  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  from ..backend.modules.attention.attention_base import NeuronAttentionBase
INFO:Neuron:Generating HLOs for the following models: ['context_encoding', 'token_generation']
[2025-10-31 01:16:22.885: I neuronx_distributed/parallel_layers/parallel_state.py:630] > initializing tensor model parallel with size 1
[2025-10-31 01:16:22.885: I neuronx_distributed/parallel_layers/parallel_state.py:631] > initializing pipeline model parallel with size 1
[2025-10-31 01:16:22.885: I neuronx_distributed/parallel_layers/parallel_state.py:632] > initializing context model 

## Setup vLLM for Neuron Inference

Now we'll install the Optimum Neuron vLLM library and run inference using our compiled distilled model.

**What is vLLM?**

vLLM (Versatile Large Language Model) is a high-performance inference engine designed for:
- **High Throughput**: Serves multiple requests efficiently with advanced batching
- **Memory Efficiency**: Uses PagedAttention and other optimizations to reduce memory usage
- **Hardware Acceleration**: Native support for GPUs, TPUs, and AWS Neuron accelerators
- **API Compatibility**: Provides OpenAI-compatible APIs for easy integration

**Optimum Neuron Integration:**

The `optimum-neuron[vllm]` package provides:
- Seamless integration between vLLM and AWS Neuron hardware
- Automatic handling of model sharding across NeuronCores
- Optimized attention mechanisms for Trainium/Inferentia
- Support for various model architectures including Qwen3

**Installation Note:**
The installation may take a few minutes as it includes the full vLLM package with Neuron optimizations and all required dependencies.

In [4]:
%pip install -q optimum-neuron[vllm]

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting vllm==0.10.2 (from optimum-neuron[vllm])
  Downloading vllm-0.10.2-cp38-abi3-manylinux1_x86_64.whl.metadata (16 kB)
Collecting cachetools (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading cachetools-6.2.1-py3-none-any.whl.metadata (5.5 kB)
Collecting sentencepiece (from vllm==0.10.2->optimum-neuron[vllm])
  Using cached sentencepiece-0.2.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (10 kB)
Collecting blake3 (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading blake3-1.0.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting py-cpuinfo (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting openai>=1.99.1 (from vllm==0.10.2->optimum-neuron[vllm])
  Downloading openai-2.6.1-py3-none-any.whl.metadata (29 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (f

## Batch Inference with Your Distilled Model

Now we'll run batch inference using your distilled sentiment classification model. This example demonstrates how to:

1. **Initialize the vLLM Engine**: Load your compiled model with appropriate configuration
2. **Prepare Input Data**: Format text samples using the same conversation template from training
3. **Run Batch Inference**: Process multiple samples efficiently in parallel
4. **Analyze Results**: Compare the distilled model's predictions with expected sentiment classifications

**vLLM Configuration Parameters:**

- **model**: Path to your compiled Neuron model
- **max_num_seqs**: Maximum number of sequences to process simultaneously (1 for this example)
- **max_model_len**: Maximum sequence length the model can handle (2048 tokens)
- **tensor_parallel_size**: Number of NeuronCores to use for tensor parallelism (2)

**Conversation Template:**

We use the same conversation format that was used during training:
- **System Message**: Defines the sentiment classification task
- **User Message**: Contains the text to classify
- **Assistant Response**: Expected to be POSITIVE, NEGATIVE, or NEUTRAL

**Expected Behavior:**

Your distilled model should demonstrate:
- **Faster Inference**: Significantly faster than the 30B teacher model
- **Reasonable Accuracy**: Good sentiment classification despite being 50x smaller
- **Consistent Format**: Responses in the expected POSITIVE/NEGATIVE/NEUTRAL format

**Performance Comparison:**

Compare the results with what you'd expect from the teacher model to evaluate the effectiveness of knowledge distillation.

In [10]:
import os
from vllm import LLM, SamplingParams
llm = LLM(
    model="/home/ubuntu/environment/ml/qwen/compiled_model", #local compiled model
    max_num_seqs=1,
    max_model_len=2048,
    tensor_parallel_size=2,)

def create_conversation(sample):
    return f"""<|im_start|>system
    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.
    <|im_start|>user
    {sample}
    <|im_start|>assistant"""

prompts = []
with open('/home/ubuntu/environment/distillation/data/dataset.txt', 'r') as f:
    for line in f:
        if line.strip():
            prompts.append( create_conversation(line.strip()) )
print(prompts)
sampling_params = SamplingParams(max_tokens=2048, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)

print("#########################################################")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\n Generated text: {generated_text!r} \n")

INFO 10-31 01:38:28 [utils.py:328] non-default args: {'max_model_len': 2048, 'tensor_parallel_size': 2, 'max_num_seqs': 1, 'disable_log_stats': True, 'model': '/home/ubuntu/environment/ml/qwen/compiled_model'}
INFO 10-31 01:38:28 [__init__.py:742] Resolved architecture: Qwen3ForCausalLM
INFO 10-31 01:38:28 [__init__.py:1815] Using max model len 2048
INFO 10-31 01:38:28 [llm_engine.py:221] Initializing a V0 LLM engine (v0.10.2) with config: model='/home/ubuntu/environment/ml/qwen/compiled_model', speculative_config=None, tokenizer='/home/ubuntu/environment/ml/qwen/compiled_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, decoding_config=DecodingConfig(backend=

INFO:Neuron:Loading sharded checkpoint from /home/ubuntu/environment/ml/qwen/compiled_model/checkpoint/weights


INFO 10-31 01:38:35 [executor_base.py:114] # neuron blocks: 2, # CPU blocks: 0
INFO 10-31 01:38:35 [executor_base.py:119] Maximum concurrency for 2048 tokens per request: 2.00x
INFO 10-31 01:38:35 [llm_engine.py:420] init engine (profile, create kv cache, warmup model) took 0.00 seconds
INFO 10-31 01:38:35 [llm.py:295] Supported_tasks: ['generate']
INFO 10-31 01:38:35 [__init__.py:36] No IOProcessor plugins requested by the model
['<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant', "<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    I can't believe how rude the staff was today.\n    <|im_start|>assistant", '<|im_start|

Adding requests:   0%|          | 0/100 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

#########################################################
Prompt: '<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant', 

 Generated text: '\nassistant\nThe service at this restaurant exceeded all my expectations!' 

Prompt: "<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    I can't believe how rude the staff was today.\n    <|im_start|>assistant", 

 Generated text: '\nassistant\nThe report contains data from the last fiscal year.' 

Prompt: '<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_sta

## Deploy Production API Server

For production use cases, you can deploy your distilled model as an OpenAI-compatible API server. This allows you to integrate your sentiment classification model into applications using standard HTTP requests.

**API Server Benefits:**

- **OpenAI Compatibility**: Use the same API format as OpenAI's models
- **HTTP Interface**: Easy integration with web applications and services
- **Concurrent Requests**: Handle multiple requests simultaneously
- **Production Ready**: Built-in request queuing and error handling
- **Cost Effective**: Run your own model instead of paying per API call

**Server Configuration:**

- **--model**: Path to your compiled Neuron model
- **--max-num-seqs**: Maximum concurrent sequences (1 for this configuration)
- **--max-model-len**: Maximum sequence length (2048 tokens)
- **--tensor-parallel-size**: NeuronCores to use (2 for optimal performance)
- **--port**: HTTP port for the API server (8080)
- **--device**: Specify "neuron" to use AWS Neuron hardware

**Starting the Server:**

Run the following command to start your API server. The server will be accessible at `http://localhost:8080` and provide OpenAI-compatible endpoints like `/v1/completions`.

In [None]:
!python -m vllm.entrypoints.openai.api_server \
    --model="/home/ubuntu/environment/ml/qwen/compiled_model" \
    --max-num-seqs=1 \
    --max-model-len=2048 \
    --tensor-parallel-size=2 \
    --port=8080 \
    --device "neuron"

## Test the API Server

Once your API server is running, you can test it using standard HTTP requests. The server provides OpenAI-compatible endpoints that accept the same request format.

**API Endpoint:**
- **URL**: `http://127.0.0.1:8080/v1/completions`
- **Method**: POST
- **Content-Type**: application/json

**Request Parameters:**
- **prompt**: The formatted conversation prompt (same format as training)
- **temperature**: Controls randomness in generation (0.8 for balanced creativity)
- **max_tokens**: Maximum number of tokens to generate (128 should be sufficient for sentiment labels)

**Example Request:**

The following curl command demonstrates how to query your sentiment classification model:

In [None]:
curl 127.0.0.1:8080/v1/completions \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{"prompt":"<|im_start|>system\n    You are a sentiment classifier. You take input strings and return the sentiment of POSITIVE, NEGATIVE, or NEUTRAL. Only return the sentiment.\n    <|im_start|>user\n    The service at this restaurant exceeded all my expectations!\n    <|im_start|>assistant", "temperature": 0.8, "max_tokens":128}'


## Lab 2 Summary and Next Steps

Congratulations! You've successfully completed the knowledge distillation workflow by deploying your distilled model for high-performance inference.

**What You've Accomplished:**

1. ✅ **Model Compilation**: Optimized your distilled model for AWS Neuron hardware
2. ✅ **vLLM Integration**: Set up high-performance inference with Optimum Neuron
3. ✅ **Batch Processing**: Demonstrated efficient batch inference on sentiment classification
4. ✅ **API Deployment**: Deployed an OpenAI-compatible API server for production use

**Performance Benefits Achieved:**

- **50x Model Size Reduction**: From 30B parameters (teacher) to 0.6B parameters (student)
- **Faster Inference**: Significantly reduced latency compared to the teacher model
- **Cost Optimization**: Lower compute costs for production inference
- **Hardware Acceleration**: Optimized execution on AWS Trainium/Inferentia

**Production Considerations:**

- **Scaling**: Use multiple instances behind a load balancer for higher throughput
- **Monitoring**: Implement logging and metrics collection for production monitoring
- **Security**: Add authentication and rate limiting for public APIs
- **Optimization**: Fine-tune batch sizes and parallelism based on your workload

**Next Steps:**

1. **Evaluate Performance**: Compare accuracy and speed against the teacher model
2. **Integration**: Integrate the API into your applications
3. **Optimization**: Experiment with different compilation settings for your use case
4. **Scaling**: Deploy multiple instances for production workloads
5. **Monitoring**: Set up comprehensive monitoring and alerting

You now have a complete knowledge distillation pipeline from teacher model logit generation through production deployment!