# DistilBERT Inference on AWS Trainium2 with Triton Inference Server (LNC=1)

This notebook deploys DistilBERT (`distilbert-base-uncased`) on an AWS **trn2.3xlarge** instance using
NVIDIA Triton Inference Server with the AWS Neuron SDK, then benchmarks throughput and latency.

**This is the LNC=1 variant.** With LNC=1, each physical NeuronCore becomes one logical core,
giving **8 logical cores** (vs 4 with LNC=2). Each core has half the memory (12 GB vs 24 GB) but
there are twice as many independent execution units.

Everything is self-contained: the notebook writes all required files (Dockerfile, Triton config,
Python backend, compile script, benchmark client), builds the Docker image, compiles the models,
starts the server, and runs the benchmark.

**Instance**: trn2.3xlarge (1 Neuron device, 8 logical NeuronCores at LNC=1)  
**Time to complete**: ~30 minutes (Docker build dominates)  
**Prerequisites**: Deep Learning AMI Neuron (Ubuntu 24.04), Docker installed

## Methodology

We measure throughput (inferences/second) and latency (P50, P95, P99) under multiple load patterns,
varying both client-side batch size and request concurrency to exercise Triton's dynamic batching.

- **Input**: Fixed "hello world" text, tokenized and padded to 128 tokens
- **Output**: 768-dimensional CLS token embedding (FP32 output, BF16 internal matmult via `--auto-cast matmult`)
- **Duration**: 10 seconds per test configuration
- **Warmup**: 5 requests before timing
- **Concurrency**: Python threads, one HTTP client per thread

### LNC=1 vs LNC=2

| Property | LNC=2 | LNC=1 |
|----------|-------|-------|
| Logical cores | 4 | **8** |
| Physical cores per logical core | 2 | **1** |
| Memory per logical core | 24 GB | **12 GB** |
| Triton model instances | 4 | **8** |
| Single-core throughput | ~4,329 inf/s | ~2,661 inf/s |
| Total throughput (peak) | ~17,300 inf/s | ~21,300 inf/s |
| Per-request latency | Lower | Higher (~2x) |

### Neuron-Specific: Per-Batch-Size Compilation

Unlike GPUs, Neuron models are compiled for a **fixed batch size**. Running a BS=1 request through
a BS=16 compiled model wastes 15/16 of the compute. We compile separate models for BS=1, 2, 4, 8, 16
and the Triton Python backend dispatches each request to the smallest model that fits.

**Critical**: Models compiled with `--lnc 1` are **not interchangeable** with `--lnc 2` models.
The `--lnc` compiler flag and `NEURON_LOGICAL_NC_CONFIG` environment variable must always match.

### Triton Configuration

- **8 model instances** (one per logical NeuronCore), each pinned via `NEURON_RT_VISIBLE_CORES`
- **Dynamic batching**: preferred sizes [4, 8, 16], max queue delay 5ms
- **Python backend**: Triton has no native Neuron backend, so we use the Python backend with `torch_neuronx`

### Test Matrix

| Test | Concurrency | Client Batch Sizes |
|------|-------------|--------------------|
| Baseline | 1 (no batching) | 1 |
| Dynamic batching | 16, 32, 64 | 1, 2, 4, 8, 16 |

Concurrency levels are doubled vs LNC=2 (16/32/64 instead of 8/16/32) to maintain
2 workers per core as the baseline.

## Model & Instance Specifications

| Spec | Value |
|------|-------|
| Model | DistilBERT base uncased (67M params) |
| Architecture | 6-layer transformer, hidden size 768 |
| Input | 128 tokens (padded) |
| Output | 768-dim CLS embedding (FP32 output, BF16 matmult) |
| Compiled model size | ~148 MB per batch size variant |
| Instance type | trn2.3xlarge |
| Neuron device | 1 device, 8 physical NeuronCores |
| LNC config | **1 (= 8 logical cores, 12 GB each)** |
| vCPUs / RAM | 12 / 128 GB |

### Software Stack

| Component | Version |
|-----------|--------|
| Neuron SDK | 2.28 |
| PyTorch / torch-neuronx | 2.9.0 / 2.9.0.2.12 |
| neuronx-cc (compiler) | 2.22.22436 |
| Transformers | 4.48.0 |
| Triton Inference Server | 2.65.0 (r26.01, built from source) |
| Python | 3.12 |
| OS / AMI | Ubuntu 24.04 / Deep Learning AMI Neuron 20260227 |

> **Important**: Transformers versions 4.54.0+ have a confirmed 31% performance regression for
> DistilBERT on Neuron. Use versions 4.48.0 through 4.53.3.

---
## Step 0: Setup & Environment Check

After you deploy the instance using the Ubuntu Neuron Deep Learning AMI (with all the Neuron drivers installed), run this notebook inside the pre-installed PyTorch 2.9 Neuron virtual environment:

```bash
source /opt/aws_neuronx_venv_pytorch_2_9/bin/activate
pip install jupyter
jupyter notebook --ip=0.0.0.0 --no-browser
```

Alternatively, if you are running from withing a remote vscode instance, you can use ```ln -s /opt/aws_neuronx_venv_pytorch_2_9/bin/activate ~/.venv``` to help vscode find your kernel.

In [1]:
import subprocess, sys, os, time, shutil

# Verify Neuron environment
import torch, torch_neuronx
print(f'PyTorch: {torch.__version__}')
print(f'torch-neuronx: {torch_neuronx.__version__}')

# Ensure correct transformers version
try:
    import transformers
    ver = transformers.__version__
    print(f'transformers: {ver}')
    if ver >= '4.54.0':
        print('WARNING: transformers >= 4.54.0 has 31% regression. Downgrading...')
        subprocess.check_call([sys.executable, '-m', 'pip', 'install',
                               'transformers==4.48.0', '-q'])
except ImportError:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install',
                           'transformers==4.48.0', '-q'])

# Install Triton client
subprocess.check_call([sys.executable, '-m', 'pip', 'install',
                       'tritonclient[http]', '-q'])

# Show Neuron devices
r = subprocess.run(['neuron-ls'], capture_output=True, text=True)
print(f'\n{r.stdout}')

PyTorch: 2.9.0+cu128
torch-neuronx: 2.9.0.2.12.22436+0f1dac25
transformers: 4.48.0



instance-type: trn2.3xlarge
instance-id: i-09dbb4802167c2239
logical-neuroncore-config: 2
+--------+--------+----------+--------+--------------+----------+------+
| NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
| DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
+--------+--------+----------+--------+--------------+----------+------+
| 0      | 4      | 0-3      | 96 GB  | 0000:33:00.0 | 0-11     | 0    |
+--------+--------+----------+--------+--------------+----------+------+



---
## Step 1: Compile DistilBERT for All Batch Sizes (LNC=1)

Compile 5 model variants (BS=1, 2, 4, 8, 16) with sequence length 128 and **LNC=1**.
Takes ~5 minutes per variant (~25 min total--time may vary for other models). Skips already-compiled models, so if you are running this multiple times you will see faster compilation.  For production, you would deploy with the pre-compiled models.

**Critical**: The `--lnc 1` compiler flag produces models that are **incompatible** with LNC=2.
Never mix LNC=1 and LNC=2 compiled models.

The `--auto-cast matmult` flag casts matrix multiplications to BF16, yielding ~2.8x throughput
with negligible accuracy loss (cosine similarity > 0.99999 vs FP32). This matches the
compilation settings used in the [AWS Neuron SDK benchmarks](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/benchmarks/inf2/inf2-performance.html).

In [2]:
from transformers import DistilBertModel, DistilBertTokenizer

os.environ['NEURON_RT_LOG_LEVEL'] = 'ERROR'

SEQ_LENGTH = 128
LNC = 1
MODEL_DIR = os.path.expanduser('~/triton_repo/distilbert/1')
BATCH_SIZES = [1, 2, 4, 8, 16]

os.makedirs(MODEL_DIR, exist_ok=True)

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
model.eval()

print(f'Compiling DistilBERT for batch sizes: {BATCH_SIZES}')
print(f'Sequence length: {SEQ_LENGTH}, LNC: {LNC}\n')

for bs in BATCH_SIZES:
    output_path = os.path.join(MODEL_DIR, f'model_bs{bs}.pt')
    if os.path.exists(output_path):
        print(f'  BS={bs}: already compiled, skipping')
        continue
    print(f'  Compiling BS={bs}...')
    texts = ['Test sentence.'] * bs
    inputs = tokenizer(texts, return_tensors='pt', max_length=SEQ_LENGTH,
                       padding='max_length', truncation=True)
    model_neuron = torch_neuronx.trace(
        model, (inputs['input_ids'], inputs['attention_mask']),
        compiler_args=['--model-type', 'transformer', '--optlevel', '2',
                       '--auto-cast', 'matmult', '--lnc', str(LNC)])
    torch.jit.save(model_neuron, output_path)
    size_mb = os.path.getsize(output_path) / (1024 * 1024)
    print(f'    Saved: {output_path} ({size_mb:.1f} MB)')

print('\nCompiled models:')
for f in sorted(os.listdir(MODEL_DIR)):
    if f.endswith('.pt'):
        size = os.path.getsize(os.path.join(MODEL_DIR, f)) / (1024 * 1024)
        print(f'  {f}: {size:.1f} MB')

Compiling DistilBERT for batch sizes: [1, 2, 4, 8, 16]
Sequence length: 128, LNC: 1

  Compiling BS=1...


.

Completed run_backend_driver.



Compiler status PASS


    Saved: /home/ubuntu/triton_repo/distilbert/1/model_bs1.pt (148.1 MB)
  Compiling BS=2...


.

Completed run_backend_driver.



Compiler status PASS


    Saved: /home/ubuntu/triton_repo/distilbert/1/model_bs2.pt (148.1 MB)
  Compiling BS=4...


.

Completed run_backend_driver.



Compiler status PASS


    Saved: /home/ubuntu/triton_repo/distilbert/1/model_bs4.pt (148.2 MB)
  Compiling BS=8...


.

Completed run_backend_driver.



Compiler status PASS


    Saved: /home/ubuntu/triton_repo/distilbert/1/model_bs8.pt (148.4 MB)
  Compiling BS=16...


.

Completed run_backend_driver.



Compiler status PASS


    Saved: /home/ubuntu/triton_repo/distilbert/1/model_bs16.pt (149.1 MB)

Compiled models:
  model_bs1.pt: 148.1 MB
  model_bs16.pt: 149.1 MB
  model_bs2.pt: 148.1 MB
  model_bs4.pt: 148.2 MB
  model_bs8.pt: 148.4 MB


---
## Step 2: Write Triton Model Repository Files (LNC=1)

Write `config.pbtxt` and `model.py` into the model repository alongside the compiled `.pt` files.

Key differences from LNC=2:
- `instance_group count: 8` (8 logical cores)
- `NEURON_LOGICAL_NC_CONFIG = "1"` in model.py
- Core IDs 0-7 (vs 0-3)

In [3]:
REPO_DIR = os.path.expanduser('~/triton_repo/distilbert')

# ── config.pbtxt (LNC=1: 8 instances) ─────────────────────────────────────────
config_pbtxt = r"""name: "distilbert"
platform: "python"
backend: "python"
max_batch_size: 16

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [128]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [128]
  }
]

output [
  {
    name: "embeddings"
    data_type: TYPE_FP32
    dims: [768]
  }
]

instance_group [
  {
    count: 8
    kind: KIND_CPU
  }
]

dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 5000
}

parameters: {
  key: "model_dir"
  value: { string_value: "/models/distilbert/1" }
}

parameters: {
  key: "tokenizer_name"
  value: { string_value: "distilbert-base-uncased" }
}

parameters: {
  key: "max_seq_length"
  value: { string_value: "128" }
}
"""

with open(os.path.join(REPO_DIR, 'config.pbtxt'), 'w') as f:
    f.write(config_pbtxt)
print('Wrote config.pbtxt (instance_group count=8 for LNC=1)')

# ── model.py (Triton Python backend, LNC=1) ──────────────────────────────────
model_py = r'''#!/usr/bin/env python3
"""
Triton Python Backend for DistilBERT on AWS Neuron - Multi-Model (LNC=1)

Loads compiled models for each batch size (1, 2, 4, 8, 16) and
dispatches to the best-fit model to avoid padding waste.
Each Triton instance is pinned to a separate Neuron core (cores 0-7).
"""

import os
import json
import numpy as np

try:
    import triton_python_backend_utils as pb_utils
except ImportError:
    pass

import torch
import torch_neuronx
from transformers import DistilBertTokenizer


class TritonPythonModel:
    BATCH_SIZES = [1, 2, 4, 8, 16]

    def initialize(self, args):
        self.model_config = json.loads(args["model_config"])
        params = self.model_config.get("parameters", {})
        model_dir = params.get("model_dir", {}).get(
            "string_value", "/models/distilbert/1")
        tokenizer_name = params.get("tokenizer_name", {}).get(
            "string_value", "distilbert-base-uncased")
        self.max_seq_length = int(
            params.get("max_seq_length", {}).get("string_value", "128"))

        # Pin this instance to a specific Neuron core (0-7 for LNC=1)
        instance_name = args.get("model_instance_name", "distilbert_0_0")
        core_id = int(instance_name.split("_")[-1])
        os.environ["NEURON_RT_VISIBLE_CORES"] = str(core_id)
        os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"
        print(f"Instance {instance_name}: pinned to Neuron core {core_id}")

        self.tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_name)
        print("Tokenizer loaded")

        # Load all compiled models
        self.models = {}
        for bs in self.BATCH_SIZES:
            model_path = os.path.join(model_dir, f"model_bs{bs}.pt")
            if os.path.exists(model_path):
                self.models[bs] = torch.jit.load(model_path)
                self.models[bs].eval()
                print(f"  Loaded model for batch_size={bs}")
            else:
                print(f"  WARNING: Model not found: {model_path}")

        if not self.models:
            raise RuntimeError("No compiled models found!")

        # Warmup all models
        print("Warming up models...")
        for bs, mdl in self.models.items():
            dummy_ids = torch.zeros((bs, self.max_seq_length), dtype=torch.long)
            dummy_mask = torch.ones((bs, self.max_seq_length), dtype=torch.long)
            with torch.no_grad():
                for _ in range(3):
                    _ = mdl(dummy_ids, dummy_mask)
            print(f"  Warmed up BS={bs}")
        print(f"Model initialization complete! "
              f"Available batch sizes: {sorted(self.models.keys())}")

    def _get_best_model(self, actual_batch_size):
        for bs in self.BATCH_SIZES:
            if bs >= actual_batch_size and bs in self.models:
                return bs, self.models[bs]
        largest = max(self.models.keys())
        return largest, self.models[largest]

    def execute(self, requests):
        responses = []
        for request in requests:
            input_ids = torch.from_numpy(
                pb_utils.get_input_tensor_by_name(request, "input_ids").as_numpy()
            ).long()
            attention_mask = torch.from_numpy(
                pb_utils.get_input_tensor_by_name(request, "attention_mask").as_numpy()
            ).long()

            actual_bs = input_ids.shape[0]
            target_bs, model = self._get_best_model(actual_bs)

            # Pad to compiled batch size if needed
            if actual_bs < target_bs:
                pad = target_bs - actual_bs
                input_ids = torch.cat([input_ids,
                    torch.zeros((pad, input_ids.shape[1]), dtype=torch.long)], dim=0)
                attention_mask = torch.cat([attention_mask,
                    torch.zeros((pad, attention_mask.shape[1]), dtype=torch.long)], dim=0)

            with torch.no_grad():
                outputs = model(input_ids, attention_mask)

            embeddings = outputs["last_hidden_state"][:, 0, :].cpu().numpy()
            embeddings = embeddings[:actual_bs, :]

            output_tensor = pb_utils.Tensor("embeddings",
                                            embeddings.astype(np.float32))
            responses.append(
                pb_utils.InferenceResponse(output_tensors=[output_tensor]))
        return responses

    def finalize(self):
        print("Finalizing DistilBERT model...")
        if hasattr(self, "models"):
            for bs, mdl in self.models.items():
                del mdl
            self.models.clear()
'''

with open(os.path.join(MODEL_DIR, 'model.py'), 'w') as f:
    f.write(model_py)
print('Wrote model.py (NEURON_LOGICAL_NC_CONFIG=1, cores 0-7)')

# Verify
print('\nModel repository:')
for root, dirs, files in os.walk(os.path.expanduser('~/triton_repo')):
    level = root.replace(os.path.expanduser('~/triton_repo'), '').count(os.sep)
    indent = '  ' * level
    print(f'{indent}{os.path.basename(root)}/')
    for f in sorted(files):
        size = os.path.getsize(os.path.join(root, f)) / (1024 * 1024)
        label = f'{f} ({size:.1f} MB)' if size > 1 else f
        print(f'{indent}  {label}')

Wrote config.pbtxt (instance_group count=8 for LNC=1)
Wrote model.py (NEURON_LOGICAL_NC_CONFIG=1, cores 0-7)

Model repository:
triton_repo/
  distilbert/
    config.pbtxt
    1/
      model.py
      model_bs1.pt (148.1 MB)
      model_bs16.pt (149.1 MB)
      model_bs2.pt (148.1 MB)
      model_bs4.pt (148.2 MB)
      model_bs8.pt (148.4 MB)


---
## Step 3: Build Triton + Neuron Docker Image

Triton has no native Neuron backend, so we build Triton from source inside the AWS Neuron
PyTorch inference base image. This takes **15-20 minutes** and produces a ~15.8 GB image.

The cell skips the build if the image already exists.

In [4]:
DOCKER_IMAGE = 'triton-neuron-distilbert:latest'

# Check if already built
r = subprocess.run(['docker', 'images', '-q', DOCKER_IMAGE],
                   capture_output=True, text=True)
if r.stdout.strip():
    print(f'Docker image already exists: {r.stdout.strip()}')
    print('Delete with: docker rmi ' + DOCKER_IMAGE)
else:
    # Write Dockerfile
    dockerfile = r"""ARG BASE_IMAGE=public.ecr.aws/neuron/pytorch-inference-neuronx:2.9.0-neuronx-py312-sdk2.27.1-ubuntu24.04
FROM $BASE_IMAGE

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PJRT_DEVICE=NEURON \
    LD_LIBRARY_PATH="/opt/conda/lib:/opt/aws/neuron/lib:/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}" \
    PATH="/opt/program:/opt/aws/neuron/bin:/opt/tritonserver/bin:${PATH}"

RUN apt-get update && apt-get install -y --no-install-recommends \
    wget gnupg2 build-essential git nginx pkg-config unzip \
    libssl-dev libcurl4-openssl-dev libgoogle-perftools-dev \
    libnuma-dev libarchive-dev libxml2-dev zlib1g-dev \
    autoconf automake libtool gperf scons patchelf \
    libre2-dev libb64-dev rapidjson-dev libboost-dev \
    cmake cmake-data \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel virtualenv build cmake==3.31.10
RUN pip3 install transformers==4.48.0

RUN git clone --depth=1 --branch=r26.01 https://github.com/triton-inference-server/server.git /server && \
    cd /server && \
    Python3_ROOT_DIR=/opt/conda \
    Python3_EXECUTABLE=/opt/conda/bin/python3 \
    Python3_INCLUDE_DIR=/opt/conda/include/python3.12 \
    Python3_LIBRARY=/opt/conda/lib/libpython3.12.so \
    ./build.py -v --no-container-build --build-dir=/server/build --backend=python \
    --enable-metrics --enable-logging --enable-stats --endpoint="http" --endpoint="grpc" && \
    cp -r /server/build/opt/* /opt/ && \
    cd / && rm -rf /server

EXPOSE 8000 8001 8002
CMD ["tritonserver", "--model-repository=/models"]
"""
    dockerfile_path = os.path.expanduser('~/Dockerfile.triton-neuron')
    with open(dockerfile_path, 'w') as f:
        f.write(dockerfile)

    print('Building Triton + Neuron Docker image (15-20 minutes)...')
    r = subprocess.run(
        ['docker', 'build', '-f', dockerfile_path, '-t', DOCKER_IMAGE, '.'],
        cwd=os.path.expanduser('~'),
        capture_output=True, text=True, timeout=2400)
    if r.returncode == 0:
        print('Build complete!')
    else:
        print(f'Build FAILED (rc={r.returncode})')
        print(r.stderr[-3000:])

# Show image
r = subprocess.run(['docker', 'images', DOCKER_IMAGE], capture_output=True, text=True)
print(r.stdout)

Docker image already exists: 26b55f589d0c
Delete with: docker rmi triton-neuron-distilbert:latest
IMAGE                             ID             DISK USAGE   CONTENT SIZE   EXTRA
triton-neuron-distilbert:latest   26b55f589d0c       24.1GB         7.82GB        



---
## Step 4: Start Triton Server

Launch the Docker container with the Neuron device mounted and the model repository bind-mounted.
With 8 instances (LNC=1), model loading takes longer (~90-120 seconds).

In [5]:
import urllib.request

NUM_INSTANCES = 8  # LNC=1: 8 logical cores

# Stop any previous run
subprocess.run(['docker', 'rm', '-f', 'triton-distilbert'],
               capture_output=True)
time.sleep(3)

cmd = [
    'docker', 'run', '-d',
    '--name', 'triton-distilbert',
    '--device=/dev/neuron0',
    '-v', os.path.expanduser('~/triton_repo') + ':/models:ro',
    '-p', '8000:8000', '-p', '8001:8001', '-p', '8002:8002',
    DOCKER_IMAGE,
    'tritonserver', '--model-repository=/models', '--log-verbose=0',
]
r = subprocess.run(cmd, capture_output=True, text=True)
if r.returncode != 0:
    print(f'Failed to start container: {r.stderr}')
else:
    print(f'Container started. Waiting for {NUM_INSTANCES} model instances to load (~90-120s)...')

# Poll for readiness
for i in range(180):
    time.sleep(2)
    try:
        resp = urllib.request.urlopen('http://localhost:8000/v2/health/ready', timeout=2)
        if resp.status == 200:
            elapsed = (i + 1) * 2
            print(f'Server ready after ~{elapsed}s')
            break
    except Exception:
        pass
else:
    print('Timeout waiting for server!')
    r = subprocess.run(['docker', 'logs', '--tail', '30', 'triton-distilbert'],
                       capture_output=True, text=True)
    print(r.stdout)

# Verify instances loaded
r = subprocess.run(['docker', 'logs', 'triton-distilbert'],
                   capture_output=True, text=True)
n = r.stdout.count('Model initialization complete')
print(f'Model instances initialized: {n}/{NUM_INSTANCES}')

Container started. Waiting for 8 model instances to load (~90-120s)...


Server ready after ~36s
Model instances initialized: 8/8


---
## Step 5: Run Benchmark

Runs the full test matrix: 1 baseline + 15 dynamic-batching configurations (3 concurrency levels
x 5 batch sizes), 10 seconds each. Total runtime ~3 minutes.

Concurrency levels are 16/32/64 (vs 8/16/32 for LNC=2) to maintain 2 workers per core as baseline.

**Note**: One worker per test will print a harmless greenlet thread-switch error. This is a
known cosmetic issue in `tritonclient` and does not affect results -- the remaining workers run fine.

In [6]:
import numpy as np
from transformers import AutoTokenizer
import tritonclient.http as httpclient
import threading
from queue import Queue

TRITON_URL = 'localhost:8000'
MODEL_NAME = 'distilbert'
DURATION = 10.0  # seconds per test

tok = AutoTokenizer.from_pretrained('distilbert-base-uncased')


def _worker(client, batch_size, duration_s, latencies_q, stop_evt):
    sentences = ['hello world'] * batch_size
    tokens = tok(sentences, max_length=128, padding='max_length',
                 truncation=True, return_tensors='np')
    ids_np = tokens['input_ids'].astype(np.int64)
    mask_np = tokens['attention_mask'].astype(np.int64)
    inp_ids = httpclient.InferInput('input_ids', ids_np.shape, 'INT64')
    inp_mask = httpclient.InferInput('attention_mask', mask_np.shape, 'INT64')
    inp_ids.set_data_from_numpy(ids_np)
    inp_mask.set_data_from_numpy(mask_np)
    t_end = time.time() + duration_s
    while time.time() < t_end and not stop_evt.is_set():
        try:
            t0 = time.time()
            client.infer(model_name=MODEL_NAME, inputs=[inp_ids, inp_mask])
            latencies_q.put((time.time() - t0) * 1000)
        except Exception:
            break


def run_concurrent(batch_size, num_workers, duration_s=DURATION):
    clients = [httpclient.InferenceServerClient(url=TRITON_URL)
               for _ in range(num_workers)]
    # Warmup
    tokens = tok(['hello world'] * batch_size, max_length=128,
                 padding='max_length', truncation=True, return_tensors='np')
    ids_np = tokens['input_ids'].astype(np.int64)
    mask_np = tokens['attention_mask'].astype(np.int64)
    inp_ids = httpclient.InferInput('input_ids', ids_np.shape, 'INT64')
    inp_mask = httpclient.InferInput('attention_mask', mask_np.shape, 'INT64')
    inp_ids.set_data_from_numpy(ids_np)
    inp_mask.set_data_from_numpy(mask_np)
    for _ in range(5):
        clients[0].infer(model_name=MODEL_NAME, inputs=[inp_ids, inp_mask])

    q = Queue()
    stop = threading.Event()
    threads = []
    t_start = time.time()
    for i in range(num_workers):
        t = threading.Thread(target=_worker,
                             args=(clients[i], batch_size, duration_s, q, stop))
        t.start()
        threads.append(t)
    for t in threads:
        t.join()
    total_time = time.time() - t_start

    latencies = []
    while not q.empty():
        latencies.append(q.get())
    if not latencies:
        return None
    return {
        'batch_size': batch_size, 'num_workers': num_workers,
        'total_requests': len(latencies),
        'p50': np.percentile(latencies, 50),
        'p95': np.percentile(latencies, 95),
        'p99': np.percentile(latencies, 99),
        'throughput': (len(latencies) * batch_size) / total_time,
    }


def run_baseline(duration_s=DURATION):
    client = httpclient.InferenceServerClient(url=TRITON_URL)
    tokens = tok(['hello world'], max_length=128, padding='max_length',
                 truncation=True, return_tensors='np')
    ids_np = tokens['input_ids'].astype(np.int64)
    mask_np = tokens['attention_mask'].astype(np.int64)
    inp_ids = httpclient.InferInput('input_ids', ids_np.shape, 'INT64')
    inp_mask = httpclient.InferInput('attention_mask', mask_np.shape, 'INT64')
    inp_ids.set_data_from_numpy(ids_np)
    inp_mask.set_data_from_numpy(mask_np)
    for _ in range(10):
        client.infer(model_name=MODEL_NAME, inputs=[inp_ids, inp_mask])
    latencies = []
    t_start = time.time()
    while time.time() - t_start < duration_s:
        t0 = time.time()
        client.infer(model_name=MODEL_NAME, inputs=[inp_ids, inp_mask])
        latencies.append((time.time() - t0) * 1000)
    total_time = time.time() - t_start
    return {
        'batch_size': 1, 'num_workers': 1,
        'total_requests': len(latencies),
        'p50': np.percentile(latencies, 50),
        'p95': np.percentile(latencies, 95),
        'p99': np.percentile(latencies, 99),
        'throughput': len(latencies) / total_time,
    }


# ── Run all tests ─────────────────────────────────────────────────────────────
client = httpclient.InferenceServerClient(url=TRITON_URL)
assert client.is_server_ready(), 'Triton server not ready!'

print('DISTILBERT TRITON BENCHMARK - Neuron (trn2.3xlarge, LNC=1)')
print('=' * 90)

results = []

# Baseline
print('\nBaseline: single request, no concurrency...')
r = run_baseline()
results.append(r)
print(f'  P50: {r["p50"]:.2f}ms  Throughput: {r["throughput"]:.0f} inf/sec')

# Dynamic batching -- concurrency levels scaled for 8 cores
for conc in [16, 32, 64]:
    print(f'\nConcurrency={conc}:')
    for bs in [1, 2, 4, 8, 16]:
        r = run_concurrent(bs, conc)
        if r:
            results.append(r)
            print(f'  BS={bs:<3} P50={r["p50"]:>7.2f}ms  '
                  f'P95={r["p95"]:>7.2f}ms  '
                  f'P99={r["p99"]:>7.2f}ms  '
                  f'Throughput={r["throughput"]:>8.0f} inf/sec')

# Summary table
print(f'\n{"=" * 90}')
print(f'{"Batch":<8}{"Workers":<10}{"Requests":<12}'
      f'{"P50 (ms)":<12}{"P95 (ms)":<12}{"P99 (ms)":<12}{"Throughput":<15}')
print('-' * 90)
for r in results:
    print(f'{r["batch_size"]:<8}{r["num_workers"]:<10}{r["total_requests"]:<12}'
          f'{r["p50"]:<12.2f}{r["p95"]:<12.2f}{r["p99"]:<12.2f}'
          f'{r["throughput"]:<15.0f}')

DISTILBERT TRITON BENCHMARK - Neuron (trn2.3xlarge, LNC=1)

Baseline: single request, no concurrency...


  P50: 7.10ms  Throughput: 141 inf/sec

Concurrency=16:


  BS=1   P50=   4.37ms  P95=   5.26ms  P99=   5.63ms  Throughput=    3392 inf/sec


  BS=2   P50=   3.49ms  P95=   5.26ms  P99=   6.55ms  Throughput=    8048 inf/sec


  BS=4   P50=   3.79ms  P95=   7.08ms  P99=   9.23ms  Throughput=   14328 inf/sec


  BS=8   P50=   8.05ms  P95=   9.45ms  P99=  10.16ms  Throughput=   17310 inf/sec


  BS=16  P50=  11.18ms  P95=  12.86ms  P99=  13.95ms  Throughput=   21289 inf/sec

Concurrency=32:


  BS=1   P50=   6.15ms  P95=  11.82ms  P99=  15.88ms  Throughput=    4489 inf/sec


  BS=2   P50=   5.65ms  P95=  14.31ms  P99=  19.99ms  Throughput=    9062 inf/sec


  BS=4   P50=   7.06ms  P95=  16.46ms  P99=  22.23ms  Throughput=   14898 inf/sec


  BS=8   P50=  14.31ms  P95=  15.74ms  P99=  16.75ms  Throughput=   17229 inf/sec


  BS=16  P50=  23.10ms  P95=  25.63ms  P99=  27.00ms  Throughput=   21176 inf/sec

Concurrency=64:


  BS=1   P50=   9.77ms  P95=  27.70ms  P99=  40.13ms  Throughput=    4909 inf/sec


  BS=2   P50=   8.90ms  P95=  32.58ms  P99=  48.74ms  Throughput=    9685 inf/sec


  BS=4   P50=  13.14ms  P95=  34.36ms  P99=  47.95ms  Throughput=   15641 inf/sec


  BS=8   P50=  29.03ms  P95=  31.04ms  P99=  32.49ms  Throughput=   16880 inf/sec


  BS=16  P50=  47.05ms  P95=  50.73ms  P99=  52.66ms  Throughput=   21151 inf/sec

Batch   Workers   Requests    P50 (ms)    P95 (ms)    P99 (ms)    Throughput     
------------------------------------------------------------------------------------------
1       1         1409        7.10        7.30        7.37        141            
1       16        33973       4.37        5.26        5.63        3392           
2       16        40417       3.49        5.26        6.55        8048           
4       16        35894       3.79        7.08        9.23        14328          
8       16        21687       8.05        9.45        10.16       17310          
16      16        13338       11.18       12.86       13.95       21289          
1       32        45128       6.15        11.82       15.88       4489           
2       32        45712       5.65        14.31       19.99       9062           
4       32        37701       7.06        16.46       22.23       14898          
8     

---
## Step 6: Cleanup

In [7]:
subprocess.run(['docker', 'rm', '-f', 'triton-distilbert'], capture_output=True)
print('Triton server stopped and removed.')

Triton server stopped and removed.
