<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# vLLM Inference Engine

This notebook demonstrates how to use the `VLLMInferenceEngine` class for inference with Llama 3.3 70B.

# Prerequisites

## Machine Requirements

❗**NOTICE:** This notebook doesn't run on Colab because the GPU is too old to be supported by vLLM.

It is recommended to run this notebook on a machine with GPU support, as vLLM is mainly intended to run on GPUs. Llama 3.3 70B requires 140GB VRAM to serve, though we also provide examples below for inference with Llama 3.1 8B, Llama 3.2 1B, and quantized Llama 3.3 70B that require less memory.

If your local machine cannot run this notebook, you can instead run this notebook on a cloud platform. The following demonstrates how to open a VSCode instance backed by a GCP node with 4 A100 GPUs, from which the notebook can be run.

```bash
# Run on your local machine
gcloud auth application-default login  # Authenticate with GCP
make gcpcode ARGS="--resources.accelerators A100:4"  # 4 A100-40GB GPUs, enough for 70B model. Can also use 2x "A100-80GB"
```

## Oumi Installation

First, let's install Oumi and vLLM. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html). Here, we include Oumi's GPU dependencies.


In [None]:
%pip install oumi[gpu]

## Llama Access

Llama 3.3 70B is a gated model on HuggingFace Hub. To run this notebook, you must first complete the [agreement](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) on HuggingFace, and wait for it to be accepted. Then, specify `HF_TOKEN` below to enable access to the model if it's not already set.

Usually, you can get the token by running this command `cat ~/.cache/huggingface/token` on your local machine.

In [1]:
import os

# if not os.environ.get("HF_TOKEN"):
#     # NOTE: Set your Hugging Face token here if not already set.
#     os.environ["HF_TOKEN"] = "<MY_HF_TOKEN>"
# hf_token = os.environ.get("HF_TOKEN")
# print(f"Using HF Token: '{hf_token}'")
import dotenv

dotenv.load_dotenv()

# This is needed for vLLM to use multiple GPUs in a notebook.
# If you're not running in a notebook, you can ignore this.
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

To download Llama 3.3 70B to your machine before inference, run:

In [2]:
%pip install hf_transfer
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
! huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --exclude original/*

Note: you may need to restart the kernel to use updated packages.
/home/shanghong/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659


In [4]:
import torch

from oumi.core.configs import InferenceConfig
from oumi.core.types import Conversation, Message, Role
from oumi.inference import VLLMInferenceEngine

INFO 06-30 22:44:05 [__init__.py:239] Automatically detected platform cuda.


In [None]:
# If we have multiple GPUs, we can use Ray to parallelize the inference.
# This is essential if you're running a model that's too big to fit in a single GPU.

import ray

if torch.cuda.is_available() and torch.cuda.device_count() >= 2:
    ray.shutdown()
    ray.init()  # num_gpus=torch.cuda.device_count()

### Setting up the config file

Note: in this section we are writing the config file to the current working directory.

An alternative option is to initialize the params classes directly: `ModelParams`, `GenerationParams`.

In [5]:
config_path = "vllm_tutorial_llama70b_infer.yaml"

In [6]:
%%writefile vllm_tutorial_llama70b_infer.yaml

model:
  model_name: "meta-llama/Llama-3.1-8B-Instruct"  # 8B model, requires 1x A100-40GB GPUs
  # model_name: "meta-llama/Llama-3.3-70B-Instruct"  # 70B model, requires 4x A100-40GB GPUs
  model_max_length: 512
  torch_dtype_str: "bfloat16"
  trust_remote_code: True
  attn_implementation: "sdpa"

generation:
  max_new_tokens: 128
  batch_size: 1

Writing vllm_tutorial_llama70b_infer.yaml


### Load the model and the inference engine

In [7]:
%%time

# Download, and load the model in memory
# This may take a while, depending on your internet speed.
# The inference engine only needs to be loaded once and can be
# reused for multiple conversations.

config = InferenceConfig.from_yaml(config_path)

inference_engine = VLLMInferenceEngine(
    config.model,
    tensor_parallel_size=torch.cuda.device_count(),  # use all available GPUs
    # Enable prefix caching for vLLM.
    # This is key for performance when running prompts with a long prefix,
    # such as judging or conversations with large system prompts
    # or few-shot examples.
    enable_prefix_caching=True,
)

[2025-06-30 22:44:19,807][oumi][rank0][pid:1901055][MainThread][INFO]][models.py:506] Using the model's built-in chat template for model 'meta-llama/Llama-3.1-8B-Instruct'.
INFO 06-30 22:44:29 [config.py:600] This model supports multiple tasks: {'score', 'classify', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 06-30 22:44:29 [config.py:1600] Defaulting to use mp for distributed inference
INFO 06-30 22:44:29 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 06-30 22:44:35 [__init__.py:239] Automatically detected platform cuda.
INFO 06-30 22:44:39 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=LoadForm

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:06<00:18,  6.04s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:12<00:13,  6.52s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:19<00:06,  6.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:20<00:00,  4.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:20<00:00,  5.15s/it]
[1;36m(VllmWorker rank=0 pid=1901757)[0;0m 


[1;36m(VllmWorker rank=1 pid=1901879)[0;0m INFO 06-30 22:45:44 [loader.py:447] Loading weights took 20.54 seconds
[1;36m(VllmWorker rank=0 pid=1901757)[0;0m INFO 06-30 22:45:44 [loader.py:447] Loading weights took 20.62 seconds
[1;36m(VllmWorker rank=0 pid=1901757)[0;0m INFO 06-30 22:45:44 [gpu_model_runner.py:1273] Model loading took 7.5123 GiB and 20.930658 seconds
[1;36m(VllmWorker rank=1 pid=1901879)[0;0m INFO 06-30 22:45:44 [gpu_model_runner.py:1273] Model loading took 7.5123 GiB and 20.897024 seconds
INFO 06-30 22:45:50 [kv_cache_utils.py:578] GPU KV cache size: 929,840 tokens
INFO 06-30 22:45:50 [kv_cache_utils.py:581] Maximum concurrency for 512 tokens per request: 1816.09x
INFO 06-30 22:45:50 [kv_cache_utils.py:578] GPU KV cache size: 929,840 tokens
INFO 06-30 22:45:50 [kv_cache_utils.py:581] Maximum concurrency for 512 tokens per request: 1816.09x
INFO 06-30 22:45:50 [core.py:162] init engine (profile, create kv cache, warmup model) took 5.99 seconds
CPU times: user 1

### Preprocessing our inputs

The inference engine expects a list of conversations, where each conversation is a list of messages.

See the [Conversation](https://github.com/oumi-ai/oumi/blob/38b3d2b27407be5fc9be5a1dd88f9ad518f3491c/src/oumi/core/types/turn.py#L109) class for more details.

Tip: you can visualize how the conversation is rendered as a prompt with the following:

```python
inference_engine.apply_chat_template(conversation, tokenize=False)
```

In [8]:
conversations = [
    Conversation(
        messages=[
            Message(
                role=Role.SYSTEM, content="Translate the following text into French."
            ),
            Message(role=Role.USER, content="Hello, how are you?"),
        ]
    ),
]

### Running inference

Under the hood, the vLLM engine will batch the conversations to run inference with a high throughput.

Make sure to feed all your prompts to the engine at once for maximum throughput.

In [9]:
%%time

print(f"Running inference for {len(conversations)} conversations")

generations = inference_engine.infer(
    input=conversations,
    inference_config=config,
)

Running inference for 1 conversations
INFO 06-30 22:46:27 [chat_utils.py:396] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
CPU times: user 383 ms, sys: 43.9 ms, total: 426 ms
Wall time: 1.58 s


In [10]:
for conversation in generations:
    print(repr(conversation))
    print()

SYSTEM: Translate the following text into French.
USER: Hello, how are you?
ASSISTANT: Bonjour, comment allez-vous ?



### Bonus: Running quantized GGUF models

You can also run quantized GGUF models, by downloading the model file and passing it to the engine.

For example, to run the Llama 3.3 70B model quantized at 4-bit, you can do the following: 

First, we download the GGUF model file. There are multiple quantization schemes available, here we choose the `Q4_K_S` scheme which is 4-bit with the `K_S` quantization algorithm.

In [None]:
from huggingface_hub import hf_hub_download

repo_id = "bartowski/Llama-3.3-70B-Instruct-GGUF"
filename = "Llama-3.3-70B-Instruct-Q4_K_S.gguf"

# Will download the model in the current working directory instead of HF_CACHE_DIR
model_path = hf_hub_download(repo_id, filename=filename, local_dir=".")

We then update the config file to point to the model we just downloaded:

In [None]:
%%writefile vllm_tutorial_llama70b_infer.yaml

model:
  # Filepath to the GGUF model, which we just downloaded, see `model_path` output above
  model_name: "Meta-Llama-3.1-70B-Instruct-Q4_K_S.gguf"  
  # GGUF files do not have a config. We need to specify the tokenizer name manually.
  tokenizer_name: "meta-llama/Llama-3.3-70B-Instruct"  
  model_max_length: 512
  torch_dtype_str: "float16"  # GGUF models require float16
  trust_remote_code: True
  attn_implementation: "sdpa"

generation:
  max_new_tokens: 128
  batch_size: 1