# vLLM - An Inference and Serving Library

Today we'll be looking at vLLM, a Python library for running inference on, and serving, Large Language Models.

In [None]:
!pip install vllm -qU

In [None]:
from vllm import LLM, SamplingParams

In [None]:
prompts = [
    "Retrieval Augmented Generation is",
    "The best way to fry an egg",
    "Vancian Magic refers to"
]

In [None]:
sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=120)

In [None]:
llm = LLM(model="mistralai/Mistral-7B-v0.1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO 01-17 15:59:22 llm_engine.py:70] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

INFO 01-17 16:12:24 llm_engine.py:275] # GPU blocks: 8921, # CPU blocks: 2048
INFO 01-17 16:12:26 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-17 16:12:26 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 01-17 16:12:31 model_runner.py:547] Graph capturing finished in 5 secs.


In [None]:
%%time
outputs = llm.generate(prompts, sampling_params)

Processed prompts: 100%|██████████| 3/3 [00:01<00:00,  1.77it/s]

CPU times: user 1.7 s, sys: 0 ns, total: 1.7 s
Wall time: 1.69 s





In [None]:
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Prompt: 'Retrieval Augmented Generation is', Generated text: ' a technique that combines the strengths of two different approaches to text generation: Retrieval-based and Generation-based. The idea is to use a retrieval model to identify the most relevant passages from a large corpus of text, and then use a generation model to generate new text that incorporates the relevant information from those passages.\n\nOne way to think about Retrieval Augmented Generation is as a way to improve the accuracy and relevance of text generation models. By using a retrieval model to identify the most relevant passages from a large corpus of text'
Prompt: 'The best way to fry an egg', Generated text: ' is to use a non-stick pan.\n\n## How To Fry An Egg\n\nThere are many ways to fry an egg. The most common way is to use a frying pan and heat it up over medium heat. Once the pan is hot, add some oil or butter and then crack the egg into the pan. Cook the egg until the whites are cooked through and the y

In [None]:
import torch
import gc
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

destroy_model_parallel()
#del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")

Successfully delete the llm pipeline and free the GPU memory!


## API Service

We'll be using the OpenAI-compatible API server as an example today - more details can be found [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py) about the FastAPI server used to provide the API for our desired model.

### How Does the Server Work?

We'll walk through the steps outlined in the vLLM presentation found [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit#slide=id.g24ad94a0065_0_102) to provide more context about how the model is served and leveraged.

#### User Interface
First, how does the user interact with our served model?

vLLM uses a FastAPI webserver to provide access to the `AsyncLLMEngine` which is what powers the model.


#### Model Initialization and Preparation

A lot of the optimization that vLLM utilizes comes down to effective memory management and preprocessing to provide a snappy inference experience.

Let's look at some of the pre-processing steps utilized:

- N workers are spun-up and have the model weights loaded.
- Memory profiling is done to determine the number of available memory blocks per worker.
- the `LLMEngine` pre-allocates KV blocks using the novel KV Cache Manager (see the [PagedAttention paper](https://arxiv.org/pdf/2309.06180.pdf) for more details)

Now, we can move onto what happens *when the request comes in*.

- A request comes in and is assigned to the scheduler's waiting queue after being tokenized. There are three scheduled states:
  - Waiting
  - Running
  - Swapped
- Scheduler makes decisions at each step based on KV Block Memory availability.
  - When there is KV block memory available - move requests from waiting to running.
  - Where there is no KV block available for new tokens either swap, or recompute.
- Workers do the brunt of the work, including running the model with PagedAttention

In order to run this in Colab - you will need to open your Colab terminal, which can be found at the bottom left-hand side of the Colab instance. It's the last icon!

![image](https://i.imgur.com/LmVjLmy.png)

Once you've opened your terminal - you'll want to paste the following into it:

```
python -m vllm.entrypoints.openai.api_server \
  --host 127.0.0.1 \
  --port 8888 \
  --model mistralai/Mistral-7B-v0.1
```

After that, you'll want to wait for the FastAPI server to spin-up - you can run the next cell when you see this message in the terminal:

![image](https://i.imgur.com/3f1k1yq.png)

In [None]:
!curl http://127.0.0.1:8888/v1/completions -H "Content-Type: application/json" -d '{"model": "mistralai/Mistral-7B-v0.1","prompt": "How cool is vLLM?","max_tokens": 100,"temperature": 0.7}'

{"id":"cmpl-ca511262ffd74d71b5074342f5cf37c3","object":"text_completion","created":11311,"model":"mistralai/Mistral-7B-v0.1","choices":[{"index":0,"text":"\n\nvLLM recently moved to a new office space at the Open Innovation Campus in Haifa, Israel.\n\nThe office space provides a cozy atmosphere and a vibrant work environment for vLLM employees and visitors.\n\nWe are excited to be part of the Open Innovation Campus community, and to have the opportunity to collaborate with other innovative companies and startups.\n\nIn addition, vLLM is positioned in the center of the campus, providing","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":8,"total_tokens":108,"completion_tokens":100}}