# vLLM

[vLLM](https://vllm.readthedocs.io/en/latest/index.html) is a fast and easy-to-use library for LLM inference and serving, offering:

* State-of-the-art serving throughput 
* Efficient management of attention key and value memory with PagedAttention
* Continuous batching of incoming requests
* Optimized CUDA kernels

This notebooks goes over how to use a LLM with langchain and vLLM.

To use, you should have the `vllm` python package installed.

In [1]:
%pip install --upgrade --quiet  vllm -q

In [1]:
from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-7b",
    trust_remote_code=True,  # mandatory for hf models
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
)

print(llm.invoke("What is the capital of France ?"))

INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512


Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]


What is the capital of France ? The capital of France is Paris.





## Integrate the model in an LLMChain

In [3]:
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

print(llm_chain.invoke(question))

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]



1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.





## Distributed Inference

vLLM supports distributed tensor-parallel inference and serving. 

To run multi-GPU inference with the LLM class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs

In [None]:
from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-30b",
    tensor_parallel_size=4,
    trust_remote_code=True,  # mandatory for hf models
)

llm.invoke("What is the future of AI?")

## Quantization

vLLM supports `awq` quantization. To enable it, pass `quantization` to `vllm_kwargs`.

In [None]:
llm_q = VLLM(
    model="TheBloke/Llama-2-7b-Chat-AWQ",
    trust_remote_code=True,
    max_new_tokens=512,
    vllm_kwargs={"quantization": "awq"},
)

## OpenAI-Compatible Server

vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

This server can be queried in the same format as OpenAI API.

### OpenAI-Compatible Completion

In [3]:
from langchain_community.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name="tiiuae/falcon-7b",
    model_kwargs={"stop": ["."]},
)
print(llm.invoke("Rome is"))

 a city that is filled with history, ancient buildings, and art around every corner


### OpenAI-Compatible Chat Model

vLLM also supports chat messages as input. It automatically formats the input according to the template expected by the model.

In [4]:
from langchain_community.chat_models.vllm import ChatVLLMOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatVLLMOpenAI(
    openai_api_base="http://localhost:8000/v1",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    temperature=0.0,
)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a pirate. Speak like one!"),
    ("human", "{input}"),
])
chain = prompt | llm | StrOutputParser()
print(chain.invoke("What do you think of Rome?"))

Arrrr, Rome, ye say? Well, matey, I've had me share o' dealin's with them fancy-pants Romans. They think they're so high and mighty with their marble buildings and their fancy togas. But I'll tell ye this, they ain't got nothin' on the likes o' me and me crew!

We've plundered their ships, pillaged their ports, and left 'em walkin' the plank! Aye, we've had our share o' battles with the Roman Navy, but we've always come out on top. They may have their fancy legions and their centurions, but we've got the sea, and we've got the cunning!

And don't even get me started on their food! Aye, I've had me share o' Roman cuisine, and let me tell ye, it's as dull as a block o' wood! Where's the spice? Where's the flavor? Give me a good ol' fashioned sea dog's stew any day o' the week!

But, I'll give 'em this, matey: they've got some grand architecture. I've seen some o' their temples and whatnot, and they're as grand as any o' the treasures we've plundered. And their history! Aye, they've got a

# Structured Output With Guided Generation

The `ChatVLLMOpenAI` class supports the `with_structured_output` method for structuring the output of the model.
This is achieved through vLLM' support for guided generation.
This is a very powerful feature that allows even small models to accurately follow instructions.
Below we show several examples of how to use this feature.

We can use ensure that the model output can be parsed as a pydantic model

In [5]:
from typing import Literal
from langchain_core.pydantic_v1 import BaseModel, Field

# Example pydantic model
class CityModel(BaseModel):
    name: str = Field(..., description="Name of the city")
    population: int = Field(
        ..., description="Population of the city measured in number of inhabitants"
    )
    country: str = Field(..., description="Country of the city")
    population_category: Literal[">1M", "<1M"] = Field(
        ..., description="Population category of the city"
    )

structured_llm = llm.with_structured_output(CityModel)

city_model = structured_llm.invoke("What is the capital of France?")
assert isinstance(city_model, CityModel)
print(city_model)

name='Paris' population=2141000 country='France' population_category='>1M'


We can also ensure that the model output is one of a set of options

In [13]:
allowed_choices = ["positive", "negative"]
structured_llm = llm.with_structured_output(
    allowed_choices,
    guided_mode="guided_choice",
)

print(structured_llm.invoke("I loved this movie!"))

positive


vLLM even supports EBNF grammars.

In [10]:
grammar = """?start: expression

?expression: term (("+" | "-") term)*

?term: factor (("*" | "/") factor)*

?factor: NUMBER
        | "-" factor
        | "(" expression ")"

%import common.NUMBER"""

structured_llm = llm.with_structured_output(
    grammar,
    guided_mode="guided_grammar",
)
print(
    structured_llm.invoke(
        "Translate two hundred and fifty-six minus three hundred and twenty-four divided by three into a mathematical expression."
    )
)

256-324/3


If you just want to ensure the output is valid JSON, you can specify `None` for the `schema`.

In [12]:
structured_llm = llm.with_structured_output(None)
print(structured_llm.invoke("What is the capital of France?"))

{'capital': 'Paris'}
