#### VLLM examples
- [vllm emgine args](https://docs.vllm.ai/en/latest/models/engine_args.html)
- [vllm with lora](https://docs.vllm.ai/en/latest/models/lora.html)
- [LLM class](https://docs.vllm.ai/en/latest/dev/offline_inference/llm.html)
- [example API client](https://docs.vllm.ai/en/latest/getting_started/examples/api_client.html)

#### vllm with lora

In [1]:
import os,sys
sys.path.insert(0,'../libs')
from utils import donload_hf_model
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

In [2]:
## first download lora adapter to local 
sql_lora_path = donload_hf_model(REPO_ID="yard1/llama-2-7b-sql-lora-test",save_location="/root/data/hf_cache/llama-2-7b-sql-lora-test")
llm_path = donload_hf_model(REPO_ID="meta-llama/Llama-2-7b-hf",save_location="/root/data/hf_cache/Llama-2-7b-hf")

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

#### Load model, and enable lora

In [3]:
llm = LLM(model=llm_path, enable_lora=True)

INFO 06-16 22:20:04 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='/root/data/hf_cache/Llama-2-7b-hf', speculative_config=None, tokenizer='/root/data/hf_cache/Llama-2-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/data/hf_cache/Llama-2-7b-hf)
INFO 06-16 22:20:07 model_runner.py:159] Loading model weights took 12.5562 GB
INFO 06-16 22:20:09 gpu_executor.py:83] # GPU blocks: 7418, # CPU blocks: 512
INFO 06-16 22:20:12 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected c

#### Format prompt and specify lora model path and generate

In [4]:
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    stop=["[/assistant]"]
)

prompts = [
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)
# print outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Generated text: {generated_text!r}")
    #print(f"performance metrics: {output.metrics}")

Processed prompts: 100%|██████████| 2/2 [00:00<00:00,  4.09it/s, Generation Speed: 106.48 toks/s]

Generated text: "  SELECT icao FROM table_name_74 WHERE airport = 'lilongwe international airport' "
Generated text: "  SELECT nationality FROM table_name_11 WHERE elector = 'anchero pantaleone' "





#### OpenAI compatible Server 
- Basic Serving:
    - you can specify --host ; --port and -- api-key ; default is localhost  8000
    - ```CUDA_VISIBLE_DEVICES=2,3  python -m vllm.entrypoints.openai.api_server --model /root/data/hf_cache/llama-3-8B-Instruct --dtype auto --api-key token-abc123```
- [Available params](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server) 
- Serve lora adapters 
- ```python -m vllm.entrypoints.openai.api_server --model /root/data/hf_cache/Llama-2-7b-hf --enable-lora --lora-modules sql-lora=/root/data/hf_cache/llama-2-7b-sql-lora-test```

In [6]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
print('current running model :{}'.format(model))

current running model :/root/data/hf_cache/llama-3-8B-Instruct


#### Call api 

In [7]:
chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }],
    model=model,
    max_tokens=128,
    temperature=0.6
)

print("Chat completion results:")
print(chat_completion.choices[0].message.content)

Chat completion results:
The Los Angeles Dodgers won the World Series in 2020, defeating the Tampa Bay Rays in the Fall Classic, 4 games to 2. It was the Dodgers' first World Series title since 1988.
