Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop for neural-chat 7b with new repo of ipex-llm(2.5.0b20240425) vllm serving. #10924

Open
Vasud-ha opened this issue May 3, 2024 · 22 comments
Assignees

Comments

@Vasud-ha
Copy link

Vasud-ha commented May 3, 2024

We have seen a significant difference in performance drop with the env created with the latest repo for vllm serving for the neural-chat model as compared to the old env built with the old repo. With the offline_inference.py script, for the default prompt Old env gives inference time between 7-11 sec with GPU utilization of only 50% while the new env gives 18-24 sec, with GPU utilization of 100% on Flex 170. I tried the docker env also but it's also giving the inference time between 18-23 sec. Given below are the env details :

Old env
accelerate 0.21.0
annotated-types 0.6.0
anyio 4.3.0
bigdl-core-xe-21 2.5.0b20240402
bigdl-core-xe-esimd-21 2.5.0b20240402
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
exceptiongroup 1.2.0
fastapi 0.110.1
filelock 3.13.3
fsspec 2024.3.1
h11 0.14.0
httptools 0.6.1
huggingface-hub 0.17.3
idna 3.6
intel-extension-for-pytorch 2.1.10+xpu
intel-openmp 2024.1.0
ipex-llm 2.1.0b20240402
Jinja2 3.1.3
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.4
packaging 24.0
pillow 10.3.0
pip 23.3.1
protobuf 5.26.1
psutil 5.9.8
py-cpuinfo 9.0.0
pydantic 1.10.15
pydantic_core 2.18.0
python-dotenv 1.0.1
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
safetensors 0.4.2
sentencepiece 0.2.0
setuptools 68.2.2
sniffio 1.3.1
starlette 0.37.2
sympy 1.12.1rc1
tabulate 0.9.0
tokenizers 0.14.1
torch 2.1.0a0+cxx11.abi
torchvision 0.16.0a0+cxx11.abi
tqdm 4.66.2
transformers 4.34.0
typing_extensions 4.11.0rc1
urllib3 2.2.1
uvicorn 0.29.0
uvloop 0.19.0
watchfiles 0.21.0
websockets 12.0
wheel 0.41.2

New env
accelerate 0.21.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 4.3.0
attrs 23.2.0
bigdl-core-xe-21 2.5.0b20240425
bigdl-core-xe-esimd-21 2.5.0b20240425
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.29.2
diskcache 5.6.3
einops 0.7.0
fastapi 0.110.1
filelock 3.13.4
frozenlist 1.4.1
fsspec 2024.3.1
h11 0.14.0
httptools 0.6.1
huggingface-hub 0.22.2
idna 3.7
intel-extension-for-pytorch 2.1.10+xpu
intel-openmp 2024.1.0
interegular 0.3.3
ipex-llm 2.1.0b20240425
Jinja2 3.1.3
joblib 1.4.0
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
lark 1.1.9
llvmlite 0.42.0
MarkupSafe 2.1.5
mpmath 1.3.0
msgpack 1.0.8
nest-asyncio 1.6.0
networkx 3.3
ninja 1.11.1.1
numba 0.59.1
numpy 1.26.4
oneccl-bind-pt 2.1.100+xpu
outlines 0.0.34
packaging 24.0
pandas 2.2.2
pillow 10.3.0
pip 23.3.1
prometheus_client 0.20.0
protobuf 5.26.1
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 16.0.0
pydantic 2.7.1
pydantic_core 2.18.2
pynvml 11.5.0
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.1
PyYAML 6.0.1
ray 2.12.0
referencing 0.35.0
regex 2023.12.25
requests 2.31.0
rpds-py 0.18.0
safetensors 0.4.3
scipy 1.13.0
sentencepiece 0.2.0
setuptools 68.2.2
six 1.16.0
sniffio 1.3.1
starlette 0.37.2
sympy 1.12.1rc1
tabulate 0.9.0
tiktoken 0.6.0
tokenizers 0.19.1
torch 2.1.0a0+cxx11.abi
torchvision 0.16.0a0+cxx11.abi
tqdm 4.66.2
transformers 4.40.1
transformers-stream-generator 0.0.5
triton 2.1.0
typing_extensions 4.11.0
tzdata 2024.1
urllib3 2.2.1
uvicorn 0.29.0
uvloop 0.19.0
vllm 0.3.3+xpu0.0.1 /root/vllm
watchfiles 0.21.0
websockets 12.0
wheel 0.41.2
xformers 0.0.25.post1

Also with the new env its giving the below error with bfloat16 datatype is it not supported now?
image

@digitalscream
Copy link

I don't know if it's related, but I also noticed a drop in performance running Llama 3 under Ollama recently (so using the IPEX llama.cpp implementation). I was originally getting ~50t/s on inference, ran a rebuild (still with the intelanalytics/ipex-llm-xpu:latest image as a base) and suddenly it dropped to ~30t/s. I couldn't find any combination of host drivers or OneAPI packages that would get the performance back.

At this point, it's difficult to justify sticking with the Intel platform - my old RX 6600 XT is 30% faster than the A770 is now!

@gc-fu
Copy link
Contributor

gc-fu commented May 6, 2024

Hi, I am working to reproduce this issue.

@gc-fu
Copy link
Contributor

gc-fu commented May 6, 2024

Can you post the result of the offline_inference.py within your old environment?

We fix a bug recently that may cause the generation ends early. So if the generation ends early with wired output, the inference will be quicker.

@rnwang04
Copy link
Contributor

rnwang04 commented May 6, 2024

Hi @digitalscream , based on our local test, Llama3 could get ~50 tokens/s on a single A770.

I was originally getting ~50t/s on inference, ran a rebuild (still with the intelanalytics/ipex-llm-xpu:latest image as a base) and suddenly it dropped to ~30t/s.

I wonder is there anything changed in this rebuild progress? This performance degradation looks more like a driver related issue.

@gc-fu
Copy link
Contributor

gc-fu commented May 6, 2024

Can you check if your old environment's vLLM have the following code:
https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216

Also, you can try benchmark_throughput to get a more accurate performance estimation:
Try follow the instructions here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker#offline-benchmark-through-benchmark_throughputpy

The benchmark_throughput.py can be acquired at here

@digitalscream
Copy link

Hi @digitalscream , based on our local test, Llama3 could get ~50 tokens/s on a single A770.

I was originally getting ~50t/s on inference, ran a rebuild (still with the intelanalytics/ipex-llm-xpu:latest image as a base) and suddenly it dropped to ~30t/s.

I wonder is there anything changed in this rebuild progress? This performance degradation looks more like a driver related issue.

Sorry, should've mentioned - that's using a single A770, the only change in the Docker image was that it pulled the intelanalytics/ipex-llm-xpu:latest base image again. I do have some new information, though - seems like it's heavily CPU-bound now, where it wasn't before. After a bit of experimentation, I get 30t/s on my Ryzen 3600 machine, and 41t/s on my i5-13600 machine. Originally, I was getting ~50t/s on the Ryzen machine.

@Vasud-ha
Copy link
Author

Vasud-ha commented May 6, 2024

Can you post the result of the offline_inference.py within your old environment?

We fix a bug recently that may cause the generation ends early. So if the generation ends early with wired output, the inference will be quicker.

image

@Vasud-ha
Copy link
Author

Vasud-ha commented May 6, 2024

Can you check if your old environment's vLLM have the following code: https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216

Also, you can try benchmark_throughput to get a more accurate performance estimation: Try follow the instructions here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker#offline-benchmark-through-benchmark_throughputpy

The benchmark_throughput.py can be acquired at here

Yes, my old environment's vllm has the code mentioned here. I Will try benchmark_throughput.py script. Thanks.

@rnwang04
Copy link
Contributor

rnwang04 commented May 6, 2024

Sorry, should've mentioned - that's using a single A770, the only change in the Docker image was that it pulled the intelanalytics/ipex-llm-xpu:latest base image again. I do have some new information, though - seems like it's heavily CPU-bound now, where it wasn't before. After a bit of experimentation, I get 30t/s on my Ryzen 3600 machine, and 41t/s on my i5-13600 machine. Originally, I was getting ~50t/s on the Ryzen machine.

Thanks for more information provided! When you updated the Docker image, did you update the version of the ipex-llm [cpp] and updated the ollama binary file you used?
We have a conjecture about the performance degradation on Ryzen 3600, suspecting that this issue is related to one of our previous PR which affecting a certain function on the CPU. We have already reverted that PR. Perhaps you can try our latest release tomorrow (pip install --pre --upgrade ipex-llm[cpp] and don't forget to init-ollama again) to see if this issue can be resolved?

@digitalscream
Copy link

Sorry, should've mentioned - that's using a single A770, the only change in the Docker image was that it pulled the intelanalytics/ipex-llm-xpu:latest base image again. I do have some new information, though - seems like it's heavily CPU-bound now, where it wasn't before. After a bit of experimentation, I get 30t/s on my Ryzen 3600 machine, and 41t/s on my i5-13600 machine. Originally, I was getting ~50t/s on the Ryzen machine.

Thanks for more information provided! When you updated the Docker image, did you update the version of the ipex-llm [cpp] and updated the ollama binary file you used? We have a conjecture about the performance degradation on Ryzen 3600, suspecting that this issue is related to one of our previous PR which affecting a certain function on the CPU. We have already reverted that PR. Perhaps you can try our latest release tomorrow (pip install --pre --upgrade ipex-llm[cpp] and don't forget to init-ollama again) to see if this issue can be resolved?

Ah, OK - yes, I updated everything when I rebuilt it from the base image. Is there another issue regarding the Ryzen performance? Don't want to pollute this one if there's a more appropriate place to discuss it.

@Vasud-ha
Copy link
Author

Vasud-ha commented May 7, 2024

Can you check if your old environment's vLLM have the following code: https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216
Also, you can try benchmark_throughput to get a more accurate performance estimation: Try follow the instructions here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker#offline-benchmark-through-benchmark_throughputpy
The benchmark_throughput.py can be acquired at here

Yes, my old environment's vllm has the code mentioned here. I Will try benchmark_throughput.py script. Thanks.

Hi @gc-fu I tried to run offline_inference.py with the latest code still getting 17sec of latency. Also, I tested the benchmark_throughput.py script, could you suggest how to get the inference latency from end to end?

@gc-fu
Copy link
Contributor

gc-fu commented May 7, 2024

The offline_inference.py is not designed for performance benchmark.

If you wanna get latency from end to end or get request per second, you should start the service according to this readme. Then you can send requests to the service using benchmark tools like wrk or jmeter.

The result of benchmark_throughput.py should give you some insight about token per second, which should indicate the performance of different versions.

Could you please post the result of benchmark_throughput script for old/new environment?

@Vasud-ha
Copy link
Author

Vasud-ha commented May 7, 2024

Thanks @gc-fu, with docker env the benchmark_throughput.py gives a throughput of 489.68 token/sec for 1000 prompt (default settings), however, this script is not available in the docker directory of the old repo.

@gc-fu
Copy link
Contributor

gc-fu commented May 8, 2024

Can you check if this official benchmark script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py can be used or not?

If not, can you post the docker image name and tag so that I can see if I can find an proper script for you 😃 And it would be most helpful if you can post the entire offline_inference.py script in your old environment. It has been a long time since we post the old docker image 😢

@Vasud-ha
Copy link
Author

Vasud-ha commented May 8, 2024

I tried running the script in the old repo but facing import issues.

This is the docker image ipex-llm-serving-xpu:2.1.0-SNAPSHOT

This is the offline_inference.py script

from ipex_llm.vllm.entrypoints.llm import LLM 
from ipex_llm.vllm.sampling_params import SamplingParams
import time
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/root/neural-chat-7b-v3/", load_in_low_bit="sym_int4", dtype="bfloat16", device="xpu")
st_time = time.time()
outputs = llm.generate(prompts, sampling_params)
en_time = time.time()
print(f'Inference time: {en_time-st_time} s')
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
                                                                            



@gc-fu
Copy link
Contributor

gc-fu commented May 8, 2024

In this case, can you try the following script?

"""Benchmark offline inference throughput."""
import argparse
import json
import random
import time
from typing import List, Optional, Tuple

import torch
from ipex_llm.transformers import AutoModelForCausalLM
# from transformers import AutoModelForCausalLM

from transformers import PreTrainedTokenizerBase
from tqdm import tqdm

#from vllm import LLM, SamplingParams
from ipex_llm.vllm.entrypoints.llm import LLM
from ipex_llm.vllm.sampling_params import SamplingParams
#from vllm.transformers_utils.tokenizer import get_tokenizer
from ipex_llm.vllm.transformers_utils.tokenizer import get_tokenizer

device = 'xpu'
if device == 'xpu':
    import intel_extension_for_pytorch as ipex


def sample_requests(
    dataset_path: str,
    num_requests: int,
    tokenizer: PreTrainedTokenizerBase,
    fixed_output_len: Optional[int],
) -> List[Tuple[str, int, int]]:
    if fixed_output_len is not None and fixed_output_len < 4:
        raise ValueError("output_len too small")

    # Load the dataset.
    with open(dataset_path) as f:
        dataset = json.load(f)
    # Filter out the conversations with less than 2 turns.
    dataset = [data for data in dataset if len(data["conversations"]) >= 2]
    # Only keep the first two turns of each conversation.
    dataset = [(data["conversations"][0]["value"],
                data["conversations"][1]["value"]) for data in dataset]

    # Tokenize the prompts and completions.
    prompts = [prompt for prompt, _ in dataset]
    prompt_token_ids = tokenizer(prompts).input_ids
    completions = [completion for _, completion in dataset]
    completion_token_ids = tokenizer(completions).input_ids
    tokenized_dataset = []
    for i in range(len(dataset)):
        output_len = len(completion_token_ids[i])
        if fixed_output_len is not None:
            output_len = fixed_output_len
        tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len))

    # Filter out too long sequences.
    filtered_dataset: List[Tuple[str, int, int]] = []
    for prompt, prompt_token_ids, output_len in tokenized_dataset:
        prompt_len = len(prompt_token_ids)
        if prompt_len < 4 or output_len < 4:
            # Prune too short sequences.
            continue
        if prompt_len > 1024 or prompt_len + output_len > 2048:
            # Prune too long sequences.
            continue
        filtered_dataset.append((prompt, prompt_len, output_len))

    # Sample the requests.
    sampled_requests = random.sample(filtered_dataset, num_requests)
    return sampled_requests


def run_vllm(
    requests: List[Tuple[str, int, int]],
    model: str,
    tokenizer: str,
    quantization: Optional[str],
    tensor_parallel_size: int,
    seed: int,
    n: int,
    use_beam_search: bool,
    trust_remote_code: bool,
    dtype: str,
    max_num_seqs: int,
) -> float:
    llm = LLM(
        model=model,
        tokenizer=tokenizer,
        quantization=quantization,
        #tensor_parallel_size=tensor_parallel_size,
        seed=42,
        trust_remote_code=trust_remote_code,
        dtype=dtype,
				# change here
		device=device,
        max_num_batched_tokens=204800,
        max_model_len=2048,
        max_num_seqs=max_num_seqs,
    )
    warm_prompt = "hi " * (1024 - 1)
    warm_requests = [(warm_prompt, 1024, 1024)
                    for _ in range(1)]
    for prompt, _, output_len in warm_requests:
        sampling_params = SamplingParams(
            n=n,
            temperature=0.0 if use_beam_search else 1.0,
            top_p=1.0,
            use_beam_search=use_beam_search,
            ignore_eos=True,
            max_tokens=output_len,
        )
        llm._add_request(
            prompt=prompt,
            prompt_token_ids=None,
            sampling_params=sampling_params,
        )
    llm._run_engine(use_tqdm=True)

    # Add the requests to the engine.
    for prompt, _, output_len in requests:
        sampling_params = SamplingParams(
            n=n,
            temperature=0.0 if use_beam_search else 1.0,
            top_p=1.0,
            use_beam_search=use_beam_search,
            ignore_eos=True,
            max_tokens=output_len,
        )
        # FIXME(woosuk): Do not use internal method.
        llm._add_request(
            prompt=prompt,
            prompt_token_ids=None,
            sampling_params=sampling_params,
        )

    start = time.perf_counter()
    # FIXME(woosuk): Do use internal method.
    llm._run_engine(use_tqdm=True)
    end = time.perf_counter()
    return end - start


def run_hf(
    requests: List[Tuple[str, int, int]],
    model: str,
    tokenizer: PreTrainedTokenizerBase,
    n: int,
    use_beam_search: bool,
    max_batch_size: int,
    trust_remote_code: bool,
) -> float:
    assert not use_beam_search
    llm = AutoModelForCausalLM.from_pretrained(
        model, load_in_4bit=True,  optimize_model=True,
                                                 trust_remote_code=True,
                                                 use_cache=True)
    # llm = AutoModelForCausalLM.from_pretrained(
    #     model, trust_remote_code=True, use_cache=True, torch_dtype=torch.bfloat16,
    # )

    tokenizer.pad_token = tokenizer.eos_token
    if device == 'xpu':
        llm = llm.to('xpu')

    # warmup
    warm_prompt = "hi " * (1000 - 1)
    input_ids = tokenizer(warm_prompt, return_tensors="pt",
                              padding=True).input_ids

    if device == 'xpu':
        input_ids = input_ids.to('xpu')
    _ = llm.generate(
            input_ids=input_ids,
            do_sample=False,
            num_return_sequences=n,
            num_beams=1,
            temperature=1.0,
            top_p=1.0,
            use_cache=True,
            max_new_tokens=1024,
            pad_token_id=tokenizer.pad_token_id,
        )

    pbar = tqdm(total=len(requests))
    start = time.perf_counter()
    batch: List[str] = []
    max_prompt_len = 0
    max_output_len = 0
    for i in range(len(requests)):
        prompt, prompt_len, output_len = requests[i]
        # Add the prompt to the batch.
        batch.append(prompt)
        max_prompt_len = max(max_prompt_len, prompt_len)
        max_output_len = max(max_output_len, output_len)
        if len(batch) < max_batch_size and i != len(requests) - 1:
            # Check if we can add more requests to the batch.
            _, next_prompt_len, next_output_len = requests[i + 1]
            if (max(max_prompt_len, next_prompt_len) +
                    max(max_output_len, next_output_len)) <= 2048:
                # We can add more requests to the batch.
                continue

        # Generate the sequences.
        # print(batch)
        input_ids = tokenizer(batch, return_tensors="pt",
                              padding=True).input_ids
        if device == 'xpu':
            input_ids = input_ids.to('xpu')
        llm_outputs = llm.generate(
            input_ids=input_ids,
            do_sample=False,
            num_return_sequences=n,
            num_beams=1,
            temperature=1.0,
            top_p=1.0,
            use_cache=True,
            max_new_tokens=max_output_len,
            pad_token_id=tokenizer.pad_token_id,
        )
        # Include the decoding time.
        tokenizer.batch_decode(llm_outputs, skip_special_tokens=True)
        pbar.update(len(batch))

        # Clear the batch.
        batch = []
        max_prompt_len = 0
        max_output_len = 0
    end = time.perf_counter()
    return end - start


def main(args: argparse.Namespace):
    print(args)
    random.seed(args.seed)

    # Sample the requests.
    tokenizer = get_tokenizer(args.tokenizer,
                              unk_token="<unk>",
                              trust_remote_code=args.trust_remote_code)
    if args.dataset is None:
        # Synthesize a prompt with the given input length.
        prompt = "hi " * (args.input_len - 1)
        requests = [(prompt, args.input_len, args.output_len)
                    for _ in range(args.num_prompts)]
    else:
        requests = sample_requests(args.dataset, args.num_prompts, tokenizer,
                                   args.output_len)

    if args.backend == "vllm":
        elapsed_time = run_vllm(requests, args.model, args.tokenizer,
                                args.quantization, args.tensor_parallel_size,
                                args.seed, args.n, args.use_beam_search,
                                args.trust_remote_code, args.dtype, args.max_num_seqs)
    elif args.backend == "hf":
        assert args.tensor_parallel_size == 1
        elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
                              args.use_beam_search, args.hf_max_batch_size,
                              args.trust_remote_code)
    else:
        raise ValueError(f"Unknown backend: {args.backend}")
    total_num_tokens = sum(prompt_len + output_len
                           for _, prompt_len, output_len in requests)
    print(f"Throughput: {len(requests) / elapsed_time:.4f} requests/s, "
          f"{total_num_tokens / elapsed_time:.2f} tokens/s")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Benchmark the throughput.")
    parser.add_argument("--backend",
                        type=str,
                        choices=["vllm", "hf"],
                        default="vllm")
    parser.add_argument("--dataset",
                        type=str,
                        default=None,
                        help="Path to the dataset.")
    parser.add_argument("--input-len",
                        type=int,
                        default=None,
                        help="Input prompt length for each request")
    parser.add_argument("--output-len",
                        type=int,
                        default=None,
                        help="Output length for each request. Overrides the "
                        "output length from the dataset.")
    parser.add_argument("--model", type=str, default="facebook/opt-125m")
    parser.add_argument("--tokenizer", type=str, default=None)
    parser.add_argument('--quantization',
                        '-q',
                        choices=['awq', None],
                        default=None)
    parser.add_argument("--tensor-parallel-size", "-tp", type=int, default=1)
    parser.add_argument("--n",
                        type=int,
                        default=1,
                        help="Number of generated sequences per prompt.")
    parser.add_argument("--use-beam-search", action="store_true")
    parser.add_argument("--num-prompts",
                        type=int,
                        default=1000,
                        help="Number of prompts to process.")
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--hf-max-batch-size",
                        type=int,
                        default=None,
                        help="Maximum batch size for HF backend.")
    parser.add_argument("--max-num-seqs", type=int, default=8)
    parser.add_argument('--trust-remote-code',
                        action='store_true',
                        help='trust remote code from huggingface')
    parser.add_argument(
        '--dtype',
        type=str,
        default='auto',
        choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
        help='data type for model weights and activations. '
        'The "auto" option will use FP16 precision '
        'for FP32 and FP16 models, and BF16 precision '
        'for BF16 models.')
    args = parser.parse_args()

    if args.backend == "vllm":
        if args.hf_max_batch_size is not None:
            raise ValueError("HF max batch size is only for HF backend.")
    elif args.backend == "hf":
        if args.hf_max_batch_size is None:
            raise ValueError("HF max batch size is required for HF backend.")
        if args.quantization is not None:
            raise ValueError("Quantization is only for vLLM backend.")
    if args.tokenizer is None:
        args.tokenizer = args.model
    
    if args.dataset is None:
        assert args.input_len is not None
        assert args.output_len is not None
    else:
        assert args.input_len is None

    main(args)

Also, can you try to run the following command in your docker environment and post the result?

find / -name "bigdl_mistral.py"

If you successfully find the file in your environment, then you are using an quite old version of vLLM that not fully implement the PagedAttention algorithm.

@Vasud-ha
Copy link
Author

I Faced this error while trying out the above code, could you suggest how to resolve it?
image

@gc-fu
Copy link
Contributor

gc-fu commented May 11, 2024

Hi, the vLLM you used is deprecated and will not be supported anymore 😢

The old vLLM does not use PagedAttention and do not perform good enough in our tests. Besides, the old vLLM suffered from Out of Memory issue in GPU environment.

Try using the latest vLLM instead, I am pretty sure the new vLLM is quicker than the old one.

@Vasud-ha
Copy link
Author

Hi @gc-fu , I couldn't locate the benchmark_throughput.py file inside docker, could you share the path? This is the docker image built, earlier I was able to locate it, but rebuild the image and now I couldn't find it.
image

@gc-fu
Copy link
Contributor

gc-fu commented May 14, 2024

The image you should use is this one: intelanalytics/ipex-llm-serving-xpu

Try check the README here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker

@jason-dai
Copy link
Contributor

The image you should use is this one: intelanalytics/ipex-llm-serving-xpu

Try check the README here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker

Can we remove the deprecated docker image?

@gc-fu
Copy link
Contributor

gc-fu commented May 14, 2024

The image you should use is this one: intelanalytics/ipex-llm-serving-xpu
Try check the README here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker

Can we remove the deprecated docker image?

Yes, we can. There are basically two images related to vLLM:

vLLM-CPU: ipex-llm-serving-cpu. Since vLLM-v1 is removed from our codebase, this image no longer contains any code related to vLLM.

vLLM-XPU: ipex-llm-serving-xpu. This is the only available image that users should use for now.

At the beginning of this issue, the user uses the ipex-llm-serving-xpu image that was build long ago which contains the vLLM-v1 code that is deprecated. If the user pulls the image again, the old code will disappear.

I will remove the vLLM-CPU example page later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants