performance issue #323

sleepwalker2017 · 2024-03-13T03:22:05Z

Hi, I'm benchmarking lora-x on 2*A30.
I get the poor performance, is that normal?
The first sheet, I send requests for base model, and the batch means the number of clients.
The second sheet, I send requests for multiple loras, I notice the token throughput is low and the GPU util is also low.

Here is my scripts:

lorax-launcher --model-id /data/vicuna-13b/vicuna-13b-v1.5/ --sharded true --num-shard 2

client: I use locust to start multiple clients to send requests. The core code is here, each request with its unique adatper id:

adapters = ['mattreid/alpaca-lora-13b', "merror/llama_13b_lora_beauty", 'shibing624/llama-13b-belle-zh-lora', 'shibing624/ziya-llama-13b-medical-lora']
def build_request(output_len):
    global req_cnt
    idx = req_cnt % len(test_data)
    lora_id = idx % 4

    input_dict = {
        "inputs": test_data[idx],
        "parameters": {
            "adapter_id": adapters[lora_id],
            "max_new_tokens": 256,
            "top_p": 0.7
        }
    }
    req_cnt += 1
    return input_dict

Any benchmark result for lora-x? Or any benchmark example codes? Than you.

The text was updated successfully, but these errors were encountered:

tgaddair · 2024-03-14T19:39:51Z

Hey @sleepwalker2017, thanks for reporting this. We'll definitely take a look on our side to try and repro and see if we get similar results.

One thing to point out since you're running on multiple GPUs is there was a bug that was just recently fixed (PR #324) that was disabling the optimized SGMV kernels when using small rank LoRAs like this across multiple GPUs. It should now be fixed if you want to pull the latest image and try running again.

Rgardless, we'll take a look and share our results as well.

So we do have benchmark results we can share for single GPU that might be helpful. We haven't run as extensive of tests with multi-GPU. There will be some additional overhead from cross device comms. Are you connecting your GPUs via NVLink or PCIe?

sleepwalker2017 · 2024-03-18T08:35:26Z

Hey @sleepwalker2017, thanks for reporting this. We'll definitely take a look on our side to try and repro and see if we get similar results.

One thing to point out since you're running on multiple GPUs is there was a bug that was just recently fixed (PR #324) that was disabling the optimized SGMV kernels when using small rank LoRAs like this across multiple GPUs. It should now be fixed if you want to pull the latest image and try running again.

Rgardless, we'll take a look and share our results as well.

So we do have benchmark results we can share for single GPU that might be helpful. We haven't run as extensive of tests with multi-GPU. There will be some additional overhead from cross device comms. Are you connecting your GPUs via NVLink or PCIe?

It's PCIe.
I'll check the latest version.

thincal · 2024-03-21T15:26:04Z

@tgaddair I have noticed the same performance drop with 1 *4090 and batch 1, input text length 2K chars avg and generated 300 tokens avg.

other info:

adapter: rank 64, all targets
inference with --compile
lorax version: v0.8.1

performance:

base model (llama2 7B): 60 token/s
1 adapter (rank 64): 30 token/s

Is there any potential optimization to be planed ?

thincal · 2024-03-22T05:02:31Z

@tgaddair Hi, I have got the updated result after using the latest main code, same config with bellow performance:

base model (llama2 7B): 59.62 token/s
1 adapter (rank 64): 46.70 token/s

About 21.6% perf drop, so is that reasonable? and could you share what's the optimization has been make? thanks.

tgaddair assigned magdyksaleh Mar 14, 2024

tgaddair added the question Further information is requested label Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue #323

performance issue #323

sleepwalker2017 commented Mar 13, 2024 •

edited

Loading

tgaddair commented Mar 14, 2024

sleepwalker2017 commented Mar 18, 2024

thincal commented Mar 21, 2024 •

edited

Loading

thincal commented Mar 22, 2024

performance issue #323

performance issue #323

Comments

sleepwalker2017 commented Mar 13, 2024 • edited Loading

tgaddair commented Mar 14, 2024

sleepwalker2017 commented Mar 18, 2024

thincal commented Mar 21, 2024 • edited Loading

thincal commented Mar 22, 2024

sleepwalker2017 commented Mar 13, 2024 •

edited

Loading

thincal commented Mar 21, 2024 •

edited

Loading