Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issue #323

Open
sleepwalker2017 opened this issue Mar 13, 2024 · 4 comments
Open

performance issue #323

sleepwalker2017 opened this issue Mar 13, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@sleepwalker2017
Copy link

sleepwalker2017 commented Mar 13, 2024

Hi, I'm benchmarking lora-x on 2*A30.
I get the poor performance, is that normal?
The first sheet, I send requests for base model, and the batch means the number of clients.
The second sheet, I send requests for multiple loras, I notice the token throughput is low and the GPU util is also low.

image

Here is my scripts:

lorax-launcher --model-id /data/vicuna-13b/vicuna-13b-v1.5/ --sharded true --num-shard 2

client: I use locust to start multiple clients to send requests. The core code is here, each request with its unique adatper id:

adapters = ['mattreid/alpaca-lora-13b', "merror/llama_13b_lora_beauty", 'shibing624/llama-13b-belle-zh-lora', 'shibing624/ziya-llama-13b-medical-lora']
def build_request(output_len):
    global req_cnt
    idx = req_cnt % len(test_data)
    lora_id = idx % 4

    input_dict = {
        "inputs": test_data[idx],
        "parameters": {
            "adapter_id": adapters[lora_id],
            "max_new_tokens": 256,
            "top_p": 0.7
        }
    }
    req_cnt += 1
    return input_dict

Any benchmark result for lora-x? Or any benchmark example codes? Than you.

@tgaddair tgaddair added the question Further information is requested label Mar 14, 2024
@tgaddair
Copy link
Contributor

Hey @sleepwalker2017, thanks for reporting this. We'll definitely take a look on our side to try and repro and see if we get similar results.

One thing to point out since you're running on multiple GPUs is there was a bug that was just recently fixed (PR #324) that was disabling the optimized SGMV kernels when using small rank LoRAs like this across multiple GPUs. It should now be fixed if you want to pull the latest image and try running again.

Rgardless, we'll take a look and share our results as well.

So we do have benchmark results we can share for single GPU that might be helpful. We haven't run as extensive of tests with multi-GPU. There will be some additional overhead from cross device comms. Are you connecting your GPUs via NVLink or PCIe?

@sleepwalker2017
Copy link
Author

Hey @sleepwalker2017, thanks for reporting this. We'll definitely take a look on our side to try and repro and see if we get similar results.

One thing to point out since you're running on multiple GPUs is there was a bug that was just recently fixed (PR #324) that was disabling the optimized SGMV kernels when using small rank LoRAs like this across multiple GPUs. It should now be fixed if you want to pull the latest image and try running again.

Rgardless, we'll take a look and share our results as well.

So we do have benchmark results we can share for single GPU that might be helpful. We haven't run as extensive of tests with multi-GPU. There will be some additional overhead from cross device comms. Are you connecting your GPUs via NVLink or PCIe?

It's PCIe.
I'll check the latest version.

@thincal
Copy link
Contributor

thincal commented Mar 21, 2024

@tgaddair I have noticed the same performance drop with 1 *4090 and batch 1, input text length 2K chars avg and generated 300 tokens avg.

other info:

  • adapter: rank 64, all targets
  • inference with --compile
  • lorax version: v0.8.1

performance:

base model (llama2 7B): 60 token/s
1 adapter (rank 64): 30 token/s 

Is there any potential optimization to be planed ?

@thincal
Copy link
Contributor

thincal commented Mar 22, 2024

@tgaddair Hi, I have got the updated result after using the latest main code, same config with bellow performance:

base model (llama2 7B): 59.62 token/s
1 adapter (rank 64): 46.70 token/s

About 21.6% perf drop, so is that reasonable? and could you share what's the optimization has been make? thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants