Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Peft models to share their base model #1905

Merged
merged 4 commits into from
Jul 10, 2023

Conversation

fozziethebeat
Copy link
Collaborator

Why are these changes needed?

This adds a special environment variable that activates shared Peft model base weights. Currently when loading two Peft models that have the same base model, those model weights are loaded once. With this flag activated, all Peft models will share the same base model.

To make this work it requires a few work around due to how Huggingface's Peft model has implemented LoRA adapters, the most popular variant. These modify the base model's pytorch modules directly and thus adapters sharing the same base model must live within the same model object and a set_adapter method must be called to switch between them.

Related issue number (if applicable)

Expands #1805

Checks

  • I've run format.sh to lint the changes in this PR.
  • I've included any doc changes needed.
  • I've made sure the relevant tests are passing (if applicable).

@fozziethebeat fozziethebeat marked this pull request as ready for review July 9, 2023 10:15
@BabyChouSr
Copy link
Collaborator

This is pretty cool! I tested this using the following script, and it seems to work well!

I served multiple LoRAs using the following script:

PEFT_SHARE_BASE_WEIGHTS=true python3 -m fastchat.serve.multi_model_worker \
    --model-path /data/chris/peft-llama-dummy-1 \
    --model-names peft-dummy-1 \
    --model-path /data/chris/peft-llama-dummy-2 \
    --model-names peft-dummy-2 \
    --model-path /data/chris/peft-llama-dummy-3 \
    --model-names peft-dummy-3 \
    --num-gpus 2

Looking at the GPU utilization, we only load the base model once, so in my case, we only load llama-7b once so ~14GB VRAM, which is what we expect.

@Ying1123 Ying1123 self-assigned this Jul 9, 2023
@fozziethebeat
Copy link
Collaborator Author

thank goodness it works for someone else. I've submitted a few too many things that didn't 100% work. I have it running as well on GCP with two modes and they give distinct results and fit within VRAM as expected.

Copy link
Member

@Ying1123 Ying1123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

docs/model_support.md Outdated Show resolved Hide resolved
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants