Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): Add native support for PEFT Lora models #762

Merged
merged 4 commits into from
Aug 3, 2023

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Aug 2, 2023

  • Will detect peft model by finding adapter_config.json.
  • This triggers a totally dedicated download-weights path
  • This path, loads the adapter config, finds the base model_id
  • It loads the base_model
  • Then peft_model
  • Then merge_and_unload()
  • Then `save_pretrained(.., safe_serialization=True)
  • Add back the config + tokenizer.merge_and_unload()`
  • Then `save_pretrained(.., safe_serialization=True)
  • Add back the config + tokenizer.
  • The chosen location is a local folder with the name of the user
    chosen model id

PROs:

  • Easier than to expect user to merge manually
  • Barely any change outside of download-weights command.
  • This means everything will work in a single load.
  • Should enable out of the box SM + HFE

CONs:

  • Creates a local merged model in unusual location, potentially
    not saved across docker reloads, or ovewriting some files if the PEFT
    itself was local and containing other files in addition to the lora

Alternatives considered:

  • Add local_files_only=True every where (discard because of massive
    code change for not a good enough reason)
  • Return something to launcher about the new model-id (a cleaner
    location for this new model), but it would
    introduce new communication somewhere where we didn't need it before.
  • Using the HF cache folder and stopping the flow after
    download-weights and asking user to restart with the actual local
    model location

Fix #482

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Will detect `peft` model by finding `adapter_config.json`.
- This triggers a totally dedicated `download-weights` path
- This path, loads the adapter config, finds the base model_id
- It loads the base_model
- Then peft_model
- Then `merge_and_unload()`
- Then `save_pretrained(.., safe_serialization=True)
- Add back the config + tokenizer.merge_and_unload()`
- Then `save_pretrained(.., safe_serialization=True)
- Add back the config + tokenizer.
- The chosen location is a **local folder with the name of the user
  chosen model id**

PROs:

- Easier than to expect user to merge manually
- Barely any change outside of `download-weights` command.
- This means everything will work in a single load.
- Should enable out of the box SM + HFE

CONs:

- Creates a local merged model in unusual location, potentially
  not saved across docker reloads, or ovewriting some files if the PEFT
  itself was local and containing other files in addition to the lora

Alternatives considered:
- Add `local_files_only=True` every where (discard because of massive
  code change for not a good enough reason)
- Return something to `launcher` about the new model-id (a cleaner
  location for this new model), but it would
  introduce new communication somewhere where we didn't need it before.
- Using the HF cache folder and *stopping* the flow after
  `download-weights` and asking user to restart with the actual local
  model location
@Narsil
Copy link
Collaborator Author

Narsil commented Aug 2, 2023

@philschmid If you have any suggestions/counterindications for SM or HFE behavior ?

@younesbelkada Is my usage of PEFT ok ?

@philschmid
Copy link
Member

PEFT added withj 0.4 AutoModels, which should simply the code a lot, e.g.: https://github.com/philschmid/huggingface-llama-2-samples/blob/master/training/scripts/merge_adapter_weights.py#L21

model = AutoPeftModelForCausalLM.from_pretrained(
    args.peft_model_id,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
)  

Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they idea here is to:

  1. load the adapters
  2. load the base model based on the adapter config
  3. Merge the weights
  4. save safetensors
  5. start TGI?

Copy link

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great on PEFT side! Thanks for the ping
Alternatively you could have also used AutoPeftModelForCausalLM or AutoPeftModelForSeq2SeqLM: https://huggingface.co/docs/peft/quicktour#easy-loading-with-auto-classes and directly load the model with the correct class without having to declare a base model. But this also works great (it does the same thing under the hood)

@Narsil
Copy link
Collaborator Author

Narsil commented Aug 3, 2023

@younesbelkada Do these method force the dtype to be f32 or should I specify it with torch_dtype all the time ?

@younesbelkada
Copy link

yes the behaviour is the same as transformers' from_pretrained so it will load in fp32, you need to pass torch_dtype arg as @philschmid said to load the base model in any other prec

@Narsil Narsil merged commit ac736fd into main Aug 3, 2023
5 checks passed
@Narsil Narsil deleted the add_native_peft_support branch August 3, 2023 15:22
tallesairan added a commit to tallesairan/text-generation-inference that referenced this pull request Aug 3, 2023
feat(server): Add native support for PEFT Lora models (huggingface#762)
@shimizust
Copy link

@Narsil Should model loading work if the model is a local model and/or the base model specified in the adapter_config.json is a local path? For example:

text-generation-launcher --model-id /dev/models/gpt2-medium-peft

or in adapter_config.json:

{
   "base_model_name_or_path": "/dev/models/gpt2-medium",
  ...
}

It doesn't seem to be detecting that the local model I'm pointing to is a peft model.

@Narsil
Copy link
Collaborator Author

Narsil commented Sep 20, 2023

Hmm I haven' t tried but I don' t think there's any big reason why it shouldn' t work.

Could you open an issue with the full template filled out (and ideally links to the actual models and setup so we can reproduce as easily as possible).

@shimizust
Copy link

@Narsil At least just looking at the code (server/text_generation_server/cli.py), it seems the checks for PEFT model is only in the block checking if the model is not a local model? Is that a correct understanding? I can open an issue

if not is_local_model:
    try:
        adapter_config_filename = hf_hub_download(model_id, revision=revision, filename="adapter_config.json")
        utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)
    except (utils.LocalEntryNotFoundError, utils.EntryNotFoundError):
        pass

@Narsil
Copy link
Collaborator Author

Narsil commented Sep 20, 2023

Oh true, should be relatively easy to update, do you want to take a stab at it ?

@shimizust
Copy link

@Narsil Thanks, I'd be interested in taking a stab at it. However, I'm running into some challenges testing on my Mac without a GPU. I'll give it a shot, but if I'm unable to address the issue in a reasonable timeframe, I may ask someone else to take over.

@cirocavani
Copy link

cirocavani commented Sep 21, 2023

I had similar problem (I was trying to serve a PEFT LoRA from folder /opt/ml/model on SageMaker).

My workaround was to create the merged model before and serve it (not requiring the PEFT path in TGI).

Something like:

3 folders:

  • ./utils - merge script
  • ./peft/<checkpoint> - Input (adapter_config.json, tokenizer_config.json, generation_config.json, ...)
  • ./model/data - Output (config.json, safetensors, tokenizer_config.json...)
    (I also share my HF Hub cache to avoid duplicated download)

Script ./utils/peft_merger.py

(weight rename is specific to Bloom model, probably should be removed or adapted for your model)

import os
import sys

import peft
import torch
import transformers

peft_checkpoint_dir = sys.argv[1]
model_data_dir = sys.argv[2]

model = peft.AutoPeftModelForCausalLM.from_pretrained(
    peft_checkpoint_dir,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
)

base_model_id = model.peft_config["default"].base_model_name_or_path

model = model.merge_and_unload()

tokenizer_config_file = os.path.join(peft_checkpoint_dir, "tokenizer_config.json")
tokenizer_peft_or_base = (
    peft_checkpoint_dir if os.path.isfile(tokenizer_config_file) else base_model_id
)
tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_peft_or_base)

generation_config_file = os.path.join(peft_checkpoint_dir, "generation_config.json")
generation_peft_or_base = (
    peft_checkpoint_dir if os.path.isfile(generation_config_file) else base_model_id
)
generation = transformers.GenerationConfig.from_pretrained(generation_peft_or_base)


### rename weights transformer.<key> -> <key> as expected by TGI
assert model.base_model_prefix == "transformer"
state_dict = {
    k[len(model.base_model_prefix) + 1 :]: v
    for k, v in model.state_dict().items()
    if k.startswith(model.base_model_prefix)
}
###

model.save_pretrained(model_data_dir, state_dict=state_dict, safe_serialization=True)
tokenizer.save_pretrained(model_data_dir)
generation.save_pretrained(model_data_dir)

Locally, I run the script inside TGI docker.

docker run --rm \
--volume=./utils:/opt/ml/utils \
--volume=./peft/<checkpoint>:/opt/ml/peft-checkpoint \
--volume=./model/data:/opt/ml/model \
--volume=$HOME/.cache/huggingface/hub:/data \
--entrypoint=python \
--env PYTHONUNBUFFERED=1 \
--env HF_HUB_ENABLE_HF_TRANSFER=0 \
--env HUGGINGFACE_HUB_CACHE=/data \
ghcr.io/huggingface/text-generation-inference:1.0.3 \
/opt/ml/utils/peft_merger.py \
/opt/ml/peft-checkpoint \
/opt/ml/model

After docker, I create a model.tar.gz with ./model/data/* files and upload it to S3. Then I start a SageMaker Instance endpoint with this model.

I tested with bigscience/bloom-7b1. I found a problem that is not handled in this PR. When I run model = model.merge_and_unload() all weights are renamed with a prefix transformer.. I had to rename them to remove this prefix.

In your case, I think you can mount (--volume) your model in docker (and remove HF Hub Cache). Also, you will have to check if the saved model has the correct weight names.

Example with Bloom:

Input ./peft/checkpoint

README.md                                              3,621 bytes
adapter_config.json                                      438 bytes
adapter_model.bin                                 31,479,589 bytes
generation_config.json                                   141 bytes
special_tokens_map.json                                   96 bytes
tokenizer.json                                    14,501,114 bytes
tokenizer_config.json                                    286 bytes
------------------------------------------------------------------
7 files                                           45,985,285 bytes

Output ./model/data

config.json                                              783 bytes
generation_config.json                                   136 bytes
model-00001-of-00002.safetensors               9,976,218,536 bytes
model-00002-of-00002.safetensors               4,161,852,232 bytes
model.safetensors.index.json                          27,460 bytes
special_tokens_map.json                                   96 bytes
tokenizer.json                                    14,501,114 bytes
tokenizer_config.json                                    286 bytes
------------------------------------------------------------------
8 files                                       14,152,600,643 bytes

And running it with Docker (this requires GPU):

docker run --rm \
--gpus=all \
--shm-size=1g \
--publish=8080:80 \
--volume=./model/data:/opt/ml/model \
--env DTYPE=bfloat16 \
ghcr.io/huggingface/text-generation-inference:1.0.3 \
--model-id=/opt/ml/model

Output:

2023-09-20T10:55:43.888739Z  INFO text_generation_launcher: Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: Some(BFloat16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "129cc1f5b564", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-09-20T10:55:43.888849Z  INFO download: text_generation_launcher: Starting download process.
2023-09-20T10:55:46.446105Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-09-20T10:55:46.891730Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-09-20T10:55:46.891936Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-09-20T10:57:28.122836Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-09-20T10:57:28.186745Z  INFO shard-manager: text_generation_launcher: Shard ready in 101.294345318s rank=0
2023-09-20T10:57:28.274466Z  INFO text_generation_launcher: Starting Webserver
2023-09-20T10:57:28.837124Z  WARN text_generation_router: router/src/main.rs:194: no pipeline tag found for model [/opt/ml/model](https://vscode-remote+ssh-002dremote-002b10-002e115-002e73-002e51.vscode-resource.vscode-cdn.net/opt/ml/model)
2023-09-20T10:57:28.860890Z  INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-09-20T10:57:30.887951Z  WARN text_generation_router: router/src/main.rs:224: Model does not support automatic max batch total tokens
2023-09-20T10:57:30.887975Z  INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 16000
2023-09-20T10:57:30.887980Z  INFO text_generation_router: router/src/main.rs:247: Connected
2023-09-20T10:57:30.887985Z  WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0

@shimizust
Copy link

Thanks @cirocavani for the detailed solution to work around this issue--this is very helpful

@tleyden
Copy link
Contributor

tleyden commented Nov 14, 2023

@Narsil @shimizust @cirocavani

I hit the same issue when trying to load local PEFT weights. Here's a PR that fixed it for me:

#1260

fxmarty pushed a commit that referenced this pull request Nov 23, 2023
# What does this PR do?

Enables PEFT weights to be loaded from a local directory, as opposed to
a hf hub repository. It is a continuation of the work in PR
#762

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes #1259 


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section? **Yes but I don't know how to run the tests for
this repo, and it doesn't look like this code is covered anyway**
- [x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case. **Yes, @Narsil asked for a PR in [this
comment](#762 (comment)
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
**I didn't see any documentation added to the [original
PR](#762),
and am not sure where this belongs. Let me know and I can add some**
- [x] Did you write any new necessary tests? **I didn't see any existing
test coverage for this python module**


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil 

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
cr313 added a commit to cr313/text-generation-inference-load-test that referenced this pull request Apr 19, 2024
# What does this PR do?

Enables PEFT weights to be loaded from a local directory, as opposed to
a hf hub repository. It is a continuation of the work in PR
huggingface/text-generation-inference#762

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes #1259 


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section? **Yes but I don't know how to run the tests for
this repo, and it doesn't look like this code is covered anyway**
- [x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case. **Yes, @Narsil asked for a PR in [this
comment](huggingface/text-generation-inference#762 (comment)
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
**I didn't see any documentation added to the [original
PR](huggingface/text-generation-inference#762),
and am not sure where this belongs. Let me know and I can add some**
- [x] Did you write any new necessary tests? **I didn't see any existing
test coverage for this python module**


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil 

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

curious about the plans for supporting PEFT and LoRa.
6 participants