feat(server): Add native support for PEFT Lora models #762

Narsil · 2023-08-02T22:30:32Z

Will detect peft model by finding adapter_config.json.
This triggers a totally dedicated download-weights path
This path, loads the adapter config, finds the base model_id
It loads the base_model
Then peft_model
Then merge_and_unload()
Then `save_pretrained(.., safe_serialization=True)
Add back the config + tokenizer.merge_and_unload()`
Then `save_pretrained(.., safe_serialization=True)
Add back the config + tokenizer.
The chosen location is a local folder with the name of the user
chosen model id

PROs:

Easier than to expect user to merge manually
Barely any change outside of download-weights command.
This means everything will work in a single load.
Should enable out of the box SM + HFE

CONs:

Creates a local merged model in unusual location, potentially
not saved across docker reloads, or ovewriting some files if the PEFT
itself was local and containing other files in addition to the lora

Alternatives considered:

Add local_files_only=True every where (discard because of massive
code change for not a good enough reason)
Return something to launcher about the new model-id (a cleaner
location for this new model), but it would
introduce new communication somewhere where we didn't need it before.
Using the HF cache folder and stopping the flow after
download-weights and asking user to restart with the actual local
model location

Fix #482

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Will detect `peft` model by finding `adapter_config.json`. - This triggers a totally dedicated `download-weights` path - This path, loads the adapter config, finds the base model_id - It loads the base_model - Then peft_model - Then `merge_and_unload()` - Then `save_pretrained(.., safe_serialization=True) - Add back the config + tokenizer.merge_and_unload()` - Then `save_pretrained(.., safe_serialization=True) - Add back the config + tokenizer. - The chosen location is a **local folder with the name of the user chosen model id** PROs: - Easier than to expect user to merge manually - Barely any change outside of `download-weights` command. - This means everything will work in a single load. - Should enable out of the box SM + HFE CONs: - Creates a local merged model in unusual location, potentially not saved across docker reloads, or ovewriting some files if the PEFT itself was local and containing other files in addition to the lora Alternatives considered: - Add `local_files_only=True` every where (discard because of massive code change for not a good enough reason) - Return something to `launcher` about the new model-id (a cleaner location for this new model), but it would introduce new communication somewhere where we didn't need it before. - Using the HF cache folder and *stopping* the flow after `download-weights` and asking user to restart with the actual local model location

Narsil · 2023-08-02T22:31:27Z

@philschmid If you have any suggestions/counterindications for SM or HFE behavior ?

@younesbelkada Is my usage of PEFT ok ?

philschmid · 2023-08-03T06:36:09Z

PEFT added withj 0.4 AutoModels, which should simply the code a lot, e.g.: https://github.com/philschmid/huggingface-llama-2-samples/blob/master/training/scripts/merge_adapter_weights.py#L21

model = AutoPeftModelForCausalLM.from_pretrained(
    args.peft_model_id,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
)

philschmid

So they idea here is to:

load the adapters
load the base model based on the adapter config
Merge the weights
save safetensors
start TGI?

younesbelkada

Looks great on PEFT side! Thanks for the ping
Alternatively you could have also used AutoPeftModelForCausalLM or AutoPeftModelForSeq2SeqLM: https://huggingface.co/docs/peft/quicktour#easy-loading-with-auto-classes and directly load the model with the correct class without having to declare a base model. But this also works great (it does the same thing under the hood)

Narsil · 2023-08-03T08:23:44Z

@younesbelkada Do these method force the dtype to be f32 or should I specify it with torch_dtype all the time ?

younesbelkada · 2023-08-03T08:58:03Z

yes the behaviour is the same as transformers' from_pretrained so it will load in fp32, you need to pass torch_dtype arg as @philschmid said to load the base model in any other prec

feat(server): Add native support for PEFT Lora models (huggingface#762)

shimizust · 2023-09-20T00:51:50Z

@Narsil Should model loading work if the model is a local model and/or the base model specified in the adapter_config.json is a local path? For example:

text-generation-launcher --model-id /dev/models/gpt2-medium-peft

or in adapter_config.json:

{
   "base_model_name_or_path": "/dev/models/gpt2-medium",
  ...
}

It doesn't seem to be detecting that the local model I'm pointing to is a peft model.

Narsil · 2023-09-20T15:26:26Z

Hmm I haven' t tried but I don' t think there's any big reason why it shouldn' t work.

Could you open an issue with the full template filled out (and ideally links to the actual models and setup so we can reproduce as easily as possible).

shimizust · 2023-09-20T16:09:05Z

@Narsil At least just looking at the code (server/text_generation_server/cli.py), it seems the checks for PEFT model is only in the block checking if the model is not a local model? Is that a correct understanding? I can open an issue

if not is_local_model:
    try:
        adapter_config_filename = hf_hub_download(model_id, revision=revision, filename="adapter_config.json")
        utils.download_and_unload_peft(model_id, revision, trust_remote_code=trust_remote_code)
    except (utils.LocalEntryNotFoundError, utils.EntryNotFoundError):
        pass

Narsil · 2023-09-20T16:37:47Z

Oh true, should be relatively easy to update, do you want to take a stab at it ?

shimizust · 2023-09-20T22:22:20Z

@Narsil Thanks, I'd be interested in taking a stab at it. However, I'm running into some challenges testing on my Mac without a GPU. I'll give it a shot, but if I'm unable to address the issue in a reasonable timeframe, I may ask someone else to take over.

cirocavani · 2023-09-21T14:23:51Z

I had similar problem (I was trying to serve a PEFT LoRA from folder /opt/ml/model on SageMaker).

My workaround was to create the merged model before and serve it (not requiring the PEFT path in TGI).

Something like:

3 folders:

./utils - merge script
./peft/<checkpoint> - Input (adapter_config.json, tokenizer_config.json, generation_config.json, ...)
./model/data - Output (config.json, safetensors, tokenizer_config.json...)
(I also share my HF Hub cache to avoid duplicated download)

Script ./utils/peft_merger.py

(weight rename is specific to Bloom model, probably should be removed or adapted for your model)

import os
import sys

import peft
import torch
import transformers

peft_checkpoint_dir = sys.argv[1]
model_data_dir = sys.argv[2]

model = peft.AutoPeftModelForCausalLM.from_pretrained(
    peft_checkpoint_dir,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
)

base_model_id = model.peft_config["default"].base_model_name_or_path

model = model.merge_and_unload()

tokenizer_config_file = os.path.join(peft_checkpoint_dir, "tokenizer_config.json")
tokenizer_peft_or_base = (
    peft_checkpoint_dir if os.path.isfile(tokenizer_config_file) else base_model_id
)
tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_peft_or_base)

generation_config_file = os.path.join(peft_checkpoint_dir, "generation_config.json")
generation_peft_or_base = (
    peft_checkpoint_dir if os.path.isfile(generation_config_file) else base_model_id
)
generation = transformers.GenerationConfig.from_pretrained(generation_peft_or_base)


### rename weights transformer.<key> -> <key> as expected by TGI
assert model.base_model_prefix == "transformer"
state_dict = {
    k[len(model.base_model_prefix) + 1 :]: v
    for k, v in model.state_dict().items()
    if k.startswith(model.base_model_prefix)
}
###

model.save_pretrained(model_data_dir, state_dict=state_dict, safe_serialization=True)
tokenizer.save_pretrained(model_data_dir)
generation.save_pretrained(model_data_dir)

Locally, I run the script inside TGI docker.

docker run --rm \
--volume=./utils:/opt/ml/utils \
--volume=./peft/<checkpoint>:/opt/ml/peft-checkpoint \
--volume=./model/data:/opt/ml/model \
--volume=$HOME/.cache/huggingface/hub:/data \
--entrypoint=python \
--env PYTHONUNBUFFERED=1 \
--env HF_HUB_ENABLE_HF_TRANSFER=0 \
--env HUGGINGFACE_HUB_CACHE=/data \
ghcr.io/huggingface/text-generation-inference:1.0.3 \
/opt/ml/utils/peft_merger.py \
/opt/ml/peft-checkpoint \
/opt/ml/model

After docker, I create a model.tar.gz with ./model/data/* files and upload it to S3. Then I start a SageMaker Instance endpoint with this model.

I tested with bigscience/bloom-7b1. I found a problem that is not handled in this PR. When I run model = model.merge_and_unload() all weights are renamed with a prefix transformer.. I had to rename them to remove this prefix.

In your case, I think you can mount (--volume) your model in docker (and remove HF Hub Cache). Also, you will have to check if the saved model has the correct weight names.

Example with Bloom:

Input ./peft/checkpoint

README.md                                              3,621 bytes
adapter_config.json                                      438 bytes
adapter_model.bin                                 31,479,589 bytes
generation_config.json                                   141 bytes
special_tokens_map.json                                   96 bytes
tokenizer.json                                    14,501,114 bytes
tokenizer_config.json                                    286 bytes
------------------------------------------------------------------
7 files                                           45,985,285 bytes

Output ./model/data

config.json                                              783 bytes
generation_config.json                                   136 bytes
model-00001-of-00002.safetensors               9,976,218,536 bytes
model-00002-of-00002.safetensors               4,161,852,232 bytes
model.safetensors.index.json                          27,460 bytes
special_tokens_map.json                                   96 bytes
tokenizer.json                                    14,501,114 bytes
tokenizer_config.json                                    286 bytes
------------------------------------------------------------------
8 files                                       14,152,600,643 bytes

And running it with Docker (this requires GPU):

docker run --rm \
--gpus=all \
--shm-size=1g \
--publish=8080:80 \
--volume=./model/data:/opt/ml/model \
--env DTYPE=bfloat16 \
ghcr.io/huggingface/text-generation-inference:1.0.3 \
--model-id=/opt/ml/model

Output:

2023-09-20T10:55:43.888739Z  INFO text_generation_launcher: Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: Some(BFloat16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "129cc1f5b564", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-09-20T10:55:43.888849Z  INFO download: text_generation_launcher: Starting download process.
2023-09-20T10:55:46.446105Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-09-20T10:55:46.891730Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-09-20T10:55:46.891936Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-09-20T10:57:28.122836Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-09-20T10:57:28.186745Z  INFO shard-manager: text_generation_launcher: Shard ready in 101.294345318s rank=0
2023-09-20T10:57:28.274466Z  INFO text_generation_launcher: Starting Webserver
2023-09-20T10:57:28.837124Z  WARN text_generation_router: router/src/main.rs:194: no pipeline tag found for model [/opt/ml/model](https://vscode-remote+ssh-002dremote-002b10-002e115-002e73-002e51.vscode-resource.vscode-cdn.net/opt/ml/model)
2023-09-20T10:57:28.860890Z  INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-09-20T10:57:30.887951Z  WARN text_generation_router: router/src/main.rs:224: Model does not support automatic max batch total tokens
2023-09-20T10:57:30.887975Z  INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 16000
2023-09-20T10:57:30.887980Z  INFO text_generation_router: router/src/main.rs:247: Connected
2023-09-20T10:57:30.887985Z  WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0

shimizust · 2023-10-20T17:08:57Z

Thanks @cirocavani for the detailed solution to work around this issue--this is very helpful

tleyden · 2023-11-14T14:42:59Z

@Narsil @shimizust @cirocavani

I hit the same issue when trying to load local PEFT weights. Here's a PR that fixed it for me:

#1260

@Narsil

# What does this PR do? Enables PEFT weights to be loaded from a local directory, as opposed to a hf hub repository. It is a continuation of the work in PR #762   Fixes #1259 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? **Yes but I don't know how to run the tests for this repo, and it doesn't look like this code is covered anyway** - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. **Yes, @Narsil asked for a PR in [this comment](#762 (comment) - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). **I didn't see any documentation added to the [original PR](#762), and am not sure where this belongs. Let me know and I can add some** - [x] Did you write any new necessary tests? **I didn't see any existing test coverage for this python module** ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil  --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

@Narsil

# What does this PR do? Enables PEFT weights to be loaded from a local directory, as opposed to a hf hub repository. It is a continuation of the work in PR huggingface/text-generation-inference#762   Fixes #1259 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? **Yes but I don't know how to run the tests for this repo, and it doesn't look like this code is covered anyway** - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. **Yes, @Narsil asked for a PR in [this comment](huggingface/text-generation-inference#762 (comment) - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). **I didn't see any documentation added to the [original PR](huggingface/text-generation-inference#762), and am not sure where this belongs. Let me know and I can add some** - [x] Did you write any new necessary tests? **I didn't see any existing test coverage for this python module** ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil  --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

philschmid reviewed Aug 3, 2023

View reviewed changes

younesbelkada reviewed Aug 3, 2023

View reviewed changes

Narsil mentioned this pull request Aug 3, 2023

Deploying Falcon to SageMaker TGI DLC after QLoRA fine-tuning #454

Closed

Narsil added 3 commits August 3, 2023 12:41

Cleaner peft code.

9d5a018

Without hashes.

76366a5

Missing catch.

e515f8d

Narsil merged commit ac736fd into main Aug 3, 2023
5 checks passed

Narsil deleted the add_native_peft_support branch August 3, 2023 15:22

tallesairan added a commit to tallesairan/text-generation-inference that referenced this pull request Aug 3, 2023

Merge pull request #1 from huggingface/main

c668f3f

feat(server): Add native support for PEFT Lora models (huggingface#762)

loganlebanoff mentioned this pull request Aug 24, 2023

Error when trying to make inference with a local model #896

Closed

4 tasks

This was referenced Nov 14, 2023

PEFT support does not work with local directories #1259

Closed

Load PEFT weights from local directory #1260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): Add native support for PEFT Lora models #762

feat(server): Add native support for PEFT Lora models #762

Narsil commented Aug 2, 2023 •

edited

Narsil commented Aug 2, 2023

philschmid commented Aug 3, 2023

philschmid left a comment

younesbelkada left a comment

Narsil commented Aug 3, 2023

younesbelkada commented Aug 3, 2023

shimizust commented Sep 20, 2023

Narsil commented Sep 20, 2023

shimizust commented Sep 20, 2023

Narsil commented Sep 20, 2023

shimizust commented Sep 20, 2023

cirocavani commented Sep 21, 2023 •

edited

shimizust commented Oct 20, 2023

tleyden commented Nov 14, 2023

feat(server): Add native support for PEFT Lora models #762

feat(server): Add native support for PEFT Lora models #762

Conversation

Narsil commented Aug 2, 2023 • edited

What does this PR do?

Before submitting

Who can review?

Narsil commented Aug 2, 2023

philschmid commented Aug 3, 2023

philschmid left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

Narsil commented Aug 3, 2023

younesbelkada commented Aug 3, 2023

shimizust commented Sep 20, 2023

Narsil commented Sep 20, 2023

shimizust commented Sep 20, 2023

Narsil commented Sep 20, 2023

shimizust commented Sep 20, 2023

cirocavani commented Sep 21, 2023 • edited

shimizust commented Oct 20, 2023

tleyden commented Nov 14, 2023

Narsil commented Aug 2, 2023 •

edited

cirocavani commented Sep 21, 2023 •

edited