-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tied weight optimization for checkpoints doesn't work with text-generation-inference. #555
Comments
Could you share the name fo the affected model ? It's simply a matter of weight naming, the conversion method here is a bit crude (but very efficient memory wise), we just need some weights renaming. |
- Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : #555 and : #501 and #556 and #482 (comment)
Hi @Narsil, Just wanted to add some more detail as I have been dealing with this issue as well. If I load and save one of the falcon models:
Then copy over the tokenizer and use that saved model to start the text-generation-inference server:
When transformers version =
However using transformers version = I have also tried using the PR you've linked above but that does not solve the issue which I think is related to how the model weights are saved rather than how they are converted to safetensors. Maybe this issue belongs in the transformers repo? |
Can you try with
This is what got modified 4 days ago (We changed a bit how we choose the actual tensors to copy). |
Yeah just tried that now it runs into the same issue. Here is the full output when pointing to the model saved with
compared to pointing at the model saved with
|
Another thing I have noticed (though not sure if it is at all related), the newer version of transformers does not save the configuration and modeling python files when Edit: I believe this issue is unrelated I can download those files and place them in the saved model and load it up as expected. I have loaded both models with from pretrained and compared |
I just think the previous weights are already saved. The new PR doesn't fix it because we're still pointing to a PR incoming. |
Yeah they are I had just wanted to confirm that by loading the models directly. Great to hear let me know if there's anything I can do to help! |
This is not new, you used only You need to do the same with |
Fix is ready:#579 |
With transformers
|
But not I think those are necessary to load the entire thing. |
The The only difference between the two transformers versions in terms of the files that get saved are the python files. |
Interesting, maybe create an issue in Edit: |
Absolutely right TGI doesn't start up properly without
Yeah will do |
- Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : huggingface/text-generation-inference#555 and : huggingface/text-generation-inference#501 and huggingface/text-generation-inference#556 and huggingface/text-generation-inference#482 (comment)
I'm still having issues when saving safetensors (works without error with from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('tiiuae/falcon-7b', trust_remote_code=True, device_map='auto')
model.save_pretrained("test-falcon-7b-deploy", safe_serialization=True)
tokenizer = AutoTokenizer.from_pretrained('tiiuae/falcon-7b', trust_remote_code=True)
tokenizer.save_pretrained('test-falcon-7b-deploy')
# I can reload the model without error
# AutoModelForCausalLM.from_pretrained('test-falcon-7b-deploy') VOLUME=/home/ubuntu/test-falcon-7b-deploy
docker run -i -t --gpus 0 \
--shm-size 1g -p 8080:80 \
--volume $VOLUME:/data/checkpoint \
ghcr.io/huggingface/text-generation-inference:1.0.0 \
--model-id /data/checkpoint \
--num-shard 1 2023-07-28T21:52:50.837307Z INFO text_generation_launcher: Args { model_id: "/data/checkpoint", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "b2d45baef9e4", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-28T21:52:50.837422Z INFO download: text_generation_launcher: Starting download process.
2023-07-28T21:52:52.513651Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-07-28T21:52:52.939481Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-28T21:52:52.939686Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-28T21:53:00.065924Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 218, in get_model
return FlashRWSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 64, in __init__
model = FlashRWForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 624, in __init__
self.lm_head = TensorParallelHead.load(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 207, in load
weight = weights.get_tensor(f"{prefix}.weight")
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist
2023-07-28T21:53:01.147439Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
You are using a model of type RefinedWebModel to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 218, in get_model
return FlashRWSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 64, in __init__
model = FlashRWForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 624, in __init__
self.lm_head = TensorParallelHead.load(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 207, in load
weight = weights.get_tensor(f"{prefix}.weight")
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist
rank=0
2023-07-28T21:53:01.246715Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-28T21:53:01.246761Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart |
- Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : huggingface/text-generation-inference#555 and : huggingface/text-generation-inference#501 and huggingface/text-generation-inference#556 and huggingface/text-generation-inference#482 (comment)
- Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : huggingface/text-generation-inference#555 and : huggingface/text-generation-inference#501 and huggingface/text-generation-inference#556 and huggingface/text-generation-inference#482 (comment)
System Info
Ubuntu 20.04
4 A10 NVIDIA GPU's
I think checkpoints saved after this feature was merged don't work with text-generation-inference.
huggingface/transformers#23868
With falcon models getting "
lm_head
not found"I'll add more details once I find minimal steps to reproduce.
Information
Tasks
Reproduction
Save tiiuae/falcon-40b checkpoint using transformers==4.30.2
launch text-generation-inference server
(using transformers==4.27.4 works without issue)
Expected behavior
Expect the text-generation-inference weight loader to be able to find the
lm_head
weight in the checkpoint. Note this may be a safetensor issue.The text was updated successfully, but these errors were encountered: