Tied weight optimization for checkpoints doesn't work with text-generation-inference. #555

jenkspt · 2023-07-05T20:57:35Z

System Info

Ubuntu 20.04
4 A10 NVIDIA GPU's

I think checkpoints saved after this feature was merged don't work with text-generation-inference.
huggingface/transformers#23868

With falcon models getting "lm_head not found"
I'll add more details once I find minimal steps to reproduce.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Save tiiuae/falcon-40b checkpoint using transformers==4.30.2
launch text-generation-inference server

(using transformers==4.27.4 works without issue)

Expected behavior

Expect the text-generation-inference weight loader to be able to find the lm_head weight in the checkpoint. Note this may be a safetensor issue.

The text was updated successfully, but these errors were encountered:

Narsil · 2023-07-06T20:51:30Z

Could you share the name fo the affected model ?

It's simply a matter of weight naming, the conversion method here is a bit crude (but very efficient memory wise), we just need some weights renaming.

- Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : #555 and : #501 and #556 and #482 (comment)

fadynakhla · 2023-07-10T17:17:01Z

Hi @Narsil,

Just wanted to add some more detail as I have been dealing with this issue as well. If I load and save one of the falcon models:

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instrusct", trust_remote_code=True)
model.save_pretrained("/path/to/model")

Then copy over the tokenizer and use that saved model to start the text-generation-inference server:

docker run --gpus '"device=2,3"' --shm-size 1g -p 8080:80 -v /data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id /path/to/model --num-shard 1 --max-input-length 9000 --max-total-tokens 10000 --max-best-of 8 --trust-remote-code --max-batch-prefill-tokens 9000

When transformers version = 4.30.2 I get an error that looks something like this:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 253, in get_model
    return FlashRWSharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 56, in __init__
    model = FlashRWForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 628, in __init__
    self.transformer = FlashRWModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 553, in __init__
    self.word_embeddings = TensorParallelEmbedding(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight transformer.word_embeddings.weight does not exist

However using transformers version = 4.27.4 to load and save the model allows the tgi server to start up as expected from the locally saved weights.

I have also tried using the PR you've linked above but that does not solve the issue which I think is related to how the model weights are saved rather than how they are converted to safetensors. Maybe this issue belongs in the transformers repo?

Narsil · 2023-07-10T17:39:57Z

Can you try with

ghcr.io/huggingface/text-generation-inference:sha-b4024ed

This is what got modified 4 days ago (We changed a bit how we choose the actual tensors to copy).

fadynakhla · 2023-07-10T18:11:20Z

Yeah just tried that now it runs into the same issue. Here is the full output when pointing to the model saved with 4.30.2:

2023-07-10T18:06:11.828477Z  INFO text_generation_launcher: Args { model_id: "/data/test-models/falcon-7b-instruct", revision: None, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 8, max_stop_sequences: 4, max_input_length: 9000, max_total_tokens: 10000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 9000, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "dc570373c710", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-10T18:06:11.828614Z  INFO text_generation_launcher: Starting download process.
2023-07-10T18:06:13.292872Z  WARN download: text_generation_launcher: No safetensors weights found for model /data/test-models/falcon-7b-instruct at revision None. Converting PyTorch weights to safetensors.

2023-07-10T18:06:27.323366Z  INFO download: text_generation_launcher: Convert: [1/3] -- Took: 0:00:13.984885

2023-07-10T18:06:38.581181Z  INFO download: text_generation_launcher: Convert: [2/3] -- Took: 0:00:11.257167

2023-07-10T18:06:53.949943Z  INFO download: text_generation_launcher: Convert: [3/3] -- Took: 0:00:15.368202

2023-07-10T18:06:54.369737Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-10T18:06:54.369814Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/data/test-models/falcon-7b-instruct` do not contain malicious code.
2023-07-10T18:06:54.369821Z  WARN text_generation_launcher: Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
2023-07-10T18:06:54.370572Z  INFO text_generation_launcher: Starting shard 0
2023-07-10T18:06:56.617177Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-10T18:06:56.818961Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 253, in get_model
    return FlashRWSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 56, in __init__
    model = FlashRWForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 634, in __init__
    self.transformer = FlashRWModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 559, in __init__
    self.word_embeddings = TensorParallelEmbedding(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight transformer.word_embeddings.weight does not exist

compared to pointing at the model saved with 4.27.4:

2023-07-10T18:09:53.995176Z  INFO text_generation_launcher: Args { model_id: "/data/test-models/falcon-7b-instruct-oldhf", revision: None, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 8, max_stop_sequences: 4, max_input_length: 9000, max_total_tokens: 10000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 9000, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "9ae087ae066c", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-10T18:09:53.995278Z  INFO text_generation_launcher: Starting download process.
2023-07-10T18:09:55.481628Z  WARN download: text_generation_launcher: No safetensors weights found for model /data/test-models/falcon-7b-instruct-oldhf at revision None. Converting PyTorch weights to safetensors.

2023-07-10T18:10:09.814556Z  INFO download: text_generation_launcher: Convert: [1/3] -- Took: 0:00:14.287668

2023-07-10T18:10:22.541350Z  INFO download: text_generation_launcher: Convert: [2/3] -- Took: 0:00:12.726082

2023-07-10T18:10:37.206531Z  INFO download: text_generation_launcher: Convert: [3/3] -- Took: 0:00:14.664828

2023-07-10T18:10:37.940380Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-10T18:10:37.940456Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/data/test-models/falcon-7b-instruct-oldhf` do not contain malicious code.
2023-07-10T18:10:37.940464Z  WARN text_generation_launcher: Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
2023-07-10T18:10:37.940728Z  INFO text_generation_launcher: Starting shard 0
2023-07-10T18:10:40.184657Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-10T18:10:47.951716Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-07-10T18:10:50.542364Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-07-10T18:10:50.554101Z  INFO text_generation_launcher: Shard 0 ready in 12.611513023s
2023-07-10T18:10:50.650702Z  INFO text_generation_launcher: Starting Webserver
2023-07-10T18:10:50.728175Z  WARN text_generation_router: router/src/main.rs:186: no pipeline tag found for model /data/test-models/falcon-7b-instruct-oldhf
2023-07-10T18:10:50.733065Z  INFO text_generation_router: router/src/main.rs:205: Warming up model
2023-07-10T18:10:53.408244Z  INFO text_generation_router: router/src/main.rs:214: Connected
2023-07-10T18:10:53.408264Z  WARN text_generation_router: router/src/main.rs:219: Invalid hostname, defaulting to 0.0.0.0

fadynakhla · 2023-07-10T18:14:55Z

Another thing I have noticed (though not sure if it is at all related), the newer version of transformers does not save the configuration and modeling python files when save_pretrained is used which results in an error when trying to load the saved model with from_pretrained

Edit: I believe this issue is unrelated I can download those files and place them in the saved model and load it up as expected. I have loaded both models with from pretrained and compared model.transformer.word_embeddings.weight and model.lm_head.weight and they are identical for both so it would seem that none of the weights are being skipped over when using save_pretrained

Narsil · 2023-07-10T18:32:51Z

I just think the previous weights are already saved.

The new PR doesn't fix it because we're still pointing to a trust_remote_code model, which doesn't have the heursitics to choose the weights better.

PR incoming.

fadynakhla · 2023-07-10T18:37:23Z

Yeah they are I had just wanted to confirm that by loading the models directly.

Great to hear let me know if there's anything I can do to help!

Narsil · 2023-07-10T18:37:39Z

the newer version of transformers does not save the configuration and modeling python files when save_pretrained is used which results in an error when trying to load the saved model with from_pretrained

This is not new, you used only AutoModelFor...from_pretrained(..).save_pretrained(...).
Which never saved anything else than the model.

You need to do the same with AutoConfig and AutoTokenizer if you want those files in. Or do pipe = pipeline(...); pipe.save_pretrained(...")

Narsil · 2023-07-10T18:41:12Z

Fix is ready:#579

fadynakhla · 2023-07-10T18:47:55Z

This is not new, you used only AutoModelFor...from_pretrained(..).save_pretrained(...). Which never saved anything else than the model.

You need to do the same with AutoConfig and AutoTokenizer if you want those files in. Or do pipe = pipeline(...); pipe.save_pretrained(...")

With transformers 4.27.4 the python files are saved:

import transformers


model = transformers.AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
model.save_pretrained("/data/test-models/falcon-7b-instruct-oldhf-1")

ls /data/test-models/falcon-7b-instruct-oldhf-1
does indeed list configuration_RW.py and modelling_RW.py

Narsil · 2023-07-10T18:49:55Z

But not config.json nor tokenizer.json no ?

I think those are necessary to load the entire thing.

fadynakhla · 2023-07-10T18:57:58Z

The config.json is saved (in both versions) and the tokenizer.json is not saved, but this is the expected behavior.

The only difference between the two transformers versions in terms of the files that get saved are the python files.

Narsil · 2023-07-10T19:00:46Z

Interesting, maybe create an issue in transformers directly ?

Edit: tokenizer.json is definitely needed for TGI to work optimally. It will work regardless, but having the files enables the rust router to make cleverer decisions because it can count tokens with it

fadynakhla · 2023-07-10T19:32:35Z

Absolutely right TGI doesn't start up properly without tokenizer.json. I've just been copying it over instead of loading and saving it.

Interesting, maybe create an issue in transformers directly ?

Yeah will do

…esn't work to see which weights to keep). (#579) Fixes #555

- Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : huggingface/text-generation-inference#555 and : huggingface/text-generation-inference#501 and huggingface/text-generation-inference#556 and huggingface/text-generation-inference#482 (comment)

jenkspt · 2023-07-28T21:57:24Z

I'm still having issues when saving safetensors (works without error with safe_serialization=False). Steps to reproduce:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('tiiuae/falcon-7b', trust_remote_code=True, device_map='auto')
model.save_pretrained("test-falcon-7b-deploy", safe_serialization=True)

tokenizer = AutoTokenizer.from_pretrained('tiiuae/falcon-7b', trust_remote_code=True)
tokenizer.save_pretrained('test-falcon-7b-deploy')

# I can reload the model without error
# AutoModelForCausalLM.from_pretrained('test-falcon-7b-deploy')

VOLUME=/home/ubuntu/test-falcon-7b-deploy
docker run -i -t --gpus 0 \
  --shm-size 1g -p 8080:80 \
  --volume $VOLUME:/data/checkpoint \
  ghcr.io/huggingface/text-generation-inference:1.0.0 \
  --model-id /data/checkpoint \
  --num-shard 1

2023-07-28T21:52:50.837307Z  INFO text_generation_launcher: Args { model_id: "/data/checkpoint", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "b2d45baef9e4", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-28T21:52:50.837422Z  INFO download: text_generation_launcher: Starting download process.
2023-07-28T21:52:52.513651Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-28T21:52:52.939481Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-28T21:52:52.939686Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-28T21:53:00.065924Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 218, in get_model
    return FlashRWSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 64, in __init__
    model = FlashRWForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 624, in __init__
    self.lm_head = TensorParallelHead.load(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 207, in load
    weight = weights.get_tensor(f"{prefix}.weight")
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist

2023-07-28T21:53:01.147439Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

You are using a model of type RefinedWebModel to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 218, in get_model
    return FlashRWSharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 64, in __init__
    model = FlashRWForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 624, in __init__
    self.lm_head = TensorParallelHead.load(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 207, in load
    weight = weights.get_tensor(f"{prefix}.weight")

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight lm_head.weight does not exist
 rank=0
2023-07-28T21:53:01.246715Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-28T21:53:01.246761Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

- Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : huggingface/text-generation-inference#555 and : huggingface/text-generation-inference#501 and huggingface/text-generation-inference#556 and huggingface/text-generation-inference#482 (comment)

Narsil mentioned this issue Jul 6, 2023

Attempting to harden a bit the weights choice to save on disk. #561

Merged

5 tasks

Narsil mentioned this issue Jul 10, 2023

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). #579

Merged

5 tasks

OlivierDehaene closed this as completed in #579 Jul 12, 2023

OlivierDehaene pushed a commit that referenced this issue Jul 12, 2023

fix(server): Fixing RW code (it's remote code so the Arch checking do…

f018143

…esn't work to see which weights to keep). (#579) Fixes #555

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tied weight optimization for checkpoints doesn't work with text-generation-inference. #555

Tied weight optimization for checkpoints doesn't work with text-generation-inference. #555

jenkspt commented Jul 5, 2023 •

edited

Narsil commented Jul 6, 2023

fadynakhla commented Jul 10, 2023 •

edited

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023

fadynakhla commented Jul 10, 2023 •

edited

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023

Narsil commented Jul 10, 2023

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023 •

edited

Narsil commented Jul 10, 2023 •

edited

fadynakhla commented Jul 10, 2023

jenkspt commented Jul 28, 2023

Tied weight optimization for checkpoints doesn't work with text-generation-inference. #555

Tied weight optimization for checkpoints doesn't work with text-generation-inference. #555

Comments

jenkspt commented Jul 5, 2023 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

Narsil commented Jul 6, 2023

fadynakhla commented Jul 10, 2023 • edited

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023

fadynakhla commented Jul 10, 2023 • edited

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023

Narsil commented Jul 10, 2023

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023

Narsil commented Jul 10, 2023

fadynakhla commented Jul 10, 2023 • edited

Narsil commented Jul 10, 2023 • edited

fadynakhla commented Jul 10, 2023

jenkspt commented Jul 28, 2023

jenkspt commented Jul 5, 2023 •

edited

fadynakhla commented Jul 10, 2023 •

edited

fadynakhla commented Jul 10, 2023 •

edited

fadynakhla commented Jul 10, 2023 •

edited

Narsil commented Jul 10, 2023 •

edited