[`PEFT`] Fix `save_pretrained` to make sure adapters weights are also saved on TPU #29388

shub-kris · 2024-03-01T08:51:47Z

Bug Fix for saving adapter weights when using PEFT

What does this PR do?

This PR fixes saving adapter weights when using PEFT on TPUs. Currently only the model weights are being saved and not the adapter weights.

I tested it locally with this change on this script and now it saves following whiles whenever checkpointing:

README.md            
adapter_model.safetensors  
rng_state.pth  
special_tokens_map.json  
tokenizer.model        
trainer_state.json
adapter_config.json  
optimizer.pt               
scheduler.pt   
tokenizer.json           
tokenizer_config.json  
training_args.bin

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Earlier discussed here

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

apter weights when using PEFT

HuggingFaceDocBuilderDev · 2024-03-01T09:11:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks 🤗

ArthurZucker · 2024-03-04T08:31:27Z

src/transformers/trainer.py

@@ -3035,9 +3035,10 @@ def _save_tpu(self, output_dir: Optional[str] = None):

        # Save a trained model and configuration using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
+        supported_classes = (PreTrainedModel,) if not is_peft_available() else (PreTrainedModel, PeftModel)


I think all PushtoHubMixin can fall in the classes that support save_pretrained and from pretrained so we could also use that as both should inherit from the latter

My fix was inspired from the code here:

transformers/src/transformers/trainer.py

Line 3073 in 0cb946d

supported_classes = (PreTrainedModel,) if not is_peft_available() else (PreTrainedModel, PeftModel)

Great idea @ArthurZucker pushed it.

moficodes · 2024-03-05T11:04:40Z

Hello there, What is the state of the PR? Is there a timeline when it will be merged and a new release of transformer will be out?

ArthurZucker · 2024-03-07T00:20:27Z

Just waiting for @shub-kris to come back ( he is off ) and transformers release will be in around 2 weeks

moficodes · 2024-03-09T02:20:17Z

I ran some tests on a GKE Cluster with TPU V4 with 4 nodes.

https://gist.github.com/moficodes/1492228c80a3c08747a973b519cc7cda

This run fails with

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 13, in storage_ptr
    return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//fsdp.py", line 112, in <module>
    model.save_pretrained(new_model_id)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2448, in save_pretrained
    safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 470, in _flatten
    shared_pointers = _find_shared_tensors(tensors)
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 72, in _find_shared_tensors
    if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 17, in storage_ptr
    return tensor.storage().data_ptr()
  File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 956, in data_ptr
    return self._data_ptr()
  File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 960, in _data_ptr
    return self._untyped_storage.data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

That looks like the original error. So not certain if the cause of the error was resolved.

shub-kris · 2024-03-11T07:09:31Z

I ran some tests on a GKE Cluster with TPU V4 with 4 nodes.

https://gist.github.com/moficodes/1492228c80a3c08747a973b519cc7cda

This run fails with

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 13, in storage_ptr
    return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//fsdp.py", line 112, in <module>
    model.save_pretrained(new_model_id)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2448, in save_pretrained
    safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 470, in _flatten
    shared_pointers = _find_shared_tensors(tensors)
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 72, in _find_shared_tensors
    if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:
  File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 17, in storage_ptr
    return tensor.storage().data_ptr()
  File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 956, in data_ptr
    return self._data_ptr()
  File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 960, in _data_ptr
    return self._untyped_storage.data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

That looks like the original error. So not certain if the cause of the error was resolved.

Hi @moficodes thanks for flagging this error, but on an initial glance, it doesn't look like the problem that this PR addresses. This PR aims to save the adapter weights, which were not being saved before this PR.

So, if you would have used trainer with this change it would save the adapter-weights too:

README.md            
adapter_model.safetensors  
rng_state.pth  
special_tokens_map.json  
tokenizer.model        
trainer_state.json
adapter_config.json  
optimizer.pt               
scheduler.pt   
tokenizer.json           
tokenizer_config.json  
training_args.bin

Earlier, adapter_model.safetensors and adapter_config.json were not being saved. A simple script to demonstrate is here which you can run by

export XLA_USE_BF16=1 PJRT_DEVICE=TPU XLA_USE_SPMD=1  HF_TOKEN=<your-HF-TOKEN>
python save-gemma.py

So, it might happen that the error you are encountering is unrelated to what this PR tries to fix.

moficodes · 2024-03-12T06:06:29Z

I see. Will open a separate issue for it then.

Thank you!

moficodes · 2024-03-12T06:07:48Z

The error happens on the same line though. model.save_pretrained(new_model_id)

shub-kris · 2024-03-12T13:14:21Z

@LysandreJik can we merge this if it looks good to you, since @ArthurZucker is on holidays and I made the changes he asked and have tested it too locally.

shub-kris · 2024-03-13T08:22:20Z

@moficodes answered it here: #29608 (comment)

amyeroberts · 2024-03-13T19:04:51Z

@shub-kris Based on reviews and code, we can merge. There's currently a failing test which needs to be resolved first. Could you try rebasing on main to make sure you have all the latest updates, and trigger a fresh CI run?

shub-kris · 2024-03-13T19:59:14Z

@amyeroberts Than you for looking into the PR. I have rebased but some checks are still failing because of this I guess: https://github.com/huggingface/transformers/runs/22627188153

amyeroberts · 2024-03-14T11:08:44Z

@shub-kris Yep - a fix has just been merged into main. Apologies for the disruption. Could you try rebasing again?

shub-kris · 2024-03-14T11:32:29Z

Thanks a lot @amyeroberts

Fix for saving ad

0cb946d

apter weights when using PEFT

shub-kris requested review from ArthurZucker and pacman100 March 1, 2024 08:53

ArthurZucker changed the title ~~Fix for saving ad~~ [PEFT] Fix save_pretrained to make sure adapters weights are also saved Mar 4, 2024

ArthurZucker changed the title ~~[PEFT] Fix save_pretrained to make sure adapters weights are also saved~~ [PEFT] Fix save_pretrained to make sure adapters weights are also saved on TPU Mar 4, 2024

ArthurZucker approved these changes Mar 4, 2024

View reviewed changes

Change supported-classes to PushToHubMixin

f84599a

Merge branch 'huggingface:main' into fix/save-adapter-weights-peft

620d74a

Merge branch 'huggingface:main' into fix/save-adapter-weights-peft

c70efbd

Merge branch 'huggingface:main' into fix/save-adapter-weights-peft

b38e5d5

amyeroberts merged commit c9e3c0b into huggingface:main Mar 14, 2024
21 checks passed

shub-kris deleted the fix/save-adapter-weights-peft branch March 14, 2024 15:40

zorrofox mentioned this pull request Mar 15, 2024

Problems with saving standalone gemma-2b-it after fine-tuning with LoRA on TPU v3-8 #29659

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`PEFT`] Fix `save_pretrained` to make sure adapters weights are also saved on TPU #29388

[`PEFT`] Fix `save_pretrained` to make sure adapters weights are also saved on TPU #29388

shub-kris commented Mar 1, 2024 •

edited

HuggingFaceDocBuilderDev commented Mar 1, 2024

ArthurZucker left a comment

ArthurZucker Mar 4, 2024

shub-kris Mar 11, 2024

shub-kris Mar 12, 2024

moficodes commented Mar 5, 2024

ArthurZucker commented Mar 7, 2024

moficodes commented Mar 9, 2024

shub-kris commented Mar 11, 2024 •

edited

moficodes commented Mar 12, 2024

moficodes commented Mar 12, 2024

shub-kris commented Mar 12, 2024

shub-kris commented Mar 13, 2024

amyeroberts commented Mar 13, 2024

shub-kris commented Mar 13, 2024 •

edited

amyeroberts commented Mar 14, 2024

shub-kris commented Mar 14, 2024

[PEFT] Fix save_pretrained to make sure adapters weights are also saved on TPU #29388

[PEFT] Fix save_pretrained to make sure adapters weights are also saved on TPU #29388

Conversation

shub-kris commented Mar 1, 2024 • edited

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Mar 1, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Mar 4, 2024

Choose a reason for hiding this comment

shub-kris Mar 11, 2024

Choose a reason for hiding this comment

shub-kris Mar 12, 2024

Choose a reason for hiding this comment

moficodes commented Mar 5, 2024

ArthurZucker commented Mar 7, 2024

moficodes commented Mar 9, 2024

shub-kris commented Mar 11, 2024 • edited

moficodes commented Mar 12, 2024

moficodes commented Mar 12, 2024

shub-kris commented Mar 12, 2024

shub-kris commented Mar 13, 2024

amyeroberts commented Mar 13, 2024

shub-kris commented Mar 13, 2024 • edited

amyeroberts commented Mar 14, 2024

shub-kris commented Mar 14, 2024

[`PEFT`] Fix `save_pretrained` to make sure adapters weights are also saved on TPU #29388

[`PEFT`] Fix `save_pretrained` to make sure adapters weights are also saved on TPU #29388

shub-kris commented Mar 1, 2024 •

edited

shub-kris commented Mar 11, 2024 •

edited

shub-kris commented Mar 13, 2024 •

edited