Parallel sharding #21

tengomucho · 2024-04-09T16:39:11Z

What does this PR do?

This enables sharding on Gemma model, making it possible to load google/gemma-7b and do inference on it.
TGI integration is yet to come but it should be done soon!

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-04-09T16:42:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

It will be used to adapt it for sharding. Only imports have been adapted, and only code relevant for GemmaForCausalLM has been added.

It seems that device_map parameter triggers a chain of calls that will try to use accelerate to load the model using less memory. The problem is that it skips the load state pre-hooks, making the weights loading impossible.

It will now be running in parallel. More changes to come.

This will lead to loading the model in bfloat16 when specified in the config.

regisss

I left a couple of comments, I'll review the modeling file tomorrow!

optimum/tpu/modeling.py

tests/conftest.py

API change when transformers was updated.

I wrongly chose the model's generation config instead of the one to the token selector.

mfuntowicz

LGTM - Only concern about the explicit need to provide the torch_dtype in the from_pretrained which I find a bit spurious but ok to merge and dig into another PR

mfuntowicz · 2024-04-10T13:48:28Z

examples/text-generation/generation_gemma.py

@@ -56,7 +56,7 @@ def main():
    model_id = "google/gemma-2b"
    torch_dtype = torch.bfloat16

-    model = TpuModelForCausalLM.from_pretrained(model_id, torch_dtype=torch_dtype)
+    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch_dtype)


Do we need the torch_dtype=torch_dtype? It should be taken from the config no?

Well, it doesn't look like it works this way:

>>> from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("google/gemma-2b") print(model.config.torch_dtype) print(model.model.layers[0].self_attn.o_proj.weight.dtype) >>> model = AutoModelForCausalLM.from_pretrained("google/gemma-2b") Gemma's activation function should be approximate GeLU and not exact GeLU. Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details. Loading checkpoint shards: 100%|██████████████████████| 2/2 [00:00<00:00, 2.65it/s] >>> print(model.config.torch_dtype) torch.bfloat16 >>> print(model.model.layers[0].self_attn.o_proj.weight.dtype) torch.float32

mfuntowicz · 2024-04-10T13:53:02Z

optimum/tpu/distributed_model.py

+    config = AutoConfig.from_pretrained(model_id)
+    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=config.torch_dtype)


I have hard time to get why we need to do this way? We are overriding the default behaviour to the default behaviour no? @regisss do you know?

It seems the default is to load in fp32 whatever the dtype specified in the config is: https://huggingface.slack.com/archives/C014N4749J9/p1712757959601599

So I got some insights on the design for this. It seems that transformers uses the default pytorch type, i.e.: torch.float32. So probably I will need to change this code later, as it might not work if there are models whose weights were not trained in float32/bfloat16. I have seen we cannot use bf16 everywhere already, because some operations cannot be made (I've seen it in a unit test with gpt2). It is probably a custom configuration we need to add to the model. I pushed a fix cleaner than this.

bfloat16 will be set by default in gemma models, other models will still load in float32 by default.

tengomucho added 16 commits April 9, 2024 16:44

chore: update transformers dependency

d75ba94

feat: import transformer's gemma modeling code

0ee7430

It will be used to adapt it for sharding. Only imports have been adapted, and only code relevant for GemmaForCausalLM has been added.

chore: rename model Gemma -> TpuGemma to prepare for changes

ca88068

feat(DistributedModel): added config property

a3de4d7

chore: rename test_parallel_proxy.py -> test_distributed_model.py

80170a9

fix: use AutoModelForCausalLM instead of TpuModelForCausalLM

9a9bcf8

feat: AutoModelForCausalLM will choose TpuGemmaForCausalLM if possible

5bf6c70

fix(TpuGemma): avoid using device_map when loading model

9dfb7b6

It seems that device_map parameter triggers a chain of calls that will try to use accelerate to load the model using less memory. The problem is that it skips the load state pre-hooks, making the weights loading impossible.

feat(gemma): sharding o_proj

ec3b752

It will now be running in parallel. More changes to come.

feat(gemma): sharding on q_proj

a7d7c0b

feat(gemma): sharding on k and v proj

b6fe32e

feat(gemma): sharding on mlp gate and up proj

e13d9ec

feat(gemma): sharding on mlp down proj

6cdede2

feat: model il loaded using pytorch_dtype from config

cd99226

This will lead to loading the model in bfloat16 when specified in the config.

fix: remove useless import

550e1fb

feat(tests): added test showing gemma7b sharding and prefill works

2215595

tengomucho force-pushed the parallel-sharding branch from 941fdf2 to 2215595 Compare April 9, 2024 16:44

tengomucho marked this pull request as ready for review April 9, 2024 16:44

tengomucho requested review from mfuntowicz and regisss April 9, 2024 16:44

regisss reviewed Apr 9, 2024

View reviewed changes

optimum/tpu/modeling.py Outdated Show resolved Hide resolved

tests/conftest.py Show resolved Hide resolved

chore: config_name_to_class uses config.model_type now

fe888a9

tengomucho force-pushed the parallel-sharding branch from f334bbd to fe888a9 Compare April 10, 2024 08:52

tengomucho added 3 commits April 10, 2024 09:10

fix: get_generation_mode is now a method of generation_config

dbf11f7

API change when transformers was updated.

fix(TGI server): fix slot.stopped changed after transformers update

a96903b

fix(generator): fix sample generation again

6e6b44e

I wrongly chose the model's generation config instead of the one to the token selector.

tengomucho requested a review from regisss April 10, 2024 12:44

tengomucho mentioned this pull request Apr 10, 2024

Weights upcasted to float32 at load time #23

Closed

regisss approved these changes Apr 10, 2024

View reviewed changes

mfuntowicz approved these changes Apr 10, 2024

View reviewed changes

tengomucho added 2 commits April 10, 2024 14:53

fix: better handle torch_dtype

92e9e31

bfloat16 will be set by default in gemma models, other models will still load in float32 by default.

fix: remove unused import

7901d91

tengomucho merged commit 8e12733 into main Apr 10, 2024
4 checks passed

mfuntowicz deleted the parallel-sharding branch April 10, 2024 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel sharding #21

Parallel sharding #21

tengomucho commented Apr 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 9, 2024

regisss left a comment

mfuntowicz left a comment

mfuntowicz Apr 10, 2024

tengomucho Apr 10, 2024

mfuntowicz Apr 10, 2024

regisss Apr 10, 2024

tengomucho Apr 10, 2024

		config = AutoConfig.from_pretrained(model_id)
		model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=config.torch_dtype)

Parallel sharding #21

Parallel sharding #21

Conversation

tengomucho commented Apr 9, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Apr 9, 2024

regisss left a comment

Choose a reason for hiding this comment

mfuntowicz left a comment

Choose a reason for hiding this comment

mfuntowicz Apr 10, 2024

Choose a reason for hiding this comment

tengomucho Apr 10, 2024

Choose a reason for hiding this comment

mfuntowicz Apr 10, 2024

Choose a reason for hiding this comment

regisss Apr 10, 2024

Choose a reason for hiding this comment

tengomucho Apr 10, 2024

Choose a reason for hiding this comment

tengomucho commented Apr 9, 2024 •

edited

Loading