Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BloomCausalLM #1467

Merged
merged 31 commits into from
Mar 6, 2024
Merged

Add BloomCausalLM #1467

merged 31 commits into from
Mar 6, 2024

Conversation

abuelnasr0
Copy link
Contributor

@abuelnasr0 abuelnasr0 commented Feb 24, 2024

This PR:

  • adds BloomCausalLM
    I have compared BloomCausalLM with the huggingface output and both produces similar output

This is hf output with no start_token
hf_with_no_ s.

This is keras output with no start_token
keras_with_no_ s.

this is keras output with start_token (default)
keras_with_ s.

  • Fixes small nit in AlibiBias
    when keras dtype is float16 in the causalLM test, AlibiBias called arange function with dtype=int16 which causes error in tensorflow (I think). also this fix is the same in this PR Add FalconBackbone #1389.
  • Adds the BLOOMz presets to the conversion script
    BLOOMZ models are fintuned versions of the pretrained models and are better and recommended to use by bigscience
    bloom vs bloomz
    reference: https://arxiv.org/abs/2211.01786

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks so much for the thorough testing on this. "canary generations" like you did are a great way to check. Just one comment.

keras_nlp/models/bloom/bloom_backbone.py Show resolved Hide resolved
head_dim = self.backbone.hidden_dim // num_heads
shape = [batch_size, num_layers, 2, max_length, num_heads, head_dim]
attention_layer_dtype = self.backbone.transformer_layers[0].dtype
cache = ops.zeros(shape, dtype=attention_layer_dtype)
Copy link
Contributor Author

@abuelnasr0 abuelnasr0 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these two lines because of a keras2 error. Setting the cache dtype to the same dtype as the variable_dtype of the attention layer will prevent raising the tensorflow error while updating the cache.

TypeError: Input 'update' of 'XlaDynamicUpdateSlice' Op has type bfloat16 that does not match type float32 of argument 'input'.

This error didn't raise in keras3, because keras3 casts all the inputs dtype to the layer.variable_dtype,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we want this to be compute_dtype not variable_dtype, for mixed precision cases. In that case, a variable dtype might be float32, but the output of layers will be bfloat16 or float16. This cache is the cache our key/value projection outputs, so the compute dtype would be the correct thing to cache here.

Do you run into errors if you make that change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for that! That's right, the cache dtype should be the same as the dtype of the einsum layer output. That was my intention actually, but I thought that the output of the layer will be the variable_dtype not compute_dtype.

But are you with me that this is not compatible with keras2:

        cache = ops.zeros(shape, dtype=self.compute_dtype)

because BloomCausalLM doesn't have the same dtype policy as the backbone layers, it follow the global policy.
So instead we should initialize the cache with the compute_dtype of the backbone attention layers policy.
like that:

        attention_layer_dtype = self.backbone.transformer_layers[
            0
        ].compute_dtype
        cache = ops.zeros(shape, dtype=attention_layer_dtype)

Even we can't do:

        cache = ops.zeros(shape, dtype=self.backbone.compute_dtype)

because also the backbone has dtype policy different than its layers ( when we pass a dtype arg different from the global policy during initialization).

Take a look at this commit :
77c576b
to show how we can set the dtype_policy of the backbone. and how to set the causalLM to follow the backbone dtype_policy. I will revert this commit after checking the tests because it's not its time now. but you can take a look. I can open a PR to Apply that for all models, if you think it will be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I forget to mention, Look at that commit and its error:
7cc8671
That will clarify what I am talking about a little bit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a computer right now, but I think you are right. We want to set the task dtype policy to just follow the backbone's exactly. And set the backbone dtype accessors so they reflect the proper dtype. That's a bug we can probably fix on a separate pr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will land #1486 tomorrow, and pull this in with self.compute_dtype.

Looks great!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok #1486 is in. Shall I update this to self.compute_dtype or do you want to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdangerw I updated to self.compute_dtype and keras2 test passed thanks for the dtype fix.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Version 3 of bloom_560m_multi is uploading now.

https://www.kaggle.com/models/keras/bloom/frameworks/keras/variations/bloom_560m_multi

Left a couple comments re variant naming, and cache dtype.

head_dim = self.backbone.hidden_dim // num_heads
shape = [batch_size, num_layers, 2, max_length, num_heads, head_dim]
attention_layer_dtype = self.backbone.transformer_layers[0].dtype
cache = ops.zeros(shape, dtype=attention_layer_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we want this to be compute_dtype not variable_dtype, for mixed precision cases. In that case, a variable dtype might be float32, but the output of layers will be bfloat16 or float16. This cache is the cache our key/value projection outputs, so the compute dtype would be the correct thing to cache here.

Do you run into errors if you make that change?

@mattdangerw mattdangerw added the kokoro:force-run Runs Tests on GPU label Mar 6, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 6, 2024
@mattdangerw mattdangerw merged commit f332034 into keras-team:master Mar 6, 2024
10 checks passed
@mattdangerw
Copy link
Member

Thank you so much! For the rest of the presets, would it help if we converted them ourselves? Is the conversion script all ready to go? Or have you already converted these already on your Kaggle?

Also a question on the preset map

    "bloom_multi": "bigscience/bloom",
    # Multitask finetuned on xP3 (Crosslingual Public Pool of Prompts) https://huggingface.co/datasets/bigscience/xP3
    # xP3 is a mixture of 13 training tasks in 46 languages with English prompts
    "bloomz_560m_multi": "bigscience/bloomz-560m",
    "bloomz_1.1b_multi": "bigscience/bloomz-1b1",
    "bloomz_1.7b_multi": "bigscience/bloomz-1b7",
    "bloomz_3b_multi": "bigscience/bloomz-3b",
    "bloomz_7b_multi": "bigscience/bloomz-7b1",
    "bloomz_multi": "bigscience/bloomz",
    # Multitask finetuned on xP3mt
    # (Crosslingual Public Pool of Prompts machine-translated) https://huggingface.co/datasets/bigscience/xP3
    # xP3mt is Mixture of 13 training tasks in 46 languages with prompts in 20
    # languages (machine-translated from English)
    "bloomz_7b_mt": "bigscience/bloomz-7b1-mt",
    "bloomz_mt": "bigscience/bloomz-mt",
    # Multitask finetuned on P3 (Public Pool of Prompts) https://huggingface.co/datasets/Muennighoff/P3
    # xP3 is a mixture of 8 training tasks with English-only prompts
    "bloomz_7b_p3": "bigscience/bloomz-7b1-p3",
    "bloomz_p3": "bigscience/bloomz-p3",

What are the sizes of bloom_multi, bloomz_multi? We might still want to include the size even the original naming does not. I think it would help usability.

@abuelnasr0
Copy link
Contributor Author

@mattdangerw Thanks for your reviews and help.
all the presets from 560m to 3b of bloom and bloomz (8 presets) are here: https://www.kaggle.com/models/mohamedabuelnasr/bloom
you can copy them , or you can run the conversion script also if you wish.

Also I have added validate_only flag you can use it after copying the models into keras from the link above.

bloom is 176b parameter: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

@abuelnasr0 abuelnasr0 deleted the bloom_causal_lm branch March 6, 2024 21:10
@abuelnasr0
Copy link
Contributor Author

@mattdangerw Also max_sequence_length should be deleted from the model card in keras.

@mattdangerw
Copy link
Member

mattdangerw commented Mar 7, 2024

Thanks! Will copy over tomorrow. Let's rename the largest presets to include size in the name though.

abuelnasr0 added a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024
* Initial commit for BloomCausalLm

* Avoid adding a start token into token ids

* Revert "Avoid adding a start token into token ids"

This reverts commit 57ce4c2.

* Tie embeddding weights

* Export BloomCausalLM

* Add bloomz to the preset map

* Fix some presets names

* Revert "Fix some presets names"

This reverts commit 1b5949f.

* Add doc Example

* format the code

* Alibi bias small fixes

* Add tests

* Maek float16 dtype test to keras3 only

* Update model version

* Edit exampples

* Remove max_sequnce_length argument

* Try fix keras2 error

* Update hf_model download fun

* Make 1b models easier to copy

* Optimize conversion script

* Save checkpoints in float16

* Add test for mixed_float16

* Try to reproduce the keras2 Error

* Revert "Try to reproduce the keras2 Error"

This reverts commit 7cc8671.

* Revert "Make 1b models easier to copy"

This reverts commit 4d701d3.

* Show How to couple dtype_policy between backbone, causalLM, and backbone layers in dtype arg is based to bacckbone

* Revert "Show How to couple dtype_policy between backbone, causalLM, and backbone layers in dtype arg is based to bacckbone"

This reverts commit 77c576b.

* Add validate_only flag to conversion script

* Change preset version

* set cache dtype to self.compute_dtype

* Minor fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants