-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BloomCausalLM #1467
Add BloomCausalLM #1467
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks so much for the thorough testing on this. "canary generations" like you did are a great way to check. Just one comment.
head_dim = self.backbone.hidden_dim // num_heads | ||
shape = [batch_size, num_layers, 2, max_length, num_heads, head_dim] | ||
attention_layer_dtype = self.backbone.transformer_layers[0].dtype | ||
cache = ops.zeros(shape, dtype=attention_layer_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these two lines because of a keras2 error. Setting the cache dtype to the same dtype as the variable_dtype of the attention layer will prevent raising the tensorflow error while updating the cache.
TypeError: Input 'update' of 'XlaDynamicUpdateSlice' Op has type bfloat16 that does not match type float32 of argument 'input'.
This error didn't raise in keras3, because keras3 casts all the inputs dtype to the layer.variable_dtype,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we want this to be compute_dtype
not variable_dtype
, for mixed precision cases. In that case, a variable dtype might be float32
, but the output of layers will be bfloat16
or float16
. This cache is the cache our key/value projection outputs, so the compute dtype would be the correct thing to cache here.
Do you run into errors if you make that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for that! That's right, the cache dtype should be the same as the dtype of the einsum layer output. That was my intention actually, but I thought that the output of the layer will be the variable_dtype
not compute_dtype
.
But are you with me that this is not compatible with keras2:
cache = ops.zeros(shape, dtype=self.compute_dtype)
because BloomCausalLM
doesn't have the same dtype policy as the backbone layers, it follow the global policy.
So instead we should initialize the cache with the compute_dtype of the backbone attention layers policy.
like that:
attention_layer_dtype = self.backbone.transformer_layers[
0
].compute_dtype
cache = ops.zeros(shape, dtype=attention_layer_dtype)
Even we can't do:
cache = ops.zeros(shape, dtype=self.backbone.compute_dtype)
because also the backbone has dtype policy different than its layers ( when we pass a dtype
arg different from the global policy during initialization).
Take a look at this commit :
77c576b
to show how we can set the dtype_policy of the backbone. and how to set the causalLM to follow the backbone dtype_policy. I will revert this commit after checking the tests because it's not its time now. but you can take a look. I can open a PR to Apply that for all models, if you think it will be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I forget to mention, Look at that commit and its error:
7cc8671
That will clarify what I am talking about a little bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a computer right now, but I think you are right. We want to set the task dtype policy to just follow the backbone's exactly. And set the backbone dtype accessors so they reflect the proper dtype. That's a bug we can probably fix on a separate pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will land #1486 tomorrow, and pull this in with self.compute_dtype
.
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok #1486 is in. Shall I update this to self.compute_dtype
or do you want to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw I updated to self.compute_dtype
and keras2 test passed thanks for the dtype fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Version 3 of bloom_560m_multi is uploading now.
https://www.kaggle.com/models/keras/bloom/frameworks/keras/variations/bloom_560m_multi
Left a couple comments re variant naming, and cache dtype.
head_dim = self.backbone.hidden_dim // num_heads | ||
shape = [batch_size, num_layers, 2, max_length, num_heads, head_dim] | ||
attention_layer_dtype = self.backbone.transformer_layers[0].dtype | ||
cache = ops.zeros(shape, dtype=attention_layer_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we want this to be compute_dtype
not variable_dtype
, for mixed precision cases. In that case, a variable dtype might be float32
, but the output of layers will be bfloat16
or float16
. This cache is the cache our key/value projection outputs, so the compute dtype would be the correct thing to cache here.
Do you run into errors if you make that change?
0953c5d
to
72eec43
Compare
Thank you so much! For the rest of the presets, would it help if we converted them ourselves? Is the conversion script all ready to go? Or have you already converted these already on your Kaggle? Also a question on the preset map
What are the sizes of |
@mattdangerw Thanks for your reviews and help. Also I have added validate_only flag you can use it after copying the models into keras from the link above. bloom is 176b parameter: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
@mattdangerw Also |
Thanks! Will copy over tomorrow. Let's rename the largest presets to include size in the name though. |
* Initial commit for BloomCausalLm * Avoid adding a start token into token ids * Revert "Avoid adding a start token into token ids" This reverts commit 57ce4c2. * Tie embeddding weights * Export BloomCausalLM * Add bloomz to the preset map * Fix some presets names * Revert "Fix some presets names" This reverts commit 1b5949f. * Add doc Example * format the code * Alibi bias small fixes * Add tests * Maek float16 dtype test to keras3 only * Update model version * Edit exampples * Remove max_sequnce_length argument * Try fix keras2 error * Update hf_model download fun * Make 1b models easier to copy * Optimize conversion script * Save checkpoints in float16 * Add test for mixed_float16 * Try to reproduce the keras2 Error * Revert "Try to reproduce the keras2 Error" This reverts commit 7cc8671. * Revert "Make 1b models easier to copy" This reverts commit 4d701d3. * Show How to couple dtype_policy between backbone, causalLM, and backbone layers in dtype arg is based to bacckbone * Revert "Show How to couple dtype_policy between backbone, causalLM, and backbone layers in dtype arg is based to bacckbone" This reverts commit 77c576b. * Add validate_only flag to conversion script * Change preset version * set cache dtype to self.compute_dtype * Minor fix
This PR:
BloomCausalLM
I have compared
BloomCausalLM
with the huggingface output and both produces similar outputThis is hf output with no start_token
.
This is keras output with no start_token
.
this is keras output with start_token (default)
.
AlibiBias
when keras dtype is
float16
in the causalLM test,AlibiBias
calledarange
function with dtype=int16 which causes error in tensorflow (I think). also this fix is the same in this PR AddFalconBackbone
#1389.BLOOMZ models are fintuned versions of the pretrained models and are better and recommended to use by bigscience
reference: https://arxiv.org/abs/2211.01786