Skip to content

Conversation

@Muennighoff
Copy link
Contributor

@Muennighoff Muennighoff commented Aug 3, 2024

What does this PR do?

Before submitting

Who can review?

The model will be released in ~1 week - can we already review this so that we can merge right upon release?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Feel free to ping me for a review once ready!

@Muennighoff
Copy link
Contributor Author

Hey! Feel free to ping me for a review once ready!

It's ready! :) I will update the README & double-check the slow tests once the model is released if that's fine!

@Muennighoff
Copy link
Contributor Author

@ArthurZucker would be great if we could get it reviewed soon 😇

@Muennighoff
Copy link
Contributor Author

We'll release the model on Tuesday, would be amazing if we could have this approved by then!

@ArthurZucker
Copy link
Collaborator

Oups sorry reviewing now

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review!
Mostly missing copied from and should be good to go otherwise.
Could go the extra mile to have compile compatible verseion of the MOE blcok!

@@ -0,0 +1,281 @@
# Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date is missing here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that mandatory? Maybe the entire header can just be removed? Seems redundant to have this header for every file when the license is clear from the repo..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 yeah it is redundant but the licence AFAIK but it's a nit don't worry

return final_hidden_states, router_logits


class OlmoeDecoderLayer(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sam ehere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from others cuz we have no shared expert

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator

COuld you make sure you rebased and CIs are green? Will review again after that!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM let's rebase and fix the CIs 🤗 (there is also a todo in the .md as well)

@Muennighoff
Copy link
Contributor Author

Great, all passing! 🙌

@ArthurZucker ArthurZucker merged commit ecd61c6 into huggingface:main Sep 3, 2024
@ArthurZucker
Copy link
Collaborator

Thanks for your contribution! 🔥

@kalomaze
Copy link

kalomaze commented Sep 5, 2024

Is it a problem on my end if I notice that a higher native batch size during training results in higher losses? (With DeepSpeed)

image

Purple line is higher native bs FFT, orange line is gradient accumulation instead of a higher native batch size.

I'm guessing expert routing parallelism is maybe not properly handled for bs>1?

@ArthurZucker
Copy link
Collaborator

cc @Muennighoff would have a lot more clues than me 😉

@Muennighoff
Copy link
Contributor Author

Is it a problem on my end if I notice that a higher native batch size during training results in higher losses? (With DeepSpeed)

image

Purple line is higher native bs FFT, orange line is gradient accumulation instead of a higher native batch size.

I'm guessing expert routing parallelism is maybe not properly handled for bs>1?

Hm not sure about this - are other models the same for you in both scenarios? How about other MoEs?

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Add OLMoE

* Add OLMoE

* Updates

* Make norm optional; add keys

* Add output

* Add

* Fix dtype

* Fix eos config

* Update

* Add OLMoE

* Fix OLMoE path

* Format

* Format

* Rmv copy statement

* Rmv copy statement

* Format

* Add copies

* Cp rotary

* Fix aming

* Fix naming

* Update RoPE integration; num_logits_to_keep; Add copy statements

* Add eps to config

* Format

* Add aux loss

* Adapt router_aux_loss_coef

* Update md

* Adapt

* adapt tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants