Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MeZo Forward Pass Implementation #601

Closed
thistleknot opened this issue Jun 20, 2023 · 19 comments
Closed

MeZo Forward Pass Implementation #601

thistleknot opened this issue Jun 20, 2023 · 19 comments

Comments

@thistleknot
Copy link

Feature request

https://github.com/princeton-nlp/MeZO/blob/main/large_models/trainer.py

Motivation

Memory efficiency while training

Your contribution

Willing to test, train, and write-up bug reports.

@younesbelkada
Copy link
Contributor

This implementation is very interesting IMO and fits the goals of PEFT library, anyone wants to give it a try?
cc @pacman100 @sayakpaul
It will be the first custom Trainer ever in PEFT, and the attached code seems quite clean. Happy to integrate it otherwise

@sayakpaul
Copy link
Member

I would say it's better as a third-party showcase. I don't think it makes sense to add as a core feature of the library yet. Maybe, based on the adoption and community response, we could revisit this.

@Bearnardd
Copy link

Hi @younesbelkada @sayakpaul! I am also interested in this change. From my perspective there are 2 ways to implement this change. 1) It would be nice to add a showcase example how to combine peft with their trainer e.g. in the examples directory. One problem might be the fact that, if I am not mistaken you cannot simply install their code via e.g. pip and use as a package. You are forced to download their repo and use CLI which probably complicates things a bit. 2) Add a custom trainer as @younesbelkada mentioned.

@pacman100
Copy link
Contributor

Hello everyone, I have read the paper today. I concur with Sayak that before investing efforts into Custom Trainer for it we need to gauge interest and performance of MeZO.

I like the approach 1 suggested by @Bearnardd to start with.

@Bearnardd
Copy link

Bearnardd commented Jun 21, 2023

I completely agree with your thoughts, @pacman100. Nowadays, numerous new ideas and papers emerge daily, making it impractical to dedicate time to include each one without first assessing the interest from the community and the actual performance of a particular solution. Simple example is a pretty cheap alternative :)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@ghost
Copy link

ghost commented Aug 2, 2023

Aw, please implement MeZo guys, this could allow for near lossless compressed full fine tuned adaptors for true inference cost. Poor mans training would no longer have a in-superior connotation. Please @sayakpaul @pacman100 @Bearnardd ?

@thistleknot
Copy link
Author

Yes please

@thistleknot
Copy link
Author

thistleknot commented Aug 18, 2023

I got this to work with bnb. train a 3b model with 5GB of ram using forward passes (effectively speeding up my training thanks to no need to do backward passes). I think it's worthwhile to integrate this, especially since it works alongside bnb .

@thistleknot
Copy link
Author

from 90it/s to 60/it/s with Mezo

@ghost
Copy link

ghost commented Aug 19, 2023

Nice! @thistleknot
When you say "bnb" do you mean 4bit?
When you say "from 90it/s to 60/it/s with Mezo" is 90 referring to single forward pass speeds not using mezo?

@thistleknot
Copy link
Author

thistleknot commented Aug 19, 2023 via email

@ghost
Copy link

ghost commented Aug 19, 2023

@thistleknot would the finetuned model export as nf4? Would MeZo's predictions be made in nf4 too? How does that work?

@thistleknot
Copy link
Author

yes and yes. the mezo function is simply a feed forward (I'm paraphrasing) compute_loss function, but it's plug and play with the lora setup with no qualms with integration that I see. I was trying to get it to work with gptq, but was unable to get a working setup, but I'm reading even if I did, it would still be limitied to a lora adapter.

So the best you can get with this without peft/lora, is simply faster training.

Shame relora isn't extended to anything other than llama.

@ghost
Copy link

ghost commented Aug 20, 2023

Great work man @thistleknot

Shame indeed, relora has so much potential in distributing dense weights.

Just wondering, do you think MeZo could full finetune in fp8? I'm not even sure if the HF format allows fp8 safetensors. If so, 2x less memory would be nice with little to no loss. I've already got flash attention 2 working :) So close to full finetuning larger models on consumer hardware.

@thistleknot
Copy link
Author

thistleknot commented Aug 20, 2023 via email

@ghost
Copy link

ghost commented Aug 20, 2023

I think it would be quite tedious to change and save gtpq models on the fly.
You're right, it does seem like we can only train layers on top of frozen quants with LoRa or bit by bit for now. I see you pain @thistleknot and relate too. Respect that your still trying haha but all we need is a member of the team to take a glance.

@ghost
Copy link

ghost commented Aug 20, 2023

@thistleknot

bigscience-workshop/petals#273

I was reminded this PR that managed to save 8bit models :) and learnt pytorch is working on e5m2 and e4m3 FP8 implementation which should be distributed better for transformers models.

@thistleknot
Copy link
Author

got to wait for it to be merged. For now, I'm using a fancy layering setup. I was thinking I would have lora be optional (controlled by a parameter). So once I figure out the 'best' solution (I would prefer quantized weights, use mezo, save quantized weights... but that as you know isn't 'implemented' fully yet). In the meantime for sake of my sanity and testing, I'm simply using 1 document + 1 use case (squad) using lora (16GB p5200 on an m7730, best $600 I ever spent).

Then maybe one day when a magical better method like qlora or saving gptq models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants