Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: how to apply this experiment to the llama2 70B model? #11

Open
ghost opened this issue Sep 1, 2023 · 6 comments
Open

Discussion: how to apply this experiment to the llama2 70B model? #11

ghost opened this issue Sep 1, 2023 · 6 comments

Comments

@ghost
Copy link

ghost commented Sep 1, 2023

I am curious what is required to apply this method to the 70B parameter version of the llama2 model?
On reddit, noticed you mention: "For training, these models barely fit in 128 80GB A100s using DeepSpeed and FA2"
Would the computer at OSC be enough? https://www.osc.edu/resources/technical_support/supercomputers/ascend
Only 96 80GB A100 GPUs: Is that enough to contribute to the SoTA (State of the art)?

@bloc97
Copy link
Collaborator

bloc97 commented Sep 4, 2023

8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind

It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)

@conceptofmind
Copy link

conceptofmind commented Sep 4, 2023

8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind

It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)

It can be done through proper parallelization. We were limited to what we could use on the Stability AI due to both potential intellectual property constraints and lack of computing. If those are adequately taken into consideration through other sponsors then we should be able to build a 70B model at longer context lengths (8k-128k) without any issues.

I am currently communicating with LAION and Together. We should seek every possible grant available.

@ghost
Copy link
Author

ghost commented Sep 4, 2023

any plans to implement yarn into llama.cpp? need to show poc to potential pi for smaller models

@cebtenzzre
Copy link
Contributor

any plans to implement yarn into llama.cpp

It could be built off of ggerganov/llama.cpp#2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.

@bloc97
Copy link
Collaborator

bloc97 commented Sep 4, 2023

any plans to implement yarn into llama.cpp

It could be built off of ggerganov/llama.cpp#2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.

YaRN is just like NTK-by-parts as implemented in your implementation, but without the "gamma" factors (thus no more base change), plus an additional self.mscale factor that you multiply the RoPE embeddings with:

self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(dtype), persistent=False)

https://github.com/jquesnelle/yarn/blob/master/scaled_rope/LlamaYaRNScaledRotaryEmbedding.py

We've intentionally made YaRN as simple as possible to implement. (by ablating everything that had a negligible effect after the finetune)

@ghost
Copy link
Author

ghost commented Sep 9, 2023

Not easy finding a PI, any good ideas for putting together a PowerPoint?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants