Discussion: how to apply this experiment to the llama2 70B model? #11

ghost · 2023-09-01T23:24:27Z

I am curious what is required to apply this method to the 70B parameter version of the llama2 model?
On reddit, noticed you mention: "For training, these models barely fit in 128 80GB A100s using DeepSpeed and FA2"
Would the computer at OSC be enough? https://www.osc.edu/resources/technical_support/supercomputers/ascend
Only 96 80GB A100 GPUs: Is that enough to contribute to the SoTA (State of the art)?

bloc97 · 2023-09-04T15:56:00Z

8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind

It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)

conceptofmind · 2023-09-04T15:59:46Z

8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind

It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)

It can be done through proper parallelization. We were limited to what we could use on the Stability AI due to both potential intellectual property constraints and lack of computing. If those are adequately taken into consideration through other sponsors then we should be able to build a 70B model at longer context lengths (8k-128k) without any issues.

I am currently communicating with LAION and Together. We should seek every possible grant available.

ghost · 2023-09-04T18:51:28Z

any plans to implement yarn into llama.cpp? need to show poc to potential pi for smaller models

cebtenzzre · 2023-09-04T19:12:50Z

any plans to implement yarn into llama.cpp

It could be built off of ggerganov/llama.cpp#2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.

bloc97 · 2023-09-04T21:03:36Z

any plans to implement yarn into llama.cpp

It could be built off of ggerganov/llama.cpp#2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.

YaRN is just like NTK-by-parts as implemented in your implementation, but without the "gamma" factors (thus no more base change), plus an additional self.mscale factor that you multiply the RoPE embeddings with:

self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(dtype), persistent=False)

https://github.com/jquesnelle/yarn/blob/master/scaled_rope/LlamaYaRNScaledRotaryEmbedding.py

We've intentionally made YaRN as simple as possible to implement. (by ablating everything that had a negligible effect after the finetune)

ghost · 2023-09-09T00:44:36Z

Not easy finding a PI, any good ideas for putting together a PowerPoint?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: how to apply this experiment to the llama2 70B model? #11

Discussion: how to apply this experiment to the llama2 70B model? #11

ghost commented Sep 1, 2023 •

edited by ghost

bloc97 commented Sep 4, 2023 •

edited

conceptofmind commented Sep 4, 2023 •

edited

ghost commented Sep 4, 2023

cebtenzzre commented Sep 4, 2023

bloc97 commented Sep 4, 2023

ghost commented Sep 9, 2023

Discussion: how to apply this experiment to the llama2 70B model? #11

Discussion: how to apply this experiment to the llama2 70B model? #11

Comments

ghost commented Sep 1, 2023 • edited by ghost

bloc97 commented Sep 4, 2023 • edited

conceptofmind commented Sep 4, 2023 • edited

ghost commented Sep 4, 2023

cebtenzzre commented Sep 4, 2023

bloc97 commented Sep 4, 2023

ghost commented Sep 9, 2023

ghost commented Sep 1, 2023 •

edited by ghost

bloc97 commented Sep 4, 2023 •

edited

conceptofmind commented Sep 4, 2023 •

edited