You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Were there any trade-offs or considerations you made when deciding on the model's size? Or What criteria did you use to select the specific number of layers, attention heads and Embedding Size etc. in your model?
The text was updated successfully, but these errors were encountered:
I reduce the head dim from 128 to 64. (I believe this is a safe step because: 1. Falcon does that. https://huggingface.co/tiiuae/falcon-7b . 2. If you look at the GPT-3 family's shape, the GPT3-2.7B used an even smaller head dim compared with GPT-3 XL.
We use GQA and set the number of query groups to 4.
In order to fit the model with 16K per-GPU-batch size, we reduced the number of layers from 24 to 22. Larger batch size leads to higher MFU /throughput. Because at the end of the day, your x-axis is the money you used to train a model, and your y-axis is the model's performance/the impact you made.
Were there any trade-offs or considerations you made when deciding on the model's size? Or What criteria did you use to select the specific number of layers, attention heads and Embedding Size etc. in your model?
The text was updated successfully, but these errors were encountered: