-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train baseline models for evaluation #42
Comments
Have you tried using the EleutherAI eval harness? It should give you a nice representation on how well the model performs, and can be used as an indicator? |
I didn't understand the part about the 1000 training examples. Our datasets are much bigger than that! |
didn't we just train our models on 1000 examples only? Or did i misunderstand that |
We definately should try on eleuther eval harness. but just testing validity loss will tell us something too. regular finetuing vs. expert finetuning + merge |
We have an issue for Eval Harness in the backlog. |
so i am told that: So the above 1000 examples should be 8K examples. |
@ontocord for 2. we want layer_9,10,11,12,13 ? |
@jordiclive @ontocord |
@jordiclive Any updates on this issue? |
@mrcabbage972 I trained 1. a model (all layers) on the exact splits...https://wandb.ai/ontocord/jordi_testing/runs/hu8j9ta1?workspace=user-jordanclive if you toggle the evaluation. But I then thought we decided on automating the experiment again with more training data/less validation, maybe same amount of final testing data #47 |
We need to eval the experts that are merged against if we trained a 1b Pythia model all together.
To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.
This will give us a comparison of :
The text was updated successfully, but these errors were encountered: