Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train baseline models for evaluation #42

Open
huu4ontocord opened this issue May 4, 2023 · 10 comments
Open

Train baseline models for evaluation #42

huu4ontocord opened this issue May 4, 2023 · 10 comments
Assignees

Comments

@huu4ontocord
Copy link
Owner

huu4ontocord commented May 4, 2023

We need to eval the experts that are merged against if we trained a 1b Pythia model all together.

  1. Trained with all layers on the 6 datasets we have.
  2. Trained with just the upper layers.

To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.

This will give us a comparison of :

  1. training all layers on same token and data
  2. training some layers on same token and data
  3. merging with different experts trained on same compute
@mrseeker
Copy link
Collaborator

mrseeker commented May 4, 2023

Have you tried using the EleutherAI eval harness? It should give you a nice representation on how well the model performs, and can be used as an indicator?

@mrcabbage972 mrcabbage972 added enhancement New feature or request and removed enhancement New feature or request labels May 5, 2023
@mrcabbage972 mrcabbage972 changed the title train baseline model on 6 instruction set Full fine-tune of baseline model for evaluation May 5, 2023
@mrcabbage972 mrcabbage972 changed the title Full fine-tune of baseline model for evaluation Train baseline models for evaluation May 5, 2023
@mrcabbage972
Copy link
Collaborator

I didn't understand the part about the 1000 training examples. Our datasets are much bigger than that!

@huu4ontocord
Copy link
Owner Author

didn't we just train our models on 1000 examples only? Or did i misunderstand that

@huu4ontocord
Copy link
Owner Author

We definately should try on eleuther eval harness. but just testing validity loss will tell us something too. regular finetuing vs. expert finetuning + merge

@mrcabbage972
Copy link
Collaborator

We have an issue for Eval Harness in the backlog.

@huu4ontocord
Copy link
Owner Author

so i am told that:
It seems they were trained on 1k batches
I think the batch size was 8 because of the number of GPUs
So that gives uhs 8k samples

So the above 1000 examples should be 8K examples.

@jordiclive
Copy link
Collaborator

@ontocord for 2. we want layer_9,10,11,12,13 ?

@mrcabbage972
Copy link
Collaborator

@jordiclive @ontocord
We had used layers 9-13 when we trained the experts. See: https://github.com/ontocord/MDEL/blob/main/src/mdel/train.sh#L4

@mrcabbage972
Copy link
Collaborator

@jordiclive Any updates on this issue?

@jordiclive
Copy link
Collaborator

@mrcabbage972 I trained 1. a model (all layers) on the exact splits...https://wandb.ai/ontocord/jordi_testing/runs/hu8j9ta1?workspace=user-jordanclive if you toggle the evaluation.

But I then thought we decided on automating the experiment again with more training data/less validation, maybe same amount of final testing data #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants