Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you offer a list of models trained on The Pile? #4

Open
nonstopfor opened this issue Sep 4, 2022 · 7 comments
Open

Can you offer a list of models trained on The Pile? #4

nonstopfor opened this issue Sep 4, 2022 · 7 comments

Comments

@nonstopfor
Copy link

Can you offer a list of models trained on The Pile as these models are not allowed to use?

@nonstopfor
Copy link
Author

As the readme said: "Querying other models trained on The Pile (other than the provided 1.3B GPT-Neo model) is not allowed. The reasoning for this is that larger models exhibit more memorization. Querying models trained on other datasets that do not significantly overlap with The Pile is allowed."

So I want to know the exact list of models trained on The Pile that are not allowed to query.

@nonstopfor
Copy link
Author

I have no more questions for this topic. But I have another question after trying the baseline method. I found that the baseline method can generate multiple very similar suffixes conditioned on one prefix. Sometimes the differences are only some meaningless tokens like spaces. So I want to know whether there could be multiple possible suffixes in the Pile dataset when giving one prefix. Currently only one suffix is offered as the answer.

@pluskid
Copy link
Collaborator

pluskid commented Sep 8, 2022

@nonstopfor Regarding "whether there could be multiple possible suffixes in the Pile dataset when giving one prefix": We constructed the dataset by selecting only examples that are "well specified", in the sense that given the prefix, there is only one continuation such that the entire sequence is contained in the training dataset. @carlini It seems we did not include this detail in the README. Shall we add it?

@heitikei
Copy link

heitikei commented Sep 8, 2022

May I ask to reframe the issue?

Why do you need a list of models trained on The Pile?
Do you need to extract data not physically connected to the same or similar server-hub, or saved in a retrievable format?
This set will not include data behind pay walls, intellectual properties , or currently unavailable online

Premise
All data in the internet forms sensible "data piles" thank you Linus and Swartz

@carlini
Copy link
Collaborator

carlini commented Sep 12, 2022

@pluskid yes. We should put that into the README! Sorry I don't know how we forgot to say that.

@carlini
Copy link
Collaborator

carlini commented Sep 12, 2022

@nonstopfor We'll try to put together a list of models not trained on the pile. But I don't claim to be able to know all models trained, so I'd feel uncomfortable only forbidding a certain set of models. If you'd like to use a model and you're not sure, I'd suggest looking at the model card (which is supposed to discuss training data) or the original paper. I understand that this this can be messy because what if some model trains on GitHub but not The Pile (for example). If you have any questions about specific models you'd like to use that do overlap just raise an issue to ask about it.

@nonstopfor
Copy link
Author

nonstopfor commented Sep 13, 2022

Thanks for your reply. I think it would be nice for you to list some popular models trained on the pile, which can help reduce the burden on participants. And for some less commonly used models, participants can raise an issue to ask about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants