Can you offer a list of models trained on The Pile? #4

nonstopfor · 2022-09-04T07:40:02Z

Can you offer a list of models trained on The Pile as these models are not allowed to use?

nonstopfor · 2022-09-04T07:49:14Z

As the readme said: "Querying other models trained on The Pile (other than the provided 1.3B GPT-Neo model) is not allowed. The reasoning for this is that larger models exhibit more memorization. Querying models trained on other datasets that do not significantly overlap with The Pile is allowed."

So I want to know the exact list of models trained on The Pile that are not allowed to query.

nonstopfor · 2022-09-07T03:53:54Z

I have no more questions for this topic. But I have another question after trying the baseline method. I found that the baseline method can generate multiple very similar suffixes conditioned on one prefix. Sometimes the differences are only some meaningless tokens like spaces. So I want to know whether there could be multiple possible suffixes in the Pile dataset when giving one prefix. Currently only one suffix is offered as the answer.

pluskid · 2022-09-08T17:30:20Z

@nonstopfor Regarding "whether there could be multiple possible suffixes in the Pile dataset when giving one prefix": We constructed the dataset by selecting only examples that are "well specified", in the sense that given the prefix, there is only one continuation such that the entire sequence is contained in the training dataset. @carlini It seems we did not include this detail in the README. Shall we add it?

heitikei · 2022-09-08T17:39:29Z

May I ask to reframe the issue?

Why do you need a list of models trained on The Pile?
Do you need to extract data not physically connected to the same or similar server-hub, or saved in a retrievable format?
This set will not include data behind pay walls, intellectual properties , or currently unavailable online

Premise
All data in the internet forms sensible "data piles" thank you Linus and Swartz

carlini · 2022-09-12T17:23:21Z

@pluskid yes. We should put that into the README! Sorry I don't know how we forgot to say that.

carlini · 2022-09-12T17:25:23Z

@nonstopfor We'll try to put together a list of models not trained on the pile. But I don't claim to be able to know all models trained, so I'd feel uncomfortable only forbidding a certain set of models. If you'd like to use a model and you're not sure, I'd suggest looking at the model card (which is supposed to discuss training data) or the original paper. I understand that this this can be messy because what if some model trains on GitHub but not The Pile (for example). If you have any questions about specific models you'd like to use that do overlap just raise an issue to ask about it.

nonstopfor · 2022-09-13T05:30:02Z

Thanks for your reply. I think it would be nice for you to list some popular models trained on the pile, which can help reduce the burden on participants. And for some less commonly used models, participants can raise an issue to ask about it.

heitikei mentioned this issue Sep 7, 2022

language-element.trad.character identified in baseline method #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you offer a list of models trained on The Pile? #4

Can you offer a list of models trained on The Pile? #4

nonstopfor commented Sep 4, 2022

nonstopfor commented Sep 4, 2022

nonstopfor commented Sep 7, 2022

pluskid commented Sep 8, 2022 •

edited

heitikei commented Sep 8, 2022

carlini commented Sep 12, 2022

carlini commented Sep 12, 2022

nonstopfor commented Sep 13, 2022 •

edited

Can you offer a list of models trained on The Pile? #4

Can you offer a list of models trained on The Pile? #4

Comments

nonstopfor commented Sep 4, 2022

nonstopfor commented Sep 4, 2022

nonstopfor commented Sep 7, 2022

pluskid commented Sep 8, 2022 • edited

heitikei commented Sep 8, 2022

carlini commented Sep 12, 2022

carlini commented Sep 12, 2022

nonstopfor commented Sep 13, 2022 • edited

pluskid commented Sep 8, 2022 •

edited

nonstopfor commented Sep 13, 2022 •

edited