Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add Toronto BookCorpus dataset #131

Closed
jarednielsen opened this issue May 15, 2020 · 2 comments
Closed

[Feature request] Add Toronto BookCorpus dataset #131

jarednielsen opened this issue May 15, 2020 · 2 comments
Labels
dataset request Requesting to add a new dataset

Comments

@jarednielsen
Copy link
Contributor

I know the copyright/distribution of this one is complex, but it would be great to have! That, combined with the existing wikitext, would provide a complete dataset for pretraining models like BERT.

@jarednielsen jarednielsen changed the title [Feature request] Add Toronto BookCorpus [Feature request] Add Toronto BookCorpus dataset May 15, 2020
@richarddwang
Copy link
Contributor

As far as I understand, wikitext is refer to WikiText-103 and WikiText-2 that created by researchers in Salesforce, and mostly used in traditional language modeling.

You might want to say wikipedia, a dump from wikimedia foundation.

Also I would like to have Toronto BookCorpus too ! Though it involves copyright problem...

@thomwolf thomwolf added the dataset request Requesting to add a new dataset label May 17, 2020
@richarddwang
Copy link
Contributor

Hi, @lhoestq, just a reminder that this is solved by #248 .😉

@lhoestq lhoestq closed this as completed Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset request Requesting to add a new dataset
Projects
None yet
Development

No branches or pull requests

4 participants