-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Toronto BooksCorpus #248
Conversation
Thanks for adding this one ! About the three points you mentioned:
URL = "https://drive.google.com/uc?export=download&id=16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z"
...
arch_path = dl_manager.download_and_extract(URL) Also this is is an unofficial host of the dataset, we should probably host it ourselves if we can. |
Yes it can be removed |
…in Lhoest's advice.
I just downloaded the file and put it on gs. The public url is Could you try to change the url to this one and heck that everything is ok ? |
In
BTW, I notice the path |
Let me change the url to match "bookscorpus", so that you don't have to change anything. Good catch. About the error you're getting: you just have to remove the |
Hi, I found I made a mistake. I found the ELECTRA paper refer it as "BooksCorpus", but actually it is caleld "BookCorpus", according to the original paper. Sorry, I should have checked the original paper . Can you do me a favor and change the url path to |
Yep I'm doing it right now. Could you please rename all the references to |
Thank you @lhoestq ,
|
Oh yea you're right about the Hellaswag example. We should keep the "_" symbol to replace spaces. As there are no space in BookCorpus, what we should do here is use:
Don't forget to regenerate the |
Awesome thanks :) |
toronto_books_corpus
BooksCorpus
but notTornotoBooksCorpus
bookscorpus.py
include a functiondownload_file_from_google_drive
, maybe you will want to put it elsewhere.The paper has said
and we have changed the form (not books), so I don't think it should have that problems. Or we can state that use it at your own risk or only for academic use. I know @thomwolf should know these things more.
This should solved #131