Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Toronto BooksCorpus #248

Merged
merged 4 commits into from
Jun 12, 2020
Merged

add Toronto BooksCorpus #248

merged 4 commits into from
Jun 12, 2020

Conversation

richarddwang
Copy link
Contributor

@richarddwang richarddwang commented Jun 7, 2020

  1. I knew there is a branch toronto_books_corpus
  • After I downloaded it, I found it is all non-english, and only have one row.
  • It seems that it cites the wrong paper
  • according to papar using it, it is called BooksCorpus but not TornotoBooksCorpus
  1. It use a text mirror in google drive
  • bookscorpus.py include a function download_file_from_google_drive , maybe you will want to put it elsewhere.
  • text mirror is found in this comment on the issue, and it said to have the same statistics as the one in the paper.
  • You may want to download it and put it on your gs in case of it disappears someday.
  1. Copyright ?
    The paper has said

The BookCorpus Dataset. In order to train our sentence similarity model we collected a corpus of 11,038 books from the web. These are free books written by yet unpublished authors. We only included books that had more than 20K words in order to filter out perhaps noisier shorter stories. The dataset has books in 16 different genres, e.g., Romance (2,865 books), Fantasy (1,479), Science fiction (786), Teen (430), etc. Table 2 highlights the summary statistics of our book corpus.

and we have changed the form (not books), so I don't think it should have that problems. Or we can state that use it at your own risk or only for academic use. I know @thomwolf should know these things more.

This should solved #131

@richarddwang richarddwang changed the title add Tornoto BooksCorpus add Toronto BooksCorpus Jun 7, 2020
@lhoestq
Copy link
Member

lhoestq commented Jun 8, 2020

Thanks for adding this one !

About the three points you mentioned:

  1. I think the toronto_books_corpus branch can be removed @mariamabarham ?
  2. You can use the download manager to download from google drive. For you case you can just do something like
URL = "https://drive.google.com/uc?export=download&id=16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z"
...
arch_path = dl_manager.download_and_extract(URL)

Also this is is an unofficial host of the dataset, we should probably host it ourselves if we can.
3. Not sure about the copyright here, but I maybe @thomwolf has better insights about it.

@mariamabarham
Copy link
Contributor

Yes it can be removed

@lhoestq
Copy link
Member

lhoestq commented Jun 11, 2020

I just downloaded the file and put it on gs. The public url is
https://storage.googleapis.com/huggingface-nlp/datasets/toronto_books_corpus/bookcorpus.tar.bz2

Could you try to change the url to this one and heck that everything is ok ?

@richarddwang
Copy link
Contributor Author

richarddwang commented Jun 11, 2020

In books.py

URL = "https://storage.googleapis.com/huggingface-nlp/datasets/toronto_books_corpus/bookcorpus.tar.bz2"
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nlp import load_dataset
>>> book = load_dataset("nlp/datasets/bookscorpus/books.py", cache_dir='~/tmp')
Downloading and preparing dataset bookscorpus/plain_text (download: 1.10 GiB, generated: 4.52 GiB, total: 5.62 GiB) to /home/yisiang/tmp/bookscorpus/plain_text/1.0.0...
Downloading: 100%|███████████████████████████████████████████████████████████| 1.18G/1.18G [00:39<00:00, 30.0MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yisiang/nlp/src/nlp/load.py", line 520, in load_dataset
    save_infos=save_infos,
  File "/home/yisiang/nlp/src/nlp/builder.py", line 420, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/yisiang/nlp/src/nlp/builder.py", line 460, in _download_and_prepare
    verify_checksums(self.info.download_checksums, dl_manager.get_recorded_sizes_checksums())
  File "/home/yisiang/nlp/src/nlp/utils/info_utils.py", line 31, in verify_checksums
    raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums)))
nlp.utils.info_utils.ExpectedMoreDownloadedFiles: {'16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z'}
>>>

BTW, I notice the path huggingface-nlp/datasets/toronto_books_corpus, does it mean I have to change folder name "bookscorpus" to "toronto_books_corpus"

@lhoestq
Copy link
Member

lhoestq commented Jun 11, 2020

In books.py

URL = "https://storage.googleapis.com/huggingface-nlp/datasets/toronto_books_corpus/bookcorpus.tar.bz2"
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nlp import load_dataset
>>> book = load_dataset("nlp/datasets/bookscorpus/books.py", cache_dir='~/tmp')
Downloading and preparing dataset bookscorpus/plain_text (download: 1.10 GiB, generated: 4.52 GiB, total: 5.62 GiB) to /home/yisiang/tmp/bookscorpus/plain_text/1.0.0...
Downloading: 100%|███████████████████████████████████████████████████████████| 1.18G/1.18G [00:39<00:00, 30.0MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yisiang/nlp/src/nlp/load.py", line 520, in load_dataset
    save_infos=save_infos,
  File "/home/yisiang/nlp/src/nlp/builder.py", line 420, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/yisiang/nlp/src/nlp/builder.py", line 460, in _download_and_prepare
    verify_checksums(self.info.download_checksums, dl_manager.get_recorded_sizes_checksums())
  File "/home/yisiang/nlp/src/nlp/utils/info_utils.py", line 31, in verify_checksums
    raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums)))
nlp.utils.info_utils.ExpectedMoreDownloadedFiles: {'16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z'}
>>>

BTW, I notice the path huggingface-nlp/datasets/toronto_books_corpus, does it mean I have to change folder name "bookscorpus" to "toronto_books_corpus"

Let me change the url to match "bookscorpus", so that you don't have to change anything. Good catch.

About the error you're getting: you just have to remove the dataset_infos.json and regenerate it

@lhoestq
Copy link
Member

lhoestq commented Jun 11, 2020

@richarddwang
Copy link
Contributor Author

Hi, I found I made a mistake. I found the ELECTRA paper refer it as "BooksCorpus", but actually it is caleld "BookCorpus", according to the original paper. Sorry, I should have checked the original paper .

Can you do me a favor and change the url path to https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2 ?

@lhoestq
Copy link
Member

lhoestq commented Jun 11, 2020

Yep I'm doing it right now. Could you please rename all the references to bookscorpus and BooksCorpus to book_corpus and BookCorpus (with the right casing) ?

@richarddwang
Copy link
Contributor Author

richarddwang commented Jun 11, 2020

Thank you @lhoestq ,
Just to confirm it fits your naming convention

  • make the file path book_corpus/book_corpus.py ?
  • make class Bookscorpus(nlp.GeneratorBasedBuilder) -> BookCorpus (which make cache folder name book_corpus and user use load_dataset('book_corpus')) ?
    (Cuz I found "HellaSwag" dataset is named "nlp/datasets/hellaswag" and class Hellaswag )

@lhoestq
Copy link
Member

lhoestq commented Jun 11, 2020

Oh yea you're right about the Hellaswag example. We should keep the "_" symbol to replace spaces. As there are no space in BookCorpus, what we should do here is use:

Don't forget to regenerate the dataset_infos.json and we'll be good :D

@lhoestq
Copy link
Member

lhoestq commented Jun 12, 2020

Awesome thanks :)

@lhoestq lhoestq merged commit 3343285 into huggingface:master Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants