add Toronto BooksCorpus #248

richarddwang · 2020-06-07T12:54:56Z

I knew there is a branch toronto_books_corpus

After I downloaded it, I found it is all non-english, and only have one row.
It seems that it cites the wrong paper
according to papar using it, it is called BooksCorpus but not TornotoBooksCorpus

It use a text mirror in google drive

bookscorpus.py include a function download_file_from_google_drive , maybe you will want to put it elsewhere.
text mirror is found in this comment on the issue, and it said to have the same statistics as the one in the paper.
You may want to download it and put it on your gs in case of it disappears someday.

Copyright ?
The paper has said

The BookCorpus Dataset. In order to train our sentence similarity model we collected a corpus of 11,038 books from the web. These are free books written by yet unpublished authors. We only included books that had more than 20K words in order to filter out perhaps noisier shorter stories. The dataset has books in 16 different genres, e.g., Romance (2,865 books), Fantasy (1,479), Science fiction (786), Teen (430), etc. Table 2 highlights the summary statistics of our book corpus.

and we have changed the form (not books), so I don't think it should have that problems. Or we can state that use it at your own risk or only for academic use. I know @thomwolf should know these things more.

This should solved #131

lhoestq · 2020-06-08T12:40:25Z

Thanks for adding this one !

About the three points you mentioned:

I think the toronto_books_corpus branch can be removed @mariamabarham ?
You can use the download manager to download from google drive. For you case you can just do something like

URL = "https://drive.google.com/uc?export=download&id=16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z"
...
arch_path = dl_manager.download_and_extract(URL)

Also this is is an unofficial host of the dataset, we should probably host it ourselves if we can.
3. Not sure about the copyright here, but I maybe @thomwolf has better insights about it.

mariamabarham · 2020-06-08T14:05:54Z

Yes it can be removed

…in Lhoest's advice.

lhoestq · 2020-06-11T08:22:21Z

I just downloaded the file and put it on gs. The public url is
https://storage.googleapis.com/huggingface-nlp/datasets/toronto_books_corpus/bookcorpus.tar.bz2

Could you try to change the url to this one and heck that everything is ok ?

richarddwang · 2020-06-11T09:53:19Z

In books.py

URL = "https://storage.googleapis.com/huggingface-nlp/datasets/toronto_books_corpus/bookcorpus.tar.bz2"

Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nlp import load_dataset
>>> book = load_dataset("nlp/datasets/bookscorpus/books.py", cache_dir='~/tmp')
Downloading and preparing dataset bookscorpus/plain_text (download: 1.10 GiB, generated: 4.52 GiB, total: 5.62 GiB) to /home/yisiang/tmp/bookscorpus/plain_text/1.0.0...
Downloading: 100%|███████████████████████████████████████████████████████████| 1.18G/1.18G [00:39<00:00, 30.0MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yisiang/nlp/src/nlp/load.py", line 520, in load_dataset
    save_infos=save_infos,
  File "/home/yisiang/nlp/src/nlp/builder.py", line 420, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/yisiang/nlp/src/nlp/builder.py", line 460, in _download_and_prepare
    verify_checksums(self.info.download_checksums, dl_manager.get_recorded_sizes_checksums())
  File "/home/yisiang/nlp/src/nlp/utils/info_utils.py", line 31, in verify_checksums
    raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums)))
nlp.utils.info_utils.ExpectedMoreDownloadedFiles: {'16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z'}
>>>

BTW, I notice the path huggingface-nlp/datasets/toronto_books_corpus, does it mean I have to change folder name "bookscorpus" to "toronto_books_corpus"

lhoestq · 2020-06-11T10:08:27Z

In books.py

URL = "https://storage.googleapis.com/huggingface-nlp/datasets/toronto_books_corpus/bookcorpus.tar.bz2"

Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nlp import load_dataset
>>> book = load_dataset("nlp/datasets/bookscorpus/books.py", cache_dir='~/tmp')
Downloading and preparing dataset bookscorpus/plain_text (download: 1.10 GiB, generated: 4.52 GiB, total: 5.62 GiB) to /home/yisiang/tmp/bookscorpus/plain_text/1.0.0...
Downloading: 100%|███████████████████████████████████████████████████████████| 1.18G/1.18G [00:39<00:00, 30.0MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yisiang/nlp/src/nlp/load.py", line 520, in load_dataset
    save_infos=save_infos,
  File "/home/yisiang/nlp/src/nlp/builder.py", line 420, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/yisiang/nlp/src/nlp/builder.py", line 460, in _download_and_prepare
    verify_checksums(self.info.download_checksums, dl_manager.get_recorded_sizes_checksums())
  File "/home/yisiang/nlp/src/nlp/utils/info_utils.py", line 31, in verify_checksums
    raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums)))
nlp.utils.info_utils.ExpectedMoreDownloadedFiles: {'16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z'}
>>>

BTW, I notice the path huggingface-nlp/datasets/toronto_books_corpus, does it mean I have to change folder name "bookscorpus" to "toronto_books_corpus"

Let me change the url to match "bookscorpus", so that you don't have to change anything. Good catch.

About the error you're getting: you just have to remove the dataset_infos.json and regenerate it

lhoestq · 2020-06-11T10:15:03Z

The new url is https://storage.googleapis.com/huggingface-nlp/datasets/bookscorpus/bookcorpus.tar.bz2

richarddwang · 2020-06-11T11:53:43Z

Hi, I found I made a mistake. I found the ELECTRA paper refer it as "BooksCorpus", but actually it is caleld "BookCorpus", according to the original paper. Sorry, I should have checked the original paper .

Can you do me a favor and change the url path to https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2 ?

lhoestq · 2020-06-11T14:38:07Z

Yep I'm doing it right now. Could you please rename all the references to bookscorpus and BooksCorpus to book_corpus and BookCorpus (with the right casing) ?

richarddwang · 2020-06-11T15:03:06Z

Thank you @lhoestq ,
Just to confirm it fits your naming convention

make the file path book_corpus/book_corpus.py ?
make class Bookscorpus(nlp.GeneratorBasedBuilder) -> BookCorpus (which make cache folder name book_corpus and user use load_dataset('book_corpus')) ?
(Cuz I found "HellaSwag" dataset is named "nlp/datasets/hellaswag" and class Hellaswag )

lhoestq · 2020-06-11T15:37:06Z

Oh yea you're right about the Hellaswag example. We should keep the "_" symbol to replace spaces. As there are no space in BookCorpus, what we should do here is use:

class name: 'Bookcorpus'
script name: bookcorpus/bookcorpus.py
use url https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2
And therefore the dataset name will be bookcorpus

Don't forget to regenerate the dataset_infos.json and we'll be good :D

lhoestq · 2020-06-12T08:00:18Z

Awesome thanks :)

add Tornoto BooksCorpus

00a28df

richarddwang mentioned this pull request Jun 7, 2020

[Dataset created] some critical small issues when I was creating a dataset #249

Closed

richarddwang changed the title ~~add Tornoto BooksCorpus~~ add Toronto BooksCorpus Jun 7, 2020

add forgot dummy data for Toronto BooksCorpus

4bcc7b4

Use download manager to download google drive files, thanks for Quent…

a126256

…in Lhoest's advice.

[BookCorpus] rename bookscorpus to bookcorpus and use file from gs

5093963

lhoestq approved these changes Jun 12, 2020

View reviewed changes

lhoestq merged commit 3343285 into huggingface:master Jun 12, 2020

richarddwang mentioned this pull request Jun 15, 2020

[Feature request] Add Toronto BookCorpus dataset #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Toronto BooksCorpus #248

add Toronto BooksCorpus #248

richarddwang commented Jun 7, 2020 •

edited

lhoestq commented Jun 8, 2020

mariamabarham commented Jun 8, 2020

lhoestq commented Jun 11, 2020

richarddwang commented Jun 11, 2020 •

edited

lhoestq commented Jun 11, 2020 •

edited

lhoestq commented Jun 11, 2020

richarddwang commented Jun 11, 2020

lhoestq commented Jun 11, 2020 •

edited

richarddwang commented Jun 11, 2020 •

edited

lhoestq commented Jun 11, 2020

lhoestq commented Jun 12, 2020

add Toronto BooksCorpus #248

add Toronto BooksCorpus #248

Conversation

richarddwang commented Jun 7, 2020 • edited

lhoestq commented Jun 8, 2020

mariamabarham commented Jun 8, 2020

lhoestq commented Jun 11, 2020

richarddwang commented Jun 11, 2020 • edited

lhoestq commented Jun 11, 2020 • edited

lhoestq commented Jun 11, 2020

richarddwang commented Jun 11, 2020

lhoestq commented Jun 11, 2020 • edited

richarddwang commented Jun 11, 2020 • edited

lhoestq commented Jun 11, 2020

lhoestq commented Jun 12, 2020

richarddwang commented Jun 7, 2020 •

edited

richarddwang commented Jun 11, 2020 •

edited

lhoestq commented Jun 11, 2020 •

edited

lhoestq commented Jun 11, 2020 •

edited

richarddwang commented Jun 11, 2020 •

edited