Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DownloadManager downloaded/extracted paths accessible #1846

Merged
merged 14 commits into from
Feb 25, 2021

Conversation

albertvillanova
Copy link
Member

Make accessible the file paths downloaded/extracted by DownloadManager.

Close #1831.

The approach:

  • I set these paths as DownloadManager attributes: these are DownloadManager's concerns
  • To access to these from DatasetBuilder, I set the DownloadManager instance as DatasetBuilder attribute: object composition

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks :)
And thanks for the test as well. I didn't know about the monkeypatch fixture of pytest, I'm used to the unittest patch function but I like this one too

I left a comment about how to handle multiple calls to .download in a row

@@ -184,6 +186,7 @@ def download(self, url_or_urls):
)
duration = datetime.now() - start_time
logger.info("Downloading took {} min".format(duration.total_seconds() // 60))
self.downloaded_paths = downloaded_path_or_paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes users call .download several times.
This happens for example if you want to first download and read some kind of manifest that tells you which other files you need to download.

Maybe this could be a dict or a list that is updated each time .download is called ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well seen! Thanks for pointing this out, @lhoestq: as it can be called several times, new results should be appended at each call. I am refactoring to take this into account.

partial(cached_path, download_config=download_config),
path_or_paths,
num_proc=num_proc,
)
self.extracted_paths = extracted_paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you :)
So right now if I understand correctly the .downloaded_paths is a list of Union[str, list, dict] right ? I was thinking of having a more user-friendly structure like a flattended list of str, or maybe a dict url->path. What do you think ?

It's possible to have a flattened list of paths with

self.downloaded_paths.extend(flatten_nested(downloaded_path_or_paths))

And if you think a dict would be more appropriate feel free to take a look at the _record_sizes_checksums method which does something really similar (but storing a dict url -> checksum_dict)

@albertvillanova
Copy link
Member Author

First I was thinking of the dict, which makes sense for .download, mapping URL to downloaded path. However does this make sense for .extract, mapping the downloaded path to the extracted path? I ask this because the user did not chose the downloaded path, so this is completely unknown for them...

@lhoestq
Copy link
Member

lhoestq commented Feb 15, 2021

There could be several situations:

  • download a file with no extraction
  • download a file and extract it
  • download a file, extract it and then inside the output folder extract some more files
  • extract a local file (for datasets with data that are manually downloaded for example)
  • extract a local file, and then inside the output folder extract some more files

So I think it's ok to have downloaded_paths as a dict url -> downloaded_path and extracted_paths as a dict local_path -> extracted_path.

@albertvillanova
Copy link
Member Author

albertvillanova commented Feb 15, 2021

OK. I am refactoring this. I have opened #1879, as an intermediate step...

@albertvillanova albertvillanova marked this pull request as ready for review February 25, 2021 10:13
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

@lhoestq lhoestq merged commit 792f1d9 into huggingface:master Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some question about raw dataset download info in the project .
2 participants