-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DownloadManager downloaded/extracted paths accessible #1846
Make DownloadManager downloaded/extracted paths accessible #1846
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks :)
And thanks for the test as well. I didn't know about the monkeypatch
fixture of pytest, I'm used to the unittest patch function but I like this one too
I left a comment about how to handle multiple calls to .download
in a row
@@ -184,6 +186,7 @@ def download(self, url_or_urls): | |||
) | |||
duration = datetime.now() - start_time | |||
logger.info("Downloading took {} min".format(duration.total_seconds() // 60)) | |||
self.downloaded_paths = downloaded_path_or_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes users call .download
several times.
This happens for example if you want to first download and read some kind of manifest that tells you which other files you need to download.
Maybe this could be a dict or a list that is updated each time .download
is called ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well seen! Thanks for pointing this out, @lhoestq: as it can be called several times, new results should be appended at each call. I am refactoring to take this into account.
partial(cached_path, download_config=download_config), | ||
path_or_paths, | ||
num_proc=num_proc, | ||
) | ||
self.extracted_paths = extracted_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thank you :)
So right now if I understand correctly the .downloaded_paths
is a list of Union[str, list, dict] right ? I was thinking of having a more user-friendly structure like a flattended list of str, or maybe a dict url->path. What do you think ?
It's possible to have a flattened list of paths with
self.downloaded_paths.extend(flatten_nested(downloaded_path_or_paths))
And if you think a dict would be more appropriate feel free to take a look at the _record_sizes_checksums
method which does something really similar (but storing a dict url -> checksum_dict)
First I was thinking of the dict, which makes sense for .download, mapping URL to downloaded path. However does this make sense for .extract, mapping the downloaded path to the extracted path? I ask this because the user did not chose the downloaded path, so this is completely unknown for them... |
There could be several situations:
So I think it's ok to have |
OK. I am refactoring this. I have opened #1879, as an intermediate step... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thank you !
Make accessible the file paths downloaded/extracted by DownloadManager.
Close #1831.
The approach: