Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in resumable_download #1135

Merged
merged 3 commits into from
Sep 7, 2023

Conversation

flyingleafe
Copy link
Contributor

Bug #1: re-downloading from scratch worked incorrectly, since the file was opened in "ab" mode - hence, seeking to byte 0 had no effect, and the entire file was appended to the already downloaded part instead of rewriting it. Fixed this by opening the file in "r+b" mode (not "wb" to preserve the original contents).

Bug #2 (which is not really a bug but still): some servers serving some datasets behave incorrectly while returning 416 status code - they do not supply the "Content-Range" header with the response, even though they accept ranged downloads, as indicated by "Accept-Ranges" header in 200/206 responses. The example of that is AMI dataset server.

These bugs were encountered by me when trying to resume an interrupted download of AMI dataset, which resulted in a huge mess in the corpus folder. With those fixes, interrupting and resuming the download works without problems.

@@ -513,7 +513,7 @@ def _download(rq, size):
# sometimes, the server actually supports range requests
# but does not return the Content-Range header with 416 code
# This is out of spec, but let us check twice for pragmatic reasons.
head_req = urllib.request.Request(url, method='HEAD')
head_req = urllib.request.Request(url, method="HEAD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe "HEAD" and 'HEAD' are equivalent in Python. That is

assert "HEAD" == 'HEAD' == """HEAD""" == '''HEAD'''

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csukuangfj yes, the name of the commit is "run pre-commit" (to make CI happy), this is the edit by black formatter

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I had a hunch that resumable downloads might be a rabbit hole, thanks for helping to maintain this.

@pzelasko pzelasko merged commit 43379ae into lhotse-speech:master Sep 7, 2023
8 checks passed
@pzelasko pzelasko added this to the v1.17 milestone Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants