Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use urllib2 rather than curl to download #1

Closed
dagss opened this issue Dec 9, 2012 · 8 comments · Fixed by #80
Closed

Use urllib2 rather than curl to download #1

dagss opened this issue Dec 9, 2012 · 8 comments · Fixed by #80

Comments

@dagss
Copy link
Member

dagss commented Dec 9, 2012

In 02b8ed8, changed from urllib2 to shelling out to curl for downloading in order to get nice statistics, progress etc. This should be switched back:

  • It means an extra thing that can fail on the host system (finding and executing curl)
  • There's been too many cases where curl will download error pages etc. and transparently succeed with a broken download

The relevant code is in

SourceCache._download_and_hash

When an error happens (404 or similar) one should first log the error, then re-raise the exception (see #29),

Just switching it over to urllib2 is easy, the problem is that when downloading a huge file one should probably update a progress meter (something like "34% (4 MB of 100 MB)"). To integrate a progress meter:

  1. In hashdist/hashdist_logging.py, add a "start_progress(msg)", "update_progress(percentage)", "stop_progress" methods. The former sets "self.progress_msg", while update_progress prints "\r{self.progress_msg}{percentage-message}", and the latter just emits "done\n" instead of {percentage-message}.

  2. BUT, the progress meter should not be dumped to backing log files ("raw streams"), then only the start/end messages and no "\r" characters should be emitted.

@certik
Copy link
Member

certik commented Feb 20, 2013

We now validate the archives, so there should be no more problems. It'd be nice to use urllib2, but it's not so simple to do the progress meter, see e.g. here: http://stackoverflow.com/questions/2028517/python-urllib2-progress-hook, I can imagine tons of subtle bugs that we can introduce. So I would rather stick to curl for now and we can get back to this later.

@dagss
Copy link
Member Author

dagss commented Feb 20, 2013

I disagree on several counts. First, I think this is a real problem:

(master) ~/code/hashdist $ bin/hdist fetch http://google.com/nonexisting.tar.gz
[hashdist] Downloading 'http://google.com/nonexisting.tar.gz'
....innocent-looking curl output...
[ERROR] File downloaded from 'http://google.com/nonexisting.tar.gz' is not a valid archive

"not a valid archive" is not a very helpful message when the file doesn't exist in the first place. Using urllib2 will be much better for creating helpful error messages (404 vs. access denied vs. got a file but it is not an archive).

And, most of the complexity in what you posted is already there in source_cache.py, it already reads from the stdout of curl in chunks because it does the hashing at the same time before writing to disk

It's really just the progress meter that's additional (and if that goes wrong it's not that bad).

Finally, we definitely want to do this for Windows, so we may as well start to iron out the bugs.

@dagss
Copy link
Member Author

dagss commented Feb 20, 2013

conda has (better) code for this in conda/remote.py as well (though the progress meter itself is some other place)

@dagss
Copy link
Member Author

dagss commented Feb 20, 2013

(I meant better than the stackoverflow link.)

@certik
Copy link
Member

certik commented Feb 20, 2013

Ok, good points. I can work on this.

@certik
Copy link
Member

certik commented Feb 21, 2013

Here is a minimal example of a progress bar:

import sys
total = 10000000
point = total / 100
increment = total / 20
for i in xrange(total+1):
    if(i % (5 * point) == 0):
        sys.stdout.write("\r[" + "=" * (i / increment) +  " " * ((total - i)/ increment) + "]" +  str(i / point) + "%")
        sys.stdout.flush()

So we just need to adapt this to make it work with our logging system and we are good.

@dagss
Copy link
Member Author

dagss commented Feb 21, 2013

Yes. I think you simply have start_progress(msg), stop_progress(), and update_progress(percentage, msg) which:

  • Do nothing for is_raw streams (they are piped to build.log which will get lots of \r in it
  • Do nothing for normal streams if GOT_TTY is false
  • But do the above when GOT_TTY is true

Oh, and 10% of the job is pretty-printing number of bytes downloaded so far and total as 'KB', 'MB', and so on. (Or, I guess always using 'MB' isn't too bad)

@certik
Copy link
Member

certik commented Apr 10, 2013

Ok, most of this is fixed, the rest of this is in #81.

@certik certik closed this as completed Apr 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants