Use urllib2 rather than curl to download #1

dagss · 2012-12-09T19:21:38Z

In 02b8ed8, changed from urllib2 to shelling out to curl for downloading in order to get nice statistics, progress etc. This should be switched back:

It means an extra thing that can fail on the host system (finding and executing curl)
There's been too many cases where curl will download error pages etc. and transparently succeed with a broken download

The relevant code is in

SourceCache._download_and_hash

When an error happens (404 or similar) one should first log the error, then re-raise the exception (see #29),

Just switching it over to urllib2 is easy, the problem is that when downloading a huge file one should probably update a progress meter (something like "34% (4 MB of 100 MB)"). To integrate a progress meter:

In hashdist/hashdist_logging.py, add a "start_progress(msg)", "update_progress(percentage)", "stop_progress" methods. The former sets "self.progress_msg", while update_progress prints "\r{self.progress_msg}{percentage-message}", and the latter just emits "done\n" instead of {percentage-message}.
BUT, the progress meter should not be dumped to backing log files ("raw streams"), then only the start/end messages and no "\r" characters should be emitted.

The text was updated successfully, but these errors were encountered:

certik · 2013-02-20T05:30:29Z

We now validate the archives, so there should be no more problems. It'd be nice to use urllib2, but it's not so simple to do the progress meter, see e.g. here: http://stackoverflow.com/questions/2028517/python-urllib2-progress-hook, I can imagine tons of subtle bugs that we can introduce. So I would rather stick to curl for now and we can get back to this later.

dagss · 2013-02-20T10:27:42Z

I disagree on several counts. First, I think this is a real problem:

(master) ~/code/hashdist $ bin/hdist fetch http://google.com/nonexisting.tar.gz
[hashdist] Downloading 'http://google.com/nonexisting.tar.gz'
....innocent-looking curl output...
[ERROR] File downloaded from 'http://google.com/nonexisting.tar.gz' is not a valid archive

"not a valid archive" is not a very helpful message when the file doesn't exist in the first place. Using urllib2 will be much better for creating helpful error messages (404 vs. access denied vs. got a file but it is not an archive).

And, most of the complexity in what you posted is already there in source_cache.py, it already reads from the stdout of curl in chunks because it does the hashing at the same time before writing to disk

It's really just the progress meter that's additional (and if that goes wrong it's not that bad).

Finally, we definitely want to do this for Windows, so we may as well start to iron out the bugs.

dagss · 2013-02-20T13:46:26Z

conda has (better) code for this in conda/remote.py as well (though the progress meter itself is some other place)

dagss · 2013-02-20T13:46:46Z

(I meant better than the stackoverflow link.)

certik · 2013-02-20T20:07:52Z

Ok, good points. I can work on this.

certik · 2013-02-21T21:51:58Z

Here is a minimal example of a progress bar:

import sys
total = 10000000
point = total / 100
increment = total / 20
for i in xrange(total+1):
    if(i % (5 * point) == 0):
        sys.stdout.write("\r[" + "=" * (i / increment) +  " " * ((total - i)/ increment) + "]" +  str(i / point) + "%")
        sys.stdout.flush()

So we just need to adapt this to make it work with our logging system and we are good.

dagss · 2013-02-21T21:57:15Z

Yes. I think you simply have start_progress(msg), stop_progress(), and update_progress(percentage, msg) which:

Do nothing for is_raw streams (they are piped to build.log which will get lots of \r in it
Do nothing for normal streams if GOT_TTY is false
But do the above when GOT_TTY is true

Oh, and 10% of the job is pretty-printing number of bytes downloaded so far and total as 'KB', 'MB', and so on. (Or, I guess always using 'MB' isn't too bad)

certik · 2013-04-10T18:56:15Z

Ok, most of this is fixed, the rest of this is in #81.

dagss mentioned this issue Jan 22, 2013

Make file downloading more robust #29

Closed

5 tasks

dagss mentioned this issue Feb 15, 2013

Stderr of subprocesses spewed to console #33

Closed

certik mentioned this issue Apr 9, 2013

Use urllib2 instead of curl #80

Merged

dagss mentioned this issue Apr 10, 2013

Integrate progress bar with our loggers #81

Open

certik closed this as completed Apr 10, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use urllib2 rather than curl to download #1

Use urllib2 rather than curl to download #1

dagss commented Dec 9, 2012

certik commented Feb 20, 2013

dagss commented Feb 20, 2013

dagss commented Feb 20, 2013

dagss commented Feb 20, 2013

certik commented Feb 20, 2013

certik commented Feb 21, 2013

dagss commented Feb 21, 2013

certik commented Apr 10, 2013

Use urllib2 rather than curl to download #1

Use urllib2 rather than curl to download #1

Comments

dagss commented Dec 9, 2012

certik commented Feb 20, 2013

dagss commented Feb 20, 2013

dagss commented Feb 20, 2013

dagss commented Feb 20, 2013

certik commented Feb 20, 2013

certik commented Feb 21, 2013

dagss commented Feb 21, 2013

certik commented Apr 10, 2013