Vectorized curl_download? #166

hammer · 2018-10-11T20:11:51Z

I often have a character vector of URLs that represent data files that I'd like to download to a local directory. I can use purrr:map2 with curl_download pretty easily to grab these files; however, I'm a little sad that I can't just pass vectors to curl_download. I know I should probably use curl_fetch_multi for large vectors but for the usual case I am getting a few dozen files from a reliable server and don't want to go to the trouble of writing callback handlers for an async API.

The text was updated successfully, but these errors were encountered:

jeroen · 2018-10-11T20:14:09Z

I have thought about this but I am not sure how to do proper exception handling that way. What should happen if 1 of the downloads fails but the other ones are OK? Should it raise an error and delete all the files? Or just return FALSE for the files that failed?

gaborcsardi · 2018-10-11T20:18:15Z

It should be possible to design a good API. E.g. for failures you would want to return an error object for that download, not just FALSE.

jeroen · 2018-10-18T11:44:19Z

What about the destfile argument? Should it be a vector of equal length as the input url? Or should the user be able to specify a directory, and curl will automatically guess the filesnames and save them to that directory?

gaborcsardi · 2018-10-18T11:57:25Z

Should it be a vector of equal length as the input url?

I think that's a good start. We can add the directory approach later. That one is tricky, because you need to sanitize the output file names. E.g. /etc/passwd should probably not be allowed.

jeroen · 2018-10-18T12:42:02Z

And a lot of URL's done have an obvious filename, i.e when they are a REST endpoint to some oid.

gaborcsardi · 2018-10-18T12:43:03Z

Yeah, another good reason. I think just a vector of output file names is fine.

lemairev · 2019-01-03T18:34:21Z

Sorry to reactivate that issue but I have a question related to that topic (it seems)!

You raise the fact that we can use map and curl_download but it will dl files one by one right?
Is there a better way to download several files async than doing something like
test <- c("http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.jpg", "http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.nc") walk2(test, basename(test), ~ curl_fetch_multi(.x, done = cb, pool = pool, data = file(.y, open = "wb")))
I'm struggling with the data args... sorry if it's obvious :s
thanks & happy new year

hongooi73 · 2019-03-26T17:19:18Z

@lemairev You could do the parallelism at the process level by creating a cluster and using parLapply/foreach/etc.

clus <- parallel::makeCluster(10) # or whatever number you want
parallel::clusterMap(clus, function(src, dest)
{
    curl::curl_fetch_disk(src, dest) # don't forget to check the result
}, srcfiles, destfiles)

This isn't technically asynchronous, but for large numbers of small files it'll still be much faster than downloading sequentially. The number of processes in the cluster isn't so important since you'll generally be constrained by your network bandwidth more than memory or CPU. You can kill the cluster afterwards, or keep it around if you know you're going to be using it again.

lemairev · 2019-04-10T09:05:15Z

@Hong-Revo Thank you for the suggestion! I'll try that :) cheers

jeroen · 2024-10-26T11:06:15Z

Have a look at the multi_download interface.

florisvdh mentioned this issue Jan 24, 2020

Boosting download_zenodo() inbo/inborutils#67

Merged

jeroen mentioned this issue May 1, 2020

Vectorise curl_download #222

Closed

florisvdh mentioned this issue Aug 13, 2020

download a zenodo archive eblondel/zen4R#31

Closed

jeroen closed this as completed Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized curl_download? #166

Vectorized curl_download? #166

hammer commented Oct 11, 2018

jeroen commented Oct 11, 2018

gaborcsardi commented Oct 11, 2018

jeroen commented Oct 18, 2018

gaborcsardi commented Oct 18, 2018

jeroen commented Oct 18, 2018

gaborcsardi commented Oct 18, 2018

lemairev commented Jan 3, 2019

hongooi73 commented Mar 26, 2019

lemairev commented Apr 10, 2019

jeroen commented Oct 26, 2024

Vectorized curl_download? #166

Vectorized curl_download? #166

Comments

hammer commented Oct 11, 2018

jeroen commented Oct 11, 2018

gaborcsardi commented Oct 11, 2018

jeroen commented Oct 18, 2018

gaborcsardi commented Oct 18, 2018

jeroen commented Oct 18, 2018

gaborcsardi commented Oct 18, 2018

lemairev commented Jan 3, 2019

hongooi73 commented Mar 26, 2019

lemairev commented Apr 10, 2019

jeroen commented Oct 26, 2024