Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized curl_download? #166

Closed
hammer opened this issue Oct 11, 2018 · 10 comments
Closed

Vectorized curl_download? #166

hammer opened this issue Oct 11, 2018 · 10 comments

Comments

@hammer
Copy link

hammer commented Oct 11, 2018

I often have a character vector of URLs that represent data files that I'd like to download to a local directory. I can use purrr:map2 with curl_download pretty easily to grab these files; however, I'm a little sad that I can't just pass vectors to curl_download. I know I should probably use curl_fetch_multi for large vectors but for the usual case I am getting a few dozen files from a reliable server and don't want to go to the trouble of writing callback handlers for an async API.

@jeroen
Copy link
Owner

jeroen commented Oct 11, 2018

I have thought about this but I am not sure how to do proper exception handling that way. What should happen if 1 of the downloads fails but the other ones are OK? Should it raise an error and delete all the files? Or just return FALSE for the files that failed?

@gaborcsardi
Copy link
Contributor

It should be possible to design a good API. E.g. for failures you would want to return an error object for that download, not just FALSE.

@jeroen
Copy link
Owner

jeroen commented Oct 18, 2018

What about the destfile argument? Should it be a vector of equal length as the input url? Or should the user be able to specify a directory, and curl will automatically guess the filesnames and save them to that directory?

@gaborcsardi
Copy link
Contributor

Should it be a vector of equal length as the input url?

I think that's a good start. We can add the directory approach later. That one is tricky, because you need to sanitize the output file names. E.g. /etc/passwd should probably not be allowed.

@jeroen
Copy link
Owner

jeroen commented Oct 18, 2018

And a lot of URL's done have an obvious filename, i.e when they are a REST endpoint to some oid.

@gaborcsardi
Copy link
Contributor

Yeah, another good reason. I think just a vector of output file names is fine.

@lemairev
Copy link

lemairev commented Jan 3, 2019

Sorry to reactivate that issue but I have a question related to that topic (it seems)!

You raise the fact that we can use map and curl_download but it will dl files one by one right?
Is there a better way to download several files async than doing something like
test <- c("http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.jpg", "http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.nc") walk2(test, basename(test), ~ curl_fetch_multi(.x, done = cb, pool = pool, data = file(.y, open = "wb")))
I'm struggling with the data args... sorry if it's obvious :s
thanks & happy new year

@hongooi73
Copy link

@lemairev You could do the parallelism at the process level by creating a cluster and using parLapply/foreach/etc.

clus <- parallel::makeCluster(10) # or whatever number you want
parallel::clusterMap(clus, function(src, dest)
{
    curl::curl_fetch_disk(src, dest) # don't forget to check the result
}, srcfiles, destfiles)

This isn't technically asynchronous, but for large numbers of small files it'll still be much faster than downloading sequentially. The number of processes in the cluster isn't so important since you'll generally be constrained by your network bandwidth more than memory or CPU. You can kill the cluster afterwards, or keep it around if you know you're going to be using it again.

@lemairev
Copy link

@Hong-Revo Thank you for the suggestion! I'll try that :) cheers

@jeroen
Copy link
Owner

jeroen commented Oct 26, 2024

Have a look at the multi_download interface.

@jeroen jeroen closed this as completed Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants