Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to implement asynchronous (i.e. non-blocking io) in curl? #51

Closed
zachmayer opened this issue Dec 17, 2015 · 6 comments
Closed

Comments

@zachmayer
Copy link

From r-lib/httr#271:

This is not an issue, more of a question that I wanted to pose to the community. If there is already a method for achieving this, I apologize for the repeat and would appreciate being pointed in the right direction.

I use httr extensively and often find myself in situations where the needed aggregation of data requires hundreds of thousands of REST calls. This becomes performance limiting in R because it seems like every http call made using httr is a blocking call.

Is there any plan or path forward for enabling asynchronous io within curl similar to what exists in Python via aiohttp or Scala via Akka-http?

@zachmayer
Copy link
Author

And a follow up:
Here's an example from Rcurl:

getURIs =
function(uris, ..., multiHandle = getCurlMultiHandle(), .perform = TRUE)
{
  content = list()
  curls = list()

  for(i in uris) {
    curl = getCurlHandle()
    content[[i]] = basicTextGatherer()
    opts = curlOptions(URL = i, writefunction = content[[i]]$update, ...)    
    curlSetOpt(.opts = opts, curl = curl)
    multiHandle = push(multiHandle, curl)
  }

  if(.perform) {
     complete(multiHandle)
     lapply(content, function(x) x$value())
   } else {
     return(list(multiHandle = multiHandle, content = content))
   }
}

There is also getURIAsynchronous

Here's an example use case where this would be helpful. I want to submit 1,000 requests to the server, and each request takes 10 minutes to process (the server has to lookup some data and do some math that takes a long time). However, the sever can handle many thousands of simultaneous requests.

Currently, I'm looping through something like this:

requests <- lapply(urls, POST, ...)

Which blocks on each request and takes 1,000 x 10 minutes to complete. It'd be really nice to be able to send each request off to the server, without blocking on the request being completed. Then after they have all been submitted, we can block on collecting the results with a loop like this:

requests <- lapply(urls, POST, ..., async=TRUE)
results  <- lapply(results, httr::complete)

Where httr::complete would be similar to RCurl::complete. The second example is nice because we can submit all the requests at once and let the server start processing them before blocking on gathering the results. In theory, this loop would take ~10 minutes to complete, plus the overhead of the 2 lapply loops.

@jeroen
Copy link
Owner

jeroen commented Dec 18, 2015

I think the natural R solution might be a non-blocking connection object. Although R is limited to 128 connections which is kind of lame.

@zachmayer
Copy link
Author

Would that be a feature request for R core? If so, how would I submit it to R core?

@jeroen
Copy link
Owner

jeroen commented Jun 13, 2016

There is now experimental support for async requests in the dev version of curl:

install.packages("https://github.com/jeroenooms/curl/archive/master.tar.gz", repos = NULL)
library(curl)
?multi

Let me know if you have any feedback.

Here is an example: https://github.com/jeroenooms/curl/blob/master/examples/crawler.R

@jeroen jeroen closed this as completed Jun 13, 2016
@zachmayer
Copy link
Author

Wahoo!!!

@jeroen
Copy link
Owner

jeroen commented Sep 16, 2016

curl 2.0 which includes the async stuff should be on CRAN this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants