Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

multi_run could (should?) return as soon as some requests finish #104

Closed
gaborcsardi opened this Issue Jul 9, 2017 · 26 comments

Comments

Projects
None yet
2 participants
Contributor

gaborcsardi commented Jul 9, 2017

The current implementation of multi_run makes it hard to program around it without wasting precious processing time. Here is a snippet that illustrates the problem:

pool <- new_pool()
{ 
  multi_add(new_handle(url = "https://httpbin.org/delay/10"), pool = pool)
  multi_add(new_handle(url = "https://httpbin.org/delay/2"), pool = pool)
  system.time(multi_run(pool = pool, timeout = 5))
}

This outputs:

   user  system elapsed
  0.231   0.357   5.831

The quicker request finishes at around 2 seconds, but multi_run still waits around until the timeout period expires. I guess for some applications this makes sense, but I would argue that for most it is better to return as soon as at least one request is completed. (If multiple requests are available immediately, it is OK to process them, but waiting for other requests to complete is not optimal, I think.)

Specifying timeout = 0 does not help, either, because that effectively means that my app has to spend some time doing nothing, as I need to implement a wait-poll loop myself; or busy-waiting, which is not good, either.

Ideally, I want my app to get the control back, as soon as there is something to work on, i.e. as soon as (at least) one request has completed. This would be similar to how the standard Unix poll system call works.

I am not sure how you would want to implement this, in a way that the current semantics is still supported for backwards compatibility, but maybe another multi_poll() function could help. This would use the same internals, but it would have the new, poll-like semantics.

Owner

jeroen commented Jul 9, 2017 edited

The idea is to use the callback functions in multi_add to do async processing as you would do in JavaScript. In the callback you can take control, and then once you return from the callback, multi_run takes back control.

The callback can exit multi-run by raising an error.

library(curl)

done <- function(res){
  print("Done with a request. Going to do stuff now.")
}

pool <- new_pool()
multi_add(new_handle(url = "https://httpbin.org/delay/10"), pool = pool, done=done)
multi_add(new_handle(url = "https://httpbin.org/delay/2"), pool = pool, done=done)
system.time(multi_run(pool = pool, timeout = 5))

What are you trying to do that cannot be implemented this way?

Contributor

gaborcsardi commented Jul 9, 2017 edited

Because that complicates the processing considerably, because I just cannot orchestrate the callbacks for an arbitrary number of requests. multi_run is actually perfect, except for the waiting. It is just much simpler to write an event loop with it:

while (TRUE) {
    reqs <- multi_run(pool = pool)
    process_just_finished_requests()
    if (! reqs$pending) break;
}

I actually need to use the callbacks, to save the results, there is no other way to get the results, AFAIK. And using the callbacks does not help with the time lost while waiting, anyway.

Contributor

gaborcsardi commented Jul 9, 2017

Callbacks are just not a good way to get a nice control flow. Not even in JS, and they have great tools for them, like async. There are no such tools in R, so it is worse for us.

So ideally my callbacks would just put down the result at a pre-specified place, and the actual program logic would go into my main event loop.

Owner

jeroen commented Jul 9, 2017

I don't quite understand. Which time is lost? You can just call process_just_finished_requests() from the callback function?

So do you suggest we add a parameter to multi_run to return for max n results (n=1 in your case)? Or a multi_poll() function?

I personally think the javascript closure-callback design is more elegant and gives finer control than C style the poll-and-return pattern but I guess that's a matter of taste.

Owner

jeroen commented Jul 9, 2017

I think you've been doing too much C programming lately :) JavaScript style callbacks are the only good way to manage many concurrent async requests (possibly recursive by making new requests from the callbacks to do paging, etc).

Contributor

gaborcsardi commented Jul 9, 2017

I have a bunch of functions that have this kind of logic:

function() {

  cat("one\n")
  resp1 <- async_http("https://httpbin.org/get?q=one")
  cat(resp1$url)

  cat("two\n")
  resp2 <- async_http("https://httpbin.org/get?q=two")
  cat(resp1$url)

  cat("three\n")
}

I.e. they do some HTTP calls occasionally. I want them to do the HTTP asynchronously, and I want that one such function running, while the others are waiting for the HTTP result. (Unless all of them are waiting.)

This is trivial to implement in JS, but impossible in R, so I am trying to transform these functions to some construct that is roughly equivalent. I think it is mostly doable, but dealing with the callbacks is a pain in the neck.

I don't quite understand. Which time is lost? You can just call process_just_finished_requests() from the callback function?

I could, but that's the complicated thing. My functions (like the one above) are essentially dynamic, which means they can have an arbitrary number of async calls. Which means arbitrary number of callbacks, each one different. And since I don't know how the functions will look, I would probably have to dynamically generate the code of the callbacks. Etc.

So do you suggest we add a parameter to multi_run to return for max n results (n=1 in your case)? Or a multi_poll() function?

Yeah, having an n argument is an easy fix, I think. multi_poll is an alternative, but probably more work, and it might cause confusion.

I personally think the javascript closure-callback design is more elegant and gives finer control than C style the poll-and-return pattern but I guess that's a matter of taste.

Yes, if the callbacks are not too deep, and if the structure of the design is fixed in the source code. But it is also no wonder that the JS world is abandoning the callbacks and is switching to promises. Callbacks do not scale. But my main problem is that the structure is not fixed, so callbacks are just not a good fit here.

Contributor

gaborcsardi commented Jul 9, 2017

Anyway, I think adding the n argument is probably not too hard, and an easy fix for me. :)

Owner

jeroen commented Jul 9, 2017

OK i'll try to add that. btw: as you were using this, do you feel the busy wait problems solved? I never implemented unit tests for this because I don't know how.

Contributor

gaborcsardi commented Jul 9, 2017

I don't know, sorry. I checked a couple of times, for various cases manually, and I haven't seen it, but it would be indeed good to test.

I'll think about how we could test it.

Owner

jeroen commented Jul 9, 2017

OK I added something. Is this what you have in mind? Do you like the parameter name?

Contributor

gaborcsardi commented Jul 9, 2017

I think this will work.

The name is a little bit misleading for me. Isn't min_results better? As soon as there are min_results results, multi_run returns?

Owner

jeroen commented Jul 9, 2017

OK maybe that's better indeed.

Owner

jeroen commented Jul 9, 2017

Although that kind of implies it overrides timeout

Contributor

gaborcsardi commented Jul 9, 2017

Yeah, the timeout is just stronger. It has to be stronger. So yeah, it returns as soon as there are min_results results available, or a timeout kicks in.

Owner

jeroen commented Jul 9, 2017

We could also call it poll so that poll = TRUE equals 1 result when converted to an int.

Contributor

gaborcsardi commented Jul 9, 2017

poll is good, too.

Owner

jeroen commented Jul 9, 2017

Though arguments poll and pool might be prone to typos :)

Contributor

gaborcsardi commented Jul 9, 2017

Yeah. And also, in general, arguments that can take both logical or integer are not a good idea....

Contributor

gaborcsardi commented Jul 9, 2017

typos are not a big problem, because it will turn out immediately...

Contributor

gaborcsardi commented Jul 9, 2017

I would add the argument at the end, though.

Contributor

gaborcsardi commented Jul 9, 2017

Because in the middle it breaks current usage if the arguments are not named:

multi_run(x, y)
Owner

jeroen commented Jul 9, 2017

I know but I like the pool argument to be the last one. There are only a few people that use multi_run and they are probably smart enough to use named arguments ?

Contributor

gaborcsardi commented Jul 9, 2017

On CRAN only crul AFAICT.

But it is up to you. :)

I think if you know that you want to keep sg as last argument, it is better to have

multi_run(timeout = Inf, ..., pool = NULL)
Contributor

gaborcsardi commented Jul 9, 2017

Then they have to explicitly name it.

Owner

jeroen commented Jul 22, 2017

This has landed in 2.8.1 on cran.

@jeroen jeroen closed this Jul 22, 2017

Contributor

gaborcsardi commented Jul 22, 2017

Cool! Thanks much for adding this! I built a very cool async HTTP tool with curl, I'll show it the next time we talk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment