Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Add batch geocoding feature #13

Closed
sgillies opened this issue Aug 11, 2015 · 14 comments
Closed

Add batch geocoding feature #13

sgillies opened this issue Aug 11, 2015 · 14 comments

Comments

@sgillies
Copy link
Contributor

https://www.mapbox.com/developers/api/geocoding/#batch

A POST endpoint for these batch requests would be a little easier to use, but no biggie.

@sbma44
Copy link
Member

sbma44 commented Aug 24, 2015

Curious to talk through how to implement this. Basic problems:

  • batch is only useful if you're worried about throughput
  • if you're worried about throughput you're going to want to run parallel HTTP requests
  • parallel HTTP requests mean results will arrive asynchronously, which means you need to worry about mapping input to output, and there's not an obvious right way to do this independent of end-user implementation (from my perspective anyway). I could imagine a generator or something that resorts results before a return...

I don't have an elegant answer to this; in the demo code I just throw the results onto disk and expect the user to come back and sort it out.

@sgillies
Copy link
Contributor Author

@sbma44 introducing users to parallel requests is going to run them up against the rate limits pretty quickly, isn't it? Say we create 12 threads and each makes a batch of 50 geocode queries... there's a UX issue to work out.

The mbx-geocode command has some support for batching built in already: it can get queries from stdin and doesn't presume that there's only one. We could collect queries into batches, send them to the batch geocode endpoint, and write out results out the batch, syncronously. If we gave users an async option, maybe we would also put on them the responsibility of looking in the query object of the responses instead of trusting in any order. There are already some standard methods in Python that behave like this. The multiprocessing module's imap_unordered() is one.

@sbma44
Copy link
Member

sbma44 commented Aug 25, 2015

Good point about ratelimiting. If we don't parallelize, this is as simple as query=';'.join(queries).

@sbma44
Copy link
Member

sbma44 commented Aug 25, 2015

Talked about this a bit more with @sgillies -- we're still not sure just yet how to make this feature meaningful to users. Closing for now; if people wind up needing the feature we can reopen.

@sbma44 sbma44 closed this as completed Aug 25, 2015
@sgillies
Copy link
Contributor Author

Just about to write the same thing. Intention is to make use of the batch geocode endpoint where it makes sense, but not surface it specifically in the Python client.

@perrygeo
Copy link
Contributor

@sbma44 @sgillies I'd like to clarify the decisions on batch geocoding and maybe revisit this now that the SDK and the CLI are a bit further along.

If I understand correctly, the advantage of using the batch geocoding at the API level is you can get many more geocodes per request. This would be more desirable than making many concurrent requests as you wouldn't hit the rate limit as quickly, right?

In the SDK:
Using the mapbox.places-permanent dataset and a ;-delimited query string, an sdk user can effectively do this already. I wonder if we can make this explicit by allowing a list of queries if they are using the permanent dataset. Also, because the batch mode returns an array, we'd need to figure out how to expose it through the .geojson method.

In the CLI:
We don't have any way to specify the dataset at the CLI (mapbox/mapbox-cli-py#40). Once we do, batching should be possible by multiple args.

I'm still not entirely clear the best way to do this but batch geocoding is going to come up eventually and right now we don't have a good story for achieving that at the sdk or at the cli level. I'd like to at least clarify and document some suggested approaches.

@perrygeo perrygeo reopened this Dec 24, 2015
@perrygeo perrygeo reopened this Dec 24, 2015
@sbma44
Copy link
Member

sbma44 commented Dec 24, 2015

If I understand correctly, the advantage of using the batch geocoding at the API level is you can get many more geocodes per request. This would be more desirable than making many concurrent requests as you wouldn't hit the rate limit as quickly, right?

No, the rate limit is per-query, not per request. You don't get anything for free here. The advantages of batch are:

  • avoiding HTTP overhead
  • the potential for us to parallelize on the server-side to improve performance (we do not currently do this in a meaningful way)
  • it's appealing to users seeking high throughput who are most comfortable writing blocking code (even though it doesn't actually buy them much)

I dislike batch because

  • if one query fails the entire request 500s
  • the semicolon delimiting seems to trip up a lot of people
  • requests are pretty much uncacheable

We have a few recent features (e.g. country filtering) that will probably be landing in a PR here soon, and adding batch then might make sense. But I think the bigger advantage, by far, would be building in systems for adaptive ratelimiting and parallelized requests, rather than focusing on the batch query mode itself. I have example code that does this but am very sure it could benefit from a less hacky python dev's eyes :P (also, from a non-beta dependency -- that gevent==1.1b5 weirds me out a little).

@perrygeo
Copy link
Contributor

No, the rate limit is per-query, not per request.

We should clarify that in the docs. The batch docs say "Up to 50 queries may be included in a single batch geocoding request" and the rate limits are specified in requests per minute.

dislike batch because...

All good reasons not to go that route, though the SDK could potentially hide the semicolon delimiting.

adaptive ratelimiting and parallelized requests

That looks like a good approach! I think we could port that logic over to asyncio which is the new standard for asynchronous coding in python.

What about this: Adding batch_forward and batch_reverse methods on the geocoding service, performing the queries in parallel and returning a list of responses? Or should the parallelization logic be external to the SDK, e.g. the CLI would handle the multiple args and the async code?

@perrygeo
Copy link
Contributor

perrygeo commented Jan 2, 2016

Here's an asyncio version of the batch_geocode CLI: https://gist.github.com/perrygeo/60633de1b2cd90b07b3d

Some issues with it

  • This only works on Python 3.4 (would need to port to Trollius to get python < 3.4 support).
  • no error handling or match thresholds, it just takes the first feature
  • there might still be a place for async stuff in the SDK but I'm leaning towards keeping the SDK simple and doing the concurrent requests in the CLI client.

@sbma44
Copy link
Member

sbma44 commented Jan 4, 2016

We should clarify that in the docs.

Agreed, PR here: https://github.com/mapbox/www.mapbox.com/pull/6565. In this ticket I've been using request and query to distinguish between "an HTTP interaction" and "a geocoding question that we should answer" but a quick review shows that our docs tend to focus on the word requests. I think this is fine and makes it easier to firewall the language in the batch section.

Thanks very much for tackling the batch endpoint. Quick reactions:

  • What's the pan-SDK perspective on Python 2 support? I jumped through some hoops to get it when writing that sample code but perhaps it's time to ✂️
  • I think we should look at the CLI/SDK line from two perspectives: as a quick & dirty way for clients to get things done, and as a reference implementation. My suspicion is that the SDK is the right home for the reference implementation and the CLI the right home for code that we expect users to run without reading it first. The non-obvious/valuable parts of batch implementations are concurrency and handling ratelimits. The "stick stuff together with semicolons and get back a 200" part is not that valuable to us or to clients. So my own instinct would be to make sure that concurrency lives in the SDK; and (for instance) decisions about throwing out non-zeroth position results happen in the CLI.

But this is just my own gut reaction. Can you say more about your thinking on where this line belongs?

@perrygeo
Copy link
Contributor

perrygeo commented Jan 5, 2016

Python 2 support? I think Python 2 will be around for a long long time to come, most likely even after official support ends in 2020. We should support it until at least then.

My current thinking on the SDK vs CLI: The SDK is low-level and tracks the APIs very closely. The CLI can contain higher level abstractions like batching and concurrent requests. They'd be hidden from the user by a clean command line interface of course.

@sbma44
Copy link
Member

sbma44 commented Jan 5, 2016

That distinction makes sense to me. Still, appropriate use of the rate limit headers is a core part of using the API correctly. Perhaps an ops or util module that lives in the SDK and is used by the CLI?

@perrygeo
Copy link
Contributor

perrygeo commented Jan 7, 2016

I guess you could say we're currently handling rate limiting by asking for forgiveness, not permission. IOW, the user of the sdk can just keep making requests until the server tells them "No" which should show up in the response message/statuscode/headers.

I like the idea of keeping the sdk simple but providing an ops module that would effectively be a reference implementation for how to do things like concurrent requests that respect rate limits. OTOH maybe the CLI could be that reference implementation.

@perrygeo
Copy link
Contributor

perrygeo commented Jan 7, 2016

I'm going to close this in favor of mapbox/mapbox-cli-py#44 and #104

@perrygeo perrygeo closed this as completed Jan 7, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants