Add batch geocoding feature #13

sgillies · 2015-08-11T16:11:47Z

https://www.mapbox.com/developers/api/geocoding/#batch

A POST endpoint for these batch requests would be a little easier to use, but no biggie.

sbma44 · 2015-08-24T22:09:22Z

Curious to talk through how to implement this. Basic problems:

batch is only useful if you're worried about throughput
if you're worried about throughput you're going to want to run parallel HTTP requests
parallel HTTP requests mean results will arrive asynchronously, which means you need to worry about mapping input to output, and there's not an obvious right way to do this independent of end-user implementation (from my perspective anyway). I could imagine a generator or something that resorts results before a return...

I don't have an elegant answer to this; in the demo code I just throw the results onto disk and expect the user to come back and sort it out.

sgillies · 2015-08-25T22:17:03Z

@sbma44 introducing users to parallel requests is going to run them up against the rate limits pretty quickly, isn't it? Say we create 12 threads and each makes a batch of 50 geocode queries... there's a UX issue to work out.

The mbx-geocode command has some support for batching built in already: it can get queries from stdin and doesn't presume that there's only one. We could collect queries into batches, send them to the batch geocode endpoint, and write out results out the batch, syncronously. If we gave users an async option, maybe we would also put on them the responsibility of looking in the query object of the responses instead of trusting in any order. There are already some standard methods in Python that behave like this. The multiprocessing module's imap_unordered() is one.

sbma44 · 2015-08-25T22:26:43Z

Good point about ratelimiting. If we don't parallelize, this is as simple as query=';'.join(queries).

sbma44 · 2015-08-25T22:46:07Z

Talked about this a bit more with @sgillies -- we're still not sure just yet how to make this feature meaningful to users. Closing for now; if people wind up needing the feature we can reopen.

sgillies · 2015-08-25T22:47:37Z

Just about to write the same thing. Intention is to make use of the batch geocode endpoint where it makes sense, but not surface it specifically in the Python client.

perrygeo · 2015-12-24T15:05:53Z

@sbma44 @sgillies I'd like to clarify the decisions on batch geocoding and maybe revisit this now that the SDK and the CLI are a bit further along.

If I understand correctly, the advantage of using the batch geocoding at the API level is you can get many more geocodes per request. This would be more desirable than making many concurrent requests as you wouldn't hit the rate limit as quickly, right?

In the SDK:
Using the mapbox.places-permanent dataset and a ;-delimited query string, an sdk user can effectively do this already. I wonder if we can make this explicit by allowing a list of queries if they are using the permanent dataset. Also, because the batch mode returns an array, we'd need to figure out how to expose it through the .geojson method.

In the CLI:
We don't have any way to specify the dataset at the CLI (mapbox/mapbox-cli-py#40). Once we do, batching should be possible by multiple args.

I'm still not entirely clear the best way to do this but batch geocoding is going to come up eventually and right now we don't have a good story for achieving that at the sdk or at the cli level. I'd like to at least clarify and document some suggested approaches.

sbma44 · 2015-12-24T15:50:43Z

If I understand correctly, the advantage of using the batch geocoding at the API level is you can get many more geocodes per request. This would be more desirable than making many concurrent requests as you wouldn't hit the rate limit as quickly, right?

No, the rate limit is per-query, not per request. You don't get anything for free here. The advantages of batch are:

avoiding HTTP overhead
the potential for us to parallelize on the server-side to improve performance (we do not currently do this in a meaningful way)
it's appealing to users seeking high throughput who are most comfortable writing blocking code (even though it doesn't actually buy them much)

I dislike batch because

if one query fails the entire request 500s
the semicolon delimiting seems to trip up a lot of people
requests are pretty much uncacheable

We have a few recent features (e.g. country filtering) that will probably be landing in a PR here soon, and adding batch then might make sense. But I think the bigger advantage, by far, would be building in systems for adaptive ratelimiting and parallelized requests, rather than focusing on the batch query mode itself. I have example code that does this but am very sure it could benefit from a less hacky python dev's eyes :P (also, from a non-beta dependency -- that gevent==1.1b5 weirds me out a little).

perrygeo · 2015-12-24T17:32:17Z

No, the rate limit is per-query, not per request.

We should clarify that in the docs. The batch docs say "Up to 50 queries may be included in a single batch geocoding request" and the rate limits are specified in requests per minute.

dislike batch because...

All good reasons not to go that route, though the SDK could potentially hide the semicolon delimiting.

adaptive ratelimiting and parallelized requests

That looks like a good approach! I think we could port that logic over to asyncio which is the new standard for asynchronous coding in python.

What about this: Adding batch_forward and batch_reverse methods on the geocoding service, performing the queries in parallel and returning a list of responses? Or should the parallelization logic be external to the SDK, e.g. the CLI would handle the multiple args and the async code?

perrygeo · 2016-01-02T16:35:08Z

Here's an asyncio version of the batch_geocode CLI: https://gist.github.com/perrygeo/60633de1b2cd90b07b3d

Some issues with it

This only works on Python 3.4 (would need to port to Trollius to get python < 3.4 support).
no error handling or match thresholds, it just takes the first feature
there might still be a place for async stuff in the SDK but I'm leaning towards keeping the SDK simple and doing the concurrent requests in the CLI client.

sbma44 · 2016-01-04T15:58:08Z

We should clarify that in the docs.

Agreed, PR here: https://github.com/mapbox/www.mapbox.com/pull/6565. In this ticket I've been using request and query to distinguish between "an HTTP interaction" and "a geocoding question that we should answer" but a quick review shows that our docs tend to focus on the word requests. I think this is fine and makes it easier to firewall the language in the batch section.

Thanks very much for tackling the batch endpoint. Quick reactions:

What's the pan-SDK perspective on Python 2 support? I jumped through some hoops to get it when writing that sample code but perhaps it's time to ✂️
I think we should look at the CLI/SDK line from two perspectives: as a quick & dirty way for clients to get things done, and as a reference implementation. My suspicion is that the SDK is the right home for the reference implementation and the CLI the right home for code that we expect users to run without reading it first. The non-obvious/valuable parts of batch implementations are concurrency and handling ratelimits. The "stick stuff together with semicolons and get back a 200" part is not that valuable to us or to clients. So my own instinct would be to make sure that concurrency lives in the SDK; and (for instance) decisions about throwing out non-zeroth position results happen in the CLI.

But this is just my own gut reaction. Can you say more about your thinking on where this line belongs?

perrygeo · 2016-01-05T18:55:04Z

Python 2 support? I think Python 2 will be around for a long long time to come, most likely even after official support ends in 2020. We should support it until at least then.

My current thinking on the SDK vs CLI: The SDK is low-level and tracks the APIs very closely. The CLI can contain higher level abstractions like batching and concurrent requests. They'd be hidden from the user by a clean command line interface of course.

sbma44 · 2016-01-05T19:58:40Z

That distinction makes sense to me. Still, appropriate use of the rate limit headers is a core part of using the API correctly. Perhaps an ops or util module that lives in the SDK and is used by the CLI?

perrygeo · 2016-01-07T17:45:31Z

I guess you could say we're currently handling rate limiting by asking for forgiveness, not permission. IOW, the user of the sdk can just keep making requests until the server tells them "No" which should show up in the response message/statuscode/headers.

I like the idea of keeping the sdk simple but providing an ops module that would effectively be a reference implementation for how to do things like concurrent requests that respect rate limits. OTOH maybe the CLI could be that reference implementation.

perrygeo · 2016-01-07T17:59:47Z

I'm going to close this in favor of mapbox/mapbox-cli-py#44 and #104

sbma44 closed this as completed Aug 25, 2015

perrygeo reopened this Dec 24, 2015

perrygeo closed this as completed Dec 24, 2015

perrygeo reopened this Dec 24, 2015

This was referenced Jan 7, 2016

reference implementation for concurrent requests w/ rate limits #104

Closed

Batch geocode mapbox/mapbox-cli-py#44

Closed

perrygeo closed this as completed Jan 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch geocoding feature #13

Add batch geocoding feature #13

sgillies commented Aug 11, 2015

sbma44 commented Aug 24, 2015

sgillies commented Aug 25, 2015

sbma44 commented Aug 25, 2015

sbma44 commented Aug 25, 2015

sgillies commented Aug 25, 2015

perrygeo commented Dec 24, 2015

sbma44 commented Dec 24, 2015

perrygeo commented Dec 24, 2015

perrygeo commented Jan 2, 2016

sbma44 commented Jan 4, 2016

perrygeo commented Jan 5, 2016

sbma44 commented Jan 5, 2016

perrygeo commented Jan 7, 2016

perrygeo commented Jan 7, 2016

Add batch geocoding feature #13

Add batch geocoding feature #13

Comments

sgillies commented Aug 11, 2015

sbma44 commented Aug 24, 2015

sgillies commented Aug 25, 2015

sbma44 commented Aug 25, 2015

sbma44 commented Aug 25, 2015

sgillies commented Aug 25, 2015

perrygeo commented Dec 24, 2015

sbma44 commented Dec 24, 2015

perrygeo commented Dec 24, 2015

perrygeo commented Jan 2, 2016

sbma44 commented Jan 4, 2016

perrygeo commented Jan 5, 2016

sbma44 commented Jan 5, 2016

perrygeo commented Jan 7, 2016

perrygeo commented Jan 7, 2016