New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[blogpost] Local search with PostGIS #6

javisantana opened this Issue Jul 10, 2015 · 0 comments


None yet
1 participant

javisantana commented Jul 10, 2015

Local search with PostGIS

I was talking with some friends about a geo problem they have:

Given a set of elements with a position and a name in a database give me the N closest to a certain point that
match some pattern on the name.

So imagine you have openstreetmap database and want to find the first 300 banks and bars closer to
Madrid city center (pretty interesting combination I'd say, in Spain ATMs are in the banks).

So in order to test it I loaded Madrid OSM in a postgres database. I just downloaded the data from geofabrik site and imported using osm2pgsql tool, pretty straightforward.

Then I created some indices for the way and amenity columns (using full text search stuff)

-- gist index for the geometry, full text search for the text
create index on planet_osm_point gist(way);
create index on planet_osm_point using to_tsvector('spanish', amenity)

First try, use Nearest Neighbour search

Since postgis 2.0 we have Nearest Neighbour search (thanks to CartoDB which founded it) that allows
to use the geospatial index to sort results (read this blogpost in boundless blog), so the first try was to order by distance operator <-> the results from the text filter.

select way, amenity from planet_osm_point 
to_tsvector('spanish', amenity) @@ to_tsquery('ba:*') 
order by way <->'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry limit 301

This takes around 48ms (everything cached). Looking at the explain analyze out of curiosity I
realize spatial index wasn't being used:

Limit  (cost=23947.86..23948.61 rows=301 width=41) (actual time=48.729..48.780 rows=301 loops=1)
   ->  Sort  (cost=23947.86..24035.30 rows=34976 width=41) (actual time=48.727..48.755 rows=301 loops=1)
         Sort Key: ((way <-> '010100002031BF0D00F8DAD368D52F19C1C324815603CB5241'::geometry))
         Sort Method: top-N heapsort  Memory: 48kB
         ->  Bitmap Heap Scan on planet_osm_point  (cost=1363.07..22333.09 rows=34976 width=41) (actual time=12.120..41.008 rows=17382 loops=1)
               Recheck Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
               ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0) (actual time=10.889..10.889 rows=17382 loops=1)
                     Index Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
 Total runtime: 48.877 ms

I don't fully understand how the postgres planner works but sounds like it might be using a bitmap
and operation both indices. Increasing the limit does not change anything, I thought it could change
the selectivity. I tried a search by distance:

SELECT way, amenity FROM planet_osm_point 
to_tsvector('spanish', amenity) @@ to_tsquery('ba:*') 
st_dwithin(way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 500)

and I get the desired BitmapAnd:

 ->  BitmapAnd  (cost=1422.90..1422.90 rows=42 width=0) (actual time=14.881..14.881 rows=0 loops=1)
         ->  Bitmap Index Scan on planet_osm_point_index  (cost=0.00..68.32 rows=2121 width=0) (actual time=2.229..2.229 rows=6028 loops=1)
               Index Cond: (way && '010300002031BF0D000100000005000000F8DAD368154F19C1C32481560FC95241F8DAD368154F19C1C3248156F7CC5241F8DAD368951019C1C3248156F7CC5241F8DAD368951019C1C32481560FC95241F8DAD368154F19C1C32481560FC95241'::geometry)
         ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0) (actual time=12.607..12.607 rows=17382 loops=1)

Second try, use a recursive query

But that's not the result I was looking for, I need the first 300 bars and banks closer to my
location, no matter if they are 30km away

So I though about having a kind of python generator, a query that returns results on demand until I
have enough results.

Reading postgres documentation there is a nice trick in CTE documentation, you can do a recursive
query without stop condition that stops where the external query reach the limit.

WITH RECURSIVE t(osm_id, way, amenity, distance) AS (
    SELECT osm_id, way, amenity, 100.0 as distance from planet_osm_point
    to_tsvector('spanish', amenity) @@ to_tsquery('ba:*')
    st_dwithin(way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 100) 
    order by way <-> 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry
    SELECT p.osm_id, p.way, p.amenity, prev.distance * 2 as distance
    FROM planet_osm_point p , (SELECT distance FROM t LIMIT 1) prev
    to_tsvector('spanish', p.amenity) @@ to_tsquery('ba:*')
    st_dwithin(p.way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, prev.distance * 2)
      AND not
    st_dwithin(p.way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, prev.distance)
    order by way <-> 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry
results as  (
    select *, st_distance(way,'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry) as real_dist FROM t limit 300
SELECT * FROM results

The time for this query is around 48ms so no improvement at all. But that may depend on the number
of iterations it needs to do until fetch all the results. Starting with 300 meters takes 4 loops
since last bar is 2393 meters away form city center.

If it starts the iteration with a bigger radius, like 1500 (2 iterations), the query time is 25ms. If it only needs
to do the fist iteration, it's 18ms which is much better.

So how do we know what would be a good value to start? Hard to say without some density information stats... luckily postgres has pretty good stats about indices and there are good ways to access it: EXPLAIN and

spain=# explain select 1 FROM planet_osm_point where   to_tsvector('spanish', amenity) @@ to_tsquery('ba:*')       ;
                                              QUERY PLAN
 Bitmap Heap Scan on planet_osm_point  (cost=1363.07..22245.65 rows=34976 width=0)
   Recheck Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
   ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0)
         Index Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
(4 rows)

Forget everything but rows=34976 that's the estimate of total number of bars and banks in the
whole table (the real count value is 17k but let's see if it's good enough)

In order to know how many points there are in a certain are we can use _postgis_selectivity
function. It's kind of hidden, is what actually postgres stats planner uses.

select _postgis_selectivity('planet_osm_point', 'way', st_expand('SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 5000))

(you can get the same value using EXPLAIN but I like _postgis_selectivity)

So knowing the selectivity (the part of total rows selected by that bbox), the total number of bars
and banks and the total count (1.8M) we can estimate:

total_rows * percentaje_categories * selectivity
1748797 * 0.02 * 0.00751143626570702 ~ 262.0

So in 10km area we have an stimation of 262 points, so to get 300 points:

density = 262/(10000*10000) // points/m^2
area_to_300 = 300/density
R = sqrt(area_to_300/2*PI);
R -> 4268.93996744354m

So using a radius of 4200 meters the query takes ~18ms as it's doing a single iteration. It's
important to say that calculate the stats is almost free, it takes less than 1ms (except total count
that could be precalculated).

Other nice thing about the recursive query is it can be paginated, so you can find 100 results get
the last distance and the next time use that distance to start iterating.

Maybe the method works well for this case but not for other ones, depending on how postgres uses the indices the times may vary a lot but this is the best approach I found, any idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment