New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GeoHash ordering instead of way ordering #242

Merged
merged 1 commit into from Dec 31, 2014

Conversation

Projects
None yet
4 participants
@pnorman
Collaborator

pnorman commented Dec 30, 2014

Creating a table as ORDER BY way is known to offer no performance advantages (#87) and in fact ahve losses in some cases. ST_GeoHash offers a better way to have geographically nearby data in the same or nearby pages.

Rather than simply doing an order on ST_GeoHash(ST_Transform(way,4326)), we can get a total of a 15% gain by using ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10).

Benchmark Details

The polygons table from planet-130904.osm.pbf was used, and a new table created.

CREATE TABLE polygon_test AS 
  SELECT * 
    FROM planet_osm_polygon 
    ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),<N>);

With this, we can see the change in ORDER time as the number of characters is varied, which leads to the selection of a 10 character geohash.
image

Base time of ST_GeoHash(ST_Transform(ST_Envelope(way),4326)) was 1848 seconds, and ST_GeoHash(ST_Transform(way,4326)) was 2011 seconds.

There is a theoretical basis for preferring a geohash with an even number of characters, which this is.

Use GeoHash ordering instead of way ordering
Creating a table as ORDER BY way is known to offer no performance
advantages (#87) and in fact ahve losses in some cases. GeoHash
offers a better way to have geograhically nearby data in the same
or nearby pages.

Rather than simply doing an order on ST_GeoHash(ST_Transform(way,4326)),
we can get a 9% performance gain by only transforming the ST_Envelope.

This works in all cases on EPGS 3857 as geometries cannot cross WGS84
boundaries without also crossing 3857 boundaries.

There might be difficulties in projections that cross the 180 line such
as Alaskan planes, but as OSM data is itself WGS84 any geometries will have
to be broken at the 180 line anyways, and it remains safe. The worst
potential consequence would be a slightly unoptimal ordering on-disk.

Another 5% performance can be gained by restricting the geohash length to
10 characters. This gives 25 bits of resolution in either direction, which
is sufficient for our purposes.

Fixes #208
@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Dec 30, 2014

Collaborator

cc @cquest

Collaborator

pnorman commented Dec 30, 2014

cc @cquest

@cquest

This comment has been minimized.

Show comment
Hide comment
@cquest

cquest Dec 30, 2014

Contributor

Interesting and strange result at the same time.
ST_Geohash uses a loop and I don't understand why less iterations could take more time !

I did a quick test yesterday on a small extract (to limit other side effects like I/O) and the 6 chars ST_Geohash were faster than the 8 chars one in my test. My test was not the same, I simply created a Geohash based index later used by CLUSTER.

Maybe a side effect on the CREATE TABLE / SELECT / ORDER BY due to much more similar values ?

Contributor

cquest commented Dec 30, 2014

Interesting and strange result at the same time.
ST_Geohash uses a loop and I don't understand why less iterations could take more time !

I did a quick test yesterday on a small extract (to limit other side effects like I/O) and the 6 chars ST_Geohash were faster than the 8 chars one in my test. My test was not the same, I simply created a Geohash based index later used by CLUSTER.

Maybe a side effect on the CREATE TABLE / SELECT / ORDER BY due to much more similar values ?

@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Dec 30, 2014

Collaborator

ST_Geohash uses a loop and I don't understand why less iterations could take more time

ST_GeoHash computation speed differences are not likely to be the most significant factor in an ORDER BY, given that it is big enough to go out to disk.

I did a quick test yesterday on a small extract (to limit other side effects like I/O)

You really need to make sure that you're hitting disk. An in-memory sort is a different beast from a disk-based sort.

6 chars ST_Geohash were faster than the 8 chars one in my test

Numbers?

My test was not the same, I simply created a Geohash based index later used by CLUSTER.

Maybe a side effect on the CREATE TABLE / SELECT / ORDER BY due to much more similar values ?

I wouldn't be surprised if the CLUSTER case had difference performance characteristics, but without details I can't really comment.

Collaborator

pnorman commented Dec 30, 2014

ST_Geohash uses a loop and I don't understand why less iterations could take more time

ST_GeoHash computation speed differences are not likely to be the most significant factor in an ORDER BY, given that it is big enough to go out to disk.

I did a quick test yesterday on a small extract (to limit other side effects like I/O)

You really need to make sure that you're hitting disk. An in-memory sort is a different beast from a disk-based sort.

6 chars ST_Geohash were faster than the 8 chars one in my test

Numbers?

My test was not the same, I simply created a Geohash based index later used by CLUSTER.

Maybe a side effect on the CREATE TABLE / SELECT / ORDER BY due to much more similar values ?

I wouldn't be surprised if the CLUSTER case had difference performance characteristics, but without details I can't really comment.

@cquest

This comment has been minimized.

Show comment
Hide comment
@cquest

cquest Dec 30, 2014

Contributor

Sure I was not hitting disk in my quick test. Geohash computation is obviously not the cause of the difference in timing.

5% difference is not a big issue anyway...

Contributor

cquest commented Dec 30, 2014

Sure I was not hitting disk in my quick test. Geohash computation is obviously not the cause of the difference in timing.

5% difference is not a big issue anyway...

@lonvia

This comment has been minimized.

Show comment
Hide comment
@lonvia

lonvia Dec 31, 2014

Collaborator

Looks good to me.

Collaborator

lonvia commented Dec 31, 2014

Looks good to me.

pnorman added a commit that referenced this pull request Dec 31, 2014

Merge pull request #242 from pnorman/geohash
Use GeoHash ordering instead of way ordering

@pnorman pnorman merged commit 7c60fd5 into openstreetmap:master Dec 31, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

@pnorman pnorman deleted the pnorman:geohash branch Dec 31, 2014

@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Jan 10, 2015

Collaborator

Just for the record

length  distinct hashes
    8    85.65M
   10   111.38M
   12   111.40M

so 10 is a sensible choice.

Collaborator

pnorman commented Jan 10, 2015

Just for the record

length  distinct hashes
    8    85.65M
   10   111.38M
   12   111.40M

so 10 is a sensible choice.

@it-solutions-zehaczek

This comment has been minimized.

Show comment
Hide comment
@it-solutions-zehaczek

it-solutions-zehaczek Jan 26, 2015

Hi,

i just tried an import using 0.87.2-dev on postgres 9.1 with enabled slim-mode, hstore and flatnodes and got the following error:
CREATE TABLE planet_osm_polygon_tmp AS SELECT * FROM planet_osm_polygon ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) failed: ERROR: ST_GeoHash: lwgeom_geohash returned NULL.

A switchback to 0.87.1 solves the problem.

It seems to depend on the data. Planet, Germany and Hesse crashed. Some other smaller areas passed without error. So you best test with Hesse file if you like to reproduce the error (http://download.geofabrik.de/europe/germany/hessen.html).

Regards,
Chris

Hi,

i just tried an import using 0.87.2-dev on postgres 9.1 with enabled slim-mode, hstore and flatnodes and got the following error:
CREATE TABLE planet_osm_polygon_tmp AS SELECT * FROM planet_osm_polygon ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) failed: ERROR: ST_GeoHash: lwgeom_geohash returned NULL.

A switchback to 0.87.1 solves the problem.

It seems to depend on the data. Planet, Germany and Hesse crashed. Some other smaller areas passed without error. So you best test with Hesse file if you like to reproduce the error (http://download.geofabrik.de/europe/germany/hessen.html).

Regards,
Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment