Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sp-gist index for BOX3D #251

Closed
wants to merge 13 commits into from
Closed

sp-gist index for BOX3D #251

wants to merge 13 commits into from

Conversation

estebanzimanyi
Copy link

Implementation of the SP-GIST indexes for the BOX3D type provided by PostGIS. It implements a 6-dimensional oct-tree over 3D boxes that generalizes and improves the 4-dimensional quad-tree over 2D boxes provided by PostgreSQL in the file geo_spgist.c.

COMMUTATOR = <->
);

CREATE FUNCTION temporal_left(box3D, box3D)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporal -> box3d?

@alesuiss
Copy link

alesuiss commented May 24, 2018

Hi,

We've fixed all the build issues on the CI but are left with a problem in the regression tests which we're not sure how to fix. This is the problematic query (in regress/regress_index.sql):

select num,ST_astext(the_geom) from test where the_geom && 'BOX3D(125 125,135 135)'::box3d order by num;
ERROR: operator is not unique: geometry && box3d

Adding an explicit cast to the_geom doesn't fix it. We could change the name of our && operator, but that is obviously not ideal.

Any ideas on how to fix this would be very much appreciated :-)

Thanks,
Arthur

EDIT: actually it seems to work when explicitly casting the_geom to box3d (I was trying geometry :-)

@estebanzimanyi
Copy link
Author

estebanzimanyi commented May 24, 2018 via email

@alesuiss
Copy link

Now we get a weird error:

Checking for shp2pgsql ... found
Checking for pgsql2shp ... found
TMPDIR is /tmp/pgis_reg
Database postgis_reg already exists.
Drop it, or run with the --nocreate flag to use it.
make[2]: *** [check] Error 1

I don't see how this could be from our changes, so is this a bug in postgis' test system?

@Komzpa
Copy link
Member

Komzpa commented May 24, 2018

@alesuiss such error means the issue is reported one test run above. for your case:

psql:/home/travis/build/postgis/postgis/regress/00-regress-install/share/contrib/postgis/postgis_upgrade.sql:6157: ERROR:  function "box3d_contains" already exists with same argument types

@alesuiss
Copy link

alesuiss commented May 24, 2018

@Komzpa oops! Thanks, we're a little new to this :-)

I still don't understand the error though. I've added our SQL declarations to postgis/postgis.sql.in, did we need to do anything else?

EDIT: never mind, I'm just stupid :-)

#define RTOverFrontStrategyNumber 28 /* for &</ */
#define RTFrontStrategyNumber 29 /* for <</ */
#define RTBackStrategyNumber 30 /* for />> */
#define RTOverBackStrategyNumber 31 /* for /&> */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename these to use a different prefix than "RT" which to me expands to "r-tree".

AS 'MODULE_PATHNAME', 'spgist_box3D_octree_leaf_consistent'
LANGUAGE C IMMUTABLE STRICT _PARALLEL;

-- Availability: 2.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure I want to see all the different operators? My experience has been that nothing but "overlaps", "contains" and "within" are used in practice, and if we define all the operators then we get to carry/maintain them for the rest of time (like we did for the initial gist-rtree binding) even though they aren't used.

@@ -34,14 +34,14 @@ set enable_bitmapscan = off;
set enable_seqscan = on;

SELECT 'scan_idx', qnodes('select * from test where the_geom && ST_MakePoint(0,0)');
select num,ST_astext(the_geom) from test where the_geom && 'BOX3D(125 125,135 135)'::box3d order by num;
select num,ST_astext(the_geom) from test where the_geom::box3d && 'BOX3D(125 125,135 135)'::box3d order by num;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other large scale issue I have w/ this patch is defining the index on box3d rather than geometry. There's a convenience associated w/ having the index defined against geometry, because then people can use it directly w/o casting. Did you do it against box3d because of issues of dealing with large objects? The pg11 support for an spgist compress API hook should allow you to do an index against geometry while using boxes as the keys.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is to get around the spgist storage issue; the plan was to generalize the index once pg11 is out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant released (we're currently basing our work on the last released version).

Should we work with pg11 instead for this patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is not too much trouble for you. :)
Postgres 11 beta1 was released just a couple of days ago: https://www.postgresql.org/about/news/1855/ and is aimed at developers.

Basically, a released non-beta Postgres 11 means a released non-beta PostGIS 2.5, and feature freeze is going to come soon.

I'll see if I can get it running on Travis to make testing easier.

Copy link
Author

@estebanzimanyi estebanzimanyi Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we can start with only a few operators.

However, please notice that all this pull request originates from working with trajectories, i.e., temporal points and thus we have developed an extension based on PostGIS in which we defined temporal points, as well as temporal floats (e.g., for expressing the distance between moving cars) and temporal booleans (e.g., for determining whether a moving car is located within a province). In that case we used the z dimension for encoding timestamps and thus the operators such as <</ and />>correspond to before and after in time and these are heavily used in that scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Did you know that there is some trajectory work in PostGIS already, encoding the timestamps on the M?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, but didn't scale up well, in particular because time is ordered by definition while M dimension is not and therefore we cannot leverage on that for having a much faster implementation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, please notice that the bug I reported
https://trac.osgeo.org/postgis/ticket/4088#ticket
has not been corrected in the beta version of postgis 2.5.0
I am also aiming to develop spgist 2D indexes on geometries using boxd2f bounding boxes. However, the fact that the over* (e.g. overleft) operators have erroneously defined a commutative operator prevents me to have correct results.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have the generalized index ready soon. Should we use a new branch & new pull request for this?

* This module provides SP-GiST implementation for boxes using oct tree
* analogy in 4-dimensional space. SP-GiST doesn't allow indexing of
* overlapping objects. We are making 3D objects never-overlapping in
* 6D space. This technique has some benefits compared to traditional
Copy link
Member

@pramsey pramsey May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's nice from a completionist point-of-view to support 3D inputs, the reality of general use is that most indexes are against 2D objects. How does this index degrade if I feed it 2D objects? Is it happy? Also, note that most 3D objects have a very narrow range in Z compared to the range in X/Y -- is the index happy in that case? Is there any argument for having a 2D opclass be the default, with an optional 3D opclass for people who really have that problem?

Copy link
Member

@pramsey pramsey Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behaviour for indexing points (which end up represented by boxes that have zero-length edges? (which end up represented by boxes that have zero-length edges?) This is another common use case. Does it need special handling for good performance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a quick test with 1e4 3D points. For higher numbers of points the scale up is much bigger.


CREATE OR REPLACE FUNCTION random_float(low float, high float)
RETURNS float AS $$
BEGIN
RETURN random()* (high-low + 1) + low;
END;
$$ LANGUAGE 'plpgsql' STRICT;

CREATE OR REPLACE FUNCTION random_geompoint3D(lowx float, highx float,
lowy float, highy float, lowz float, highz float)
RETURNS geometry AS $$
BEGIN
RETURN st_makepoint(random_float(lowx, highx), random_float(lowy, highy),
random_float(lowz, highz));
END;
$$ LANGUAGE 'plpgsql' STRICT;


drop table if exists tbl_point3Dboxes;
create table tbl_point3Dboxes as
select k, random_geompoint3D(0, 1000, 0, 1000, 0, 1000)::box3D as b
from generate_series(1,1e4) k;

select * from tbl_point3Dboxes limit 3;


select count() from tbl_point3Dboxes t1, tbl_point3Dboxes t2 where t1.b && t2.b
-- 10000
explain analyze
select count(
) from tbl_point3Dboxes t1, tbl_point3Dboxes t2 where t1.b && t2.b

"Aggregate (cost=1027824.45..1027824.46 rows=1 width=8) (actual time=15421.243..15421.243 rows=1 loops=1)"
" -> Nested Loop (cost=0.00..1010981.64 rows=6737126 width=0) (actual time=0.022..15419.443 rows=10000 loops=1)"
" Join Filter: (t1.b && t2.b)"
" Rows Removed by Join Filter: 99990000"
" -> Seq Scan on tbl_point3dboxes t1 (cost=0.00..196.08 rows=8208 width=52) (actual time=0.011..1.988 rows=10000 loops=1)"
" -> Materialize (cost=0.00..237.12 rows=8208 width=52) (actual time=0.000..0.507 rows=10000 loops=10000)"
" -> Seq Scan on tbl_point3dboxes t2 (cost=0.00..196.08 rows=8208 width=52) (actual time=0.004..1.683 rows=10000 loops=1)"
"Planning Time: 0.067 ms"
"Execution Time: 15421.441 ms"

create index tbl_point3Dboxes_spgist_idx on tbl_point3Dboxes using spgist(b);
-- Query returned successfully with no result in 94 msec.

select count() from tbl_point3Dboxes t1, tbl_point3Dboxes t2 where t1.b && t2.b
-- 10000
explain analyze
select count(
) from tbl_point3Dboxes t1, tbl_point3Dboxes t2 where t1.b && t2.b

"Aggregate (cost=303928.00..303928.01 rows=1 width=8) (actual time=183.226..183.226 rows=1 loops=1)"
" -> Nested Loop (cost=0.28..278928.00 rows=10000000 width=0) (actual time=0.049..181.660 rows=10000 loops=1)"
" -> Seq Scan on tbl_point3dboxes t1 (cost=0.00..214.00 rows=10000 width=52) (actual time=0.011..1.343 rows=10000 loops=1)"
" -> Index Only Scan using tbl_point3dboxes_spgist_idx on tbl_point3dboxes t2 (cost=0.28..17.87 rows=1000 width=52) (actual time=0.017..0.017 rows=1 loops=10000)"
" Index Cond: (b && t1.b)"
" Heap Fetches: 10000"
"Planning Time: 0.109 ms"
"Execution Time: 183.267 ms"


Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same test with 1e6 points


drop table if exists tbl_point3Dboxes;
create table tbl_point3Dboxes as
select k, random_geompoint3D(0, 1000, 0, 1000, 0, 1000)::box3D as b
from generate_series(1,1e6) k;
-- Query returned successfully: 1000000 rows affected, 8.3 secs execution time.

select * from tbl_point3Dboxes limit 3;


create index tbl_point3Dboxes_spgist_idx on tbl_point3Dboxes using spgist(b);
-- Query returned successfully with no result in 8.8 secs.

select count() from tbl_point3Dboxes t1, tbl_point3Dboxes t2 where t1.b && t2.b
-- 1000000
explain analyze
select count(
) from tbl_point3Dboxes t1, tbl_point3Dboxes t2 where t1.b && t2.b

"Finalize Aggregate (cost=1250227629.38..1250227629.39 rows=1 width=8) (actual time=15313.482..15313.482 rows=1 loops=1)"
" -> Gather (cost=1250227629.17..1250227629.38 rows=2 width=8) (actual time=15313.474..15313.476 rows=3 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Partial Aggregate (cost=1250226629.17..1250226629.18 rows=1 width=8) (actual time=15240.598..15240.598 rows=1 loops=3)"
" -> Nested Loop (cost=0.41..1146059962.50 rows=41666666667 width=0) (actual time=0.211..15154.174 rows=333333 loops=3)"
" -> Parallel Seq Scan on tbl_point3dboxes t1 (cost=0.00..15530.67 rows=416667 width=52) (actual time=0.038..124.000 rows=333333 loops=3)"
" -> Index Only Scan using tbl_point3dboxes_spgist_idx on tbl_point3dboxes t2 (cost=0.41..1750.50 rows=100000 width=52) (actual time=0.044..0.044 rows=1 loops=1000000)"
" Index Cond: (b && t1.b)"
" Heap Fetches: 1000000"
"Planning Time: 0.199 ms"
"Execution Time: 15349.058 ms"


Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally equivalent test with 1e6 points using gist_geometry_ops_nd operator class


drop table if exists tbl_point3Dgeoms;
create table tbl_point3Dgeoms as
select k, random_geompoint3D(0, 1000, 0, 1000, 0, 1000) as p
from generate_series(1,1e6) k;

select * from tbl_point3Dgeoms limit 3;

create index tbl_point3Dgeoms_spgist_idx on tbl_point3Dgeoms using gist(p gist_geometry_ops_nd);
-- Query returned successfully with no result in 22.0 secs.

select count() from tbl_point3Dgeoms t1, tbl_point3Dgeoms t2 where t1.p &&& t2.p
-- 1000000
explain analyze
select count(
) from tbl_point3Dgeoms t1, tbl_point3Dgeoms t2 where t1.p &&& t2.p

"Finalize Aggregate (cost=1381495.95..1381495.96 rows=1 width=8) (actual time=340992.040..340992.040 rows=1 loops=1)"
" -> Gather (cost=1381495.73..1381495.94 rows=2 width=8) (actual time=340990.711..340992.032 rows=3 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Partial Aggregate (cost=1380495.73..1380495.74 rows=1 width=8) (actual time=340973.307..340973.307 rows=1 loops=3)"
" -> Nested Loop (cost=0.29..1308292.04 rows=28881479 width=0) (actual time=1.255..340868.353 rows=333333 loops=3)"
" -> Parallel Seq Scan on tbl_point3dgeoms t1 (cost=0.00..13512.67 rows=416667 width=40) (actual time=0.049..169.917 rows=333333 loops=3)"
" -> Index Scan using tbl_point3dgeoms_spgist_idx on tbl_point3dgeoms t2 (cost=0.29..2.11 rows=100 width=40) (actual time=0.731..1.021 rows=1 loops=1000000)"
" Index Cond: (t1.p &&& p)"
"Planning Time: 14.149 ms"
"Execution Time: 340994.130 ms"


bool
BOX3D_contains_internal(BOX3D *box1, BOX3D *box2)
{
return (FPge(box1->xmax, box2->xmax) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure FPge is a fit here. A box that is unbuffered by 9E-7 will be containing the original one? It's also strange that there's 1e-8 SQLMM epsilon, and FPge uses a 100 times bigger one.

#define EPSILON					1.0E-06
#define FPge(A,B)				((B) - (A) <= EPSILON)

Copy link
Author

@estebanzimanyi estebanzimanyi Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we could use float8_cmp_internal instead of those FP** comparisons

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be just >= ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference with float8_cmp_internal is that it takes care of NaN. Please tell me which is better.

@alesuiss
Copy link

alesuiss commented Jun 7, 2018

We have finished implementing the sp-gist index on geometry, for both 2D and 3D; we have pushed our changes to a new branch. The index seems to work well, but of course it doesn't currently build on Travis.

Kindly let us know what to do next.

@Komzpa Komzpa mentioned this pull request Jun 7, 2018
@Komzpa
Copy link
Member

Komzpa commented Jun 7, 2018

Hi @alesuiss,

Only PRs are being built on travis automatically - I've made a new PR for you (don't be afraid to make them too! :) )

#258

@Komzpa
Copy link
Member

Komzpa commented Jul 10, 2018

Geometry version of this PR was merged, closing this one.

@Komzpa Komzpa closed this Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants