Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoSpatial functions and optimization in Presto #5435

Closed
wants to merge 2 commits into from

Conversation

zhenxiao
Copy link
Collaborator

No description provided.

@ghost ghost added the CLA Signed label Jun 11, 2016
@martint
Copy link
Contributor

martint commented Jun 12, 2016

Why do all functions operate on varchars? We should introduce new types for whatever structures need to be represented (points, polygons, etc)

@zhenxiao
Copy link
Collaborator Author

Yep, as a first step, this PR is calling ESRI library to support geo functions (Hive geo spatial UDF did the same), using ESRI classes, Points, Polygons, etc.
ESRI library internally store Points, Polygons as varchar, e.g. polygon is represented as:
'polygon ((1 1, 1 3, 3 3, 3 1))'
This PR adds initial support. I am working on adding Presto's own Point, Polygon stuff now.

@martint
Copy link
Contributor

martint commented Jun 13, 2016

What does the st_ prefix stand for?

@dain
Copy link
Contributor

dain commented Jun 13, 2016

I don't think this should be merged until the types are created. People will create views and other stored queries with the VARCHAR functions which will break when the new types are introduced. This will also teach users to expect "effective" implicit coercions from VARCHAR to the new types.

@zhenxiao
Copy link
Collaborator Author

st_ stands for spatial type, which is a standard for all geo spatial function implementations.
Sure. Let me implement the type first, and rebase the functions based on these types. Those types will go in presto-main, is that OK?


private GeoFunctions() {}

@Description("Returns string representation of a Point")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @description lines for Point and Line are swapped.

@martint
Copy link
Contributor

martint commented Jun 15, 2016

The types can go in the plugin, too.

@zhenxiao
Copy link
Collaborator Author

@martint @cberner I could create Point type, Line type, and Polygon types in a plugin. Seems typeManager is already initialized and passed the a plugin factory. Is there a way for me to add plugin specific types into typeManager?

@ghost ghost added the CLA Signed label Jul 27, 2016
@cberner
Copy link
Contributor

cberner commented Jul 27, 2016

Yes, take a look at the presto-ml plugin

@maicaiyao
Copy link

Any updates on this PR?

@pletelli
Copy link

Is anybody still working on this ?

@zhenxiao
Copy link
Collaborator Author

sorry for the delay, I will come back to this in a few days. if you are interested, let's collaborate.

@Liorba
Copy link

Liorba commented Mar 23, 2017

any updates on this one?
we are really waiting from this one to be merged :)

@zhenxiao
Copy link
Collaborator Author

we are making performance improvement for GeoSpatial functions, including building quard tree on the fly. See some nice numbers.

Will keep this PR open, and update with our newest progress soon

@timwis
Copy link

timwis commented May 1, 2017

This is exciting! Looking forward to this being merged.

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented May 2, 2017

@martint @dain @cberner get this PR updated with:

  • complete geospatial functions, port all esri functions in to Presto
  • geospatial types: point, line, polygon encoded as Presto binary
  • geospatial join optimization with QuadTree, improve geospatial join performance by > 100X

Could you please take a review when you are free?

@zhenxiao zhenxiao changed the title Add geo spatial functions to Presto GeoSpatial functions and optimization in Presto May 2, 2017
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit which introduces a set of spatial functions looks good to me. I have a few minor comments/questions. Would it be possible to extract that commit into a separate pull request?


private GeometryType()
{
super(new TypeSignature("Geometry"), Slice.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid copy-pasting, consider adding

public static final String NAME = "Geometry";

to use here and in @SqlType. See com.facebook.presto.ml.type.RegressorType in presto-ml.

@Description("Construct Line Geometry")
@ScalarFunction
@SqlType(GEOMETRY)
public static Slice stLine(@SqlType(StandardTypes.VARCHAR) Slice input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a stLineString and should the implementation throw if the actual geometry is of different type?

public static Slice stLine(@SqlType(StandardTypes.VARCHAR) Slice input)
{
OGCGeometry line = OGCGeometry.fromText(input.toStringUtf8());
line.setSpatialReference(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

@Description("Construct Polygon Geometry")
@ScalarFunction
@SqlType(GEOMETRY)
public static Slice stPolygon(@SqlType(StandardTypes.VARCHAR) Slice input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the implementation check that the geometry specified is actually a polygon?

return serialize(polygon);
}

@Description("Returns area of input polygon")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation works for both polygons and multi-polygons. Perhaps, copy the documentation from PostGIS:

http://postgis.org/docs/ST_Area.html

Returns the area of the geometry if it is a polygon or multi-polygon.

return deserialize(input).isEmpty();
}

@Description("Returns the length of line")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Returns length of the geometry if it is a linestring or multilinestring.

@SqlType(StandardTypes.DOUBLE)
public static double stMaxX(@SqlType(GEOMETRY) Slice input)
{
OGCGeometry geometry = deserialize(input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the logic of getting an envelope for a geometry is repeated in multiple functions; would it make sense to extract it to a helper method?

return serialize(exteriorRing);
}
catch (Exception e) {
throw new IllegalArgumentException("st_exterior_ring only applies to polygon. Input type is: " + geometry.geometryType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the exception occurred after the check for geometry type has succeeded, the error message doesn't make sense.


.. function:: st_line(varchar) -> line

Constructs geometry type Line object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, clarify that input is a WKT string


.. function:: st_polygon(varchar) -> polygon

Constructs geometry type Polygon object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, clarify that input is a WKT string

@zhenxiao
Copy link
Collaborator Author

thank you @mbasmanova I get your comments addressed:
#8878

@zhenxiao zhenxiao closed this Dec 9, 2017
@zhenxiao zhenxiao deleted the geoFunctions branch December 9, 2017 02:25
@sshirazi
Copy link

Hi @zhenxiao. I'm new to this but has your Quadtree optimization been merged into master? I don't see it. I tried doing a point in poly join on 1.7 mil points against 56K polys using .187 release on aws emr 4 node cluster and it did not finish! What are your plans in getting the quadtree optimization into master. BTW pardon me if my question is way off. Thanks

@zhenxiao
Copy link
Collaborator Author

Hi @sshirazi Quadtree optimization is not merged yet. We are doing some refactoring. I think my colleague will submit a pull request with QuadTree optimization soon

@sshirazi
Copy link

Thanks @zhenxiao. 50X faster is worth waiting for!

@sshirazi
Copy link

sshirazi commented Dec 13, 2017 via email

@sshirazi
Copy link

Hi @zhenxiao Is there a way I can get this code/build now? We are trying to see if this fits our needs. Thanks

@mbasmanova
Copy link
Contributor

@sshirazi , we are working on spatial joins, but in the meantime, as a workaround, consider converting spatial joins to equi-joins using Bing tiles (see Bing Tiles section in https://prestodb.io/docs/current/functions/geospatial.html)

For example, point-in-polygon join can be written as follows:

spatial join (to be supported in the future)

SELECT ...
FROM polygons a, points b
WHERE 
    ST_Contains(
        ST_GeometryFromText(a.wkt), 
        ST_Point(b.longitude, b.latitude)
    )

an equivalent equi-join:

SELECT ...
FROM
    (
        SELECT *, ST_GeometryFromText(wkt) as geometry
        FROM polygons
        CROSS JOIN UNNEST 
            (geometry_to_bing_tiles(ST_GeometryFromText(wkt), 17)) as t(tile)
     ) a,
    points b,
WHERE a.tile = bing_tile_at(b.latitude, b.longitude, 17)
    AND ST_Contains(a.geometry, ST_Point(b.longitude, b.latitude))    

You may want to play with the zoom level (17) based on your use case. If you don't need precise results, you can speedup the join by removing AND ST_Contains(a.geometry, ST_Point(b.longitude, b.latitude)) filter and optionally increasing zoom level.

@sshirazi
Copy link

sshirazi commented Dec 22, 2017 via email

@mbasmanova
Copy link
Contributor

@sshirazi , I downloaded your shapefile and parsed it using Python's Fiona package. It looks like the file is using NAD83 datum, but Presto only supports WGS84. How did you convert between the two? Could you share the data using WGS84 ?

{'datum': 'NAD83',
'lat_0': 63.390675,
'lat_1': 49,
'lat_2': 77,
'lon_0': -91.86666666666666,
'no_defs': True,
'proj': 'lcc',
'units': 'm',
'x_0': 6200000,
'y_0': 3000000}

@sshirazi
Copy link

sshirazi commented Jan 8, 2018

@mbasmanova Hi, I usually use QGIS to this for me, but the GDAL package is very useful too. http://www.gdal.org/usergroup0.html
You can the Python package or use it's utilities. When I installed QGIS it also installed GDAL, I think. I just did this and it worked to convert to WGS84. using ogr2ogr on windows:
ogr2ogr.exe -f "ESRI Shapefile" lda_000b16a_e_4326.shp lda_000b16a_e.shp -s_srs EPSG:3347 -t_srs EPSG:4326
I think if you take the WKT of this file you should get lon/lat. Let me know how I get a WKT file to you if this does not work. BTW I think we should use Hilbert RTrees for the spatial indexing structure.

@mbasmanova
Copy link
Contributor

mbasmanova commented Jan 31, 2018

@sshirazi , sorry for not replying sooner. I didn't get a chance to convert the data myself yet. However, while working on #9474, I realized that a cross join can be optimized and perform reasonably OK. I'm curious if you tried cross join on the dataset? I imagine the query would look something like this:

SELECT p_id, da_id
FROM points, statscan_das
WHERE ST_Contains(ST_GeometryFromText(geom), ST_Point(longitude, latitude))

This query can be optimized significantly as follows:

SET session distributed_join=false;

SELECT p_id, da_id
FROM points, (SELECT p_id, ST_GeometryFromText(geom) as geometry FROM statscan_das)
WHERE ST_Contains(geometry, ST_Point(longitude, latitude));

This way, WKT parsing happens only once per polygon as opposed to once per (polygon, point) pair.

And recent #9798 optimization for ST_Contains will make this query even more efficient and fast.

I'm curious if you want to give it a try.

P.S. I will try to find time to convert the polygon dataset you use and troubleshoot the tile-based query.

@sshirazi
Copy link

sshirazi commented Feb 1, 2018

@mbasmanova Thanks for your optimization work and sql tip. I want to try this on AWS EMR asap, but I am new to AWS EMR and the latest release emr-5.11.1 comes with prestodb .187. Any idea how to upgrade it to .193 in place? My last attempt of replacing jar files did not go well! Anybody done this upgrade in AWS EMR?

@mbasmanova
Copy link
Contributor

@sshirazi , I don't have any experience with AWS EMR, but I do see that the latest release using Presto 0.187 . It looks like releases come out every month, but it is not clear how often individual components (like Presto) are upgraded. FYI, I'm actively working on #9890 and hope to complete this work within the next couple of months.

@mbasmanova
Copy link
Contributor

mbasmanova commented Feb 13, 2018

@sshirazi Finally, I got some time to look into your dataset. I converted it using command line you provided (ogr2ogr -f "ESRI Shapefile" lda_000b16a_e_4326.shp lda_000b16a_e.shp -s_srs EPSG:3347 -t_srs EPSG:4326) and ran the query. I got the following error right away:

The zoom level is too high or the geometry is too complex to compute a set of covering bing tiles. Please use a lower zoom level or convert the geometry to its bounding box using the ST_Envelope function.

This error comes from geometry_to_bing_tiles function which has a limit on the size and complexity of the input geometry. This function works by looking at the bounding box of the geometry, computing a set of bing tiles that cover that box, then checking each tile for intersection with the input geometry. It discards the tiles which cover bounding box, but don't intersect the geometry. The intersection checks are quite expensive and the cost is proportional to the number of points in the geometry. Hence, the function places a limit of 25M on the number of tiles covering bounding box times number of points in the geometry. There are quite a few geometries in the dataset that exceed that limit at bing tile zoom level 17.

Furthermore, when I modified the query to apply geometry_to_bing_tiles the bounding box instead of the actual geometry, I got a similar error:

The number of input tiles is too large (more than 1M) to compute a set of covering bing tiles.

This error indicates that 1M limit on the total number of tiles generated by geometry_to_bing_tiles function has been exceeded.

It looks like zoom level 17 is too high for the geometries in the dataset. After playing with the query some more, I got it to work using zoom level 14 and applying geometry_to_bing_tiles function to the bounding box like so:

SELECT ...
FROM
    (
        SELECT *, ST_GeometryFromText(wkt) as geometry
        FROM polygons
        CROSS JOIN UNNEST 
            (geometry_to_bing_tiles(ST_Envelope(ST_GeometryFromText(wkt)), 12)) as t(tile)
     ) a,
    points b
WHERE a.tile = bing_tile_at(b.latitude, b.longitude, 12)
    AND ST_Contains(a.geometry, ST_Point(b.longitude, b.latitude))   

Hope this helps.

@sshirazi
Copy link

Thanks for this. I will give this a try if when I have installed release .195 on AWS.

@sshirazi
Copy link

@mbasmanova Hi, I build a rpm for release .195 and updated an AWS EMR cluster running 2 worker nodes with 64 GB memory, m4.4xlarge. BTW I used presto-admin for the install of the rpm and configs. I ran the query above. It worked when I included a limit of 1000 or 100000, but when I did count(*) it would never finish. The Rows/s would constantly go down until to double digits and then stay there at 39% or so completion. Without the ST_Contains it would finish quickly. I will wait for the next release with R-Tree implementation. At least now I can try a new release on AWS from scratch anytime needed.

@voycey
Copy link

voycey commented Jul 30, 2018

What is the latest update on where this stands? We have tried the Bing Tiles solution and we get inconsistent results. We are joining Billions of rows into Millions of polygons and whilst the performance is good I still feel it could be better and @zhenxiao solution looked to be a solid solution for our use case.

We will try and extract the code from the closed pull request but I cant see anything about it being merged? Is it being handled elsewhere?

Thanks :)

@mbasmanova
Copy link
Contributor

@voycey Dan, Presto now supports spatial joins natively. Here is the original PR #9474 and issue #9890 . Spatial joins defined using ST_Contains, ST_Intersects and ST_Distance are executed as broadcast joins where right side (build) is replicated to all nodes and used to build an R-Tree. The left side (probe) is then streamed through the R-Tree. This works if the build side fits into memory of a single node. For larger build sides I'm working on a distributed version of the spatial join: #10454

@voycey
Copy link

voycey commented Jul 31, 2018

@mbasmanova this is great! So I think we probably need to increase the node capacity we are using to get the most out of this rather than using many nodes at the moment?
Out of curiosity - are there any situations where a QuadTree would preferable over an R-Tree?

Thanks :)

@sshirazi
Copy link

@mbasmanova Thanks. Still have not had the resources to try this! Sorry but I presume this is in release .198 onward, correct?

@mbasmanova
Copy link
Contributor

@voycey Dan, you are correct that using fewer large nodes (more memory) would be better than using more small nodes. In general, R-Tree is a better spatial index then Quad tree.

@mbasmanova
Copy link
Contributor

@sshirazi Spatial joins were introduced in 0.197 . Support for spatial left joins was added in 0.199.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet