-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeoSpatial functions and optimization in Presto #5435
Conversation
Why do all functions operate on varchars? We should introduce new types for whatever structures need to be represented (points, polygons, etc) |
Yep, as a first step, this PR is calling ESRI library to support geo functions (Hive geo spatial UDF did the same), using ESRI classes, Points, Polygons, etc. |
What does the |
I don't think this should be merged until the types are created. People will create views and other stored queries with the |
st_ stands for spatial type, which is a standard for all geo spatial function implementations. |
|
||
private GeoFunctions() {} | ||
|
||
@Description("Returns string representation of a Point") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @description lines for Point and Line are swapped.
The types can go in the plugin, too. |
Yes, take a look at the |
5d99a28
to
7145735
Compare
Any updates on this PR? |
Is anybody still working on this ? |
sorry for the delay, I will come back to this in a few days. if you are interested, let's collaborate. |
any updates on this one? |
we are making performance improvement for GeoSpatial functions, including building quard tree on the fly. See some nice numbers. Will keep this PR open, and update with our newest progress soon |
This is exciting! Looking forward to this being merged. |
@martint @dain @cberner get this PR updated with:
Could you please take a review when you are free? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit which introduces a set of spatial functions looks good to me. I have a few minor comments/questions. Would it be possible to extract that commit into a separate pull request?
|
||
private GeometryType() | ||
{ | ||
super(new TypeSignature("Geometry"), Slice.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid copy-pasting, consider adding
public static final String NAME = "Geometry";
to use here and in @SqlType. See com.facebook.presto.ml.type.RegressorType in presto-ml.
@Description("Construct Line Geometry") | ||
@ScalarFunction | ||
@SqlType(GEOMETRY) | ||
public static Slice stLine(@SqlType(StandardTypes.VARCHAR) Slice input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a stLineString and should the implementation throw if the actual geometry is of different type?
public static Slice stLine(@SqlType(StandardTypes.VARCHAR) Slice input) | ||
{ | ||
OGCGeometry line = OGCGeometry.fromText(input.toStringUtf8()); | ||
line.setSpatialReference(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
@Description("Construct Polygon Geometry") | ||
@ScalarFunction | ||
@SqlType(GEOMETRY) | ||
public static Slice stPolygon(@SqlType(StandardTypes.VARCHAR) Slice input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the implementation check that the geometry specified is actually a polygon?
return serialize(polygon); | ||
} | ||
|
||
@Description("Returns area of input polygon") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation works for both polygons and multi-polygons. Perhaps, copy the documentation from PostGIS:
http://postgis.org/docs/ST_Area.html
Returns the area of the geometry if it is a polygon or multi-polygon.
return deserialize(input).isEmpty(); | ||
} | ||
|
||
@Description("Returns the length of line") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Returns length of the geometry if it is a linestring or multilinestring.
@SqlType(StandardTypes.DOUBLE) | ||
public static double stMaxX(@SqlType(GEOMETRY) Slice input) | ||
{ | ||
OGCGeometry geometry = deserialize(input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the logic of getting an envelope for a geometry is repeated in multiple functions; would it make sense to extract it to a helper method?
return serialize(exteriorRing); | ||
} | ||
catch (Exception e) { | ||
throw new IllegalArgumentException("st_exterior_ring only applies to polygon. Input type is: " + geometry.geometryType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the exception occurred after the check for geometry type has succeeded, the error message doesn't make sense.
|
||
.. function:: st_line(varchar) -> line | ||
|
||
Constructs geometry type Line object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, clarify that input is a WKT string
|
||
.. function:: st_polygon(varchar) -> polygon | ||
|
||
Constructs geometry type Polygon object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, clarify that input is a WKT string
thank you @mbasmanova I get your comments addressed: |
Hi @zhenxiao. I'm new to this but has your Quadtree optimization been merged into master? I don't see it. I tried doing a point in poly join on 1.7 mil points against 56K polys using .187 release on aws emr 4 node cluster and it did not finish! What are your plans in getting the quadtree optimization into master. BTW pardon me if my question is way off. Thanks |
Hi @sshirazi Quadtree optimization is not merged yet. We are doing some refactoring. I think my colleague will submit a pull request with QuadTree optimization soon |
Thanks @zhenxiao. 50X faster is worth waiting for! |
Hi,How can I get your code now? We are really interested in using this enterprise wide.
On Tuesday, December 12, 2017, 2:15:04 PM EST, Zhenxiao Luo <notifications@github.com> wrote:
Hi @sshirazi Quadtree optimization is not merged yet. We are doing some refactoring. I think my colleague will submit a pull request with QuadTree optimization soon
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi @zhenxiao Is there a way I can get this code/build now? We are trying to see if this fits our needs. Thanks |
@sshirazi , we are working on spatial joins, but in the meantime, as a workaround, consider converting spatial joins to equi-joins using Bing tiles (see Bing Tiles section in https://prestodb.io/docs/current/functions/geospatial.html) For example, point-in-polygon join can be written as follows: spatial join (to be supported in the future)
an equivalent equi-join:
You may want to play with the zoom level (17) based on your use case. If you don't need precise results, you can speedup the join by removing |
Hi @mbasmanova Thanks for reaching out and this tip. Very smart. You are using the tile as the primary bounding box filter.
I don't know what I'm doing wrong but I cannot get my tables to join. I'm running on a one worker node m4.large AWS EMR cluster running 0.187 release. From run time output using presto-cli I see the rows/s get smaller and smaller! This is my query that ran over night and still did not finish:
SELECT p_id, da_id FROM ( SELECT *, ST_GeometryFromText(geom) as geometry FROM statscan_das CROSS JOIN UNNEST (geometry_to_bing_tiles(ST_GeometryFromText(geom), 17)) as t(tile) ) a, points b WHERE a.tile = bing_tile_at(b.latitude, b.longitude, 17);
As you see I left out the ST_Contains as you suggested. My poly table has 56204 rows and some of them are huge geometries like 32 MB wkt!. My point table has 1,110,177 points. My poly file can be downloaded as a .shp file from http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/lda_000a16a_e.zip.
It is the Statistics Canada dissemination areas (DA), similar to census block in the US. The northern DA get interesting. I can share my point file if you want, but one can make this file I guess by randomly generating points within the da file bounding box. BTW I don't know if you know about QGis software, but it's a great open source tool for visualization/conversion etc.
Even though I'm new to the open source community I would love to help out in any way I can. I have a vested interest in making presto spatial work!
How large of data have you run through this code and what is your performance like?
What do you think of the Quadtree work that @zhenxiao did but was not completed?
|
@sshirazi , I downloaded your shapefile and parsed it using Python's Fiona package. It looks like the file is using NAD83 datum, but Presto only supports WGS84. How did you convert between the two? Could you share the data using WGS84 ? {'datum': 'NAD83', |
@mbasmanova Hi, I usually use QGIS to this for me, but the GDAL package is very useful too. http://www.gdal.org/usergroup0.html |
@sshirazi , sorry for not replying sooner. I didn't get a chance to convert the data myself yet. However, while working on #9474, I realized that a cross join can be optimized and perform reasonably OK. I'm curious if you tried cross join on the dataset? I imagine the query would look something like this:
This query can be optimized significantly as follows:
This way, WKT parsing happens only once per polygon as opposed to once per (polygon, point) pair. And recent #9798 optimization for ST_Contains will make this query even more efficient and fast. I'm curious if you want to give it a try. P.S. I will try to find time to convert the polygon dataset you use and troubleshoot the tile-based query. |
@mbasmanova Thanks for your optimization work and sql tip. I want to try this on AWS EMR asap, but I am new to AWS EMR and the latest release emr-5.11.1 comes with prestodb .187. Any idea how to upgrade it to .193 in place? My last attempt of replacing jar files did not go well! Anybody done this upgrade in AWS EMR? |
@sshirazi , I don't have any experience with AWS EMR, but I do see that the latest release using Presto 0.187 . It looks like releases come out every month, but it is not clear how often individual components (like Presto) are upgraded. FYI, I'm actively working on #9890 and hope to complete this work within the next couple of months. |
@sshirazi Finally, I got some time to look into your dataset. I converted it using command line you provided (
This error comes from Furthermore, when I modified the query to apply
This error indicates that 1M limit on the total number of tiles generated by It looks like zoom level 17 is too high for the geometries in the dataset. After playing with the query some more, I got it to work using zoom level 14 and applying
Hope this helps. |
Thanks for this. I will give this a try if when I have installed release .195 on AWS. |
@mbasmanova Hi, I build a rpm for release .195 and updated an AWS EMR cluster running 2 worker nodes with 64 GB memory, m4.4xlarge. BTW I used presto-admin for the install of the rpm and configs. I ran the query above. It worked when I included a limit of 1000 or 100000, but when I did count(*) it would never finish. The Rows/s would constantly go down until to double digits and then stay there at 39% or so completion. Without the ST_Contains it would finish quickly. I will wait for the next release with R-Tree implementation. At least now I can try a new release on AWS from scratch anytime needed. |
What is the latest update on where this stands? We have tried the Bing Tiles solution and we get inconsistent results. We are joining Billions of rows into Millions of polygons and whilst the performance is good I still feel it could be better and @zhenxiao solution looked to be a solid solution for our use case. We will try and extract the code from the closed pull request but I cant see anything about it being merged? Is it being handled elsewhere? Thanks :) |
@voycey Dan, Presto now supports spatial joins natively. Here is the original PR #9474 and issue #9890 . Spatial joins defined using |
@mbasmanova this is great! So I think we probably need to increase the node capacity we are using to get the most out of this rather than using many nodes at the moment? Thanks :) |
@mbasmanova Thanks. Still have not had the resources to try this! Sorry but I presume this is in release .198 onward, correct? |
@voycey Dan, you are correct that using fewer large nodes (more memory) would be better than using more small nodes. In general, R-Tree is a better spatial index then Quad tree. |
@sshirazi Spatial joins were introduced in 0.197 . Support for spatial left joins was added in 0.199. |
No description provided.