Add GeoArrow encoding as an option to the specification #189

paleolimbot · 2023-11-17T13:49:57Z

Closes #185. See draft implementation at geoarrow/geoarrow-python#41 (includes example of reading an arbitrary file using GDAL and writing to geoarrow-encoded GeoParquet if anybody would like to try with arbitrary data).

As discussed in early versions of this PR and #185, this adds the option for "encoding": "(point|linestring|polygon|multipoint|multilinestring|multipolygon)". This emphasizes that the encodings are for efficient encoding for single-type geometry datasets.

The notable advantages are (1) column statistics and (2) serialize/deserialize speed (no WKB encoding/decoding needed). The types used are also types that most systems understand natively (list, struct, double) and many systems should be able work with the files without needing any geometry support whatsoever (e.g., you could use duckdb to group by + summarize to compute bounding boxes).

There have been a number of comments about the possibility for better compression using byte split encoding. I haven't tried this yet but can run some experiments.

I also added a note to "compatible parquet"...admittedly fewer systems can write struct(x, y) than can write x, y, but it's in theory possible to do so. Unfortunately the memory layouts for geoarrow.linestring and geoarrow.multipoint/geoarrow.polygon and geoarrow.multilinestring overlap, so without metadata we either have to disallow them or just guess the higher dimension type. I put the "guess the higher dimension type" language in the spec...perhaps writers should prefer multilinestring/multipolygon over linestring/polygon to improve compatibility for readers without the ability to inspect metadata.

cholmes

Looks great to me!

jiayuasu

Looks good to me!

jorisvandenbossche

Looks good!

It might be nice to also add a few basic tests for the json schema validation similarly as how #191 is doing that.

examples/example_metadata-geoarrow.json

format-specs/geoparquet.md

format-specs/compatible-parquet.md

kylebarron · 2024-01-30T01:31:50Z

examples/example_metadata-geoarrow.json

Nit: maybe _geoarrow to not mix _ and -?

kylebarron · 2024-01-30T01:38:16Z

format-specs/schema.json

+            "geoarrow_type": {
+              "type": "string",
+              "pattern": "^geoarrow\\.(point|linestring|polygon|multipoint|multilinestring|multipolygon)$"


I'm not a json schema expert, but would we be able to make this conditionally required? It looks like dependentRequired meets what we need, though I don't know what version of json schema we're pinned to.

kylebarron · 2024-01-30T01:39:06Z

format-specs/geoparquet.md

 | geometry_types | \[string]    | **REQUIRED.** The geometry types of all geometries, or an empty array if they are not known. |
 | crs            | object\|null | [PROJJSON](https://proj.org/specifications/projjson.html) object representing the Coordinate Reference System (CRS) of the geometry. If the field is not provided, the default CRS is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means the data in this column must be stored in longitude, latitude based on the WGS84 datum. |
 | orientation    | string       | Winding order of exterior ring of polygons. If present must be `"counterclockwise"`; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order. |
 | edges          | string       | Name of the coordinate system for the edges. Must be one of `"planar"` or `"spherical"`. The default value is `"planar"`. |
 | bbox           | \[number]    | Bounding Box of the geometries in the file, formatted according to [RFC 7946, section 5](https://tools.ietf.org/html/rfc7946#section-5). |
 | epoch          | number       | Coordinate epoch in case of a dynamic CRS, expressed as a decimal year. |
+| geoarrow_type          | string       | The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. |


Suggested change

| geoarrow_type | string | The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. |

| geoarrow_type | string | The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column's memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. |

kylebarron · 2024-01-30T01:49:46Z

format-specs/geoparquet.md


 Note that the current version of the spec only allows for a subset of WKB: 2D or 3D geometries of the standard geometry types (the Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection geometry types). This means that M values or non-linear geometry types are not yet supported.

+Using the `"geoarrow"` encoding may provide better performance and enable readers to leverage more features of the Parquet format to accelerate geospatial queries (e.g., row group-level min/max statistics). When `encoding` is set to `"geoarrow"`, the column metadata must also specify `geoarrow_type` according to the [GeoArrow metadata specification for extension names](https://geoarrow.org/extension-types#extension-names) to signify the memory layout used by the geometry column.
+
+Note that the current version of the spec only allows for a subset of GeoArrow: separated (struct) coordinates are required, only 2D or 3D geometries are permitted, and supported extension are currently `"geoarrow.point"`, `"geoarrow.linestring"`, `"geoarrow.polygon"`, `"geoarrow.multipoint"`, `"geoarrow.multilinestring"`, and `"geoarrow.multipolygon"`. This means that M values and serialized encodings are not yet supported.


Oh I see, this doc doesn't even allow interleaved coordinates as GeoParquet.

I'm sensitive to the complexity concerns of having too many options in the spec, but I see this as favoring the "support cloud-native remote queries" use case over the "efficient file format, but reading and writing whole tables" use case. It "feels" like there's still a strong pull in general towards storing interleaved coordinates across the geo ecosystem.

That said, the memcopy to and from separated coordinates is pretty fast, so I can tolerate this.

The summary of why interleaved coordinates are not a good canidiate as of this writing are:

They don't give useful column statistics

Current tools are slow to read them compared to separated encodings (important for points)

NULL values randomly error in some cases

Demo of column statistics:

import geoarrow.pyarrow as ga import pyarrow as pa from pyarrow import parquet import numpy as np array_interleaved = ga.as_geoarrow(["POINT (0 100)", "POINT (2 102)"], coord_type=ga.CoordType.INTERLEAVED) tbl_interleaved = pa.table([array_interleaved], ["geom"]) parquet.write_table(tbl_interleaved, "test_interleaved.parquet") f = parquet.ParquetFile("test_interleaved.parquet") f.metadata.row_group(0).column(0).statistics

<pyarrow._parquet.Statistics object at 0x1223e7600> has_min_max: True min: -0.0 max: 102.0 null_count: 0 distinct_count: None num_values: 4 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE

Demo of slowness:

import geoarrow.pyarrow as ga import pyarrow as pa from pyarrow import parquet import numpy as np n = int(1e6) array = ga.point().from_geobuffers(None, np.random.random(n), np.random.random(n)) array_interleaved = ga.as_geoarrow(array, coord_type=ga.CoordType.INTERLEAVED) tbl = pa.table([array], ["geom"]) tbl_interleaved = pa.table([array_interleaved], ["geom"]) parquet.write_table(tbl, "test.parquet") parquet.write_table(tbl_interleaved, "test_interleaved.parquet") %timeit parquet.read_table("test.parquet") #> 7.36 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit parquet.read_table("test_interleaved.parquet") #> 15.8 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Demo of random errors for NULL values:

import geoarrow.pyarrow as ga import pyarrow as pa from pyarrow import parquet import numpy as np array_interleaved = ga.as_geoarrow(["POINT (0 1)", None], coord_type=ga.CoordType.INTERLEAVED) tbl_interleaved = pa.table([array_interleaved], ["geom"]) parquet.write_table(tbl_interleaved, "test_interleaved.parquet") parquet.read_table("test_interleaved.parquet") #> ArrowInvalid: Expected all lists to be of size=2 but index 2 had size=0

These are probably all solveable/might be unique to Arrow C++-backed implementations, but I am not sure it is the best encoding to start with (and it does seem like a good idea to start with just one encoding to minimize burden on implementors).

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Kyle Barron <kylebarron2@gmail.com>

paleolimbot · 2024-01-30T16:24:09Z

I'm not a json schema expert, but would we be able to make this conditionally required? It looks like dependentRequired meets what we need, though I don't know what version of json schema we're pinned to.

It might be nice to also add a few basic tests for the json schema validation similarly as how #191 is doing that.

I'm not a schema expert either! I'm happy to take a stab at that, but maybe it would be best to merge the text and follow-up with JSON schema tests? (I can do it here, too, but it may take me a bit to get there).

Kontinuation · 2024-01-31T01:55:57Z

format-specs/compatible-parquet.md

@@ -12,7 +12,7 @@ The core idea of the compatibility guidelines is to have the output match the de

 * The geometry column should be named either `"geometry"` or `"geography"`.

-* The geometry column should be a `BYTE_ARRAY` with Well Known Binary (WKB) used to define the geometries, as defined in the [encoding](./geoparquet.md#encoding) section of the GeoParquet spec.
+* The geometry column should be a `BYTE_ARRAY` with Well Known Binary (WKB) used to define the geometries, as defined in the [encoding](./geoparquet.md#encoding) section of the GeoParquet spec. Alternatively, the geometry column can be stored according to the Point, MultiPoint, MultiLineString, or MultiPolygon memory layouts with separated (struct) coordinates as specified in the [GeoArrow format](https://geoarrow.org/format).


Does this mean that geometry column with mixed types of geometries cannot be encoded as GeoArrow?

Not until apache/parquet-format#44 is merged (if ever)

It can (or will be once we sort out the details of geoarrow/geoarrow#43 ), although it's unclear exactly how we'd do that in Parquet or if it would be useful in Parquet. In any case, it would be a future addition!

Columns with mixed geometry values are quite common for most query engines with geospatial support. Most of the time geometry columns have the umbrella type "geometry" or "geography", and it is not practical to first resolve the subtypes of the geometries before writing out parquet files. I'd look forward to a columnar encoding supporting mixed geometry types as well as geometry collections.

Arrow and Parquet are two different specs. Arrow has a union type, which allows for mixed geometry types in a single column, while maintaining constant-time access to any coordinate. Parquet does not today have a union type, so it's impossible to to write the Geometry and GeometryCollection arrays in geoarrow/geoarrow#43 natively to Parquet.

GeoArrow implementations are able to statically know whether they have singly-typed geometries in a column, in which case they can write one of the 6 primitive types. GeoArrow implementations will have to fall back to WKB-encoded geometries for mixed-type columns. I don't see how this is something we could realistically change, unless we essentially re-implement union handling in a struct, which would be a big ask for implementors.

All good points!

I think Kyle put the summary best:

In any case, I'd argue that a large majority of use cases will be solved by these 6 primitive types, and we can come back to union types in the future.

In the context of this PR, that would mean that the column option geoarrow_type could in the future be set to "geoarrow.mixed".

I don't think we anticipated that writing mixed geometries in geoarrow to Parquet would be the main use-case. If this is an important use, please chime in on geoarrow/geoarrow#43 with some details! We definitely don't want to represent mixed geometries in a way that prevents them being used.

The only way to know it is that scanning every single record of a big dataset first (get all geometry types), then in the second round, write the table to GeoParquet

This is only true if there's no ability for a user to supply any information about the encoding. If there is, write_geoparquet(..., encoding = geoarrow("geoarrow.point")) should do it. Typically the user does know this information (even if the database does not).

This discussion is making me think that the GeoParquet spec should not be defined in terms of GeoArrow. Rather it should be defined as "native type" or "flat type" or similar. Then a sentence in prose can mention that it overlaps partially with the GeoArrow spec.

I'm also becoming convinced that the serialized form need not exactly overlap with GeoArrow. On the topic of mixed arrays specifically, as possibly the only one who has written an implementation of GeoArrow mixed arrays, I've become an even stronger proponent of using an Arrow union type for GeoArrow mixed arrays because of its ability to know geometry types statically. So I think the best way forward for GeoParquet (for a future PR) would be to discuss a "struct-union" approach for GeoParquet that is not the same in-memory representation as GeoArrow.

all database engines that didn't use Arrow as the in-memory format have no way to know if a geometry column is a mixed type or not. In other words, they don't know if they can use this GeoArrow encoding to GeoParquet files

I think changing nomenclature will also be clearer to non-arrow-based implementations that reading and writing this "native" encoding of GeoParquet is not dependent on using Arrow internally.

So my recommendation would be to take out most references to geoarrow from this PR. I.e., we don't want the metadata key called geoarrow_type if there's a possibility where the GeoParquet union type is not the same as the GeoArrow union type.

I actually think the strength of this PR is the strong delegation to the GeoArrow specification: I don't think we should be developing two specifications, particularly since we have very little engagement on the GeoArrow spec already. We've spent quite a bit of time documenting the memory layouts for geoarrow in that specification and I don't think it would be productive to copy/paste those here and maintain them independently. I also don't think it would be productive to link to the GeoArrow specification for documentation of all the memory layouts but very pointedly not call it GeoArrow.

It may be that representing mixed geometry is not important in the context of GeoParquet (Maybe WKB is just as fast in the context of compression + IO + Parquet's list type? Have we checked?), or it may be that there is a common memory representation that makes sense for both specifications that will improve interoperability (although that would involve quite a lot of reimplementation on Kyle's end 😬 ).

I don't want us to loose track of the main point here, which is that this PR is mostly about enabling very efficient representations of single-type geometries, which are very commonly the types of files that you might want to put in a giant S3 bucket and scan efficiently.

This discussion is making me think that the GeoParquet spec should not be defined in terms of GeoArrow. Rather it should be defined as "native type" or "flat type" or similar. Then a sentence in prose can mention that it overlaps partially with the GeoArrow spec.

We could go back to to the question about which "encoding" values to allow, and instead of a generic "geoarrow" option (with an additional "geoarrow_type" key to then be more specific), have actual encoding options "point", "linestring", "polygon", etc.
(i.e. like one of the options we initially discussed was also "geoarrow.point", "geoarrow.linestring", etc, but then just dropping the "geoarrow." prefix)

For the rest, it is mostly a question about how to document this: how to phrase this exactly in the specification, how strongly to tie it to geoarrow (or just reference as mostly similar), how much to duplicate the details of the memory layout, etc. But that's all "details" about the best way to document it, while keeping the actual specification (what ends up in the metadata in a file) agnostic to geoarrow.

I think I am somewhat convinced by @kylebarron's points on this, and like to idea of having the actual spec changes not use "geoarrow" (and then we can still debate how much to use the term in the explanation of it).

For example, as long as the specification would exactly overlap (or be a strict subset of geoarrow), we can still point to the geoarrow spec for the details to avoid too much duplication (and having to maintain two versions). And this is also easy to change then in the future if we would want to introduce differences.

I also don't think it would be productive to link to the GeoArrow specification for documentation of all the memory layouts but very pointedly not call it GeoArrow.

On the other hand, for an implementation of GeoParquet in some library that has nothing to do with Arrow (doesn't use an Arrow implementation under the hood), the "geoarrow" name is also somewhat uninformative, when strictly looking at it from a GeoParquet point of view.

Perhaps it's worth crafting a new PR that uses the language you all are hoping for with a draft implementation? I don't currently have the bandwidth to do that but am happy to review!

jorisvandenbossche · 2024-02-14T15:35:54Z

format-specs/geoparquet.md

@@ -12,11 +12,7 @@ This is version 1.1.0-dev of the GeoParquet specification.  See the [JSON Schema

 ## Geometry columns

-Geometry columns MUST be stored using the `BYTE_ARRAY` parquet type. They MUST be encoded as [WKB](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary).


Can we keep the information of the first sentence somewhere? (i.e. that for a WKB encoding, the geometry column MUST be stores using the BYTE_ARRAY parquet type)

(you kept the "Implementation note" just below that also mentions BYTE_ARRAY, but that is not so specific as the above)

Done! We could also add some more in here about the Parquet physical description of how nesting works (but maybe in a future PR?)

examples/example_metadata_geoarrow.json

paleolimbot · 2024-02-27T16:28:01Z

So my recommendation would be to take out most references to geoarrow from this PR

Done! The values for "encoding" are now "point", "linestring", "polygon", "multipoint", "multilinestring" and "multipolygon". This more accurately reflects what the purpose of the encodings is reduces confusion regarding Arrow/Parquet. We could alternatively limit this list to "point"` and see how it goes since representing points is the main place where the current performance of GeoParquet is limiting uptake.

Does this mean that geometry column with mixed types of geometries cannot be encoded as GeoArrow?

The encoding of mixed geometries is independent to this PR...if the GeoParquet community finds a useful way to represent mixed geometries and can demonstrate a performance or usability benefit, another encoding can be added! This frees GeoArrow to represent things in the way that make sense there (e.g., using unions) and GeoParquet to represent things that make sense in the serialized/on disk/column statistics-are-important world.

I'll update the draft implementation shortly to reflect the changes in this PR!

paleolimbot · 2024-02-27T20:56:19Z

I also updated geoarrow/geoarrow-python#41 (full implementation of the language here) to reflect the new language.

kylebarron · 2024-03-11T16:19:02Z

format-specs/geoparquet.md


 Implementation note: when using WKB encoding with the ecosystem of Arrow libraries, Parquet types such as `BYTE_ARRAY` might not be directly accessible. Instead, the corresponding Arrow data type can be `Arrow::Type::BINARY` (for arrays that whose elements can be indexed through a 32-bit index) or `Arrow::Type::LARGE_BINARY` (64-bit index). It is recommended that GeoParquet readers are compatible with both data types, and writers preferably use `Arrow::Type::BINARY` (thus limiting to row groups with content smaller than 2 GB) for larger compatibility.

+##### Geometry type specific encodings (based on GeoArrow)


Maybe

##### Native encodings (based on GeoArrow)

?

Yeah, I don't know what's the best terminology to use here. "Geometry type specific" might also not be true in the future if we add a union-like version.

I was a bit hesitant to use "native" because that's also vague / can be interpreted in multiple ways, but happy to use this if others think that is better

I like the distinction between "serialized" and "native" encodings. "Geometry type specific encodings" is a mouthful for a header

Yes, that way it makes sense. Updated

jorisvandenbossche · 2024-03-11T16:20:08Z

@paleolimbot I pushed a small commit more clearly separating the explanation of WKB and geoarrow-like encodings into subsections in the "encoding" section (just moving some things around, no actual text change).

That way I also moved the mention that WKB should use BYTE_ARRAY to that subsection as well. And will push another commit showing an example parquet type for the point geometry.

jorisvandenbossche · 2024-03-11T16:53:22Z

format-specs/geoparquet.md

+optional group geometry (List) {
+  // the parts of the MultiPolygon
+  repeated group list {
+    optional group element (List) {


This is verbose, but this matches how the Parquet docs specify a List type: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

In the Spatial Parquet paper, they made this shorter (https://arxiv.org/pdf/2209.02158.pdf), and essentially something like (translating their syntax to our layout):

optional group geometry{ repeated group part { repeated group coordinate { required double x; required double y; } } }

While that is clearer / more readable for the reader not super familiar with Parquet details, it also has value to stick with the way the Parquet docs do this (and eg how printing the Parquet schema of a file with pyarrow will also do this).

jorisvandenbossche · 2024-03-11T16:57:37Z

format-specs/geoparquet.md

+      // the rings of one Polygon
+      repeated group list {
+        optional group element (List) {


Another element here is that the Parquet spec says to use the generic "list" and "element" field names, which means we don't use the more descriptive "polygons" / "rings" / "vertices" names as in GeoArrow

(until recently, pyarrow would honor the names in the Arrow schema and write a "non-compliant" Parquet file, but now it defaults to writing a compliant one, which you can still turn off with use_compliant_nested_type=False)

jorisvandenbossche · 2024-03-11T16:59:46Z

format-specs/geoparquet.md

+              required double x;
+              required double y;


At the moment I only made the inner x and y as "required" here, but I think I should also mark the other sub-levels (for polygons/rings/vertices) as required? I think we said only the geometry itself can be null in the GeoArrow spec.

Changed all to "required" except for the top-level geometry group.

Another question is how strict this should be followed. For example, PyArrow will write those to Parquet as "optional" anyway, even if there are no nulls, unless you ensure your actual Arrow schema also explicitly uses nullable=False for all fields.

I think it would be a little strange if the schema written by pyarrow for a GeoArrow encoded array didn't match the text here. Does marking them as required have a storage optimization?

From a quick test with tiny data, you save a bit of space.

PyArrow can write the schema as shown here, as long as you ensure your fields are marked as non-nullable in the Arrow schema. But I don't think we are being strict about doing that in our geoarrow implementations .. (in general in PyArrow this nullability flag is not used much, but I think other ecosystems use that more diligently)

We can probably make sure that everything except the outer layer is written as non-nullable fields regardless of how GeoArrow deals with that. In the proof-of-concept there's a step where the extension information is stripped, at which point we could also set the non-nullable-ness if this really does matter.

Once wrapped in an extension type, I don't think the explicit non-nullable-ness of the storage is all that useful. The assumptions of buffer data are governed by the extension type, and we say there are no nulls allowed except at the outer level.

How much is the data saving? Does parquet essentially not need a nullability buffer for these arrays when the child arrays are saved as non-nullable?

Looking at the Parquet docs, it says (https://parquet.apache.org/docs/file-format/nulls/):

Nullity is encoded in the definition levels (which is run-length encoded).

and then here (https://parquet.apache.org/docs/file-format/data-pages/):

For data that is required, the definition levels are skipped

So essentially if a field is required, it can skip some data being stored (the definition levels for that field). Although given that it is run-length encoded, this data will also be tiny in case there are no nulls (because then essentially it's just one constant being run-length encoded, I assume)

See also https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/

For the actual spec here, I think we can recommend required (and show it like that in the example) but mention that it is not required and that using optional fields if also fine (i.e. it's not a MUST, for example in case there is no easy way to control it depending on the Parquet implementation you're using)

For the actual spec here, I think we can recommend required (and show it like that in the example) but mention that it is not required and that using optional fields if also fine (i.e. it's not a MUST, for example in case there is no easy way to control it depending on the Parquet implementation you're using)

Yeah that sounds fine to me. Maybe say something that there MUST be no null values for child arrays, even if it's marked as optional?

OK, see the last commit (88ae045), I tried to clarify that only top-level nulls are allowed, but that the schema itself can still use optional fields.

(another good reason to not be strict on this, is that it is quite hard to create data that conforms this with pyarrow. For example, if you want to convert existing data to a schema with proper nullable=False fields, just casting doesn't yet work in pyarrow: apache/arrow#33592)

paleolimbot

Thank you for adding the Parquet type information!

format-specs/geoparquet.md

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

rouault · 2024-03-25T22:43:18Z

Are there (small) Parquet datasets somewhere using this new GeoArrow encoding? I now realize that the current support for GeoArrow in GDAL used the FixedSizeList[2] encoding and not the struct one... So I need to do changes and check interoperability

paleolimbot · 2024-03-25T22:59:01Z

I believe geoarrow/geoarrow-python#41 is up to date (I will check tomorrow and generate some test files)

kylebarron · 2024-03-26T01:39:45Z

format-specs/geoparquet.md

+// "multipolygon" geometry column with multiple levels of nesting
+optional group geometry (List) {
+  // the parts of the MultiPolygon
+  repeated group list {


Is this always named list or is it ever the Arrow field name? Thinking that for bounding box stats I'll need to find the leaf arrays by name and for multipolygon it'll be something like geometry.list.list.list.x and ``geometry.list.list.list.y`?

I mentioned this issue above at #189 (comment)

The Parquet spec specifies that those names should be "list" and "element": https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

But it also mentions that for backwards compatibility you should be able to read lists with other names:

It is required that the repeated group of elements is named list and that its element field is named element. However, these names may not be used in existing data and should not be enforced as errors when reading.

Until recently, PyArrow / Arrow C++ actually created such "non-compliant" files, but there is a flag to control this, for which the default has been changed to create compliant files by default since pyarrow 13.0 (apache/arrow#29781).

Illustrating this:

point_type = pa.struct([pa.field("x", pa.float64(), nullable=False), pa.field("y", pa.float64(), nullable=False)]) polygon_type = pa.list_( pa.field("rings", pa.list_(pa.field("vertices", point_type, nullable=False)), nullable=False) ) multipolygon_type = pa.list_(pa.field("polygons", _polygon_type, nullable=False)) table = pa.table({"geometry": pa.array([[[[{'x': 1.0, 'y': 2.0}, {'x': 2.0, 'y': 3.0}]]]], type=multipolygon_type)}) >>> pq.write_table(table, "test_multipolygon_non_nullable-compliant.parquet") >>> pq.read_metadata("test_multipolygon_non_nullable-compliant.parquet").schema <pyarrow._parquet.ParquetSchema object at 0x7efda2333b40> required group field_id=-1 schema { optional group field_id=-1 geometry (List) { repeated group field_id=-1 list { required group field_id=-1 element (List) { repeated group field_id=-1 list { required group field_id=-1 element (List) { repeated group field_id=-1 list { required group field_id=-1 element { required double field_id=-1 x; required double field_id=-1 y; } } } } } } } } >>> pq.write_table(table, "test_multipolygon_non_nullable-non_compliant.parquet", use_compliant_nested_type=False) >>> pq.read_metadata("test_multipolygon_non_nullable-non_compliant.parquet").schema <pyarrow._parquet.ParquetSchema object at 0x7efda135fd00> required group field_id=-1 schema { optional group field_id=-1 geometry (List) { repeated group field_id=-1 list { required group field_id=-1 polygons (List) { repeated group field_id=-1 list { required group field_id=-1 rings (List) { repeated group field_id=-1 list { required group field_id=-1 vertices { required double field_id=-1 x; required double field_id=-1 y; } } } } } } } }

Specifically for Arrow C++ (or other implementation that do this), we do store the original Arrow schema in the Parquet metadata, and so we could preserve the custom names on roundtrip. But it seems we are not doing that currently.

But this actually raises the question: for the GeoArrow spec itself, how "required" are those custom field names? Are you supposed to rename them after reading a compliant Parquet file that will not have those names?
At the moment the GeoArrow spec itself seems to mention "should" for the names (https://github.com/geoarrow/geoarrow/blob/main/format.md#memory-layouts)

Thinking that for bounding box stats I'll need to find the leaf arrays by name and for multipolygon it'll be something like geometry.list.list.list.x and ``geometry.list.list.list.y`?

To get back to that initial question: it will be something like geometry.list.element.list.element.x (with the number of list.element repetitions depending on the geometry type)

But it also mentions that for backwards compatibility you should be able to read lists with other names:

👍

But this actually raises the question: for the GeoArrow spec itself, how "required" are those custom field names? Are you supposed to rename them after reading a compliant Parquet file that will not have those names?

My implementation will likely rename field names to the GeoArrow prescribed ones.

To get back to that initial question: it will be something like geometry.list.element.list.element.x (with the number of list.element repetitions depending on the geometry type)

👍

paleolimbot · 2024-03-26T13:55:58Z

Ok! I checked and it does seem to be generating files as described here. Here is a short snippet that should be able to generate/read them:

# git clone https://github.com/paleolimbot/geoarrow-python.git
# git switch geoparquet-geoarrow
# pip install geoarrow-pyarrow/
import pyarrow as pa
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io

def convert_to_geoparquet(src, dst, geometry_encoding=io.geoparquet_encoding_geoarrow()):
    tab = io.read_pyogrio_table(src)
    io.write_geoparquet_table(tab, dst, geometry_encoding=geometry_encoding)

def make_test_geoparquet(src, dst, geometry_encoding=io.geoparquet_encoding_geoarrow()):
    array = ga.array(src)
    tab = pa.table([array], names=["geometry"])
    io.write_geoparquet_table(tab, dst, geometry_encoding=geometry_encoding)

# Example convert
# More files: https://geoarrow.org/data
# There is a slight bug in these...the last element was supposed to be NULL for the examples
# but because sf in R doesn't support them, they didn't end up being written that way. They
# also have explicit planar CRSs because of sf.
convert_to_geoparquet(
    "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.1.0/example/example-linestring.gpkg",
    "example-linestring.parquet"
)

# Example read
tab = io.read_geoparquet_table("example-linestring.parquet")
tab["geom"].type.crs
tab["geom"][0].to_shapely()
ga.to_geopandas(tab["geom"])  # Doesn't work with nulls yet

# Example create
make_test_geoparquet(["POINT (0 1)", "POINT (30 10)", None], "test.parquet")

# Example read
tab = io.read_geoparquet_table("test.parquet")

...and here are a few pre-baked examples:

examples.zip

rouault · 2024-03-26T14:14:47Z

@paleolimbot I believe there's a slight non-conformity with the schema of your samples attached to #189 (comment). The x/y fields of the struct are currently marked as optional, whereas the spec requires them to be required. Cf

$ ~/arrow/cpp/build/release/parquet-dump-schema examples/point-geoarrow.parquet
required group field_id=-1 schema {
  optional int32 field_id=-1 row_number;
  optional group field_id=-1 geom {
    optional double field_id=-1 x;
    optional double field_id=-1 y;
  }
}

jorisvandenbossche · 2024-03-26T14:19:08Z

While we should provide some example data that fully follows the recommendations, those files are compliant in the sense that we recommend marking those fields as required, but that's not required (optional is fine as well):

There MUST NOT be any null values in the child fields and the x/y/z coordinate
fields. Only the outer optional "geometry" group is allowed to have nulls (i.e
representing a missing geometry). This MAY be indicated in the Parquet schema by
using required group elements, as in the example above, but this is not
required and optional fields are permitted (as long as the data itself does
not contain any nulls).

rouault · 2024-03-26T14:25:11Z

those files are compliant in the sense that we recommend marking those fields as required, but that's not required (optional is fine as well):

ah ok, missed that. Sorry for the noise.

paleolimbot · 2024-03-26T14:25:52Z

Noise welcome! There are almost certainly other nonconformities too (but hopefully these are better than nothing to get you started).

jorisvandenbossche · 2024-03-26T14:26:03Z

We should probably also provide a version of the included example.parquet in geoarrow encoding, as well as for the nz-building-outlines.parquet.

For geoarrow-encoding, given that this is geometry type specific, it might make sense to include example files in this repo for all types. Or update the geoarrow-data repo to also include Parquet versions of the files (currently I think it hosts only .arrow files?)

paleolimbot · 2024-03-26T14:27:27Z

Or update the geoarrow-data repo to also include Parquet versions of the files (currently I think it hosts only .arrow files?)

Definitely! (And also fix the existing data).

paleolimbot added 2 commits November 17, 2023 16:45

first stab

cd8c0a2

slightly better

3452b53

paleolimbot force-pushed the geoarrow branch from fef4514 to 3452b53 Compare November 17, 2023 20:49

paleolimbot marked this pull request as ready for review November 20, 2023 21:30

cholmes approved these changes Jan 29, 2024

View reviewed changes

jiayuasu approved these changes Jan 29, 2024

View reviewed changes

jorisvandenbossche approved these changes Jan 29, 2024

View reviewed changes

examples/example_metadata-geoarrow.json Outdated Show resolved Hide resolved

format-specs/geoparquet.md Outdated Show resolved Hide resolved

kylebarron reviewed Jan 30, 2024

View reviewed changes

format-specs/compatible-parquet.md Outdated Show resolved Hide resolved

kylebarron reviewed Jan 30, 2024

View reviewed changes

paleolimbot and others added 3 commits January 30, 2024 11:49

Apply suggestions from code review

c94bcaf

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Kyle Barron <kylebarron2@gmail.com>

fix

3d28ebc

don't mix

146dcf0

Kontinuation reviewed Jan 31, 2024

View reviewed changes

jorisvandenbossche reviewed Feb 14, 2024

View reviewed changes

modify proposal

f8e0ae2

kylebarron reviewed Feb 27, 2024

View reviewed changes

examples/example_metadata_geoarrow.json Outdated Show resolved Hide resolved

kylebarron approved these changes Feb 27, 2024

View reviewed changes

paleolimbot added 2 commits February 27, 2024 14:46

update example JSON

35a2919

add back in bytearray bit

2a41943

jorisvandenbossche requested a review from cholmes March 1, 2024 16:23

separate WKB and geoarrow sections

6dc23dc

kylebarron reviewed Mar 11, 2024

View reviewed changes

add explanation about corresponding Parquet types

5a996f8

jorisvandenbossche reviewed Mar 11, 2024

View reviewed changes

jorisvandenbossche added 2 commits March 11, 2024 18:53

optional -> required for non-top-level nesting

0fac197

clarify nulls and optional/required fields

88ae045

paleolimbot commented Mar 25, 2024

View reviewed changes

format-specs/geoparquet.md Outdated Show resolved Hide resolved

format-specs/geoparquet.md Outdated Show resolved Hide resolved

Apply suggestions from code review

ec154cc

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

cholmes approved these changes Mar 25, 2024

View reviewed changes

cholmes merged commit 3b5483b into opengeospatial:main Mar 25, 2024
2 checks passed

kylebarron reviewed Mar 26, 2024

View reviewed changes

cholmes mentioned this pull request Mar 26, 2024

Update example files for 1.1 #199

Open

rouault mentioned this pull request Mar 30, 2024

Parquet/Arrow: add support for GeoArrow (struct/separate) encoding (GeoParquet 1.1) OSGeo/gdal#9586

Merged

jorisvandenbossche mentioned this pull request Apr 17, 2024

Add test data covering different native (geoarrow-based) encodings #204

Merged

jorisvandenbossche mentioned this pull request Apr 28, 2024

Mixed concerns: Encoding + Geometry Type #207

Closed

cholmes added this to the 1.1 milestone May 2, 2024

paleolimbot mentioned this pull request May 15, 2024

PARQUET-2471: Add geometry logical type apache/parquet-format#240

Open

This was referenced Jun 20, 2024

Geometry nesting requirement is incompatible with "native" encoding #233

Closed

Remove requirement that geometry columns must not be a group field #234

Merged

	\| geoarrow_type \| string \| The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. \|
	\| geoarrow_type \| string \| The [GeoArrow extension name](https://geoarrow.org/extension-types#extension-names) corresponding to the column's memory layout. This is required when `encoding` is `"geoarrow"` and must be omitted otherwise. \|

		@@ -12,11 +12,7 @@ This is version 1.1.0-dev of the GeoParquet specification. See the [JSON Schema

		## Geometry columns

		Geometry columns MUST be stored using the `BYTE_ARRAY` parquet type. They MUST be encoded as [WKB](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary).


		Implementation note: when using WKB encoding with the ecosystem of Arrow libraries, Parquet types such as `BYTE_ARRAY` might not be directly accessible. Instead, the corresponding Arrow data type can be `Arrow::Type::BINARY` (for arrays that whose elements can be indexed through a 32-bit index) or `Arrow::Type::LARGE_BINARY` (64-bit index). It is recommended that GeoParquet readers are compatible with both data types, and writers preferably use `Arrow::Type::BINARY` (thus limiting to row groups with content smaller than 2 GB) for larger compatibility.

		##### Geometry type specific encodings (based on GeoArrow)

Add GeoArrow encoding as an option to the specification #189

Add GeoArrow encoding as an option to the specification #189

Conversation

paleolimbot commented Nov 17, 2023 • edited Loading

cholmes left a comment

Choose a reason for hiding this comment

jiayuasu left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Jan 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Feb 27, 2024 • edited Loading

paleolimbot commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

rouault commented Mar 25, 2024

paleolimbot commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Mar 26, 2024

rouault commented Mar 26, 2024

jorisvandenbossche commented Mar 26, 2024

rouault commented Mar 26, 2024

paleolimbot commented Mar 26, 2024

jorisvandenbossche commented Mar 26, 2024 • edited Loading

paleolimbot commented Mar 26, 2024

paleolimbot commented Nov 17, 2023 •

edited

Loading

paleolimbot Jan 31, 2024 •

edited

Loading

jorisvandenbossche Feb 12, 2024 •

edited

Loading

paleolimbot commented Feb 27, 2024 •

edited

Loading

jorisvandenbossche Mar 11, 2024 •

edited

Loading

jorisvandenbossche commented Mar 26, 2024 •

edited

Loading