Introduce bounding box column definition #191

jwass · 2023-11-27T20:17:38Z

Introduces the per-row bounding box definition following the discussion from #188.

This initial proposal goes with a definition that looks like:

"geometry": {
  "encoding": "WKB",
  "geometry_types": [],
  "bbox": [-71.1, 42.3, -70.9, 42.5],
  "geometry_bbox": {"column": "bbox"}
}

Add documentation to the top-level GeoParquet description and definition.
Add the geometry_bbox definition to the json schema
Add a few tests. Verify with pytest test_json_schema.py
Updates the examples and metadata schema poetry run python generate_example.py and poetry run python update_example_schemas.py

* Add documentation to the top-level GeoParquet description and definition. * Add the geometry_bbox definition to the json schema * Add a few tests. Verify with `pytest test_json_schema.py`

jwass · 2023-11-27T20:20:23Z

How do people feel about the bbox column field names xmin, xmax, ymin, ymax?

When putting this together I noticed that geopandas gdf.bounds uses minx, maxx, miny, maxy. @paleolimbot made a good argument that we use the xmin style in other places in GeoParquet so should stick with that, but is worth calling out here.

paleolimbot

Thank you for writing this up! Just a few comments to get the discussion going.

Sorry for missing your earlier comment before you got to the PR, but I do think that nesting the "covering" or "simplifed_geometry" or "auxiliary_columns" or "something_better" is a more future-proof option: this proposal is for a single struct column that represents the minimum bounding rectangle, but we might need other encodings (or multiple columns for one encoding) and I think it will better communicate that those concepts are related if they are grouped in the same object.

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

I think a struct is the best way to go here, but it is worth noting that another option would be to refer to top-level columns. The only reason to do this is that some Parquet scanners can't leverage column statistics for struct columns (although this might not matter for Parquet scanners that will be using this column). This should get revealed in our testing of the bbox column concept!

For three dimensions the additional fields zmin and zmax MUST be present.

In GeoPackage the dimensions of the envelope are independent of the geometry...while some indexes care about 3D, most indexes only care about 2D and it might be an attractive option to only ever write an XY bounding box here.

jorisvandenbossche · 2023-11-28T18:41:55Z

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

Parquet indeed does not speak about "struct" columns, and this is unfortunately one of those areas that is completely underdocumented. I think the relevant term is "group", but in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types that is only used to explain LIST and MAP logical type.

When translating an Arrow struct column, I get the following Parquet schema:

import pyarrow as pa
struct_arr = pa.StructArray.from_arrays([[1, 2]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"bbox": struct_arr})

import pyarrow.parquet as pq
pq.write_table(table, "test_bbox_schema.parquet")
pq.read_metadata("test_bbox_schema.parquet").schema

<pyarrow._parquet.ParquetSchema object at 0x7f02f01dac40>
required group field_id=-1 schema {
  optional group field_id=-1 bbox {
    optional int64 field_id=-1 xmin;
    optional int64 field_id=-1 xmax;
    optional int64 field_id=-1 ymin;
    optional int64 field_id=-1 ymax;
  }
}

So something like "a group field with 4 child fieds with names ..."?

Change the geometry_bbox to the broader "covering" section. Update tests and examples. Made some documentation updates: * Parquet schema -> group * Do not require zmin/zmax if geometries have 3 dimensions

jwass · 2023-11-28T20:08:49Z

I do think that nesting the "covering" or "simplifed_geometry" or "auxiliary_columns" or "something_better" is a more future-proof option: this proposal is for a single struct column that represents the minimum bounding rectangle, but we might need other encodings (or multiple columns for one encoding) and I think it will better communicate that those concepts are related if they are grouped in the same object.

Sounds good. I just made the updates to go to that original proposal with the "covering" section. I admit to not having a better word so let's stick with it for now.

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

Good call. I went with @jorisvandenbossche's language here. And I agree, the Parquet docs/website are really lacking on this.

I think a struct is the best way to go here, but it is worth noting that another option would be to refer to top-level columns. The only reason to do this is that some Parquet scanners can't leverage column statistics for struct columns (although this might not matter for Parquet scanners that will be using this column). This should get revealed in our testing of the bbox column concept!

Agreed. We can change as necessary depending on the outcome of that testing.

For three dimensions the additional fields zmin and zmax MUST be present.

In GeoPackage the dimensions of the envelope are independent of the geometry...while some indexes care about 3D, most indexes only care about 2D and it might be an attractive option to only ever write an XY bounding box here.

Thanks. I changed the language to read zmin and zmax MAY be present but not required.

jwass · 2023-11-28T20:09:41Z

So something like "a group field with 4 child fieds with names ..."?

@jorisvandenbossche Thanks! I made some updates and went with this wording.

rouault · 2023-11-28T20:38:55Z

I'm just thinking that the use of a struct/group with child fields could potentially be a complication for some implementations. I mean the "flat" model is quite a common one among geospatial formats. I struggled a bit to deal with Parquet nested construct when mapping that to the OGR data model. I've bitten that bullet now, so the current proposal is perfectly fine to me. Just wanted to raise that potential issue for other "simple" hypothetical implementations. But perhaps dealing with Parquet means that dealing with nested constructs is an implicit requirement.

kylebarron · 2023-11-28T20:54:55Z

What about specifically listing the names of the paths in the schema:

"covering": {
  "box": {
    "xmin": "bbox.minx",
    "ymin": "bbox.miny",
    "xmax": "bbox.maxx",
    "ymax": "bbox.maxy",
  }
}

Then it can be straightforward to either use dot-access notation as above, or to refer to columns at the top-level:

"covering": {
  "box": {
    "xmin": "minx",
    "ymin": "miny",
    "xmax": "maxx",
    "ymax": "maxy",
  }
}

This also means that a struct point column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.x",
    "ymin": "geometry.y",
    "xmax": "geometry.x",
    "ymax": "geometry.y",
  }
}

And a geoarrow linestring column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.list.element.list.element.x",
    "ymin": "geometry.list.element.list.element.y",
    "xmax": "geometry.list.element.list.element.x",
    "ymax": "geometry.list.element.list.element.y",
  }
}

(The exact syntax is up for future discussion; that's what pyarrow.parquet shows for the
value of path_in_schema for parquet metadata of a geoarrow linestring file, tested with this file)

<pyarrow._parquet.ColumnChunkMetaData object at 0x114d2a1b0>
  file_offset: 509677
  file_path:
  physical_type: DOUBLE
  num_values: 49494
  path_in_schema: geometry.list.element.list.element.x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x106814270>
      has_min_max: True
      min: 215869.21669693943
      max: 781793.0855915633
      null_count: 0
      distinct_count: None
      num_values: 49494
      physical_type: DOUBLE
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 16382
  data_page_offset: 409875
  total_compressed_size: 493295
  total_uncompressed_size: 493604

<pyarrow._parquet.ColumnChunkMetaData object at 0x114cfe7a0>
  file_offset: 1002855
  file_path:
  physical_type: DOUBLE
  num_values: 49494
  path_in_schema: geometry.list.element.list.element.y
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x114d2b1f0>
      has_min_max: True
      min: 4790520.432107779
      max: 5237313.334385211
      null_count: 0
      distinct_count: None
      num_values: 49494
      physical_type: DOUBLE
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 509812
  data_page_offset: 903053
  total_compressed_size: 493043
  total_uncompressed_size: 493596

jorisvandenbossche · 2023-11-28T21:02:11Z

And a geoarrow linestring column could be defined using something like

Do we want to enable this? Because that gives different abilities. I would expect this covering to be actual min/max values, while for geoarrow that would be a list of values. While for row group statistics this will work the same, that is not the case for plain row filtering.

kylebarron · 2023-11-28T21:09:18Z

And a geoarrow linestring column could be defined using something like

Do we want to enable this? Because that gives different abilities. I would expect this covering to be actual min/max values, while for geoarrow that would be a list of values. While for row group statistics this will work the same, that is not the case for plain row filtering.

I think the question is whether covering: box is intended to be used for more than just predicate pushdown. I had assumed that covering: box was only intended to enable statistics in the row group metadata, but that readers should not assume how to use the actual data within the column.

Given that creating bounding boxes from a geometry column (once in memory) should be very fast, I think the main goal should be on solidifying a flexible way to describe the MBR of the chunk in the row group metadata, without defining what the values of that column must be. I like the above because it's flexible among struct and top-level columns while also being future proof to a geoarrow encoding.

(Though to be clear, I'm not arguing for this PR to explicitly allow geoarrow; for now we can only document top-level columns and a struct column)

jwass · 2023-11-28T21:19:45Z

This also means that a struct point column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.x",
    "ymin": "geometry.y",
    "xmax": "geometry.x",
    "ymax": "geometry.y",
  }
}

One nice outcome of this approach is it avoids the duplication (triplication?) of point coordinates that @rouault raised in #188 (comment).

Although I think I share in @jorisvandenbossche's concern that allowing it to point to non-primitive float columns makes simple row filtering more complicated.

rouault · 2023-11-28T21:21:41Z

Although I think I share in @jorisvandenbossche's concern that allowing it to point to non-primitive float columns makes simple row filtering more complicated.

agreed on that: I would expect the content of the xmin/... field to be a scalar to KISS

Maxxen · 2023-11-28T22:48:57Z

Just chiming in with my 2c. DuckDB can't do predicate pushdowns into struct columns yet (hopefully someday!), and I imagine other simpler readers may also struggle to fully utilize nested types, so I would think the "flat" representation is the most practical/has the lowest "barrier of entry". Although from a conceptual standpoint its kind of a bummer because I feel like this is the perfect use case for structs :/.

migurski · 2023-11-29T06:01:36Z

Just popping in with a 👍 to this proposal. GDAL’s new support for column stats in 3.8.0 combined with spatially-clustered rows and relatively-small row groups has shown excellent performance in my tests with country-scale OSM datasets.

I hadn’t seen the work on structs here so I was experimenting with separate xmin/ymin/xmax/ymax columns. They seem to be equivalent for read performance.

cholmes · 2023-11-29T15:04:50Z

What about specifically listing the names of the paths in the schema

I like the upside of being able to handle both structs and top level columns, and handling points without duplication would definitely be a win.

But I worry about the complexity - a reader would need to at least know about all the options, and if it's not capable of predict pushdown would have to know that some fields wouldn't work. And ideally a reader would notify the user in some way that it's not working / implemented. And writer implementations would have to make choices of how to do it. And either push that decision on the user, or else we recommend a 'default'. So if we just allow any field specified then it feels like we just sorta punt on the question of which approach to do, and we push the complexity out to the ecosystem. And it'd be harder to just look at a parquet file and know just by looking at the data if it's got a bbox - if we have one set approach of fields of set names then you can be pretty sure if a dataset has that then it is implementing a spatial optimization.

So I think I lean towards first picking one way and seeing if it will work for everyone. But it could be interesting to use this way to specify that one way if we want to future proof it a bit - like we list the path names, but we align on just having one locked in value for the first release, like how we did the geometry encoding. Maybe that's what you were proposing? I wasn't sure if the idea was people could specify any field they wanted to serve as the box, or if it'd just be a way to specify a set of options.

cholmes · 2023-11-29T15:12:28Z

One thing that I do think should accompany this is some recommendations on how to actually use this construct to accelerate spatial querying. Like talk through different indexing options, explain a bit about how you need to order the file and how the row group stats work. I think this would likely fit best in some sort of 'best practice' document that sits alongside the spec. And I don't think it needs to be part of this PR, but would be good to have in place for a first 1.1-beta release. I'd be up to take a first crack at it, but would be good for others who are deeper into spatial indexing and parquet to help out.

paleolimbot · 2023-11-29T15:38:42Z

It seems like top-level columns are the only way for this proposal to fulfill its goal (allow GeoParquet files to be written such that querying a large file for a small area does not have awful performance) for at least one engine we care about (DuckDB)? The other engine I know can't do this is cudf (less of a priority perhaps). It is a bummer because a struct column is definitely the right concept for this.

That would mean:

"covering": {
  "box": {
    "xmin": "name_of_top_level_xmin",
    "ymin": "name_of_top_level_ymin",
    "xmax": "name_of_top_level_xmax",
    "ymax": "name_of_top_level_ymax"
  }
}

A future version of the spec could allow nesting with "xmin": ["name_of_top_level_column", "xmin"] (there's no guarantee that a column name does not contain a period). I don't think we should start by allowing both (or maybe we will never get to a place where we allow both).

jwass · 2023-11-29T16:02:17Z

It seems like top-level columns are the only way for this proposal to fulfill its goal (allow GeoParquet files to be written such that querying a large file for a small area does not have awful performance) for at least one engine we care about (DuckDB)? The other engine I know can't do this is cudf (less of a priority perhaps).

Agreed. We had discussed surveying existing Parquet readers prior to making a decision, but it feels like we already know the answer without having to do more investigation...?

As @paleolimbot said, we could go with @kylebarron's suggestion of having the column specified per box element, but the spec could enforce that they MUST point to fields in the root of the schema. We always have the option later to relax that constraint without breaking backward compatibility.

Separately, I like the idea of the increased flexibility but wonder if it allows too many personal preferences to exist in the ecosystem... Someone could write:

"covering": {
  "box": {
    "xmin": "minx",
    "ymin": "miny",
    "xmax": "maxx",
    "ymax": "maxy"
  }
}

because they prefer minx over xmin. I think I'd find the competing opinions annoying if many different ones are out there. I guess we could just have a best practice recommendation of how to define these. If we want to recommend plain xmin/... we'd need a suggestion for multiple geometries too.

paleolimbot · 2023-11-29T16:12:48Z

If we want to recommend plain xmin/... we'd need a suggestion for multiple geometries too.

Perhaps a prefix? <geometry column name>_xmin, <geometry column name>_ymin, etc?

kylebarron · 2023-11-29T16:24:23Z

Someone could write ... because they prefer minx over xmin

Is that a problem? I don't think readers should be assuming anything about the column names except what is defined in the metadata. If someone wants minx, that seems totally fine as long as the metadata is correctly specified.

jwass · 2023-12-03T14:37:00Z

Is that a problem? I don't think readers should be assuming anything about the column names except what is defined in the metadata. If someone wants minx, that seems totally fine as long as the metadata is correctly specified.

As someone that spends a lot of time writing SQL, the real interface of a Parquet file for me is the table schema's columns and data types because most SQL query engines can't read the GeoParquet metadata - at least not for a while. If I want to reuse queries across different GeoParquet datasets, they'll have to be dynamic over column names which is a pain especially if determining the columns requires reading the data before I know how to query it.

I guess in my opinion, the more standardized the Parquet schema is - column names and types - and the less flexible the metadata allows it to be, the easier GeoParquet will be to work with and the more successful it will be. This is a very similar discussion to #169 as well. The broader difference of philosophy is beyond the scope of this PR but a fun one that I think will keep coming up :)

Either way, I think Chris's proposal to constrain the allowed values could make this work and future-proof to make it more open-ended later.

jwass · 2023-12-03T14:40:42Z

Thanks everyone for the feedback and good discussion. Here's my read on what would be a good path forward:

Specify every bounding box column

We can adopt @kylebarron's idea to define the bounding box like this:

"covering": {
  "box": {
    "xmin": "xmin",
    "ymin": "ymin",
    "xmax": "xmax",
    "ymax": "ymax"
  }
}

Constraints on the initial release that can be relaxed later

Top-Level Columns

Since duckdb and others don't do predicate pushdown on nested columns, the initial definition should require the box columns to be at the schema root. Otherwise we'll be negating most performance benefits for some systems.

Restricting Column Names

Per @cholmes's comment, we can start by restricting the column names and pinning them to be xmin, xmax, etc. initially. This will keep GeoParquet datasets consistent in their schemas for some time, but give us some future-proofing to allow nested columns (bbox.xmin) or arbitrary column names ({"xmin": "minx", ...}") in the future without having to redesign the box spec presumably.

What to do about files with multiple geometries?

We can't require the same names for a different geometry column. I like @paleolimbot's idea of constraining the names to be a prefix that is the geometry column name + "_". So the bbox columns for my_other_geom are my_other_geom_xmin, my_other_geom_ymin, etc. But I also like the idea of the primary geometry's bbox just being called xmin rather than e.g. geometry_xmin. In other words, we could say the primary geometry column's prefix is the empty string, but all others are column_name_.

Thoughts? I can make the PR changes soon if there's agreement on the direction here.

tschaub · 2023-12-03T17:12:19Z

@jwass - I'm not sure if this is what you are suggesting as well, but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like <geometry-column>_xmin (etc), those may be treated as values representing the geometry bounds. This would allow us to later add geo metadata that pointed to other columns serving the same purpose (and to potentially have other covering types).

jwass · 2023-12-03T19:04:56Z

I'm not sure if this is what you are suggesting as well, but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like <geometry-column>_xmin (etc), those may be treated as values representing the geometry bounds.

@tschaub That's not quite what I was suggesting since I am proposing adding this to the metadata. That way the systems that can interact with the metadata can still find it in a definitive way. We could define the bounding box more as convention for now without touching the geo metadata yet, and/or add it to the compatible parquet spec. But it seemed like most folks wanted it in there.

tschaub · 2023-12-03T20:51:44Z

If we don't make a change to the geo metadata, it would give us a chance to actually demonstrate that the min/max columns have some value in real use cases - without needing a breaking release if we find out that the initial proposal wasn't quite right.

migurski · 2023-12-04T02:24:03Z

The min/max columns create a single-level spatial index and are absolutely useful in selection use cases. This example shows a 44MB Geoparquet file for the country of Ghana with only data for the town of Kumasi selected:

https://nbviewer.org/urls/gist.githubusercontent.com/migurski/e88e6d34d8686ed54c8a3ed158061409/raw/3ba8640c38f4a87574e976000450d5e17ade4c17/index.ipynb

Preview, showing in-Kumasi row groups in blue vs. non-Kumasi row groups in gray:

With row groups of 100 features and everything sorted by quadkey of one feature corner we use the column stats alone to identify 106 row groups out of 1,001 total that must be read. This saves almost 90% of required reads. I did some tests with other space-filling curves and they don’t make a substantial difference. It’s the min/max column stats and the clustering alone that do it.

For larger source files this would be a requirement to use the network efficiently. Worldwide OSM road and building data can be dozens or hundreds of GB in size and I wouldn’t be able to afford a table scan to efficiently query a region.

paleolimbot

I don't have anything additional to add here!

My only question is what happens when there is a period in column_name. I don't think it's hard to imagine that it would live with the lefthand side (i.e., not one of the struct fields), but also not particularly important to solve right now!

jwass · 2024-01-30T01:13:10Z

My only question is what happens when there is a period in column_name. I don't think it's hard to imagine that it would live with the lefthand side (i.e., not one of the struct fields), but also not particularly important to solve right now!

@paleolimbot This came up today and @jorisvandenbossche mentioned it might be common in R. My guess is in some SQL engines you'd refer to a column like that with quotes like:

SELECT
    "my.column.with.periods".xmin
FROM t

We could require something similar such that periods are nested groups but when they are within quotes they're just part of the column name. The quotes would just have to be properly escaped in the JSON.

Describe bounding box as the "coordinate range" of the geometry which is language used by the GeoJSON spec

jorisvandenbossche · 2024-01-30T07:22:26Z

A more robust way to specify a "nested path", is to specify each part of the path as a separate string. In Python we would naturally use a tuple for that, like ("column_name", "xmin"). In JSON that translates to an array, I don't know if that would be strange to use here.

I was quickly looking into how I would implement this bbox filter with pyarrow, and there I will actually need to convert the dotted path we have here to a tuple or list of paths, because to specify a filter when reading I can do something like filter=(pc.field(("bbox", "xmin")) > xx) & (..) & ... The current pc.field(..) helper function to specify a field doesn't support a dotted path, but instead requires you to be specific (which allows to specify a top-level field name with dots in the name). We could of course change this in pyarrow, but I also like the explicitness.

That means that I would need to split the dotted path. A simple bbox_metadata["xmin"].split(".") would work in most cases, but that would not work when there are quoted sub-paths inside the string. Supporting something like "\"my.column.with.periods\".xmin" would make this quite a bit more complicated.

rouault · 2024-01-30T11:31:34Z

In JSON that translates to an array, I don't know if that would be strange to use here.

I was also thinking to that

Or, as we want all the xmin/ymin/xmax/ymax to be subfields of the same field:

"covering": {
  "bbox": {
    "struct_field": "bbox",
    "xmin_subfield": "xmin",
    "ymin_subfield": "ymin",
    "xmax_subfield": "xmax",
    "ymax_subfield": "ymax"
  }
}

jwass · 2024-01-30T14:34:04Z

If we want to be the most future-proof, maybe we should just go with the array:

"covering": {
  "bbox": {
    "xmin": ["bbox", "xmin"],
    "ymin": ["bbox", "ymin"],
    "xmax": ["bbox", "xmax"],
    "ymax": ["bbox", "ymax"],
  }
}

Kind of annoying but it should make things much easier for readers that need to find the right field in the schema as @jorisvandenbossche pointed out. We can constrain the json schema so that each element has length 2 and fix the final element which gives lots of flexibility to relax those later.

I'm good with it.

Update bbox so that each element is an array. For example: ["bbox", "xmin"] which represents a top-level bbox struct field with an xmin member beneath it

Accidentally committed in last version for testing

jwass · 2024-01-30T19:00:53Z

Updated to new format as described above. @jorisvandenbossche I re-requested a review just to make sure this in line with your thinking here.

rouault · 2024-02-01T16:32:44Z

FYI, I've a GDAL implementation of this in OSGeo/gdal#9185

csringhofer · 2024-02-01T19:15:07Z

format-specs/geoparquet.md

+
+##### bbox covering encoding
+
+Including a per-row bounding box can be useful for accelerating spatial queries by allowing consumers to inspect row group bounding box summary statistics. Furthermore a bounding box may be used to avoid complex spatial operations by first checking for bounding box overlaps. This field captures the column name and fields containing the bounding box of the geometry for every row.


I think that pages should be also mentioned besides row groups, as bbox column also works for page level indexes. Wrote a longer rant about this in #188 (comment)

Just updated to reflect this. Thanks!

@csringhofer Can you take a look at 40ebb37?

Pretty small change just to also mention page indexes in addition to row groups. There aren't any other discussions of row groups in the spec so that's the only change. Let me know if you think that works.

csringhofer · 2024-02-01T19:27:10Z

format-specs/geoparquet.md

+```
+"covering": {
+    "bbox": {
+        "xmin": ["bbox", "xmin"],


At first the column name "bbox" was confusing to me as it is the same as the json struct name. Maybe "bbox_col" would be clearer? Afterwards it could be added that if there is a single geometry column, then the recommended bbox column name is simply "bbox".

@csringhofer Are you referring to the example as in:

"covering": { "bbox": { "xmin": ["bbox_col", "xmin"], ...

I'm hesitant to change it because our recommendation really is to call it "bbox". I agree it's a bit confusing. If there's anything to rename it might be the "bbox" under covering. It used to be called just "box" in earlier versions of the PR but now that it's just the bbox columns, I put it back. I'm open to other ideas though.

It's indeed a bit confusing here in the example, but for the actual spec I would also keep "bbox" both for the recommended column name as the key here in the metadata.

Maybe another example could be added with bbox column for multiple geometry columns.
It is also not clear what is the recommended name in that case - there is an example with "any_column", but using something like "geom_column_name_bbox" seems clearer to me.

but using something like "geom_column_name_bbox" seems clearer to me.

FWIW, that's the convention I've used in the GDAL writer

csringhofer · 2024-02-01T19:35:20Z

format-specs/geoparquet.md

+
+### Bounding Box Columns
+
+A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE`. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value.


Maybe it could be added the the coordinates must have the same type, e.g. all FLOAT or all DOUBLE

About repetition: while the struct's repetition must be the same as geometry column's, the nested fields' repetation must be "required", right? So they can never be null if their parent is not null.

Maybe it could be added the the coordinates must have the same type, e.g. all FLOAT or all DOUBLE

I concur with that! This is an assumption I've actually made in my GDAL implementation

Made the update in 40ebb37. Let me know if okay or not.

* Mention that bounding boxes are useful for both row group statistics and page indexes * Require that bbox encoding columns be all of the same type (float or double)

format-specs/geoparquet.md

cholmes · 2024-02-13T16:26:44Z

Can we merge this in? Any remaining work to do? If I don't hear in a day or two I'm going to hit the green button.

csringhofer · 2024-02-13T17:59:03Z

format-specs/geoparquet.md

+
+### Bounding Box Columns
+
+A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE` and all columns MUST use the same type. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value.


xmin, xmax, ymin, and ymax representing the geometry's coordinate range.

I am confused about the semantics in case of spherical geometries.
"range" suggests to me that xmin should be always <= xmax, but this is not true in the spherical case, right? How to represent a bbox that crosses the 180.0° line of longitude or contains a pole? Or such a bbox cannot be represented?

It would be nice to add some guidance/warning about interpreting in the spherical case.

For the bounding box at the file-level metadata (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#bbox), we refer to the GeoJSON spec:

For geometries in a geographic coordinate reference system, longitude and latitude values are listed for the most southwesterly coordinate followed by values for the most northeasterly coordinate. This follows the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for a set of geometries that cross the antimeridian.

Would that work here as well / provide sufficient information?

Of course, in contrast with the bbox metadata for the full file which is just a JSON array of 4 numbers, here we need to give the numbers explicit field names. The current proposal uses "xmin", "xmax", etc, which is not ideal for the geographical case.
(in general a simple bbox might not be ideal for geographical data anyway, and the current proposal leaves it open to add other "covering" types later)

Good point, @csringhofer.

I agree with @jorisvandenbossche that we'd just defer to how GeoJSON defines bbox for anti-meridian crossings. (There's another discussion about also adopting GeoJSON's recommendation to split geometries at the anti-meridian but that's probably further out).

I suppose we could rename the bbox fields to "south", "west", etc. but it feels off to me and I think xmin, xmax is still the right name with the caveats Joris listed. It's also worth mentioning that naive queries against the bbox won't be effective for anti-meridian crossings, including row group filtering optimizations. But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering

I added this sentence to the docs: As with the top-level [bbox](#bbox) column, the values follow the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for geometries that cross the antimeridian.

But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering

I assume that would require the geometries crossing the anti-meridian are in dedicated row groups. If you start mixing geometries crossing the A.M. and geometries not crossing it, then the min(minx), min(miny), max(maxx), max(maxy) statistics aren't going to make any sense.

e.g if you have a geometry [-10,-10,10,10] and a [170,-10,-170,10] (crossing A-M) in a single row group, then the row group stats are going to be [-10,-10,10,10] and thus the geometry crossing A-M will not be selected.

Hmm, yes, that's a good point. Silently not selecting a row if you are not aware of this, doesn't sound good.
Shall we for now just say that this feature (bbox column) doesn't support A-M crossing geometries, and thus cannot be used for data including such geometries?

Good point, @rouault. We discussed this at the last GeoParquet meeting and decided that we'll just say that antimeridian crossings aren't supported for now. I'll also make an issue to track that

Created #198 to continue that discussion.

jorisvandenbossche

Thanks for the last updates!

Going to merge this now, we can always iterate a bit further on the details in follow-ups (such as the antimeridian case for which you already opened an issue)

jwass added 2 commits November 27, 2023 14:22

Add geometry_bbox proposal definition

0882e85

* Add documentation to the top-level GeoParquet description and definition. * Add the geometry_bbox definition to the json schema * Add a few tests. Verify with `pytest test_json_schema.py`

Update generate_example to include geometry_bbox parameter

e74af7d

jwass force-pushed the per_row_bbox branch from f94b15a to e74af7d Compare November 27, 2023 20:29

paleolimbot reviewed Nov 28, 2023

View reviewed changes

Change to covering from geometry_bbox. Doc updates.

8790bb4

Change the geometry_bbox to the broader "covering" section. Update tests and examples. Made some documentation updates: * Parquet schema -> group * Do not require zmin/zmax if geometries have 3 dimensions

kylebarron mentioned this pull request Dec 1, 2023

Improve reader options kylebarron/parquet-wasm#81

Open

paleolimbot approved these changes Jan 29, 2024

View reviewed changes

Update docs.

8bfb5b7

Describe bounding box as the "coordinate range" of the geometry which is language used by the GeoJSON spec

jwass added 2 commits January 30, 2024 13:57

Update covering bbox fields to be parquet schema paths

44097bf

Update bbox so that each element is an array. For example: ["bbox", "xmin"] which represents a top-level bbox struct field with an xmin member beneath it

Fix schema version to 7.

e0b3d47

Accidentally committed in last version for testing

jwass requested a review from jorisvandenbossche January 30, 2024 19:00

rouault mentioned this pull request Feb 1, 2024

Parquet: implement using and writing bounding box column, for faster spatial filtering OSGeo/gdal#9185

Merged

csringhofer reviewed Feb 1, 2024

View reviewed changes

Update docs

40ebb37

* Mention that bounding boxes are useful for both row group statistics and page indexes * Require that bbox encoding columns be all of the same type (float or double)

csringhofer reviewed Feb 2, 2024

View reviewed changes

format-specs/geoparquet.md Outdated Show resolved Hide resolved

jorisvandenbossche approved these changes Feb 2, 2024

View reviewed changes

"row groups or page indexes" -> "row groups and page indexes"

4a3c59c

csringhofer reviewed Feb 13, 2024

View reviewed changes

jwass added 2 commits February 15, 2024 13:00

Mention anti-meridian crossing

c540208

Mention not supporting antiemeridian crossings

f3edcaf

jwass mentioned this pull request Mar 11, 2024

Antimeridian Crossings and bbox #198

Open

Remove trailing whitespace for linter

eb6d3e0

jorisvandenbossche approved these changes Mar 11, 2024

View reviewed changes

jorisvandenbossche merged commit 0309eac into opengeospatial:main Mar 11, 2024
2 checks passed

This was referenced Mar 25, 2024

Proposal for a "geo-arrow" format #1

Closed

Add an spatial index in geoparquet format as optional #13

Open

wiesehahn mentioned this pull request Apr 5, 2024

update geoparquet description to include bounding box column cloudnativegeo/cloud-optimized-geospatial-formats-guide#112

Open

kylebarron mentioned this pull request Apr 16, 2024

Support for GeoParquet 1.1 metadata stac-utils/stac-geoparquet#38

Closed

cholmes added this to the 1.1 milestone May 2, 2024


		##### bbox covering encoding

		Including a per-row bounding box can be useful for accelerating spatial queries by allowing consumers to inspect row group bounding box summary statistics. Furthermore a bounding box may be used to avoid complex spatial operations by first checking for bounding box overlaps. This field captures the column name and fields containing the bounding box of the geometry for every row.


		### Bounding Box Columns

		A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE`. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value.

Introduce bounding box column definition #191

Introduce bounding box column definition #191

Conversation

jwass commented Nov 27, 2023 • edited

jwass commented Nov 27, 2023 • edited

paleolimbot left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 28, 2023

jwass commented Nov 28, 2023

jwass commented Nov 28, 2023

rouault commented Nov 28, 2023

kylebarron commented Nov 28, 2023 • edited

jorisvandenbossche commented Nov 28, 2023

kylebarron commented Nov 28, 2023 • edited

jwass commented Nov 28, 2023 • edited

rouault commented Nov 28, 2023

Maxxen commented Nov 28, 2023

migurski commented Nov 29, 2023 • edited

cholmes commented Nov 29, 2023

cholmes commented Nov 29, 2023

paleolimbot commented Nov 29, 2023

jwass commented Nov 29, 2023 • edited

paleolimbot commented Nov 29, 2023

kylebarron commented Nov 29, 2023

jwass commented Dec 3, 2023

jwass commented Dec 3, 2023 • edited

Specify every bounding box column

Constraints on the initial release that can be relaxed later

Top-Level Columns

Restricting Column Names

What to do about files with multiple geometries?

tschaub commented Dec 3, 2023

jwass commented Dec 3, 2023

tschaub commented Dec 3, 2023

migurski commented Dec 4, 2023 • edited

paleolimbot left a comment

Choose a reason for hiding this comment

jwass commented Jan 30, 2024

jorisvandenbossche commented Jan 30, 2024

rouault commented Jan 30, 2024 • edited

jwass commented Jan 30, 2024

jwass commented Jan 30, 2024

rouault commented Feb 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwass Feb 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cholmes commented Feb 13, 2024

csringhofer Feb 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwass Feb 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jwass commented Nov 27, 2023 •

edited

jwass commented Nov 27, 2023 •

edited

kylebarron commented Nov 28, 2023 •

edited

kylebarron commented Nov 28, 2023 •

edited

jwass commented Nov 28, 2023 •

edited

migurski commented Nov 29, 2023 •

edited

jwass commented Nov 29, 2023 •

edited

jwass commented Dec 3, 2023 •

edited

migurski commented Dec 4, 2023 •

edited

rouault commented Jan 30, 2024 •

edited

jwass Feb 2, 2024 •

edited

csringhofer Feb 13, 2024 •

edited

jwass Feb 15, 2024 •

edited