Feature identifiers #120

tschaub · 2022-10-07T20:24:23Z

Has there been discussion around including an id_column or something similar in the file metadata? I think it would assist in round-tripping features from other formats if it were known which column represented the feature identifier.

It looks like GDAL has a FID layer creation option. But I'm assuming that the information about which column was used when writing would be lost when reading from the parquet file (@rouault would need to confirm).

I grant that this doesn't feel "geo" specific, and there may be existing conventions in Parquet that would be appropriate.

The text was updated successfully, but these errors were encountered:

rouault · 2022-10-07T20:42:56Z

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing.
e.g:

{"fid":"fid","columns":{"AREA":{"type":"Real"},"EAS_ID":{"type":"Integer64"},"PRFEDEA":{"type":"String","width":16}}}

kylebarron · 2022-10-10T15:55:56Z

I think several libraries have come up with their own custom metadata to solve this problem; not sure there's any "native" Parquet solution. For example, Pandas adds its own metadata for its index columns (essentially equivalent to a feature id column) and data types so that it's reliably able to round-trip data.

import pandas as pd
import pyarrow.parquet as pq
import json

df = pd.DataFrame({'a': [2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd']}).set_index('a')
df.to_parquet('test.parquet')

meta = pq.read_metadata('test.parquet')
json.loads(meta.metadata[b'pandas'])

{'index_columns': ['a'],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'b',
   'field_name': 'b',
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': None},
  {'name': 'a',
   'field_name': 'a',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '7.0.0'},
 'pandas_version': '1.3.1'}

cholmes · 2022-10-10T16:09:31Z

Interesting. Seems like it might be good for GeoParquet to at least make a recommendation for reliable FID. Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

TomAugspurger · 2022-10-10T16:15:55Z

I'll note that pandas uses a list of fields for index_columns, to support its MultiIndex (I think similar to a composite key in some flavors of SQL). It looks like GDAL uses a string, at least in Even's example.

rouault · 2022-10-10T16:21:26Z

It looks like GDAL uses a string, at least in Even's example.

yes, GDAL only supports a single numeric column as feature identifier

kylebarron · 2022-10-10T17:05:55Z

Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

I'd be hesitant to mimic the Pandas metadata exactly because it's very Python-specific, at least the pandas_type and numpy_type.

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

# continuing from above
meta.metadata[b'ARROW:schema']
# b'/////6gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAQCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAADcAQAABAAAAM0BAAB7ImluZGV4X2NvbHVtbnMiOiBbImEiXSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogImIiLCAiZmllbGRfbmFtZSI6ICJiIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImEiLCAiZmllbGRfbmFtZSI6ICJhIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICI3LjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4zLjEifQAAAAYAAABwYW5kYXMAAAIAAABMAAAABAAAAMz///8AAAECEAAAABwAAAAEAAAAAAAAAAEAAABhAAAACAAMAAgABwAIAAAAAAAAAUAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQUQAAAAGAAAAAQAAAAAAAAAAQAAAGIAAAAEAAQABAAAAA=='

# Function to read and parse the above buffer
arrow_schema = pq.read_schema('test.parquet')
print(arrow_schema)
# b: string
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": ["a"], "column_indexes": [{"name": null, "fiel' ...

print(arrow_schema.metadata)
# {b'pandas': b'{"index_columns": ["a"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "b", "field_name": "b", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.3.1"}'}

Having metadata in the arrow schema might make it more interoperable than in the pandas-specific metadata, but not usable for non-Arrow-based readers (the Java world I think). If we want to be able to round-trip from e.g. GeoJSON which has feature identifiers, maybe it would make sense to add an option to the geoparquet-specific metadata describing an id column (though we'd have to guard against our metadata being out of sync with other metadata)

kylebarron · 2022-10-10T17:11:51Z

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing. e.g:

Other than the fid, is there a reason why GDAL can't reuse the Arrow schema metadata, given that GDAL is using the Arrow C++ libraries to read/write Parquet? Do GDAL types not map 1:1 to Arrow types?

rouault · 2022-10-22T01:06:12Z

Do GDAL types not map 1:1 to Arrow types?

at 99%, but they are subtelties. Like GDAL can have a hint for the maximum width of a string, or JSON or UUID "subtypes" for strings. Those are generally not essential metadata, but GDAL can write them for perfect round-tripping of its abstraction model.

jorisvandenbossche · 2022-10-22T14:09:44Z

I would personally not follow (or be inspired by) the pandas' metadata here. The information in there is very pandas specific.

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

@kylebarron yeah, this duplication of the pandas metadata in both the normal Parquet metadata and inside the serialized schema is not super ideal. There is a JIRA about this: https://issues.apache.org/jira/browse/ARROW-14303
The format itself is the IPC message for a schema, and then base64 encoded: https://arrow.apache.org/docs/dev/cpp/parquet.html#serialization-details
But, as you mention yourself, I don't think it is interesting to look at this place to put FID information, as that is not readily available for Parquet readers that are not based on an Arrow library.

tschaub · 2022-10-22T14:53:38Z

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

I was envisioning the same. Though unless space is a concern, I think “id_column” is a bit friendlier and fits well with “primary_column” (nit).

cholmes · 2022-10-24T16:55:55Z

Call 10/24 says we should add some 'best practice' that says parquet doesn't have a primary key, so it's not part of this spec. GDAL should do what it does, if other systems are also interest in roundtripping a feature id then we'd consider it as part of the spec.

tschaub mentioned this issue Oct 24, 2022

Include suggestion about feature identifiers #121

Merged

cholmes added this to the 1.0.0-beta.1 milestone Oct 24, 2022

cholmes closed this as completed in #121 Nov 7, 2022

cholmes assigned tschaub Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature identifiers #120

Feature identifiers #120

tschaub commented Oct 7, 2022

rouault commented Oct 7, 2022

kylebarron commented Oct 10, 2022

cholmes commented Oct 10, 2022

TomAugspurger commented Oct 10, 2022

rouault commented Oct 10, 2022

kylebarron commented Oct 10, 2022 •

edited

Loading

kylebarron commented Oct 10, 2022

rouault commented Oct 22, 2022

jorisvandenbossche commented Oct 22, 2022

tschaub commented Oct 22, 2022

cholmes commented Oct 24, 2022

Feature identifiers #120

Feature identifiers #120

Comments

tschaub commented Oct 7, 2022

rouault commented Oct 7, 2022

kylebarron commented Oct 10, 2022

cholmes commented Oct 10, 2022

TomAugspurger commented Oct 10, 2022

rouault commented Oct 10, 2022

kylebarron commented Oct 10, 2022 • edited Loading

kylebarron commented Oct 10, 2022

rouault commented Oct 22, 2022

jorisvandenbossche commented Oct 22, 2022

tschaub commented Oct 22, 2022

cholmes commented Oct 24, 2022

kylebarron commented Oct 10, 2022 •

edited

Loading