Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature identifiers #120

Closed
tschaub opened this issue Oct 7, 2022 · 11 comments · Fixed by #121
Closed

Feature identifiers #120

tschaub opened this issue Oct 7, 2022 · 11 comments · Fixed by #121
Assignees
Milestone

Comments

@tschaub
Copy link
Collaborator

tschaub commented Oct 7, 2022

Has there been discussion around including an id_column or something similar in the file metadata? I think it would assist in round-tripping features from other formats if it were known which column represented the feature identifier.

It looks like GDAL has a FID layer creation option. But I'm assuming that the information about which column was used when writing would be lost when reading from the parquet file (@rouault would need to confirm).

I grant that this doesn't feel "geo" specific, and there may be existing conventions in Parquet that would be appropriate.

@rouault
Copy link
Contributor

rouault commented Oct 7, 2022

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing.
e.g:

{"fid":"fid","columns":{"AREA":{"type":"Real"},"EAS_ID":{"type":"Integer64"},"PRFEDEA":{"type":"String","width":16}}}

@kylebarron
Copy link
Collaborator

I think several libraries have come up with their own custom metadata to solve this problem; not sure there's any "native" Parquet solution. For example, Pandas adds its own metadata for its index columns (essentially equivalent to a feature id column) and data types so that it's reliably able to round-trip data.

import pandas as pd
import pyarrow.parquet as pq
import json

df = pd.DataFrame({'a': [2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd']}).set_index('a')
df.to_parquet('test.parquet')

meta = pq.read_metadata('test.parquet')
json.loads(meta.metadata[b'pandas'])
{'index_columns': ['a'],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'b',
   'field_name': 'b',
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': None},
  {'name': 'a',
   'field_name': 'a',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '7.0.0'},
 'pandas_version': '1.3.1'}

@cholmes
Copy link
Member

cholmes commented Oct 10, 2022

Interesting. Seems like it might be good for GeoParquet to at least make a recommendation for reliable FID. Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

@TomAugspurger
Copy link
Collaborator

I'll note that pandas uses a list of fields for index_columns, to support its MultiIndex (I think similar to a composite key in some flavors of SQL). It looks like GDAL uses a string, at least in Even's example.

@rouault
Copy link
Contributor

rouault commented Oct 10, 2022

It looks like GDAL uses a string, at least in Even's example.

yes, GDAL only supports a single numeric column as feature identifier

@kylebarron
Copy link
Collaborator

kylebarron commented Oct 10, 2022

Perhaps just use what Pandas does, and then GeoPandas and GDAL could align?

I'd be hesitant to mimic the Pandas metadata exactly because it's very Python-specific, at least the pandas_type and numpy_type.

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

# continuing from above
meta.metadata[b'ARROW:schema']
# b'/////6gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAQCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAADcAQAABAAAAM0BAAB7ImluZGV4X2NvbHVtbnMiOiBbImEiXSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogImIiLCAiZmllbGRfbmFtZSI6ICJiIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImEiLCAiZmllbGRfbmFtZSI6ICJhIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICI3LjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4zLjEifQAAAAYAAABwYW5kYXMAAAIAAABMAAAABAAAAMz///8AAAECEAAAABwAAAAEAAAAAAAAAAEAAABhAAAACAAMAAgABwAIAAAAAAAAAUAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQUQAAAAGAAAAAQAAAAAAAAAAQAAAGIAAAAEAAQABAAAAA=='

# Function to read and parse the above buffer
arrow_schema = pq.read_schema('test.parquet')
print(arrow_schema)
# b: string
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": ["a"], "column_indexes": [{"name": null, "fiel' ...

print(arrow_schema.metadata)
# {b'pandas': b'{"index_columns": ["a"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "b", "field_name": "b", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "7.0.0"}, "pandas_version": "1.3.1"}'}

Having metadata in the arrow schema might make it more interoperable than in the pandas-specific metadata, but not usable for non-Arrow-based readers (the Java world I think). If we want to be able to round-trip from e.g. GeoJSON which has feature identifiers, maybe it would make sense to add an option to the geoparquet-specific metadata describing an id column (though we'd have to guard against our metadata being out of sync with other metadata)

@kylebarron
Copy link
Collaborator

GDAL currently uses an extension gdal:schema metadata domain where it puts information such as the fid column name or GDAL specific typing. e.g:

Other than the fid, is there a reason why GDAL can't reuse the Arrow schema metadata, given that GDAL is using the Arrow C++ libraries to read/write Parquet? Do GDAL types not map 1:1 to Arrow types?

@rouault
Copy link
Contributor

rouault commented Oct 22, 2022

Do GDAL types not map 1:1 to Arrow types?

at 99%, but they are subtelties. Like GDAL can have a hint for the maximum width of a string, or JSON or UUID "subtypes" for strings. Those are generally not essential metadata, but GDAL can write them for perfect round-tripping of its abstraction model.

@jorisvandenbossche
Copy link
Collaborator

I would personally not follow (or be inspired by) the pandas' metadata here. The information in there is very pandas specific.

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

One other thing to note is that pandas stores its metadata in two places: once in the file metadata under the pandas key, but again in the file metadata within the Arrow schema metadata. Arrow also encodes its schema in a file-level metadata key named 'ARROW:schema' and is a base64 encoding of a custom binary format (I forget exactly where this layout is defined, @jorisvandenbossche maybe would know?) so that it's possible to exactly round trip Arrow data through Parquet without losing metadata like for extension types.

@kylebarron yeah, this duplication of the pandas metadata in both the normal Parquet metadata and inside the serialized schema is not super ideal. There is a JIRA about this: https://issues.apache.org/jira/browse/ARROW-14303
The format itself is the IPC message for a schema, and then base64 encoded: https://arrow.apache.org/docs/dev/cpp/parquet.html#serialization-details
But, as you mention yourself, I don't think it is interesting to look at this place to put FID information, as that is not readily available for Parquet readers that are not based on an Arrow library.

@tschaub
Copy link
Collaborator Author

tschaub commented Oct 22, 2022

If we want to support this, I think a simple new (and optional) field in our geo metadata where you can say which column is the ID column would be sufficient? (similarly to what GDAL does now, i.e. "fid":"fid")

I was envisioning the same. Though unless space is a concern, I think “id_column” is a bit friendlier and fits well with “primary_column” (nit).

@cholmes
Copy link
Member

cholmes commented Oct 24, 2022

Call 10/24 says we should add some 'best practice' that says parquet doesn't have a primary key, so it's not part of this spec. GDAL should do what it does, if other systems are also interest in roundtripping a feature id then we'd consider it as part of the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants