Add test data covering different native (geoarrow-based) encodings #204

jorisvandenbossche · 2024-04-16T16:07:51Z

Addressing part of #199 (small example data)

And similar as #134, but focusing here on the new geoarrow-based encodings (I should also update that older PR)

This adds some tiny example data for the different geometry types:

for each I am saving the data as WKB and the native encoding (with some default metadata), and additionally also as plain text in case this would be useful for another tool as the source of truth to compare with (eg to check reading the parquet file was correct)
this scripts only relies on pyarrow for creating the data and writing the parquet files (and shapely to create the WKB data), so nothing geoarrow-specific (eg no arrow extension types, no interleaved coordinates, so just what is in the geoparquet spec)
for now I only added 2D data, not yet variants with z coordinates. And there are probably also more corner cases that could be added (eg a MultiPolygon containing an empty Polygon, and things like that?)

In #134, I also saved the metadata in a separate file as well, which allows easier validation / review that this is correct. I could do that here as well, in case that is useful.
(I should still add validation of the generated files with the json schema)

Does this look useful in this form? Any suggestions to approach it differently?

paleolimbot

Just a few notes! (I haven't checked the actual files yet)

paleolimbot · 2024-04-16T16:35:39Z

test_data/data-linestring.wkt

These look like .csv files to me...should they have a .csv extension?

paleolimbot · 2024-04-16T16:48:58Z

test_data/generate_test_data.py

+geometries_wkt = [
+    "POINT (30 10)",
+    "POINT EMPTY",
+]


Should there also be a version of these that contain NULLs or a version that contains Z values?

Added some null values!

kylebarron · 2024-04-16T18:37:58Z

test_data/generate_test_data.py

+geometries = pa.array(
+    [[(30, 10), (10, 30), (40, 40)], []], type=linestring_type)


Maybe it would be less error-prone to construct these geometries from the WKT? Then we don't have to manually check that the WKT is equivalent to here.

Or maybe it would be better to use GeoJSON and pyarrow's JSON reader to infer geometry columns? Though then you'd need to convert from interleaved to separated coords, which is a bit involved for this test data here 😕

Yeah, I actually tried to automate writing this hard-coded data a bit, by parsing the WKT with shapely and converting to GeoJSON. The main problem is that then I need to convert the inner lists to tuples (as json is only lists), to have the construction as a separated struct type work with pyarrow (a tuple element can be converted to a struct, but a list element not). And that gives some convoluted logic ..

jorisvandenbossche · 2024-04-17T12:04:45Z

cc @rouault would this (have been) useful for testing in GDAL? (xref #189 (comment))
(I have been checking the files generated here with GDAL main, and they read fine)

rouault · 2024-04-17T15:28:42Z

would this (have been) useful for testing in GDAL?

yes, thanks

rouault

I've detected that there's a typo on geometry_type which should be geometry_types . Generated files should be regenerated

test_data/generate_test_data.py

jorisvandenbossche · 2024-05-02T14:15:11Z

I've detected that there's a typo on geometry_type which should be geometry_types . Generated files should be regenerated

Ah yes, that's because I started from #134 which I wrote before we made that name change .. Thanks for the catch!

Co-authored-by: Even Rouault <even.rouault@spatialys.com>

cholmes · 2024-05-28T18:00:20Z

I'm going to go ahead and merge this in, as it's been open and approved for awhile, and seems like a good edition. Any more work can be done as PR's to main.

tschaub · 2024-06-21T21:29:47Z

@jorisvandenbossche - The scripts directory has a generate_example.py script and a readme that describes how to install the dependencies for it. Could we move this new generate_test_data.py script to that same directory and update the readme if needed to describe how to install the dependencies for it (or update the pyproject.toml or the poetry.lock as needed)?

In addition, it would be nice to pull the version identifier out of the format-specs/schema.json (as is done in the generate_example.py script) to limit the number of places we have to update version numbers for a release.

I'll leave the same comment on #232 (as I think it would be nice to consolidate the examples/test data into one thing that is easier to maintain).

jorisvandenbossche · 2024-06-24T13:36:24Z

In addition, it would be nice to pull the version identifier out of the format-specs/schema.json (as is done in the generate_example.py script) to limit the number of places we have to update version numbers for a release.

Yes, I realized last week with the release I suboptimally hardcoded the number here in this PR.

Will try to look at it tomorrow.

Add test data covering different native (geoarrow-based) encodings

6ba4fd4

paleolimbot reviewed Apr 16, 2024

View reviewed changes

kylebarron reviewed Apr 16, 2024

View reviewed changes

kylebarron approved these changes Apr 16, 2024

View reviewed changes

rouault reviewed Apr 20, 2024

View reviewed changes

jorisvandenbossche and others added 6 commits May 2, 2024 16:15

Update test_data/generate_test_data.py

7e9eed2

Co-authored-by: Even Rouault <even.rouault@spatialys.com>

fix geometry type

c1f11eb

add .csv extension

5135493

add null values

ffb547d

rename csv files

2981b60

add back csv files

6c9390f

cholmes added this to the 1.1 milestone May 2, 2024

cholmes approved these changes May 6, 2024

View reviewed changes

jorisvandenbossche and others added 2 commits May 8, 2024 09:16

properly specify mask when creating the Arrow data

3466e1f

Merge branch 'main' into example-data-geoarrow

9c129d1

cholmes merged commit dced61c into opengeospatial:main May 28, 2024
2 checks passed

This was referenced May 29, 2024

Update example files for 1.1 #199

Open

Example data to test implementations #123

Open

jorisvandenbossche deleted the example-data-geoarrow branch June 3, 2024 15:48

jorisvandenbossche mentioned this pull request Jun 3, 2024

Add test data covering various options in the spec #134

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test data covering different native (geoarrow-based) encodings #204

Add test data covering different native (geoarrow-based) encodings #204

jorisvandenbossche commented Apr 16, 2024 •

edited

Loading

paleolimbot left a comment

paleolimbot Apr 16, 2024

paleolimbot Apr 16, 2024

jorisvandenbossche May 2, 2024

kylebarron Apr 16, 2024

jorisvandenbossche Apr 16, 2024

jorisvandenbossche commented Apr 17, 2024 •

edited

Loading

rouault commented Apr 17, 2024

rouault left a comment

jorisvandenbossche commented May 2, 2024

cholmes commented May 28, 2024

tschaub commented Jun 21, 2024 •

edited

Loading

jorisvandenbossche commented Jun 24, 2024

		geometries = pa.array(
		[[(30, 10), (10, 30), (40, 40)], []], type=linestring_type)

Add test data covering different native (geoarrow-based) encodings #204

Add test data covering different native (geoarrow-based) encodings #204

Conversation

jorisvandenbossche commented Apr 16, 2024 • edited Loading

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot Apr 16, 2024

Choose a reason for hiding this comment

paleolimbot Apr 16, 2024

Choose a reason for hiding this comment

jorisvandenbossche May 2, 2024

Choose a reason for hiding this comment

kylebarron Apr 16, 2024

Choose a reason for hiding this comment

jorisvandenbossche Apr 16, 2024

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 17, 2024 • edited Loading

rouault commented Apr 17, 2024

rouault left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 2, 2024

cholmes commented May 28, 2024

tschaub commented Jun 21, 2024 • edited Loading

jorisvandenbossche commented Jun 24, 2024

jorisvandenbossche commented Apr 16, 2024 •

edited

Loading

jorisvandenbossche commented Apr 17, 2024 •

edited

Loading

tschaub commented Jun 21, 2024 •

edited

Loading