Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEOMESA-3362 Adds optional flattened arrow output #3128

Merged

Conversation

malinsinbigler
Copy link
Contributor

@malinsinbigler malinsinbigler commented Jun 13, 2024

Ticket: https://geomesa.atlassian.net/browse/GEOMESA-3362

Co-MR: geomesa/geomesa-geoserver#55

Testing

The airports shape file was obtained and loaded into geoserver:
https://hub.arcgis.com/documents/f74df2ed82ba4440a2059e8dc2ec9a5d/explore

Execute the following curl commands which exercise various arrow output options with and without the new flatten parameter.

This python snippit was used for display of the arrow data:

arrow_print.py

import pyarrow.ipc as ipc
import pandas as pd
import sys

# Specify the path to your Arrow file
file_path = sys.argv[1]

#pd.option_context('display.max_rows', None, 'display.max_columns', None)

# Open the Arrow file
with open(file_path, 'rb') as f:
    reader = ipc.open_stream(f)
    # Convert each record batch to a Pandas DataFrame
    dfs = [rb.to_pandas() for rb in reader]
    # Concatenate all DataFrames (if there are multiple record batches)
    df = pd.concat(dfs, ignore_index=True)

# Display the DataFrame
print(df.head().to_string())

Default return:

curl 'http://localhost:8080/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=geomesa:Airports&outputFormat=application/vnd.arrow' > airports_default.arrow
python arrow_print.py airports_default.arrow

Output:

Airports
0   {'id': 'Airports.1', 'the_geom': [51.883583, -176.6425], 'OBJECTID': 1, 'GLOBAL_ID': '656D38F0-F1FE-49A8-AB4F-677281616EF8', 'IDENT': 'ADK', 'NAME': 'Adak', 'LATITUDE': '51-53-00.8980N', 'LONGITUDE': '176-38-32.9360W', 'ELEVATION': 19.5, 'ICAO_ID': 'PADK', 'TYPE_CODE': 'AD', 'SERVCITY': 'ADAK ISLAND', 'STATE': 'AK', 'COUNTRY': 'UNITED STATES', 'OPERSTATUS': 'OPERATIONAL', 'PRIVATEUSE': 0, 'IAPEXISTS': 1, 'DODHIFLIP': 0, 'FAR91': 0, 'FAR93': 0, 'MIL_CODE': 'CIVIL', 'AIRANAL': 'NO OBJECTION', 'US_HIGH': 0, 'US_LOW': 0, 'AK_HIGH': 1, 'AK_LOW': 1, 'US_AREA': 0, 'PACIFIC': 0}
1     {'id': 'Airports.2', 'the_geom': [56.938694, -154.18257], 'OBJECTID': 2, 'GLOBAL_ID': 'F39AFCD2-D07F-4F41-96CF-08B79A271EAB', 'IDENT': 'AKK', 'NAME': 'Akhiok', 'LATITUDE': '56-56-19.2870N', 'LONGITUDE': '154-10-57.2000W', 'ELEVATION': 44.4, 'ICAO_ID': 'PAKH', 'TYPE_CODE': 'AD', 'SERVCITY': 'AKHIOK', 'STATE': 'AK', 'COUNTRY': 'UNITED STATES', 'OPERSTATUS': 'OPERATIONAL', 'PRIVATEUSE': 0, 'IAPEXISTS': 1, 'DODHIFLIP': 0, 'FAR91': 0, 'FAR93': 0, 'MIL_CODE': 'CIVIL', 'AIRANAL': 'NO OBJECTION', 'US_HIGH': 0, 'US_LOW': 0, 'AK_HIGH': 0, 'AK_LOW': 1, 'US_AREA': 0, 'PACIFIC': 0}
2  {'id': 'Airports.3', 'the_geom': [60.91381, -161.49335], 'OBJECTID': 3, 'GLOBAL_ID': 'C0EE48D3-E3AD-404E-945D-F404E345020D', 'IDENT': 'Z13', 'NAME': 'Akiachak', 'LATITUDE': '60-54-49.7150N', 'LONGITUDE': '161-29-35.9850W', 'ELEVATION': 22.8, 'ICAO_ID': 'PFZK', 'TYPE_CODE': 'AD', 'SERVCITY': 'AKIACHAK', 'STATE': 'AK', 'COUNTRY': 'UNITED STATES', 'OPERSTATUS': 'OPERATIONAL', 'PRIVATEUSE': 0, 'IAPEXISTS': 0, 'DODHIFLIP': 0, 'FAR91': 0, 'FAR93': 0, 'MIL_CODE': 'CIVIL', 'AIRANAL': 'NO OBJECTION', 'US_HIGH': 0, 'US_LOW': 0, 'AK_HIGH': 0, 'AK_LOW': 1, 'US_AREA': 0, 'PACIFIC': 0}
...

With flatten

curl 'http://localhost:8080/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=geomesa:Airports&outputFormat=application/vnd.arrow&format_options=flattenStruct:true;' > airports_flatten.arrow
python arrow_print.py airports_flatten.arrow

Output:



Airports
           id                 the_geom  OBJECTID                             GLOBAL_ID IDENT      NAME        LATITUDE        LONGITUDE  ELEVATION ICAO_ID TYPE_CODE     SERVCITY STATE        COUNTRY   OPERSTATUS  PRIVATEUSE  IAPEXISTS  DODHIFLIP  FAR91  FAR93 MIL_CODE       AIRANAL  US_HIGH  US_LOW  AK_HIGH  AK_LOW  US_AREA  PACIFIC
0  Airports.1   [51.883583, -176.6425]         1  656D38F0-F1FE-49A8-AB4F-677281616EF8   ADK      Adak  51-53-00.8980N  176-38-32.9360W       19.5    PADK        AD  ADAK ISLAND    AK  UNITED STATES  OPERATIONAL           0          1          0      0      0    CIVIL  NO OBJECTION        0       0        1       1        0        0
1  Airports.2  [56.938694, -154.18257]         2  F39AFCD2-D07F-4F41-96CF-08B79A271EAB   AKK    Akhiok  56-56-19.2870N  154-10-57.2000W       44.4    PAKH        AD       AKHIOK    AK  UNITED STATES  OPERATIONAL           0          1          0      0      0    CIVIL  NO OBJECTION        0       0        0       1        0        0
2  Airports.3   [60.91381, -161.49335]         3  C0EE48D3-E3AD-404E-945D-F404E345020D   Z13  Akiachak  60-54-49.7150N  161-29-35.9850W       22.8    PFZK        AD     AKIACHAK    AK  UNITED STATES  OPERATIONAL           0          0          0      0      0    CIVIL  NO OBJECTION        0       0        0       1        0        0
3  Airports.4   [60.907864, -161.4351]         4  26D96486-FA29-4866-93EB-2EEEB7FA7144   KKI  Akiachak  60-54-28.3130N  161-26-06.2780W       18.0                SP     AKIACHAK    AK  UNITED STATES  OPERATIONAL           0          0          0      0      0    CIVIL  NO OBJECTION        0       0        0       0        0        0

...

Return only desired properties

curl 'http://localhost:8080/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=geomesa:Airports&outputFormat=application/vnd.arrow&propertyName=NAME' > airports_only_name.arrow
python arrow_print.py airports_only_name.arrow

Output:

                                   Airports
0      {'id': 'Airports.1', 'NAME': 'Adak'}
1    {'id': 'Airports.2', 'NAME': 'Akhiok'}
2  {'id': 'Airports.3', 'NAME': 'Akiachak'}
3  {'id': 'Airports.4', 'NAME': 'Akiachak'}
4     {'id': 'Airports.5', 'NAME': 'Akiak'}

Return only desired properties w/ flatten

curl 'http://localhost:8080/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=geomesa:Airports&outputFormat=application/vnd.arrow&propertyName=NAME&format_options=flattenStruct:true;' > airports_only_name_flatten.arrow
python arrow_print.py airports_only_name_flatten.arrow

Output:

           id      NAME
0  Airports.1      Adak
1  Airports.2    Akhiok
2  Airports.3  Akiachak
3  Airports.4  Akiachak
4  Airports.5     Akiak

Reverse Sort

curl 'http://localhost:8080/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=geomesa:Airports&outputFormat=application/vnd.arrow&propertyName=NAME&format_options=flattenStruct:false;sortField:NAME;sortReverse:true;' > airports_only_name_reverse.arrow
python arrow_print.py airports_only_name_reverse.arrow

Output:

                                           Airports
0  {'id': 'Airports.16130', 'NAME': 'Zwainz Farms'}
1   {'id': 'Airports.5562', 'NAME': 'Zupancic Fld'}
2         {'id': 'Airports.14575', 'NAME': 'Zuehl'}
3        {'id': 'Airports.8928', 'NAME': 'Zortman'}
4    {'id': 'Airports.11254', 'NAME': 'Zorn Acres'}

Reverse sort w/ flatten

curl 'http://localhost:8080/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=geomesa:Airports&outputFormat=application/vnd.arrow&propertyName=NAME&format_options=flattenStruct:true;sortField:NAME;sortReverse:true;' > airports_only_name_reverse_flatten.arrow
python arrow_print.py airports_only_name_reverse_flatten.arrow

Output:

               id          NAME
0  Airports.16130  Zwainz Farms
1   Airports.5562  Zupancic Fld
2  Airports.14575         Zuehl
3   Airports.8928       Zortman
4  Airports.11254    Zorn Acres

@elahrvivaz
Copy link
Contributor

this is a good start, but we also have to add it to the distributed processing code, otherwise it will only work for non-geomesa-native stores (i.e. won't work for accumulo/hbase/etc). i could probably be convinced to only support it for postgis for now, if you don't have the bandwidth to do the rest.

@elahrvivaz
Copy link
Contributor

need to update docs, here at a minimum, I'm not sure if there are other places that reference it.

@malinsinbigler
Copy link
Contributor Author

this is a good start, but we also have to add it to the distributed processing code, otherwise it will only work for non-geomesa-native stores (i.e. won't work for accumulo/hbase/etc). i could probably be convinced to only support it for postgis for now, if you don't have the bandwidth to do the rest.

Thanks for the review! I'll have the changes and doc updates up soon. If it is ok with you can we keep this change isolated to postgis for now? I can clarify this feature in the docs that it will only work for postgis atm.

@malinsinbigler
Copy link
Contributor Author

@elahrvivaz
I addressed your comments for fixing the unit tests, adding docs, and changing from Option[Boolean] to Boolean.

I wasn't sure what the best practice is for adding this new parameter across all usage and defaulted pretty much everything to flattenStruct: Boolean = false. Let me know if that is overkill and if it makes sense to only set defaults in particular cases.

I manually tested against geoserver with these changes and queries look good.

@elahrvivaz
Copy link
Contributor

sorry i haven't had a chance to look at this yet, i'll try to get to it soon.

@elahrvivaz
Copy link
Contributor

lgtm, will merge when CI's done

@elahrvivaz
Copy link
Contributor

created https://geomesa.atlassian.net/browse/GEOMESA-3379 as a follow-up to implement for distributed stores.

@elahrvivaz elahrvivaz merged commit 7c8e818 into locationtech:main Jul 24, 2024
31 checks passed
@malinsinbigler
Copy link
Contributor Author

Thanks for the merge! Just to confirm did you also see the CO-MR which updates the geoserver plugin code?
geomesa/geomesa-geoserver#55

@elahrvivaz
Copy link
Contributor

Thanks for the merge! Just to confirm did you also see the CO-MR which updates the geoserver plugin code? geomesa/geomesa-geoserver#55

i did, but thanks for reminding me again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants