Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support parquet files in catalog #50

Open
christine-e-smit opened this issue Jul 31, 2020 · 4 comments
Open

support parquet files in catalog #50

christine-e-smit opened this issue Jul 31, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@christine-e-smit
Copy link

christine-e-smit commented Jul 31, 2020

As I said in #48, I was recently involved in group trying to use intake-stac with some data we have sitting in s3. This data is in parquet format. I've used intake-parquet on this data with no problem to get a dask data frame. But when I try with intake-stac,

import intake
from intake import open_stac_catalog
cat = open_stac_catalog('https://not.the.real.url/catalog.json')
df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

I get the error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-10-25d227182f13> in <module>
----> 1 df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

~/Software/python3/envs/stac/lib/python3.8/site-packages/intake/source/base.py in to_dask(self)
    219     def to_dask(self):
    220         """Return a dask container for this data source"""
--> 221         raise NotImplementedError
    222 
    223     def to_spark(self):

NotImplementedError: 

I assume that intake-stac is keying off the "type" field in the item field. Parquet doesn't have a mime-type, so I tried 'parquet' without success. I then re-read your Readme and realized that if intake-stac is built on top of intake-xarray, then you probably can't read in parquet regardless of what I put in the "type" field.

Would it be possible to add parquet via the intake-parquet library?

I'm wondering if parquet is beyond the scope of the STAC catalog spec? I don't see parquet in STAC's list of media types here. But then I don't see zarr either and I'm guessing that you support zarr with intake-stac because it's your favored data type for pangeo.

@jhamman
Copy link
Collaborator

jhamman commented Jul 31, 2020

Yes! We should totally be able to do this. We need to map the stac type to the intake-parquet driver. Here's where that would go:

def _get_driver(self, entry):
drivers = {
"application/netcdf": "netcdf",
"image/vnd.stac.geotiff": "rasterio",
"image/vnd.stac.geotiff; cloud-optimized=true": "rasterio",
"image/x.geotiff": "rasterio",
"image/png": "xarray_image",
"image/jpg": "xarray_image",
"image/jpeg": "xarray_image",
"text/xml": "textfiles",
"text/plain": "textfiles",
"text/html": "textfiles",
}

Are you up for adding this feature?

@christine-e-smit
Copy link
Author

christine-e-smit commented Jul 31, 2020

I think I can handle adding one line to your drivers :)

But I'd think this would also require adding ingest-parquet as a dependency somewhere. Your top level requirements.txt, I assume?

And I'd need to add something to https://github.com/intake/intake-stac/blob/d71b2d2b0ea2f8c89cb0310706c4de6d19406e17/intake_stac/tests/test_catalog.py

@wildintellect
Copy link
Contributor

wildintellect commented Aug 18, 2020

Was looking over this during STAC sprint 6, currently updating types based on STAC Types

  1. What Media Type should we use for parquet considering it does not have a mimetype? Ideas application/parquet
  2. While were at it should we add all the Media Types that STAC supports? Maybe this is a different ticket to figure out additional formats intake needs like geojson.

@jhamman
Copy link
Collaborator

jhamman commented Aug 18, 2020

@wildintellect - if you are up for it, let's just do one PR where we update all the media types. I think application/parquet makes sense. I can help provide additional mappings to intake drivers as needed.

@scottyhq scottyhq added the enhancement New feature or request label Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants