Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add s3 URL support to intake-stac #48

Open
christine-e-smit opened this issue Jul 31, 2020 · 6 comments
Open

add s3 URL support to intake-stac #48

christine-e-smit opened this issue Jul 31, 2020 · 6 comments
Labels
documentation enhancement New feature or request
Milestone

Comments

@christine-e-smit
Copy link

I was recently part of a group trying to use intake-stac to bring some files into dask from s3. Unfortunately, the data in question was not public and neither were the catalog files. So I wanted to use s3-style URLs for everything. Unfortunately, when I tried the following:

import intake
from intake import open_stac_catalog
cat = open_stac_catalog('s3://put-real-bucket-here/catalog.json')

I got the error:

---------------------------------------------------------------------------
STACError                                 Traceback (most recent call last)
<ipython-input-6-c2ac70a95199> in <module>
----> 1 cat = open_stac_catalog('s3://sit-giovanni-website/catalog.json')

~/Software/python3/envs/stac/lib/python3.8/site-packages/intake_stac/catalog.py in __init__(self, stac_obj, **kwargs)
     30             self._stac_obj = stac_obj
     31         elif isinstance(stac_obj, str):
---> 32             self._stac_obj = self._stac_cls.open(stac_obj)
     33         else:
     34             raise ValueError(

~/Software/python3/envs/stac/lib/python3.8/site-packages/satstac/thing.py in open(cls, filename)
     56                 dat = json.loads(dat)
     57             else:
---> 58                 raise STACError('%s does not exist locally' % filename)
     59         return cls(dat, filename=filename)
     60 

STACError: s3://put-real-bucket-here/catalog.json does not exist locally

It looks to me as though STAC thinks this is a file path rather than an S3 URL. Our time was short and I couldn't figure out if there was some other way to get STAC to take an S3 URL.

At the same time, we were hoping to put s3 URLs in our item catalog entries. E.g. -

{
  "id": "AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet",
  "stac_version": "0.9.0",
  "title": "Air temperature at surface (Daytime/Ascending) [AIRS AIRX3STD v006] for 2002-08-01 00:00:00+00:00 - 2002-08-31 23:59:59+00:00",
  "description": "Parquet file containing data from Air temperature at surface (Daytime/Ascending) [AIRS AIRX3STD v006] for 2002-08-01 00:00:00+00:00 - 2002-08-31 23:59:59+00:00",
  "type": "Feature",
  "bbox": [
    -180.0,
    -90.0,
    180.0,
    90.0
  ],
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -180.0,
          -90.0
        ],
        [
          -180.0,
          90.0
        ],
        [
          180.0,
          90.0
        ],
        [
          180.0,
          -90.0
        ],
        [
          -180.0,
          -90.0
        ]
      ]
    ]
  },
  "properties": {
    "datetime": "2002-08-01T00:00:00Z",
    "start_datetime": "2002-08-01T00:00:00Z",
    "end_datetime": "2002-08-31T23:59:59Z",
    "created": "2020-07-27T18:35:54Z",
    "license": "Apache",
    "platform": "AIRS",
    "instruments": [
      "AIRS"
    ],
    "mission": "AIRS"
  },
  "assets": {
    "AIRX3STD_006_SurfAirTemp_A_2002_08.parquet": {
      "href": "s3://fill-in-real-bucket-here/some/path/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet",
      "title": "Air temperature at surface (Daytime/Ascending) [AIRS AIRX3STD v006], 2002-08-01 00:00:00+00:00 - 2002-08-31 23:59:59+00:00 (Parquet)",
      "type": "parquet",
      "roles": [
        "data"
      ]
    }
  },
  "links": [
    {
      "rel": "self",
      "href": "N/A"
    }
  ]
}

This one may be more of a stretch. I don't know if the STAC spec has s3-style URLs in mind. My two minute evaluation of the item spec (https://github.com/radiantearth/stac-spec/blob/master/item-spec/json-schema/item.json) is inconclusive.

@matthewhanson
Copy link
Collaborator

This is more of a sat-stac issue, in fact it's this function:
https://github.com/sat-utils/sat-stac/blob/master/satstac/utils.py#L56

It checks for AWS URLs and will try a signed URL if a regular request fails, but it does not check for s3 URLs.

The main reason is that I didn't want to add boto3 and all it's dependencies as a dependency to sat-stac.

I think going forward the right approach is to use PySTAC which has a nice and easy approach to supply custom upload and download functions, I've been using it in a project do exactly this - have s3 style URLs for the catalog.

@jhamman
Copy link
Collaborator

jhamman commented Jul 31, 2020

@matthewhanson - does sat-stac accept python file objects in its open function? We could potentially leverage fsspec/s3fs to work around the catalog opening.

@scottyhq scottyhq added the enhancement New feature or request label Oct 16, 2020
@asteiker
Copy link

asteiker commented Dec 4, 2020

Recently running into this problem with outputs from Earthdata Harmony as well, as referenced in the recent mention from podaac/AGU-2020#13. More details:

import intake
stac_root = 'https://harmony.earthdata.nasa.gov/stac/{jobID}/{item}'

stac_cat = intake.open_stac_catalog(stac_root.format(jobID=job,item=''),name='Harmony output')
display(list(stac_cat))item_0 = f'{job}_0'
item = stac_cat[item_0]
assets = list(item)
asset = item[assets[0]]
da = asset.to_dask()

This gives me the following error:

ValueError: open_local can only be used on a filesystem which has attribute local_file=True

@scottyhq
Copy link
Collaborator

scottyhq commented Dec 5, 2020

Hi @asteiker - thanks for reporting and sharing the link to Earthdata Harmony, looks very neat! It's hard to tell from your example code what format the data is, is it possible to share the full link stac_root link?

Opening STAC assets with s3:// prefixes should work but the code ultimately called depends on the data format. I'm guessing this might be a netcdf dataset. The error you're getting is due to some incompatibilities between intake-xarray and fsspec which are being used behind the scenes for data loading (see intake/intake-xarray#93), so as a workaround until new versions are released you might try installing them from the unreleased master versions:

pip install git+https://github.com/intake/intake-xarray.git@master
pip install git+https://github.com/intake/filesystem_spec.git@master

@nikkopante
Copy link

I am having the same issue:
STACError: http://diwata-missions.s3-website-us-east-1.amazonaws.com/Diwata-2/SMI/stac/catalog.json does not exist locally
eventhough it is http:// and not s3://

@g2giovanni
Copy link

Is there a solution to use a STAC Catalog published into a private S3 bucket? How can we navigate the catalog followings links which refer to private resources?

@scottyhq scottyhq added this to the v0.4.0 milestone Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants