Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Assets must have exactly 1 band" #193

Closed
rabernat opened this issue Dec 5, 2022 · 6 comments
Closed

"Assets must have exactly 1 band" #193

rabernat opened this issue Dec 5, 2022 · 6 comments

Comments

@rabernat
Copy link

rabernat commented Dec 5, 2022

I tried stackstac for the first time today. I started by just running the Basic example from the documentation. Here is the code I ran

import pystac_client
import stackstac

URL = "https://earth-search.aws.element84.com/v0"
catalog = pystac_client.Client.open(URL)
lon, lat = -105.78, 35.79

items = catalog.search(
    intersects=dict(type="Point", coordinates=[lon, lat]),
    collections=["sentinel-s2-l2a-cogs"],
    datetime="2020-04-01/2020-05-01"
).get_all_items()

stack = stackstac.stack(items)

So far so good. Then I tried to actually compute something

stack[0, 0].compute()

This gave the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [4], line 1
----> 1 stack[0, 0].compute()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:993, in DataArray.compute(self, **kwargs)
    974 """Manually trigger loading of this array's data from disk or a
    975 remote source into memory and return a new array. The original is
    976 left unaltered.
   (...)
    990 dask.compute
    991 """
    992 new = self.copy(deep=False)
--> 993 return new.load(**kwargs)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:967, in DataArray.load(self, **kwargs)
    949 def load(self: T_DataArray, **kwargs) -> T_DataArray:
    950     """Manually trigger loading of this array's data from disk or a
    951     remote source into memory and return this array.
    952 
   (...)
    965     dask.compute
    966     """
--> 967     ds = self._to_temp_dataset().load(**kwargs)
    968     new = self._from_temp_dataset(ds)
    969     self._variable = new._variable

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataset.py:733, in Dataset.load(self, **kwargs)
    730 import dask.array as da
    732 # evaluate all the dask arrays simultaneously
--> 733 evaluated_data = da.compute(*lazy_data.values(), **kwargs)
    735 for k, data in zip(lazy_data, evaluated_data):
    736     self.variables[k].data = data

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/base.py:600, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    597     keys.append(x.__dask_keys__())
    598     postcomputes.append(x.__dask_postcompute__())
--> 600 results = schedule(dsk, keys, **kwargs)
    601 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, **kwargs)
     86     elif isinstance(pool, multiprocessing.pool.Pool):
     87         pool = MultiprocessingPoolExecutor(pool)
---> 89 results = get_async(
     90     pool.submit,
     91     pool._max_workers,
     92     dsk,
     93     keys,
     94     cache=cache,
     95     get_id=_thread_get_id,
     96     pack_exception=pack_exception,
     97     **kwargs,
     98 )
    100 # Cleanup pools associated to dead threads
    101 with pools_lock:

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    509         _execute_task(task, data)  # Re-execute locally
    510     else:
--> 511         raise_exception(exc, tb)
    512 res, worker_id = loads(res_info)
    513 state["cache"][key] = res

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/local.py:319, in reraise(exc, tb)
    317 if exc.__traceback__ is not tb:
    318     raise exc.with_traceback(tb)
--> 319 raise exc

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    222 try:
    223     task, data = loads(task_info)
--> 224     result = _execute_task(task, data)
    225     id = get_id()
    226     result = dumps((result, id))

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/core.py:119, in <genexpr>(.0)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/optimization.py:990, in SubgraphCallable.__call__(self, *args)
    988 if not len(args) == len(self.inkeys):
    989     raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 990 return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/core.py:149, in get(dsk, out, cache)
    147 for key in toposort(dsk):
    148     task = dsk[key]
--> 149     result = _execute_task(task, cache)
    150     cache[key] = result
    151 result = _execute_task(out, cache)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File /srv/conda/envs/notebook/lib/python3.9/site-packages/stackstac/to_dask.py:185, in fetch_raster_window(reader_table, slices, dtype, fill_value)
    178 # Only read if the window we're fetching actually overlaps with the asset
    179 if windows.intersect(current_window, asset_window):
    180     # NOTE: when there are multiple assets, we _could_ parallelize these reads with our own threadpool.
    181     # However, that would probably increase memory usage, since the internal, thread-local GDAL datasets
    182     # would end up copied to even more threads.
    183 
    184     # TODO when the Reader won't be rescaling, support passing `output` to avoid the copy?
--> 185     data = reader.read(current_window)
    187     if all_empty:
    188         # Turn `output` from a broadcast-trick array to a real array, so it's writeable
    189         if (
    190             np.isnan(data)
    191             if np.isnan(fill_value)
    192             else np.equal(data, fill_value)
    193         ).all():
    194             # Unless the data we just read is all empty anyway

File /srv/conda/envs/notebook/lib/python3.9/site-packages/stackstac/rio_reader.py:385, in AutoParallelRioReader.read(self, window, **kwargs)
    384 def read(self, window: Window, **kwargs) -> np.ndarray:
--> 385     reader = self.dataset
    386     try:
    387         result = reader.read(
    388             window=window,
    389             masked=True,
   (...)
    392             **kwargs,
    393         )

File /srv/conda/envs/notebook/lib/python3.9/site-packages/stackstac/rio_reader.py:381, in AutoParallelRioReader.dataset(self)
    379 with self._dataset_lock:
    380     if self._dataset is None:
--> 381         self._dataset = self._open()
    382     return self._dataset

File /srv/conda/envs/notebook/lib/python3.9/site-packages/stackstac/rio_reader.py:340, in AutoParallelRioReader._open(self)
    338 if ds.count != 1:
    339     ds.close()
--> 340     raise RuntimeError(
    341         f"Assets must have exactly 1 band, but file {self.url!r} has {ds.count}. "
    342         "We can't currently handle multi-band rasters (each band has to be "
    343         "a separate STAC asset), so you'll need to exclude this asset from your analysis."
    344     )
    346 # Only make a VRT if the dataset doesn't match the spatial spec we want
    347 if self.spec.vrt_params != {
    348     "crs": ds.crs.to_epsg(),
    349     "transform": ds.transform,
    350     "height": ds.height,
    351     "width": ds.width,
    352 }:

RuntimeError: Assets must have exactly 1 band, but file 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/S/DV/2020/4/S2B_13SDV_20200401_0_L2A/L2A_PVI.tif' has 3. We can't currently handle multi-band rasters (each band has to be a separate STAC asset), so you'll need to exclude this asset from your analysis.

The relevant code is quite old:

if ds.count != 1:
ds.close()
raise RuntimeError(
f"Assets must have exactly 1 band, but file {self.url!r} has {ds.count}. "
"We can't currently handle multi-band rasters (each band has to be "
"a separate STAC asset), so you'll need to exclude this asset from your analysis."
)

so this doesn't seem like a recent change. It is very common for STAC assets (e.g. COGs) to have multiple bands, so I really don't understand this limitation. All the other examples I tried fail in the same way.

What am I missing here?


I ran this on https://us-central1-b.gcp.pangeo.io/ with the following packages:

  • stackstac: 0.4.3
  • rasterio: 1.3.2
@vincentsarago
Copy link

FYI the sentinel-2 items in https://earth-search.aws.element84.com/v0 are being updated and this is breaking also some of our code cogeotiff/rio-tiler-pds#63

@rabernat
Copy link
Author

rabernat commented Dec 5, 2022

Thanks for the tip @vincentsarago! I determined that the problematic asset was the overview one. By specifying specific assets, I was able to move forward - e.g.

stack = stackstac.stack(items, assets=['B01', 'B02', 'B03'])

@gjoseph92
Copy link
Owner

All documentation examples fail

@rabernat would you mind updating the title of this issue? It doesn't seem like "all documentation examples" is accurate. What you're running here (stack[0, 0].compute()) isn't in the documentation example you reference. And asset 0 that you're selecting is the overview that has 3 bands, so it makes sense that that fails.

I can run the basic example you've linked just fine. I'm not aware of other examples that are failing; could you confirm if there are other issues?

It is very common for STAC assets (e.g. COGs) to have multiple bands, so I really don't understand this limitation

See #62. It requires a bit more infrastructure to get the chunking correct in Dask, so I didn't add it to the MVP (which stackstac still more or less is). Fortunately, cloud providers seem to be offering most datasets with each analytic band in a separate COG, so often this isn't too much of an issue, as you've found.

@rabernat rabernat changed the title All documentation examples fail with error "Assets must have exactly 1 band" "Assets must have exactly 1 band" Dec 5, 2022
@rabernat
Copy link
Author

rabernat commented Dec 5, 2022

Hi @gjoseph92 - sorry for the overly broad issue title. I have just edited it now that I understand the root cause better. My goal was to provide a highly simplified example (i.e. MRE), but I now see that the "basic example" avoids this by selecting non-problematic bands.

I also experienced the same issue on the cluster example. After having both examples fail, I incorrectly extrapolated that "all the examples were broken".

Sorry again for what probably came across as a hostile issue, when the root cause was actually my error. 🤦

FWIW, I do find it highly confusing that stackstac creates datasets that are not necessarily computable. I would much prefer to have the error raised when calling stack, rather than at compute time. While the examples are not, in fact, broken, making minor modifications (as I did) does lead to errors.

Is the number of bands in each asset part of the STAC metadata? Or can it only be discovered by opening the file.

@gjoseph92
Copy link
Owner

Is the number of bands in each asset part of the STAC metadata?

No, at least not universally. That's why, unfortunately, stackstac creates datasets that aren't computable. There's no required metadata in STAC about the number of bands in an asset, so there will always be cases where we can't catch the error until runtime.

There are some optional STAC extensions that can provide the necessary metadata. stackstac does not currently use these to select assets (or exclude assets that won't work), but it could. Both eo:bands and raster can indicate the number of bands in an asset. raster didn't exist when I wrote stackstac, but I'd like to start using it for a few reasons: #91 #63 #62 (comment).

The tricky part is that STAC also lets you specify item_assets at the collection level, and stackstac is designed to just take a sequence of Items, not a collection, so for providers that use that approach, we'd be missing the metadata. (earth-search.aws.element84.com is one example of this.)

So adding some logic to pre-filter assets with >1 band according to eo:bands or raster would be pretty straightforward if those are defined at the asset level, but require a bit more refactoring to handle the item_assets case.

@rabernat
Copy link
Author

rabernat commented Dec 6, 2022

Thanks for explaining Gabe, and sorry for the noise on the issue tracker.

@rabernat rabernat closed this as completed Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants