-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could we handle geotiff? #78
Comments
Basic TIFFs are already handled by TIFFFile - so we could just need to analyse the coordinates correctly. Just! When mosaicing, are the pieces guaranteed to form a contiguous block with no overlaps? |
@martindurant: The data set is based on 1x1 degree tiles with geotiffs that are not overlapping, covering all land and ice masses, but not water areas, i.e. there are gaps from non-exisiting files, the VRTs do work globally though. |
This is not a problem for a zarr-based interface, missing files just become missing values like nan. |
Great. So it would just (!) take a vrt2zarr tool that is just (!) adding a relevant metadata description using the existing geotiffs, right? Just of course emphasized ... |
Caveat that I don't know how VRTs work; but in as much as they are pointers to file (pieces) in other places, yes. The trick is to figure out the zarr key for each data chunk. |
Maybe going via VRTs is more of a hurdle than helping. The basic structure is a grouping of geotiffs that all have the same organization (datatype, compression, xy-coordinate system) and only differ in their spatial extent individually, yet merge seamlessly. Take a look at http://sentinel-1-global-coherence-earthbigdata.s3-website-us-west-2.amazonaws.com how tiles are organized in the data set on s3. |
OK, so what we actually need, is to scan each file, presumably with TIFFFile, find the data portion as normal, and be sure to extract the correct coordinate information (position and time). Then we combine everything into one big aggregate dataset. |
Seems to be straightforward. Caveat is the assignment of a time coordinate. There are four seasons (northern hemisphere winter, spring, summer, fall) that could be assigned a time coordinate. In each season we would have a set of dataset variables, some of which are only covered in certain parts of the globe or at certain times of the year. But I guess that could be all captured in the metadata. |
We would end up with a time coordinate that is the union of all the available times, and accessing variable/time combinations that are not covers would result in NaNs. |
there are only 4 available times, given by the filenames as winter, spring, summer, fall. I guess one would pick a middate or just don't make it a time coordinate, and leave it as ordered labeled bands? That's what is done via the VRTs now. |
As the data curator who understand the original data better than me, the choice would be yours! I would tend towards changing the original labels and assumptions as little as possible. |
I would agree. Would you have any examples on how to go about this or might be interested in tackling this togehter? I think we really don't need to go via VRTs but can work from the tiffs themselves. Total number of geotiffs in the data set 1,034,038 in ~25,000 1x1 degree tiles for a total volume of 2.1 TB. Sure would be super cool to just open one "zarr" file and use xarray and dask magic to analyze the data. |
So the first thing we need is the bytes range for a file using tifffile and see if that references output contains all the information we need (bandpass, time, location coordinates). |
ok. I am installing it. Here is an example file I am using: |
Seems like a good output: |
|
I get a structure like {'.zattrs': '{}',
'.zarray': '{\n "chunks": [\n 6,\n 1200\n ],\n "compressor": {\n "id": "imagecodecs_lzw"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n 1200,\n 1200\n ],\n "zarr_format": 2\n}',
'0.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 1990, 4477],
'1.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 6467, 4528],
'2.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 10995, 4561],
... 200 small chunks. I see that the filename likely encodes the time and variable name, but where are the location coordinates? |
ModelPixelScaleTag ModelTiepointTag The ModelPixelScaleTag has the pixel spacing in x and y 0.00083333333 With We know the image size and can calculate the bounding box. e.g. lower right is |
But we can also easliy derive the bounding box from the file name |
OK, so that's one thing. I have not yet written a numcodecs codec which does "use function x to load this whole file", but it would be trivial to do, and this case might require it. Something like class WholeFileDecoder(Codec):
def __init__(self, function_loc, kwargs):
self.func = import_name(function_loc)
self.kwargs = kwargs
def decode(self, buf, _):
return self.func(io.BytesIO(buf), **self.kwargs) where for TIFF it you want to encode TiffFile().asarray(). The grib2 code does something like this. |
Interesting. Let me know if anything could be added to tifffile to make this easier. IIUC, it might be more efficient not to index the strips/tiles in each TIFF file. Instead, index each file as a single chunk and use a TIFF codec (e.g. Tifffile can parse a sequence of file names to higher dimensions using a regular expression pattern and export a fsspec reference (see the test at https://github.com/cgohlke/tifffile/blob/1c8e75bf29d591058311aee7856fc2c73dea5a83/tests/test_tifffile.py#L12744-L12779). This works with any kind of format supported by imagecodecs. It's currently not possible to parse categories (e.g. "north"/"south", "summer"/"winter"), only numbers/indices. Also, the current implementation reads one file/chunk to determine the chunk shape and dtype but that could be changed. |
@martindurant I think you are on the right track that we treat the entire file as a chunk. When we did the processing I deliberately did not choose any blocksize tiling a la COG as the data were so small to begin with, but I guess the gdal defaults made these very small sub-chunks which I have to admit I was not aware of. @cgohlke thanks for chiming in with clarifications and offer to possibly adapt tifffile! Parsing categories would be nice indeed. There are a plethora of geospatial data sets in particular that use this tile naming scheme with North/South and East/West designation of location boundaries. |
@cgohlke , I see we have been thinking along the same lines. Since you have a codec that already handles whole tiff files, it would certainly make sense to just use that. As far as combining is concerned - finding the key for each input file - there are a number of obvious things that we should be doing. We need to know the complete set of coordinates along the concat dimension(s), possibly done in a first pass, and for each input chunk:
|
Might the gdal generated VRT file be useful here after all? Here is an example of a VRT of one of the 88 data variables: https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/Global_fall_vv_COH12.vrt |
If you take a look at the top of the VRT XML, it contains info on the georeferencing, origin
|
If we already have the coordinates information from some other source, that would, of course, be totally reasonable too. I didn't gather whether the VRT thing is always available, or if it also needs construction at some effort. |
|
It would be OK to have the dependency for the same of making the references - we don't need it to load the data later. However, if there's a simpler way... I intuit that getting the arguments to |
yes, that makes sense to me when all the info can be gathered easily from the filenames. gdalbuildvrt scans each geotiff for the tifftags containing the georeferencing info and as such is more generic. |
One info we don't have from the filenames is the resolution needed for the coordinate dimensions I guess. But those could also easily be retrieved from the corresponding tifftag in one of the GeoTIFFs, maybe via |
@cgohlke and @martindurant Bravo! you guys did it! I can access the data set just fine. Just a warning when plotting.
|
Here is the code snippet I am using to go via a subset
|
@jkellndorfer I tried to reproduce this using the latest conda-forge packages and I'm getting all NaN values: Do I need to use development package versions to get this to work? |
Did you try different selections, e.g. coherence 12. Many chunks are missing in the dataset. |
@cgohlke , yes, I tried to reproduce exactly the example shown by @jkellndorfer: |
I can reproduce that on my system. Could be related to the error/warning printed in the console?
|
There is some apparent mismatch between the selectors and the display. I needed coherence index 1, which should have value 12. The title agrees, but the slider doesn't!
|
@martindurant, please excuse my ignorance: Is there a function to create a fsspec reference json string from the zarr group created with Is it possible to open a fsspec reference file from the local file system (for testing) if the target_protocol is http? Why is the xarray dataset using |
I have such a thing in a local branch, that I need to clean up. It essentially is the ascii-encoded ujson.dump of
You want
I have no idea! Adding |
Perhaps it's in order to allow NaN values? With the mask_and_scale flag, missing values become 0. |
Thank you!
Got it. Just base64 encode the binary values in out that cannot be decoded. |
Here's an attempt to wrap the whole earthbigdata set: import imagecodecs.numcodecs
import fsspec
import xarray
imagecodecs.numcodecs.register_codecs()
name = 'earthbigdata.json'
mapper = fsspec.get_mapper(
'reference://',
fo=f'https://www.lfd.uci.edu/~gohlke/{name}',
target_protocol='http',
)
dataset = xarray.open_dataset(
mapper, engine='zarr', backend_kwargs={'consolidated': False}
)
print(dataset)
|
Hah! We'll need a separate tool to visualise which parts of the coordinates space contain data and which do not! |
Having pyramid levels would be nice. |
@martindurant and @cgohlke, great progress! I have been offline for a couple of days. Let me share two figures that show where valid data can be found and where not. Basically it comes down to two parameter distinctions for polarization and COH (coherence values). There might be some seasonal gaps here and there, but those are rather marginal. This stems from the satellite coverage from Sentinel-1.
These coverages (or gaps) could easily be discerned from the filenames in the respective tiles and thus readily be coded as no data in a global metadata representation. That would be your expertise. If helpul, I could prepare a set of lists of tiles that have hh, hv, vv, vh coverage and 6- and 18- day coverge from a fsspec find operation. |
The information is already there in the set of keys known to the filesystem. I'll ask the holoviz team if they happen to have a tool for visualising that, without loading any data. |
good point, that the info is already in the keys. An interesting aspect is whether in the visualization it can be determined on the fly if there are no data in any requested subset of a data array for a given set of variables. E.g., if there are no COH06 and COH18 data are available, they would not even show as selection options in the slider or a drop-down in addition to not loading. |
Just a note that I figured out my blank viz problem -- I was using |
The script used to create |
A common use case is a time stack of geotiff images, so no time dimension or coordinate, but the date is able to be determined from the name or from some attribute in the geotiff files. We successfully processed a time stack of LCMAP geotiffs by converting them into netCDF files, then using kerchunk's new ability to use a regex to create the time coordinate from filenames. Here's the full notebook. @martindurant I'm just wondering if instead of converting to NetCDF, we could we have converted the geotiffs into cloud-optimized geotiff instead, and still used the same approach with kerchunk? |
Yes you could! There are two options for TIFF: using TIFFile and native chunks, or using one chunk per input file (as was done with the satellite coherence dataset). The attributes should be available, and the coordinates used for concat can be taken from filenames or from attributes. |
I'm guessing we don't have any examples yet of aggregating a time stack of geotiffs using native chunks, right? |
Correct. Go for it! |
Okay, I"ll try it! Looks like this what I'm looking for, from the @cgohlke tifffile readme
|
As this is mentioned by Martin on pangeo - I have the 'time stack' of geotiff possibility - with several groups of multiband variable output. The time isn't relevant as such, just the day a model was run, so the directory it is stored in. |
Was chatting with @jkellndorfer and currently a lot of folks who work with geotiff use VRT to create virtual datasets (mosaics, time stacks, etc).
Could we do the same with fsspec-reference-maker, thereby allowing folks to read from large virtual collections and avoiding GDAL?
Seems that would be very cool.
The text was updated successfully, but these errors were encountered: