Could we handle geotiff? #78

rsignell-usgs · 2021-09-09T14:20:46Z

Was chatting with @jkellndorfer and currently a lot of folks who work with geotiff use VRT to create virtual datasets (mosaics, time stacks, etc).

Could we do the same with fsspec-reference-maker, thereby allowing folks to read from large virtual collections and avoiding GDAL?

Seems that would be very cool.

martindurant · 2021-09-09T15:32:56Z

Basic TIFFs are already handled by TIFFFile - so we could just need to analyse the coordinates correctly. Just!

When mosaicing, are the pieces guaranteed to form a contiguous block with no overlaps?

jkellndorfer · 2021-09-09T16:41:05Z

@martindurant: The data set is based on 1x1 degree tiles with geotiffs that are not overlapping, covering all land and ice masses, but not water areas, i.e. there are gaps from non-exisiting files, the VRTs do work globally though.

martindurant · 2021-09-09T16:42:52Z

gaps from non-exisiting files

This is not a problem for a zarr-based interface, missing files just become missing values like nan.

jkellndorfer · 2021-09-09T16:47:07Z

Great. So it would just (!) take a vrt2zarr tool that is just (!) adding a relevant metadata description using the existing geotiffs, right? Just of course emphasized ...

martindurant · 2021-09-09T16:49:27Z

Caveat that I don't know how VRTs work; but in as much as they are pointers to file (pieces) in other places, yes. The trick is to figure out the zarr key for each data chunk.

jkellndorfer · 2021-09-09T17:12:07Z

Maybe going via VRTs is more of a hurdle than helping. The basic structure is a grouping of geotiffs that all have the same organization (datatype, compression, xy-coordinate system) and only differ in their spatial extent individually, yet merge seamlessly. Take a look at http://sentinel-1-global-coherence-earthbigdata.s3-website-us-west-2.amazonaws.com how tiles are organized in the data set on s3.

martindurant · 2021-09-09T17:15:36Z

OK, so what we actually need, is to scan each file, presumably with TIFFFile, find the data portion as normal, and be sure to extract the correct coordinate information (position and time). Then we combine everything into one big aggregate dataset.

jkellndorfer · 2021-09-09T17:31:42Z

Seems to be straightforward. Caveat is the assignment of a time coordinate. There are four seasons (northern hemisphere winter, spring, summer, fall) that could be assigned a time coordinate. In each season we would have a set of dataset variables, some of which are only covered in certain parts of the globe or at certain times of the year. But I guess that could be all captured in the metadata.

martindurant · 2021-09-09T17:33:15Z

We would end up with a time coordinate that is the union of all the available times, and accessing variable/time combinations that are not covers would result in NaNs.

jkellndorfer · 2021-09-09T17:36:45Z

there are only 4 available times, given by the filenames as winter, spring, summer, fall. I guess one would pick a middate or just don't make it a time coordinate, and leave it as ordered labeled bands? That's what is done via the VRTs now.

martindurant · 2021-09-09T17:39:25Z

As the data curator who understand the original data better than me, the choice would be yours! I would tend towards changing the original labels and assumptions as little as possible.

jkellndorfer · 2021-09-09T17:53:10Z

I would agree. Would you have any examples on how to go about this or might be interested in tackling this togehter? I think we really don't need to go via VRTs but can work from the tiffs themselves. Total number of geotiffs in the data set 1,034,038 in ~25,000 1x1 degree tiles for a total volume of 2.1 TB. Sure would be super cool to just open one "zarr" file and use xarray and dask magic to analyze the data.

martindurant · 2021-09-09T18:22:43Z

So the first thing we need is the bytes range for a file using tifffile and see if that references output contains all the information we need (bandpass, time, location coordinates).

jkellndorfer · 2021-09-09T18:42:20Z

ok. I am installing it. Here is an example file I am using:
https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/N42W090/N42W090_fall_vv_COH12.tif

jkellndorfer · 2021-09-09T18:48:38Z

Seems like a good output:

jkellndorfer · 2021-09-09T18:49:05Z

tifffile N42W090_fall_vv_COH12.tif

Reading TIFF header: 0.002332 s
Reading image data:  0.013988 s
Generating report:   0.069484 s

TiffFile 'N42W090_fall_vv_COH12.tif'  899.52 KiB  geotiff

TiffPageSeries 0  1200x1200  uint8  YX  Generic  1 Pages

TiffPage 0 @8  1200x1200  uint8  minisblack lzw geotiff

TiffTag 256 ImageWidth @10 SHORT @18 1200
TiffTag 257 ImageLength @22 SHORT @30 1200
TiffTag 258 BitsPerSample @34 SHORT @42 8
TiffTag 259 Compression @46 SHORT @54 LZW
TiffTag 262 PhotometricInterpretation @58 SHORT @66 MINISBLACK
TiffTag 273 StripOffsets @70 LONG[200] @1030 (1990, 6467, 10995, 15556, 20128,
TiffTag 277 SamplesPerPixel @82 SHORT @90 1
TiffTag 278 RowsPerStrip @94 SHORT @102 6
TiffTag 279 StripByteCounts @106 LONG[200] @230 (4477, 4528, 4561, 4572, 4473,
TiffTag 284 PlanarConfiguration @118 SHORT @126 CONTIG
TiffTag 317 Predictor @130 SHORT @138 NONE
TiffTag 339 SampleFormat @142 SHORT @150 UINT
TiffTag 33550 ModelPixelScaleTag @154 DOUBLE[3] @1830 (0.00083333333, 0.0008333
TiffTag 33922 ModelTiepointTag @166 DOUBLE[6] @1854 (0.0, 0.0, 0.0, -90.0, 42.0
TiffTag 34735 GeoKeyDirectoryTag @178 SHORT[32] @1902 (1, 1, 0, 7, 1024, 0, 1,
TiffTag 34736 GeoDoubleParamsTag @190 DOUBLE[2] @1966 (298.257223563, 6378137.0
TiffTag 34737 GeoAsciiParamsTag @202 ASCII[8] @1982 WGS 84|
TiffTag 42113 GDAL_NODATA @214 ASCII[2] @222 0

StripOffsets
(1990, 6467, 10995, 15556, 20128, 24601, 28997, 33372, 37927, 42281, 46529,
 50958, 55479, 59944, 64505, 68906, 73437, 77971, 82459, 87104, 91871, 96363,
 100870, 105256, 109544, 113898, 118374, 122791, 127370, 131910, 136443,
 140998, 145696, 150448, 155111, 159657, 164259, 168891, 173718, 178453,
 183307, 188071, 192799, 197584, 202241, 206883, 211590, 216310, 221048,
 225633, 229962, 234406, 238930, 243581, 248090, 252536, 257180, 261840,
 266341, 270829, 275333, 279866, 284370, 288756, 293301, 297760, 302201,
 306732, 311280, 315925, 320654, 325313, 330012, 334624, 339191, 343863,
 348405, 352962, 357547, 362147, 366711, 371311, 375928, 380635, 385238,
 389747, 394263, 398962, 403793, 408743, 413549, 418221, 422862, 427495,
 432058, 436550, 441183, 445641, 450069, 454557, 459184, 463675, 468041,
 472380, 476880, 481366, 485863, 490363, 494807, 499332, 503928, 508408,
 512905, 517493, 522036, 526593, 531154, 535624, 540178, 544762, 549404,
 554042, 558815, 563600, 568347, 573184, 578128, 583201, 588167, 593070,
 597966, 602817, 607612, 612506, 617519, 622713, 627673, 632096, 636533,
 640934, 645497, 650114, 654936, 659542, 663928, 668422, 672848, 677331,
 681927, 686604, 691246, 695939, 700526, 705032, 709512, 714010, 718579,
 723044, 727524, 732030, 736530, 741006, 745644, 750321, 754825, 759433,
 763937, 768528, 773160, 777819, 782481, 787117, 791830, 796496, 801272,
 806015, 810704, 815397, 820134, 824916, 829401, 834007, 838816, 843476,
 847961, 852609, 857182, 861620, 866219, 870723, 875230, 879697, 884230,
 888899, 893595, 898158, 902629, 907180, 911743, 916379)

StripByteCounts
(4477, 4528, 4561, 4572, 4473, 4396, 4375, 4555, 4354, 4248, 4429, 4521, 4465,
 4561, 4401, 4531, 4534, 4488, 4645, 4767, 4492, 4507, 4386, 4288, 4354, 4476,
 4417, 4579, 4540, 4533, 4555, 4698, 4752, 4663, 4546, 4602, 4632, 4827, 4735,
 4854, 4764, 4728, 4785, 4657, 4642, 4707, 4720, 4738, 4585, 4329, 4444, 4524,
 4651, 4509, 4446, 4644, 4660, 4501, 4488, 4504, 4533, 4504, 4386, 4545, 4459,
 4441, 4531, 4548, 4645, 4729, 4659, 4699, 4612, 4567, 4672, 4542, 4557, 4585,
 4600, 4564, 4600, 4617, 4707, 4603, 4509, 4516, 4699, 4831, 4950, 4806, 4672,
 4641, 4633, 4563, 4492, 4633, 4458, 4428, 4488, 4627, 4491, 4366, 4339, 4500,
 4486, 4497, 4500, 4444, 4525, 4596, 4480, 4497, 4588, 4543, 4557, 4561, 4470,
 4554, 4584, 4642, 4638, 4773, 4785, 4747, 4837, 4944, 5073, 4966, 4903, 4896,
 4851, 4795, 4894, 5013, 5194, 4960, 4423, 4437, 4401, 4563, 4617, 4822, 4606,
 4386, 4494, 4426, 4483, 4596, 4677, 4642, 4693, 4587, 4506, 4480, 4498, 4569,
 4465, 4480, 4506, 4500, 4476, 4638, 4677, 4504, 4608, 4504, 4591, 4632, 4659,
 4662, 4636, 4713, 4666, 4776, 4743, 4689, 4693, 4737, 4782, 4485, 4606, 4809,
 4660, 4485, 4648, 4573, 4438, 4599, 4504, 4507, 4467, 4533, 4669, 4696, 4563,
 4471, 4551, 4563, 4636, 4729)

ModelPixelScaleTag
(0.00083333333, 0.00083333333, 0.0)

ModelTiepointTag
(0.0, 0.0, 0.0, -90.0, 42.0, 0.0)

GeoKeyDirectoryTag
(1, 1, 0, 7, 1024, 0, 1, 2, 1025, 0, 1, 1, 2048, 0, 1, 4326, 2049, 34737, 7, 0,
 2054, 0, 1, 9102, 2057, 34736, 1, 1, 2059, 34736, 1, 0)

GeoDoubleParamsTag
(298.257223563, 6378137.0)

GEOTIFF_METADATA
{'GTModelTypeGeoKey': <ModelType.Geographic: 2>,
 'GTRasterTypeGeoKey': <RasterPixel.IsArea: 1>,
 'GeogAngularUnitsGeoKey': <Angular.Degree: 9102>,
 'GeogCitationGeoKey': 'WGS 84',
 'GeogInvFlatteningGeoKey': 298.257223563,
 'GeogSemiMajorAxisGeoKey': 6378137.0,
 'GeographicTypeGeoKey': <GCS.WGS_84: 4326>,
 'KeyDirectoryVersion': 1,
 'KeyRevision': 1,
 'KeyRevisionMinor': 0,
 'ModelPixelScale': [0.00083333333, 0.00083333333, 0.0],
 'ModelTiepoint': [0.0, 0.0, 0.0, -90.0, 42.0, 0.0]}

martindurant · 2021-09-09T18:49:18Z

I get a structure like

{'.zattrs': '{}',
 '.zarray': '{\n "chunks": [\n  6,\n  1200\n ],\n "compressor": {\n  "id": "imagecodecs_lzw"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  1200,\n  1200\n ],\n "zarr_format": 2\n}',
 '0.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 1990, 4477],
 '1.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 6467, 4528],
 '2.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 10995, 4561],
...

200 small chunks. I see that the filename likely encodes the time and variable name, but where are the location coordinates?

jkellndorfer · 2021-09-09T18:56:16Z

ModelPixelScaleTag
(0.00083333333, 0.00083333333, 0.0)

ModelTiepointTag
(0.0, 0.0, 0.0, -90.0, 42.0, 0.0)

The ModelPixelScaleTag has the pixel spacing in x and y 0.00083333333
The ModelTiepointTag has the upper left corner xcoordinates: -90 (=West 90deg) and y coordinates 42 (=North 42deg)

With
TiffTag 256 ImageWidth @10 SHORT @18 1200
TiffTag 257 ImageLength @22 SHORT @30 1200

We know the image size and can calculate the bounding box.

e.g. lower right is
-90+1200 x 00083333333 = -89
42 - 1200 x 00083333333 = 41

jkellndorfer · 2021-09-09T18:59:05Z

But we can also easliy derive the bounding box from the file name
N42W090 refers to the upper left coordinate of the tile. And each tile is exactly 1x1 degree.

martindurant · 2021-09-09T19:27:38Z

OK, so that's one thing.
For this particular dataset, the files being so small and being made up of so many even much smaller chunks, we may decide that addressing each of the sub-chunks is going to be prohibitively slow.

I have not yet written a numcodecs codec which does "use function x to load this whole file", but it would be trivial to do, and this case might require it.

Something like

class WholeFileDecoder(Codec):
    def __init__(self, function_loc, kwargs):
        self.func = import_name(function_loc)
        self.kwargs = kwargs

    def decode(self, buf, _):
        return self.func(io.BytesIO(buf), **self.kwargs)

where for TIFF it you want to encode TiffFile().asarray(). The grib2 code does something like this.

cgohlke · 2021-09-09T19:27:51Z

Interesting. Let me know if anything could be added to tifffile to make this easier.

IIUC, it might be more efficient not to index the strips/tiles in each TIFF file. Instead, index each file as a single chunk and use a TIFF codec (e.g. imagecodecs_tiff based on libtiff).

Tifffile can parse a sequence of file names to higher dimensions using a regular expression pattern and export a fsspec reference (see the test at https://github.com/cgohlke/tifffile/blob/1c8e75bf29d591058311aee7856fc2c73dea5a83/tests/test_tifffile.py#L12744-L12779). This works with any kind of format supported by imagecodecs.

It's currently not possible to parse categories (e.g. "north"/"south", "summer"/"winter"), only numbers/indices. Also, the current implementation reads one file/chunk to determine the chunk shape and dtype but that could be changed.

jkellndorfer · 2021-09-09T19:47:22Z

@martindurant I think you are on the right track that we treat the entire file as a chunk. When we did the processing I deliberately did not choose any blocksize tiling a la COG as the data were so small to begin with, but I guess the gdal defaults made these very small sub-chunks which I have to admit I was not aware of. @cgohlke thanks for chiming in with clarifications and offer to possibly adapt tifffile! Parsing categories would be nice indeed. There are a plethora of geospatial data sets in particular that use this tile naming scheme with North/South and East/West designation of location boundaries.

martindurant · 2021-09-10T13:12:14Z

@cgohlke , I see we have been thinking along the same lines. Since you have a codec that already handles whole tiff files, it would certainly make sense to just use that.

As far as combining is concerned - finding the key for each input file - there are a number of obvious things that we should be doing. We need to know the complete set of coordinates along the concat dimension(s), possibly done in a first pass, and for each input chunk:

the variable name (perhaps only one of these for the whole aggregate dataset)
the values of the concat dimensions; could be extracted from the filename, dataset attribute/tag or array (single or multiple value) within the data

jkellndorfer · 2021-09-10T13:19:49Z

Might the gdal generated VRT file be useful here after all? Here is an example of a VRT of one of the 88 data variables: https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/Global_fall_vv_COH12.vrt

jkellndorfer · 2021-09-10T13:23:47Z

If you take a look at the top of the VRT XML, it contains info on the georeferencing, origin
(upper left) and pixel spacing in the tag, and then for each it describes the pixel/line offset and size in the <SrcRect ...> and <DstRect ..> tags.

<VRTDataset rasterXSize="432000" rasterYSize="166800">
  <SRS dataAxisToSRSAxisMapping="2,1">GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]</SRS>
  <GeoTransform> -1.8000000000000000e+02,  8.3333333000010982e-04,  0.0000000000000000e+00,  8.1000000000000000e+01,  0.0000000000000000e+00, -8.3333333000010982e-04</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1">
    <NoDataValue>0</NoDataValue>
    <ColorInterp>Gray</ColorInterp>
    <ComplexSource>
      <SourceFilename relativeToVRT="1">N00E005/N00E005_fall_vv_COH12.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="1200" RasterYSize="1200" DataType="Byte" BlockXSize="1200" BlockYSize="6" />
      <SrcRect xOff="0" yOff="0" xSize="1200" ySize="1200" />
      <DstRect xOff="222000" yOff="97200" xSize="1200" ySize="1200" />
      <NODATA>0</NODATA>
    </ComplexSource>

martindurant · 2021-09-10T13:26:33Z

If we already have the coordinates information from some other source, that would, of course, be totally reasonable too. I didn't gather whether the VRT thing is always available, or if it also needs construction at some effort.

jkellndorfer · 2021-09-10T13:33:17Z

gdalbuildvrt would have had to been called to generate the layers.
I guess it's a question if we want the gdal dependency on generating the zarr store in this. gdalbuildvrt of the gdal suite just might do the work. Maybe the code they use for gdalbuildvrt has the pieces ready that you could use directly:
https://github.com/TUW-GEO/OGRSpatialRef3D/blob/master/gdal-1.10.0/apps/gdalbuildvrt.cpp
https://gdal.org/programs/gdalbuildvrt.html

martindurant · 2021-09-10T13:37:35Z

It would be OK to have the dependency for the same of making the references - we don't need it to load the data later. However, if there's a simpler way... I intuit that getting the arguments to gdalbuildvrt right requires some expertise, compared to writing a little lambda function to extract numbers from the filename.

jkellndorfer · 2021-09-10T13:43:32Z

yes, that makes sense to me when all the info can be gathered easily from the filenames. gdalbuildvrt scans each geotiff for the tifftags containing the georeferencing info and as such is more generic. tifffile seems to also offer the more generic solution. What gdalbuildvrt needs is actually just the list of geotiffs that should either be mosaicked into one layer or stacked (with the -separate switch.

jkellndorfer · 2021-09-10T13:48:37Z

One info we don't have from the filenames is the resolution needed for the coordinate dimensions I guess. But those could also easily be retrieved from the corresponding tifftag in one of the GeoTIFFs, maybe via tifffile.

jkellndorfer · 2021-09-21T23:12:36Z

@cgohlke and @martindurant Bravo! you guys did it! I can access the data set just fine. Just a warning when plotting.

WARNING:param.Image11208: Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.
WARNING:param.Image11208:Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.
WARNING:param.Image11208: Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.
WARNING:param.Image11208:Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.

jkellndorfer · 2021-09-21T23:17:49Z

Here is the code snippet I am using to go via a subset

fs = fsspec.filesystem("reference", fo="https://storage.googleapis.com/mdtemp/Global_COH2.json", remote_protocol="http")
ds = xr.open_dataset(fs.get_mapper(""), engine="zarr", backend_kwargs={"consolidated": False})
dssub=ds.sel(latitude=slice(42.5,41),longitude=slice(-72,-69.5))
dssub.hvplot.image(x="longitude", y="latitude", rasterize=True, geo=True,cmap='spectral_r',frame_width=400)

rsignell-usgs · 2021-09-23T13:37:14Z

@jkellndorfer I tried to reproduce this using the latest conda-forge packages and I'm getting all NaN values:
https://nbviewer.jupyter.org/gist/rsignell-usgs/9dc49fdfd00a09c7e5d2d0be940f8209

Do I need to use development package versions to get this to work?

cgohlke · 2021-09-23T14:24:02Z

Did you try different selections, e.g. coherence 12. Many chunks are missing in the dataset.

rsignell-usgs · 2021-09-23T14:57:56Z

@cgohlke , yes, I tried to reproduce exactly the example shown by @jkellndorfer:

cgohlke · 2021-09-23T15:33:39Z

I'm getting all NaN values

I can reproduce that on my system. Could be related to the error/warning printed in the console?

[IPKernelApp] ERROR | No such comm target registered: hv-extension-comm
[IPKernelApp] WARNING | No such comm: hv-extension-comm

martindurant · 2021-09-23T16:19:18Z

martindurant · 2021-09-23T16:23:49Z

There is some apparent mismatch between the selectors and the display. I needed coherence index 1, which should have value 12. The title agrees, but the slider doesn't!

dssub.data[0, 0, 1, :, :].values == dssub.sel(season="winter", polarization="vv", coherence=12).data.values

cgohlke · 2021-09-23T17:13:38Z

@martindurant, please excuse my ignorance:

Is there a function to create a fsspec reference json string from the zarr group created with out = {};g = zarr.open_group(out) ...?

Is it possible to open a fsspec reference file from the local file system (for testing) if the target_protocol is http? fsspec.get_mapper('reference://', fo='Global_COH2.json', target_protocol='http') does not find the reference file and target_protocol='file' can not load the data.

Why is the xarray dataset using float32 when the dtype in the reference file is |u1? (data (season, polarization, coherence, latitude, longitude) float32 ...)

martindurant · 2021-09-23T17:22:36Z

Is there a function to create a fsspec reference json string from the zarr group created with

I have such a thing in a local branch, that I need to clean up. It essentially is the ascii-encoded ujson.dump of out for Version 0, or of {"version": 1, "refs": out} for Version 1.

Is it possible to open a fsspec reference file from the local file system (for testing) if the target_protocol is http

You want remote_protocol="http" and optionally remote_options. The "target" is the JSON file, which could be on a different file system from the actual data files.

Why is the xarray dataset using float32

I have no idea! Adding mask_and_scale=False to open_dataset prevents it.

martindurant · 2021-09-23T17:23:45Z

I have no idea

Perhaps it's in order to allow NaN values? With the mask_and_scale flag, missing values become 0.

cgohlke · 2021-09-23T17:50:48Z

Thank you!

It essentially is the ascii-encoded ujson.dump of out for Version 0

Got it. Just base64 encode the binary values in out that cannot be decoded.

cgohlke · 2021-09-23T19:34:49Z

Here's an attempt to wrap the whole earthbigdata set:

import imagecodecs.numcodecs
import fsspec
import xarray

imagecodecs.numcodecs.register_codecs()

name = 'earthbigdata.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=f'https://www.lfd.uci.edu/~gohlke/{name}',
    target_protocol='http',
)
dataset = xarray.open_dataset(
    mapper, engine='zarr', backend_kwargs={'consolidated': False}
)
print(dataset)

<xarray.Dataset>
Dimensions:          (season: 4, polarization: 4, latitude: 193200, longitude: 432000, coherence: 6, flightdirection: 2, orbit: 175)
Coordinates:
  * coherence        (coherence) float32 6.0 12.0 18.0 24.0 36.0 48.0
  * flightdirection  (flightdirection) object 'A' 'D'
  * latitude         (latitude) float32 82.0 82.0 82.0 ... -79.0 -79.0 -79.0
  * longitude        (longitude) float32 -180.0 -180.0 -180.0 ... 180.0 180.0
  * orbit            (orbit) float64 1.0 2.0 3.0 4.0 ... 172.0 173.0 174.0 175.0
  * polarization     (polarization) object 'vv' 'vh' 'hv' 'hh'
  * season           (season) object 'winter' 'spring' 'summer' 'fall'
Data variables:
    AMP              (season, polarization, latitude, longitude) float32 ...
    COH              (season, polarization, coherence, latitude, longitude) float32 ...
    inc              (orbit, flightdirection, latitude, longitude) float32 ...
    lsmap            (orbit, flightdirection, latitude, longitude) float32 ...
    rho              (season, polarization, latitude, longitude) float32 ...
    rmse             (season, polarization, latitude, longitude) float32 ...
    tau              (season, polarization, latitude, longitude) float32 ...

martindurant · 2021-09-23T19:39:01Z

Hah! We'll need a separate tool to visualise which parts of the coordinates space contain data and which do not!

cgohlke · 2021-09-23T19:46:30Z

Having pyramid levels would be nice.

jkellndorfer · 2021-09-27T09:43:39Z

@martindurant and @cgohlke, great progress! I have been offline for a couple of days. Let me share two figures that show where valid data can be found and where not. Basically it comes down to two parameter distinctions for polarization and COH (coherence values). There might be some seasonal gaps here and there, but those are rather marginal. This stems from the satellite coverage from Sentinel-1.

for coherence we have global coverage for 12,24,36, and 48 days. There is regional coverage where we also have 6 and 18 day coherence values. Top figure shows 6 day repeat pass coverage, i.e. where we expect 6 and 18 day measurements. Bottom part shows global coverage where we expect 12,24,36, and 48 days.

for polarziation, we have high latitudes (arctic and antartic regions) covered with hh (COH and AMP) and hv (AMP only), plus some islands as shown in the figure. The rest is covered in vv (COH and AMP) and vh (AMP only).

These coverages (or gaps) could easily be discerned from the filenames in the respective tiles and thus readily be coded as no data in a global metadata representation. That would be your expertise. If helpul, I could prepare a set of lists of tiles that have hh, hv, vv, vh coverage and 6- and 18- day coverge from a fsspec find operation.

martindurant · 2021-09-27T17:21:46Z

The information is already there in the set of keys known to the filesystem. I'll ask the holoviz team if they happen to have a tool for visualising that, without loading any data.

jkellndorfer · 2021-09-27T17:48:30Z

good point, that the info is already in the keys. An interesting aspect is whether in the visualization it can be determined on the fly if there are no data in any requested subset of a data array for a given set of variables. E.g., if there are no COH06 and COH18 data are available, they would not even show as selection options in the slider or a drop-down in addition to not loading.

rsignell-usgs · 2021-09-27T19:08:43Z

Just a note that I figured out my blank viz problem -- I was using hvplot.quadmesh instead of hvplot.image

cgohlke · 2021-10-09T23:29:58Z

Here's an attempt to wrap the whole earthbigdata set:

import imagecodecs.numcodecs
import fsspec
import xarray

imagecodecs.numcodecs.register_codecs()

name = 'earthbigdata.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=f'https://www.lfd.uci.edu/~gohlke/{name}',
    target_protocol='http',
)
dataset = xarray.open_dataset(
    mapper, engine='zarr', backend_kwargs={'consolidated': False}
)
print(dataset)

The script used to create earthbigdata.json is at https://github.com/cgohlke/tifffile/blob/v2021.10.10/examples/earthbigdata.py

rsignell-usgs · 2022-04-04T13:12:17Z

A common use case is a time stack of geotiff images, so no time dimension or coordinate, but the date is able to be determined from the name or from some attribute in the geotiff files.

We successfully processed a time stack of LCMAP geotiffs by converting them into netCDF files, then using kerchunk's new ability to use a regex to create the time coordinate from filenames. Here's the full notebook.

@martindurant I'm just wondering if instead of converting to NetCDF, we could we have converted the geotiffs into cloud-optimized geotiff instead, and still used the same approach with kerchunk?

martindurant · 2022-04-04T13:15:49Z

Yes you could! There are two options for TIFF: using TIFFile and native chunks, or using one chunk per input file (as was done with the satellite coherence dataset). The attributes should be available, and the coordinates used for concat can be taken from filenames or from attributes.

rsignell-usgs · 2022-04-04T18:41:39Z

I'm guessing we don't have any examples yet of aggregating a time stack of geotiffs using native chunks, right?

martindurant · 2022-04-04T18:42:25Z

Correct. Go for it!

rsignell-usgs · 2022-04-05T13:20:41Z

Okay, I"ll try it! Looks like this what I'm looking for, from the @cgohlke tifffile readme

>>> image_sequence = TiffSequence('temp_C0*.tif', pattern=r'_(C)(\d+)(T)(\d+)')
>>> image_sequence.shape
(1, 2)
>>> image_sequence.axes
'CT'
>>> data = image_sequence.asarray()
>>> data.shape
(1, 2, 64, 64)
>>> with image_sequence.aszarr() as store:
...     zarr.open(store, mode='r')
<zarr.core.Array (1, 2, 64, 64) float64 read-only>
>>> image_sequence.close()
>>> with image_sequence.aszarr() as store:
...     store.write_fsspec('temp.json', url='file://')

RichardScottOZ · 2022-05-13T02:59:19Z

As this is mentioned by Martin on pangeo - I have the 'time stack' of geotiff possibility - with several groups of multiband variable output. The time isn't relevant as such, just the day a model was run, so the directory it is stored in.

rsignell-usgs mentioned this issue Apr 5, 2022

Creating a virtual zarr datacube from COG s3 objects cgohlke/tifffile#125

Closed

tinaok mentioned this issue Aug 1, 2022

advanced material (catalogue based on kerchunk) pangeo-data/foss4g-2022#10

Merged

martindurant closed this as completed Feb 2, 2023

rabernat mentioned this issue Apr 11, 2023

TIFF: internal codec (small chunks) vs. entire file as single chunk (imagecodecs_tiff codec) #325

Open

Could we handle geotiff? #78

Could we handle geotiff? #78

Comments

rsignell-usgs commented Sep 9, 2021 • edited Loading

martindurant commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021

martindurant commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021 • edited Loading

martindurant commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021

martindurant commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021

martindurant commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021 • edited Loading

martindurant commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021

martindurant commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021

jkellndorfer commented Sep 9, 2021

martindurant commented Sep 9, 2021 • edited Loading

jkellndorfer commented Sep 9, 2021 • edited Loading

jkellndorfer commented Sep 9, 2021

martindurant commented Sep 9, 2021

cgohlke commented Sep 9, 2021 • edited Loading

jkellndorfer commented Sep 9, 2021

martindurant commented Sep 10, 2021

jkellndorfer commented Sep 10, 2021

jkellndorfer commented Sep 10, 2021 • edited Loading

martindurant commented Sep 10, 2021

jkellndorfer commented Sep 10, 2021 • edited Loading

martindurant commented Sep 10, 2021

jkellndorfer commented Sep 10, 2021

jkellndorfer commented Sep 10, 2021 • edited Loading

jkellndorfer commented Sep 21, 2021 • edited Loading

jkellndorfer commented Sep 21, 2021 • edited Loading

rsignell-usgs commented Sep 23, 2021

cgohlke commented Sep 23, 2021

rsignell-usgs commented Sep 23, 2021

cgohlke commented Sep 23, 2021

martindurant commented Sep 23, 2021

martindurant commented Sep 23, 2021

cgohlke commented Sep 23, 2021

martindurant commented Sep 23, 2021

martindurant commented Sep 23, 2021

cgohlke commented Sep 23, 2021

cgohlke commented Sep 23, 2021

martindurant commented Sep 23, 2021

cgohlke commented Sep 23, 2021

jkellndorfer commented Sep 27, 2021

martindurant commented Sep 27, 2021

jkellndorfer commented Sep 27, 2021

rsignell-usgs commented Sep 27, 2021

cgohlke commented Oct 9, 2021

rsignell-usgs commented Apr 4, 2022 • edited Loading

martindurant commented Apr 4, 2022

rsignell-usgs commented Apr 4, 2022

martindurant commented Apr 4, 2022

rsignell-usgs commented Apr 5, 2022 • edited Loading

RichardScottOZ commented May 13, 2022 • edited Loading

rsignell-usgs commented Sep 9, 2021 •

edited

Loading

jkellndorfer commented Sep 9, 2021 •

edited

Loading

jkellndorfer commented Sep 9, 2021 •

edited

Loading

martindurant commented Sep 9, 2021 •

edited

Loading

jkellndorfer commented Sep 9, 2021 •

edited

Loading

cgohlke commented Sep 9, 2021 •

edited

Loading

jkellndorfer commented Sep 10, 2021 •

edited

Loading

jkellndorfer commented Sep 10, 2021 •

edited

Loading

jkellndorfer commented Sep 10, 2021 •

edited

Loading

jkellndorfer commented Sep 21, 2021 •

edited

Loading

jkellndorfer commented Sep 21, 2021 •

edited

Loading

rsignell-usgs commented Apr 4, 2022 •

edited

Loading

rsignell-usgs commented Apr 5, 2022 •

edited

Loading

RichardScottOZ commented May 13, 2022 •

edited

Loading