Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we handle geotiff? #78

Closed
rsignell-usgs opened this issue Sep 9, 2021 · 84 comments
Closed

Could we handle geotiff? #78

rsignell-usgs opened this issue Sep 9, 2021 · 84 comments

Comments

@rsignell-usgs
Copy link
Collaborator

rsignell-usgs commented Sep 9, 2021

Was chatting with @jkellndorfer and currently a lot of folks who work with geotiff use VRT to create virtual datasets (mosaics, time stacks, etc).

Could we do the same with fsspec-reference-maker, thereby allowing folks to read from large virtual collections and avoiding GDAL?

Seems that would be very cool.

@martindurant
Copy link
Member

Basic TIFFs are already handled by TIFFFile - so we could just need to analyse the coordinates correctly. Just!

When mosaicing, are the pieces guaranteed to form a contiguous block with no overlaps?

@jkellndorfer
Copy link
Contributor

@martindurant: The data set is based on 1x1 degree tiles with geotiffs that are not overlapping, covering all land and ice masses, but not water areas, i.e. there are gaps from non-exisiting files, the VRTs do work globally though.

@martindurant
Copy link
Member

gaps from non-exisiting files

This is not a problem for a zarr-based interface, missing files just become missing values like nan.

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 9, 2021

Great. So it would just (!) take a vrt2zarr tool that is just (!) adding a relevant metadata description using the existing geotiffs, right? Just of course emphasized ...

@martindurant
Copy link
Member

Caveat that I don't know how VRTs work; but in as much as they are pointers to file (pieces) in other places, yes. The trick is to figure out the zarr key for each data chunk.

@jkellndorfer
Copy link
Contributor

Maybe going via VRTs is more of a hurdle than helping. The basic structure is a grouping of geotiffs that all have the same organization (datatype, compression, xy-coordinate system) and only differ in their spatial extent individually, yet merge seamlessly. Take a look at http://sentinel-1-global-coherence-earthbigdata.s3-website-us-west-2.amazonaws.com how tiles are organized in the data set on s3.

@martindurant
Copy link
Member

OK, so what we actually need, is to scan each file, presumably with TIFFFile, find the data portion as normal, and be sure to extract the correct coordinate information (position and time). Then we combine everything into one big aggregate dataset.

@jkellndorfer
Copy link
Contributor

Seems to be straightforward. Caveat is the assignment of a time coordinate. There are four seasons (northern hemisphere winter, spring, summer, fall) that could be assigned a time coordinate. In each season we would have a set of dataset variables, some of which are only covered in certain parts of the globe or at certain times of the year. But I guess that could be all captured in the metadata.

@martindurant
Copy link
Member

We would end up with a time coordinate that is the union of all the available times, and accessing variable/time combinations that are not covers would result in NaNs.

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 9, 2021

there are only 4 available times, given by the filenames as winter, spring, summer, fall. I guess one would pick a middate or just don't make it a time coordinate, and leave it as ordered labeled bands? That's what is done via the VRTs now.

@martindurant
Copy link
Member

As the data curator who understand the original data better than me, the choice would be yours! I would tend towards changing the original labels and assumptions as little as possible.

@jkellndorfer
Copy link
Contributor

I would agree. Would you have any examples on how to go about this or might be interested in tackling this togehter? I think we really don't need to go via VRTs but can work from the tiffs themselves. Total number of geotiffs in the data set 1,034,038 in ~25,000 1x1 degree tiles for a total volume of 2.1 TB. Sure would be super cool to just open one "zarr" file and use xarray and dask magic to analyze the data.

@martindurant
Copy link
Member

So the first thing we need is the bytes range for a file using tifffile and see if that references output contains all the information we need (bandpass, time, location coordinates).

@jkellndorfer
Copy link
Contributor

@jkellndorfer
Copy link
Contributor

Seems like a good output:

@jkellndorfer
Copy link
Contributor

tifffile N42W090_fall_vv_COH12.tif

Reading TIFF header: 0.002332 s
Reading image data:  0.013988 s
Generating report:   0.069484 s

TiffFile 'N42W090_fall_vv_COH12.tif'  899.52 KiB  geotiff

TiffPageSeries 0  1200x1200  uint8  YX  Generic  1 Pages

TiffPage 0 @8  1200x1200  uint8  minisblack lzw geotiff

TiffTag 256 ImageWidth @10 SHORT @18 1200
TiffTag 257 ImageLength @22 SHORT @30 1200
TiffTag 258 BitsPerSample @34 SHORT @42 8
TiffTag 259 Compression @46 SHORT @54 LZW
TiffTag 262 PhotometricInterpretation @58 SHORT @66 MINISBLACK
TiffTag 273 StripOffsets @70 LONG[200] @1030 (1990, 6467, 10995, 15556, 20128,
TiffTag 277 SamplesPerPixel @82 SHORT @90 1
TiffTag 278 RowsPerStrip @94 SHORT @102 6
TiffTag 279 StripByteCounts @106 LONG[200] @230 (4477, 4528, 4561, 4572, 4473,
TiffTag 284 PlanarConfiguration @118 SHORT @126 CONTIG
TiffTag 317 Predictor @130 SHORT @138 NONE
TiffTag 339 SampleFormat @142 SHORT @150 UINT
TiffTag 33550 ModelPixelScaleTag @154 DOUBLE[3] @1830 (0.00083333333, 0.0008333
TiffTag 33922 ModelTiepointTag @166 DOUBLE[6] @1854 (0.0, 0.0, 0.0, -90.0, 42.0
TiffTag 34735 GeoKeyDirectoryTag @178 SHORT[32] @1902 (1, 1, 0, 7, 1024, 0, 1,
TiffTag 34736 GeoDoubleParamsTag @190 DOUBLE[2] @1966 (298.257223563, 6378137.0
TiffTag 34737 GeoAsciiParamsTag @202 ASCII[8] @1982 WGS 84|
TiffTag 42113 GDAL_NODATA @214 ASCII[2] @222 0

StripOffsets
(1990, 6467, 10995, 15556, 20128, 24601, 28997, 33372, 37927, 42281, 46529,
 50958, 55479, 59944, 64505, 68906, 73437, 77971, 82459, 87104, 91871, 96363,
 100870, 105256, 109544, 113898, 118374, 122791, 127370, 131910, 136443,
 140998, 145696, 150448, 155111, 159657, 164259, 168891, 173718, 178453,
 183307, 188071, 192799, 197584, 202241, 206883, 211590, 216310, 221048,
 225633, 229962, 234406, 238930, 243581, 248090, 252536, 257180, 261840,
 266341, 270829, 275333, 279866, 284370, 288756, 293301, 297760, 302201,
 306732, 311280, 315925, 320654, 325313, 330012, 334624, 339191, 343863,
 348405, 352962, 357547, 362147, 366711, 371311, 375928, 380635, 385238,
 389747, 394263, 398962, 403793, 408743, 413549, 418221, 422862, 427495,
 432058, 436550, 441183, 445641, 450069, 454557, 459184, 463675, 468041,
 472380, 476880, 481366, 485863, 490363, 494807, 499332, 503928, 508408,
 512905, 517493, 522036, 526593, 531154, 535624, 540178, 544762, 549404,
 554042, 558815, 563600, 568347, 573184, 578128, 583201, 588167, 593070,
 597966, 602817, 607612, 612506, 617519, 622713, 627673, 632096, 636533,
 640934, 645497, 650114, 654936, 659542, 663928, 668422, 672848, 677331,
 681927, 686604, 691246, 695939, 700526, 705032, 709512, 714010, 718579,
 723044, 727524, 732030, 736530, 741006, 745644, 750321, 754825, 759433,
 763937, 768528, 773160, 777819, 782481, 787117, 791830, 796496, 801272,
 806015, 810704, 815397, 820134, 824916, 829401, 834007, 838816, 843476,
 847961, 852609, 857182, 861620, 866219, 870723, 875230, 879697, 884230,
 888899, 893595, 898158, 902629, 907180, 911743, 916379)

StripByteCounts
(4477, 4528, 4561, 4572, 4473, 4396, 4375, 4555, 4354, 4248, 4429, 4521, 4465,
 4561, 4401, 4531, 4534, 4488, 4645, 4767, 4492, 4507, 4386, 4288, 4354, 4476,
 4417, 4579, 4540, 4533, 4555, 4698, 4752, 4663, 4546, 4602, 4632, 4827, 4735,
 4854, 4764, 4728, 4785, 4657, 4642, 4707, 4720, 4738, 4585, 4329, 4444, 4524,
 4651, 4509, 4446, 4644, 4660, 4501, 4488, 4504, 4533, 4504, 4386, 4545, 4459,
 4441, 4531, 4548, 4645, 4729, 4659, 4699, 4612, 4567, 4672, 4542, 4557, 4585,
 4600, 4564, 4600, 4617, 4707, 4603, 4509, 4516, 4699, 4831, 4950, 4806, 4672,
 4641, 4633, 4563, 4492, 4633, 4458, 4428, 4488, 4627, 4491, 4366, 4339, 4500,
 4486, 4497, 4500, 4444, 4525, 4596, 4480, 4497, 4588, 4543, 4557, 4561, 4470,
 4554, 4584, 4642, 4638, 4773, 4785, 4747, 4837, 4944, 5073, 4966, 4903, 4896,
 4851, 4795, 4894, 5013, 5194, 4960, 4423, 4437, 4401, 4563, 4617, 4822, 4606,
 4386, 4494, 4426, 4483, 4596, 4677, 4642, 4693, 4587, 4506, 4480, 4498, 4569,
 4465, 4480, 4506, 4500, 4476, 4638, 4677, 4504, 4608, 4504, 4591, 4632, 4659,
 4662, 4636, 4713, 4666, 4776, 4743, 4689, 4693, 4737, 4782, 4485, 4606, 4809,
 4660, 4485, 4648, 4573, 4438, 4599, 4504, 4507, 4467, 4533, 4669, 4696, 4563,
 4471, 4551, 4563, 4636, 4729)

ModelPixelScaleTag
(0.00083333333, 0.00083333333, 0.0)

ModelTiepointTag
(0.0, 0.0, 0.0, -90.0, 42.0, 0.0)

GeoKeyDirectoryTag
(1, 1, 0, 7, 1024, 0, 1, 2, 1025, 0, 1, 1, 2048, 0, 1, 4326, 2049, 34737, 7, 0,
 2054, 0, 1, 9102, 2057, 34736, 1, 1, 2059, 34736, 1, 0)

GeoDoubleParamsTag
(298.257223563, 6378137.0)

GEOTIFF_METADATA
{'GTModelTypeGeoKey': <ModelType.Geographic: 2>,
 'GTRasterTypeGeoKey': <RasterPixel.IsArea: 1>,
 'GeogAngularUnitsGeoKey': <Angular.Degree: 9102>,
 'GeogCitationGeoKey': 'WGS 84',
 'GeogInvFlatteningGeoKey': 298.257223563,
 'GeogSemiMajorAxisGeoKey': 6378137.0,
 'GeographicTypeGeoKey': <GCS.WGS_84: 4326>,
 'KeyDirectoryVersion': 1,
 'KeyRevision': 1,
 'KeyRevisionMinor': 0,
 'ModelPixelScale': [0.00083333333, 0.00083333333, 0.0],
 'ModelTiepoint': [0.0, 0.0, 0.0, -90.0, 42.0, 0.0]}

@martindurant
Copy link
Member

martindurant commented Sep 9, 2021

I get a structure like

{'.zattrs': '{}',
 '.zarray': '{\n "chunks": [\n  6,\n  1200\n ],\n "compressor": {\n  "id": "imagecodecs_lzw"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  1200,\n  1200\n ],\n "zarr_format": 2\n}',
 '0.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 1990, 4477],
 '1.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 6467, 4528],
 '2.0': ['N42W090_fall_vv_COH12.tif/N42W090_fall_vv_COH12.tif', 10995, 4561],
...

200 small chunks. I see that the filename likely encodes the time and variable name, but where are the location coordinates?

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 9, 2021

ModelPixelScaleTag
(0.00083333333, 0.00083333333, 0.0)

ModelTiepointTag
(0.0, 0.0, 0.0, -90.0, 42.0, 0.0)

The ModelPixelScaleTag has the pixel spacing in x and y 0.00083333333
The ModelTiepointTag has the upper left corner xcoordinates: -90 (=West 90deg) and y coordinates 42 (=North 42deg)

With
TiffTag 256 ImageWidth @10 SHORT @18 1200
TiffTag 257 ImageLength @22 SHORT @30 1200

We know the image size and can calculate the bounding box.

e.g. lower right is
-90+1200 x 00083333333 = -89
42 - 1200 x 00083333333 = 41

@jkellndorfer
Copy link
Contributor

But we can also easliy derive the bounding box from the file name
N42W090 refers to the upper left coordinate of the tile. And each tile is exactly 1x1 degree.

@martindurant
Copy link
Member

OK, so that's one thing.
For this particular dataset, the files being so small and being made up of so many even much smaller chunks, we may decide that addressing each of the sub-chunks is going to be prohibitively slow.

I have not yet written a numcodecs codec which does "use function x to load this whole file", but it would be trivial to do, and this case might require it.

Something like

class WholeFileDecoder(Codec):
    def __init__(self, function_loc, kwargs):
        self.func = import_name(function_loc)
        self.kwargs = kwargs

    def decode(self, buf, _):
        return self.func(io.BytesIO(buf), **self.kwargs)

where for TIFF it you want to encode TiffFile().asarray(). The grib2 code does something like this.

@cgohlke
Copy link

cgohlke commented Sep 9, 2021

Interesting. Let me know if anything could be added to tifffile to make this easier.

IIUC, it might be more efficient not to index the strips/tiles in each TIFF file. Instead, index each file as a single chunk and use a TIFF codec (e.g. imagecodecs_tiff based on libtiff).

Tifffile can parse a sequence of file names to higher dimensions using a regular expression pattern and export a fsspec reference (see the test at https://github.com/cgohlke/tifffile/blob/1c8e75bf29d591058311aee7856fc2c73dea5a83/tests/test_tifffile.py#L12744-L12779). This works with any kind of format supported by imagecodecs.

It's currently not possible to parse categories (e.g. "north"/"south", "summer"/"winter"), only numbers/indices. Also, the current implementation reads one file/chunk to determine the chunk shape and dtype but that could be changed.

@jkellndorfer
Copy link
Contributor

@martindurant I think you are on the right track that we treat the entire file as a chunk. When we did the processing I deliberately did not choose any blocksize tiling a la COG as the data were so small to begin with, but I guess the gdal defaults made these very small sub-chunks which I have to admit I was not aware of. @cgohlke thanks for chiming in with clarifications and offer to possibly adapt tifffile! Parsing categories would be nice indeed. There are a plethora of geospatial data sets in particular that use this tile naming scheme with North/South and East/West designation of location boundaries.

@martindurant
Copy link
Member

@cgohlke , I see we have been thinking along the same lines. Since you have a codec that already handles whole tiff files, it would certainly make sense to just use that.

As far as combining is concerned - finding the key for each input file - there are a number of obvious things that we should be doing. We need to know the complete set of coordinates along the concat dimension(s), possibly done in a first pass, and for each input chunk:

  • the variable name (perhaps only one of these for the whole aggregate dataset)
  • the values of the concat dimensions; could be extracted from the filename, dataset attribute/tag or array (single or multiple value) within the data

@jkellndorfer
Copy link
Contributor

Might the gdal generated VRT file be useful here after all? Here is an example of a VRT of one of the 88 data variables: https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/tiles/Global_fall_vv_COH12.vrt

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 10, 2021

If you take a look at the top of the VRT XML, it contains info on the georeferencing, origin
(upper left) and pixel spacing in the tag, and then for each it describes the pixel/line offset and size in the <SrcRect ...> and <DstRect ..> tags.

<VRTDataset rasterXSize="432000" rasterYSize="166800">
  <SRS dataAxisToSRSAxisMapping="2,1">GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]</SRS>
  <GeoTransform> -1.8000000000000000e+02,  8.3333333000010982e-04,  0.0000000000000000e+00,  8.1000000000000000e+01,  0.0000000000000000e+00, -8.3333333000010982e-04</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1">
    <NoDataValue>0</NoDataValue>
    <ColorInterp>Gray</ColorInterp>
    <ComplexSource>
      <SourceFilename relativeToVRT="1">N00E005/N00E005_fall_vv_COH12.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="1200" RasterYSize="1200" DataType="Byte" BlockXSize="1200" BlockYSize="6" />
      <SrcRect xOff="0" yOff="0" xSize="1200" ySize="1200" />
      <DstRect xOff="222000" yOff="97200" xSize="1200" ySize="1200" />
      <NODATA>0</NODATA>
    </ComplexSource>

@martindurant
Copy link
Member

If we already have the coordinates information from some other source, that would, of course, be totally reasonable too. I didn't gather whether the VRT thing is always available, or if it also needs construction at some effort.

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 10, 2021

gdalbuildvrt would have had to been called to generate the layers.
I guess it's a question if we want the gdal dependency on generating the zarr store in this. gdalbuildvrt of the gdal suite just might do the work. Maybe the code they use for gdalbuildvrt has the pieces ready that you could use directly:
https://github.com/TUW-GEO/OGRSpatialRef3D/blob/master/gdal-1.10.0/apps/gdalbuildvrt.cpp
https://gdal.org/programs/gdalbuildvrt.html

@martindurant
Copy link
Member

It would be OK to have the dependency for the same of making the references - we don't need it to load the data later. However, if there's a simpler way... I intuit that getting the arguments to gdalbuildvrt right requires some expertise, compared to writing a little lambda function to extract numbers from the filename.

@jkellndorfer
Copy link
Contributor

yes, that makes sense to me when all the info can be gathered easily from the filenames. gdalbuildvrt scans each geotiff for the tifftags containing the georeferencing info and as such is more generic. tifffile seems to also offer the more generic solution. What gdalbuildvrt needs is actually just the list of geotiffs that should either be mosaicked into one layer or stacked (with the -separate switch.

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 10, 2021

One info we don't have from the filenames is the resolution needed for the coordinate dimensions I guess. But those could also easily be retrieved from the corresponding tifftag in one of the GeoTIFFs, maybe via tifffile.

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 21, 2021

@cgohlke and @martindurant Bravo! you guys did it! I can access the data set just fine. Just a warning when plotting.

WARNING:param.Image11208: Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.
WARNING:param.Image11208:Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.
WARNING:param.Image11208: Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.
WARNING:param.Image11208:Image dimension(s) longitude and latitude are not evenly sampled to relative tolerance of 0.001. Please use the QuadMesh element for irregularly sampled data or set a higher tolerance on hv.config.image_rtol or the rtol parameter in the Image constructor.

@jkellndorfer
Copy link
Contributor

jkellndorfer commented Sep 21, 2021

Here is the code snippet I am using to go via a subset

fs = fsspec.filesystem("reference", fo="https://storage.googleapis.com/mdtemp/Global_COH2.json", remote_protocol="http")
ds = xr.open_dataset(fs.get_mapper(""), engine="zarr", backend_kwargs={"consolidated": False})
dssub=ds.sel(latitude=slice(42.5,41),longitude=slice(-72,-69.5))
dssub.hvplot.image(x="longitude", y="latitude", rasterize=True, geo=True,cmap='spectral_r',frame_width=400)

image

@rsignell-usgs
Copy link
Collaborator Author

@jkellndorfer I tried to reproduce this using the latest conda-forge packages and I'm getting all NaN values:
https://nbviewer.jupyter.org/gist/rsignell-usgs/9dc49fdfd00a09c7e5d2d0be940f8209

Do I need to use development package versions to get this to work?

@cgohlke
Copy link

cgohlke commented Sep 23, 2021

Did you try different selections, e.g. coherence 12. Many chunks are missing in the dataset.

@rsignell-usgs
Copy link
Collaborator Author

@cgohlke , yes, I tried to reproduce exactly the example shown by @jkellndorfer:
2021-09-23_10-53-29

@cgohlke
Copy link

cgohlke commented Sep 23, 2021

I'm getting all NaN values

I can reproduce that on my system. Could be related to the error/warning printed in the console?

[IPKernelApp] ERROR | No such comm target registered: hv-extension-comm
[IPKernelApp] WARNING | No such comm: hv-extension-comm

@martindurant
Copy link
Member

Screen Shot 2021-09-23 at 12 18 43

@martindurant
Copy link
Member

There is some apparent mismatch between the selectors and the display. I needed coherence index 1, which should have value 12. The title agrees, but the slider doesn't!

dssub.data[0, 0, 1, :, :].values == dssub.sel(season="winter", polarization="vv", coherence=12).data.values

@cgohlke
Copy link

cgohlke commented Sep 23, 2021

@martindurant, please excuse my ignorance:

Is there a function to create a fsspec reference json string from the zarr group created with out = {};g = zarr.open_group(out) ...?

Is it possible to open a fsspec reference file from the local file system (for testing) if the target_protocol is http? fsspec.get_mapper('reference://', fo='Global_COH2.json', target_protocol='http') does not find the reference file and target_protocol='file' can not load the data.

Why is the xarray dataset using float32 when the dtype in the reference file is |u1? (data (season, polarization, coherence, latitude, longitude) float32 ...)

@martindurant
Copy link
Member

Is there a function to create a fsspec reference json string from the zarr group created with

I have such a thing in a local branch, that I need to clean up. It essentially is the ascii-encoded ujson.dump of out for Version 0, or of {"version": 1, "refs": out} for Version 1.

Is it possible to open a fsspec reference file from the local file system (for testing) if the target_protocol is http

You want remote_protocol="http" and optionally remote_options. The "target" is the JSON file, which could be on a different file system from the actual data files.

Why is the xarray dataset using float32

I have no idea! Adding mask_and_scale=False to open_dataset prevents it.

@martindurant
Copy link
Member

I have no idea

Perhaps it's in order to allow NaN values? With the mask_and_scale flag, missing values become 0.

@cgohlke
Copy link

cgohlke commented Sep 23, 2021

Thank you!

It essentially is the ascii-encoded ujson.dump of out for Version 0

Got it. Just base64 encode the binary values in out that cannot be decoded.

@cgohlke
Copy link

cgohlke commented Sep 23, 2021

Here's an attempt to wrap the whole earthbigdata set:

import imagecodecs.numcodecs
import fsspec
import xarray

imagecodecs.numcodecs.register_codecs()

name = 'earthbigdata.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=f'https://www.lfd.uci.edu/~gohlke/{name}',
    target_protocol='http',
)
dataset = xarray.open_dataset(
    mapper, engine='zarr', backend_kwargs={'consolidated': False}
)
print(dataset)
<xarray.Dataset>
Dimensions:          (season: 4, polarization: 4, latitude: 193200, longitude: 432000, coherence: 6, flightdirection: 2, orbit: 175)
Coordinates:
  * coherence        (coherence) float32 6.0 12.0 18.0 24.0 36.0 48.0
  * flightdirection  (flightdirection) object 'A' 'D'
  * latitude         (latitude) float32 82.0 82.0 82.0 ... -79.0 -79.0 -79.0
  * longitude        (longitude) float32 -180.0 -180.0 -180.0 ... 180.0 180.0
  * orbit            (orbit) float64 1.0 2.0 3.0 4.0 ... 172.0 173.0 174.0 175.0
  * polarization     (polarization) object 'vv' 'vh' 'hv' 'hh'
  * season           (season) object 'winter' 'spring' 'summer' 'fall'
Data variables:
    AMP              (season, polarization, latitude, longitude) float32 ...
    COH              (season, polarization, coherence, latitude, longitude) float32 ...
    inc              (orbit, flightdirection, latitude, longitude) float32 ...
    lsmap            (orbit, flightdirection, latitude, longitude) float32 ...
    rho              (season, polarization, latitude, longitude) float32 ...
    rmse             (season, polarization, latitude, longitude) float32 ...
    tau              (season, polarization, latitude, longitude) float32 ...

@martindurant
Copy link
Member

Hah! We'll need a separate tool to visualise which parts of the coordinates space contain data and which do not!

@cgohlke
Copy link

cgohlke commented Sep 23, 2021

Having pyramid levels would be nice.

@jkellndorfer
Copy link
Contributor

@martindurant and @cgohlke, great progress! I have been offline for a couple of days. Let me share two figures that show where valid data can be found and where not. Basically it comes down to two parameter distinctions for polarization and COH (coherence values). There might be some seasonal gaps here and there, but those are rather marginal. This stems from the satellite coverage from Sentinel-1.

  • for coherence we have global coverage for 12,24,36, and 48 days. There is regional coverage where we also have 6 and 18 day coherence values. Top figure shows 6 day repeat pass coverage, i.e. where we expect 6 and 18 day measurements. Bottom part shows global coverage where we expect 12,24,36, and 48 days.

image

  • for polarziation, we have high latitudes (arctic and antartic regions) covered with hh (COH and AMP) and hv (AMP only), plus some islands as shown in the figure. The rest is covered in vv (COH and AMP) and vh (AMP only).

image

These coverages (or gaps) could easily be discerned from the filenames in the respective tiles and thus readily be coded as no data in a global metadata representation. That would be your expertise. If helpul, I could prepare a set of lists of tiles that have hh, hv, vv, vh coverage and 6- and 18- day coverge from a fsspec find operation.

@martindurant
Copy link
Member

The information is already there in the set of keys known to the filesystem. I'll ask the holoviz team if they happen to have a tool for visualising that, without loading any data.

@jkellndorfer
Copy link
Contributor

good point, that the info is already in the keys. An interesting aspect is whether in the visualization it can be determined on the fly if there are no data in any requested subset of a data array for a given set of variables. E.g., if there are no COH06 and COH18 data are available, they would not even show as selection options in the slider or a drop-down in addition to not loading.

@rsignell-usgs
Copy link
Collaborator Author

Just a note that I figured out my blank viz problem -- I was using hvplot.quadmesh instead of hvplot.image

@cgohlke
Copy link

cgohlke commented Oct 9, 2021

Here's an attempt to wrap the whole earthbigdata set:

import imagecodecs.numcodecs
import fsspec
import xarray

imagecodecs.numcodecs.register_codecs()

name = 'earthbigdata.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=f'https://www.lfd.uci.edu/~gohlke/{name}',
    target_protocol='http',
)
dataset = xarray.open_dataset(
    mapper, engine='zarr', backend_kwargs={'consolidated': False}
)
print(dataset)

The script used to create earthbigdata.json is at https://github.com/cgohlke/tifffile/blob/v2021.10.10/examples/earthbigdata.py

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Apr 4, 2022

A common use case is a time stack of geotiff images, so no time dimension or coordinate, but the date is able to be determined from the name or from some attribute in the geotiff files.

We successfully processed a time stack of LCMAP geotiffs by converting them into netCDF files, then using kerchunk's new ability to use a regex to create the time coordinate from filenames. Here's the full notebook.

@martindurant I'm just wondering if instead of converting to NetCDF, we could we have converted the geotiffs into cloud-optimized geotiff instead, and still used the same approach with kerchunk?

@martindurant
Copy link
Member

Yes you could! There are two options for TIFF: using TIFFile and native chunks, or using one chunk per input file (as was done with the satellite coherence dataset). The attributes should be available, and the coordinates used for concat can be taken from filenames or from attributes.

@rsignell-usgs
Copy link
Collaborator Author

I'm guessing we don't have any examples yet of aggregating a time stack of geotiffs using native chunks, right?

@martindurant
Copy link
Member

Correct. Go for it!

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Apr 5, 2022

Okay, I"ll try it! Looks like this what I'm looking for, from the @cgohlke tifffile readme

>>> image_sequence = TiffSequence('temp_C0*.tif', pattern=r'_(C)(\d+)(T)(\d+)')
>>> image_sequence.shape
(1, 2)
>>> image_sequence.axes
'CT'
>>> data = image_sequence.asarray()
>>> data.shape
(1, 2, 64, 64)
>>> with image_sequence.aszarr() as store:
...     zarr.open(store, mode='r')
<zarr.core.Array (1, 2, 64, 64) float64 read-only>
>>> image_sequence.close()
>>> with image_sequence.aszarr() as store:
...     store.write_fsspec('temp.json', url='file://')

@RichardScottOZ
Copy link
Contributor

RichardScottOZ commented May 13, 2022

As this is mentioned by Martin on pangeo - I have the 'time stack' of geotiff possibility - with several groups of multiband variable output. The time isn't relevant as such, just the day a model was run, so the directory it is stored in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants