Enable compression for coordinates #674

wachsylon · 2022-09-15T08:43:33Z

Hi,

can we add a check for shuffle and deflate
for axis entries to enable compression for coordinates i.e. axis entries?
Afaik, it does not hurt :)

If I add attributes to the grid table, I get a warning from CMOR:

C Traceback:
In function: cmor_set_axis_def_att
! called from: cmor_set_axis_entry
! called from: cmor_load_table_internal
! 

!!!!!!!!!!!!!!!!!!!!!!!!!
!
! Warning: Unknown attribute >>>shuffle<<< for axis section (latitude, table: ), value: 1
!
!!!!!!!!!!!!!!!!!!!!!!!!!

Best,
Fabi

The text was updated successfully, but these errors were encountered:

taylor13 · 2022-09-15T16:10:48Z

some questions for clarification:

If you ignore the warning, are you already able to do what you want?
If yes, then I guess you're just asking whether we can suppress the warning (for this particular user-defined attribute). right?
If you add attributes to a regular variable, can CMOR handle it? (It should be able to without raising a warning.)
Is there a need to handle compression for vector coordinate variables or just 2-d grids?
thanks,
Karl

durack1 · 2022-09-15T16:47:45Z

@wachsylon I'd also add the question, what extra deflation/performance gain does this give you? My guess, it would be very marginal. I also wonder whether software packages reading netcdf can handle this out of the box (if it's all managed by the netcdf library) or whether this is another aspect that needs to be considered?

wachsylon · 2022-09-20T11:50:31Z

Hi,

first of all: for my specific case I found out that the compression of the 2d longitudes and latitudes can actually be set in the tables at this stage in the grids.json table because these grid lons and lats are listed under variable_entry.

If you ignore the warning, are you already able to do what you want?

no. For variables, it is possible to define the compression in the cmor-tables which are read from CMOR and accordingly set for the output. However, this is not possible for coordinate variables which are listed under axis_entries instead of variable_entries in the tables.

If you add attributes to a regular variable, can CMOR handle it? (It should be able to without raising a warning.)

Yes, that works.

Is there a need to handle compression for vector coordinate variables or just 2-d grids?

I hope not but maybe on small scale. I fear that sometimes chunks of data can only be saved for a small period of time (e.g. monthly data for only one year), where 2d grids can make up a big part of the file. If you have 30 years, this of course ends up in a long list of files which nobody really should want - but maybe sometimes data provider have to save it like that?

Anyway, in my small use case - a dataset, 12 time steps, 2d lons and lats, 1d orginal grid (e.g. rlon and rlat for rotated pole projection) - the data size can be reduced from 8mb to 2mb all because of the grid compression.

Btw, newly high res ICON and IFS raw output is saved without the grid at all. Is that something CMIP7 should also think about? Providing external grid descriptions?

I also wonder whether software packages reading netcdf can handle this out of the box (if it's all managed by the netcdf library) or whether this is another aspect that needs to be considered?

Good point. I can check for some cases and a python workflow.

taylor13 · 2022-09-20T16:33:26Z

I want to be sure I understand your "small use case". Is the following correct:

the variable of interest: 12 2d fields
the latitude of each grid cell: 1 2d field
the longitude of each grid cell: 1 2d field
the latitude bounds (presuming each cell has 4 corners): 4 2d fields
the longitude bounds (presuming each cell has 4 corners): 4 2d fields
Some smaller 1d coordinate axes (which aren't to be compressed).

You say the size reduction is attributable to the grid compression alone, but how can that be since the 12 2d fields of the variable must occupy more space than the 10 2d fields associated with the grid? The best you could hope for is reducing the size to 8 mb x 10/22 = 3.6 mb.

I'm obviously missing something obvious (or misunderstanding). Could you explain how you get down to 2 mb my compressing the grids alone?

wachsylon · 2022-09-27T18:19:53Z

I'm obviously missing something

I cannot reproduce my numbers and I guess, I just was not precise.

Your calculation fits perfectly to what I find in the data now and I do not really understand why. The compression of data depends on its values, doesn't it? So how is there a formula at all?

What I would understand: since I only compress 10 of 22 fields, you assume that I can only reduce this part of the file size, correct? That would mean I reduce the size from 8mb to 8mb minus 8mb*10/22, correct?

But the variable of interest is already compressed in both variants i.e. also in the 8mb. And since its fields can likely be compressed better (thinking of a constant field, e.g. a lot on NaNs), I assumed that it can happen that a high percentage of the 8mb is only the grid. So all in all, now I am a bit lost.

Nevertheless, it is a significant reduction in size and therefore it can be worth it. Also, I think that the change in code is really small as the function for checking compression is already there and only has to be called at another point.

durack1 · 2022-09-27T19:42:15Z

@wachsylon, if I were going to build a test suite for this, I would choose 3 variables: tas (global coverage, no mask), tos (global coverage, land ~30% masked), and siconc (global coverage, land ~30% masked, and ~90% of the remaining cells should be missing values).

If you find that across those examples, you get a really good deflation improvement, AND, generic software libraries can read these compressed data with no problems, then it would make sense to add this to the CMOR4 planning items

durack1 · 2024-04-07T16:46:37Z

@wachsylon we have a couple of updates planned for CMOR3.9.0, including exposing netCDF quantize (~4.9.x new function), not sure compressing coordinates makes sense, but we can revisit as the release is being prepared

cofinoa · 2024-04-09T14:57:14Z

This relates to #601 and #733.

Activating chunking on coordinates can benefit of compression ratio, specially for large 2D coordinates.

Just enabling chunking and adding deflate, and/or other filters from the standard netcdf-c library doesn't affect netCDF4-classic compliance for files, and for the software reading the produced files, the netcdf library makes this transparent.

durack1 · 2024-06-17T16:34:17Z

Some of the chatter in #725 is starting to suggest that compression for coordinates is going to cause problems, so this request will likely be closed with that justification?

@taylor13 @cofinoa @wachsylon any comments before this is done?

durack1 added this to the 3.9.0 milestone Apr 7, 2024

durack1 mentioned this issue Jun 11, 2024

File output size and chunking #743

Closed

durack1 modified the milestones: 3.9.0, 4.0/Future Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable compression for coordinates #674

Enable compression for coordinates #674

wachsylon commented Sep 15, 2022

taylor13 commented Sep 15, 2022

durack1 commented Sep 15, 2022

wachsylon commented Sep 20, 2022

taylor13 commented Sep 20, 2022

wachsylon commented Sep 27, 2022

durack1 commented Sep 27, 2022

durack1 commented Apr 7, 2024

cofinoa commented Apr 9, 2024

durack1 commented Jun 17, 2024

Enable compression for coordinates #674

Enable compression for coordinates #674

Comments

wachsylon commented Sep 15, 2022

taylor13 commented Sep 15, 2022

durack1 commented Sep 15, 2022

wachsylon commented Sep 20, 2022

taylor13 commented Sep 20, 2022

wachsylon commented Sep 27, 2022

durack1 commented Sep 27, 2022

durack1 commented Apr 7, 2024

cofinoa commented Apr 9, 2024

durack1 commented Jun 17, 2024