# Assessing Compression


**OBJECTIVE**:  
The objective of this chapter is to demonstrate how to read an existing dataset available as an OpenDAP endpoint, and translate it into a cloud-optimized zarr on S3. 

This notebook will evaluate how well the data is compressed when it is 
written do disk. 

## Preamble
This is all stuff we are going to need: 

In [1]:
import os
import logging

import numpy as np
import xarray as xr
import hvplot.xarray
import fsspec
import zarr

logging.basicConfig(level=logging.INFO, force=True)

In [2]:
%run ../utils.ipynb
_versions(['xarray', 'dask', 'zarr', 'fsspec'])

Python     : 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
dask       : 2023.3.2
fsspec     : 2023.3.0
xarray     : 2023.3.0
zarr       : 2.14.2


## AWS Credentials

We will use the same credentials scheme we used to write the data in 
{doc}`OpenDAP_to_S3`. 

In [3]:
os.environ['AWS_PROFILE'] = 'osn-rsignellbucket2'
os.environ['AWS_ENDPOINT'] = 'https://renc.osn.xsede.org'

%run ../AWS.ipynb  # handles credentials for us. 

## The zarr store

Let's look at the zarr data store we wrote to object storage

In [4]:
# OUTPUT Location: 
outdir = f's3://rsignellbucket2/testing/prism/PRISM2.zarr'
# established in earlier notebooks.

fsw = fsspec.filesystem('s3', 
    anon=False, 
    default_fill_cache=False, 
    skip_instance_cache=True, 
    client_kwargs={ 'endpoint_url': os.environ['AWS_S3_ENDPOINT'] },
)

The zarr store is actually a folder/directory, with subfolders for variables, groups, etc. 
We can get a quick peek at that with a couple of zarr functions:

In [9]:
g = zarr.convenience.open_consolidated(fsw.get_mapper(outdir)) # read zarr metadata for named file.
print(g.tree())

/
 ├── lat (621,) float32
 ├── lon (1405,) float32
 ├── ppt (1512, 621, 1405) int32
 ├── time (1512,) float32
 ├── time_bnds (1512, 2) float32
 ├── tmn (1512, 621, 1405) int16
 └── tmx (1512, 621, 1405) int16


## Sizing the chunks

Using the filesystem utilities, build a datasets of file sizes:

In [10]:
flist = fsw.glob(f'{outdir}/tmx/*')
fsize = [fsw.size(f) for f in flist]
da = xr.DataArray(data=np.array(fsize)/1e6, name='size')

Plot a histogram of sizes for data files in this zarr store:

In [11]:
da.hvplot.hist(title='Compressed object sizes (MB) for "tmx" variable', grid=True)

We can see that most of the individual chunks are just over 4MB in size. Compare this with 
the in-memory size for chunks, according to `xarray` -- 34MB per chunk. 

This tells us that we get an astonishing 9:1 compression ratio on this particular data. 

Let's look at another variable in this dataset:

In [12]:
flist = fsw.glob(f'{outdir}/ppt/*')
fsize = [fsw.size(f) for f in flist]
da = xr.DataArray(data=np.array(fsize)/1e6, name='size')
da.hvplot.hist(title='Compressed object sizes (MB) for "ppt" variable', grid=True)

This variable did not compress quite as well.  In the worse case, severl chunks are about
14MB each.  This is still a respectable 2.5:1 compression ratio for this variable. 


## Total Size
The total size in GB of this dataset as stored on disk (including all metadata) is:

In [13]:
fsw.du(outdir)/1e9

3.168849405

If we count up the in-memory sizes reported by `xarray`: 

In [43]:
new_ds = xr.open_dataset(fsw.get_mapper(outdir), engine='zarr', chunks={})
total=0
for i in new_ds.variables:
    n = new_ds[i].size
    bytes = n * 4
    total += bytes
    print(f"{i:10s}: {bytes: 12d}")
print("=" * 24, f"\nTOTAL     : {total: 12d}")

print(f"In GB: {total/1e9}")

lat       :         2484
lon       :         5620
ppt       :   5276910240
time      :         6048
time_bnds :        12096
tmn       :   5276910240
tmx       :   5276910240
TOTAL     :  15830756968
In GB: 15.830756968


So... our total compression ratio for the entire dataset, including file system overhead and metadata is: 

In [45]:
total / fsw.du(outdir) 

4.995742916347266