## Independent Study Final Project


The aim of this project is to demonstrate a link between the real information content and the Bit Grooming compression level that is applied to a spatial dataset. In addition, we want to investigate how the real information performs as a metric to detect compression artifacts compared to an existing metric, the DSSIM. For instance, is there a specific real information threshold that can indicate two datasets look visually similar, as in the DSSIM? Is the real information content more effective at capturing certain types of compression induced errors in the data, such as noise?

In this repository, the files /data/test_set/daily_dssims.csv and /data/test_set/monthly_dssims.csv include a list of the DSSIMS for each climate variable at each time slice, and the files /data/test_set/daily_zfp_bg_sz_comp_slices.csv and /data/test_set/monthly_zfp_bg_sz_comp_slices.csv include a column ("zfp_level") that provides a list of the optimal compression level for each time slice based on a DSSIM cutoff threshold of 0.9995, which may be useful for comparison with the real information content.

In [1]:
# Add ldcpy root to system path
import sys
import struct
from math import log2

import astropy
import numpy as np
import matplotlib.pyplot as plt

sys.path.insert(0, '/glade/u/home/apinard/newldcpy/ldcpy')

# Import ldcpy package
# Autoreloads package everytime the package is called, so changes to code will be reflected in the notebook if the above sys.path.insert(...) line is uncommented.
%load_ext autoreload
%autoreload 2

# suppress all of the divide by zero warnings
import warnings

warnings.filterwarnings("ignore")

import ldcpy

import time

# display the plots in this notebook
%matplotlib inline

In [3]:
# This block of code can be used to load all variables into a pair of dataset collections:
# cols_daily and cols_monthly, which are indexed by their variable name. This data has
# all been compressed using the Bit Grooming (bg) compression algorithm.

# not including 3D variables in this case
#monthly_variables = ["CCN3", "CLOUD", "FLNS", "FLNT", "FSNS", "FSNT", "LHFLX",
#            "PRECC", "PRECL", "PS", "QFLX", "SHFLX", "TMQ", "TS", "U"]

monthly_variables = ["FLNS", "FLNT", "FSNS", "FSNT", "LHFLX",
            "PRECC", "PRECL", "PS", "QFLX", "SHFLX", "TMQ", "TS"]
daily_variables = ["FLUT", "LHFLX", "PRECT", "TAUX", "TS", "Z500"]

cols_monthly = {}
cols_daily = {}
sets = {}
levels = {}
data_path = "/glade/p/cisl/asap/CAM_lossy_test_data_31/research/"


for variable in daily_variables:
    print(variable)
    levels[variable] = [f"bg_2_{variable}",
                        f"bg_3_{variable}",
                        f"bg_4_{variable}", f"bg_5_{variable}",
                        f"bg_6_{variable}", f"bg_7_{variable}",]
    sets[variable] = [f"{data_path}/../orig/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h1.{variable}.20060101-20071231.nc",
                      f"{data_path}/bg/bg_2/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h1.{variable}.20060101-20071231.nc",
                      f"{data_path}/bg/bg_3/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h1.{variable}.20060101-20071231.nc",
                      f"{data_path}/bg/bg_4/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h1.{variable}.20060101-20071231.nc",
                      f"{data_path}/bg/bg_5/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h1.{variable}.20060101-20071231.nc",
                      f"{data_path}/bg/bg_6/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h1.{variable}.20060101-20071231.nc",
                      f"{data_path}/bg/bg_7/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h1.{variable}.20060101-20071231.nc"]
    cols_daily[variable] = ldcpy.open_datasets("cam-fv", [f"{variable}"], sets[variable], [f"orig_{variable}"] + levels[variable], chunks={"time":700})

for variable in monthly_variables:
    print(variable)
    levels[variable] = [f"bg_2_{variable}",
                        f"bg_3_{variable}",
                        f"bg_4_{variable}", f"bg_5_{variable}",
                        f"bg_6_{variable}", f"bg_7_{variable}",]
    sets[variable] = [f"{data_path}/../orig/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h0.{variable}.200601-201012.nc",
                      f"{data_path}/bg/bg_2/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h0.{variable}.200601-201012.nc",
                      f"{data_path}/bg/bg_3/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h0.{variable}.200601-201012.nc",
                      f"{data_path}/bg/bg_4/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h0.{variable}.200601-201012.nc",
                      f"{data_path}/bg/bg_5/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h0.{variable}.200601-201012.nc",
                      f"{data_path}/bg/bg_6/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h0.{variable}.200601-201012.nc",
                      f"{data_path}/bg/bg_7/b.e11.BRCP85C5CNBDRD.f09_g16.031.cam.h0.{variable}.200601-201012.nc"]
    cols_monthly[variable] = ldcpy.open_datasets("cam-fv", [f"{variable}"], sets[variable], [f"orig_{variable}"] + levels[variable], chunks={"time":700})

FLUT
dataset size in GB 1.13

LHFLX
dataset size in GB 1.13

PRECT
dataset size in GB 1.13

TAUX
dataset size in GB 1.13

TS
dataset size in GB 1.13

Z500
dataset size in GB 1.13

CCN3
dataset size in GB 2.79

CLOUD
dataset size in GB 2.79

FLNS
dataset size in GB 0.10

FLNT
dataset size in GB 0.10

FSNS
dataset size in GB 0.10

FSNT
dataset size in GB 0.10

LHFLX
dataset size in GB 0.10

PRECC
dataset size in GB 0.10

PRECL
dataset size in GB 0.10

PS
dataset size in GB 0.10

QFLX
dataset size in GB 0.10

SHFLX
dataset size in GB 0.10

TMQ
dataset size in GB 0.10

TS
dataset size in GB 0.10

U
dataset size in GB 2.79

