# Exercise 7b: NetCDF4 Advanced

## Aim: Introduce more advanced uses of the netCDF4 library in Python to read and create NetCDF4 Files.

Find the teaching material here: https://unidata.github.io/netcdf4-python/

### Issues covered:
- Working with time coordinates
- Multi-file datasets
- Compression of variables
- Compound datatypes
- Enum data type

## Time-coordinates

Most metadata standards specify that time should be measured relative to a fixed date with units such as `hours since YY-MM-DD hh:mm:ss`. We can convert values to and from calendar dates using `num2date` and `date2num` from the `cftime` library. Two other helpful functions are `datetime` and `timedelta` from the `datetime` library.

Q1. 
- Let's generate a list of data and time values: create a list called `dates` containing date and time values, starting from January 1st 2022, and incrementing by 6 hours for a total of 5 entries. 
- Use `date2num` to convert your list of dates to numeric values using: `units="hours since 2022-01-01 00:00:00"` amd `calendar="gregorian"`. Store these in an array called `times`.
- Print the numeric times values to confirm the numeric representation.
- Use `num2date` to convert times back to datetime objects using the same units and calendar. Store these in a list called `converted_dates`
- Print the converted dates to verify they match the original dates list. 

In [3]:
from datetime import datetime, timedelta
from cftime import num2date, date2num

# Step 1: Generate dates list
dates = [datetime(2022, 1, 1) + n * timedelta(hours=6) for n in range(5)]
print("Original dates:", dates)

# Step 2: Convert dates to numeric time values
units = "hours since 2022-01-01 00:00:00"
calendar = "gregorian"
times = date2num(dates, units=units, calendar=calendar)

# Step 3: Print numeric time values
print("Numeric time values (in units '{}'):\n{}".format(units, times))

# Step 4: Convert numeric time values back to calendar dates
converted_dates = num2date(times, units=units, calendar=calendar)

# Step 5: Print converted dates
print("Dates corresponding to numeric time values:\n", converted_dates)

Original dates: [datetime.datetime(2022, 1, 1, 0, 0), datetime.datetime(2022, 1, 1, 6, 0), datetime.datetime(2022, 1, 1, 12, 0), datetime.datetime(2022, 1, 1, 18, 0), datetime.datetime(2022, 1, 2, 0, 0)]
Numeric time values (in units 'hours since 2022-01-01 00:00:00'):
[ 0  6 12 18 24]
Dates corresponding to numeric time values:
 [cftime.DatetimeGregorian(2022, 1, 1, 0, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 1, 6, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 1, 12, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 1, 18, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 2, 0, 0, 0, 0, has_year_zero=False)]


## Multi-file datasets

Let's create multiple netCDF files with a shared variable and unlimited dimension, and use `MFDataset` to read the aggregated data as if it were contained in a single file.
- Create 5 netCDF files named `data/datafile0.nc` through to `data/datafile4.nc`. Each file should contain:
    - A single unlimited dimension named `time`.
    - A variable named `temperature` with 10 integer values ranging from `file_index * 10` to `(file_index+1) * 10 - 1`.
    - Ensure each file is saved in the `NETCDF4_CLASSIC` format.
- Using `MFDataset` read all the `temperature` data from the 5 files at once by specifying a wildcard string `datafile*.nc`.
- Print the aggregated `temperature` values to verify that they span from 0 to 49.

In [9]:
from netCDF4 import Dataset, MFDataset
import numpy as np

# Step 1: Create multiple netCDF files with a shared unlimited dimension and variable
for i in range(5):
    with Dataset(f"data/datafile{i}.nc", "w", format="NETCDF4_CLASSIC") as f:
        # Create an unlimited dimension
        f.createDimension("time", None)
        # Create a variable associated with the 'time' dimension
        temp_var = f.createVariable("temperature", "i4", ("time",))
        # Populate 'temperature' with a unique range of values for each file
        temp_var[:] = np.arange(i * 10, (i+1) * 10)

# Step 2: Use MFDataset to read all files at once
try:
    #Read and aggregate all data across all files
    f = MFDataset("datafile*.nc")
    temperature_data = f.variables["temperature"][:]
    # Print the aggregated data
    print(temperature_data)
finally:
    # Close the MFDataset object
    f.close()

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49]


## Compression of variables

Let's explore various compression options available in netCDF. 
- Create a sample NetCDF file: define a dataset with dimensions (`time`, `level`, `lat`, `lon`)
- Generate random temperature data.
- First, create the temperature variable without compression and observe the file size.
- Then, enable zlib compression and observe the change in file size.
- Finally, add quantization with `least_significant_digit` or `signigicant_digits` and check the file size again.

In [10]:
import os

# Step 1: Create a random dataset 
file_path = "data/temperature_data.nc"
time_dim, level_dim, lat_dim, lon_dim = 10, 5, 50, 100
data = np.random.rand(time_dim, level_dim, lat_dim, lon_dim) * 30 + 273.15

# Step 2: Define a function to create netCDF file with specified compression settings
def create_netcdf(file_path, compression=None, least_significant_digit=None, significant_digits=None):
    with Dataset(file_path, 'w', format="NETCDF4") as rootgrp:
        # Create dimensions
        rootgrp.createDimension("time", time_dim)
        rootgrp.createDimension("level", level_dim)
        rootgrp.createDimension("lat", lat_dim)
        rootgrp.createDimension("lon", lon_dim)
        # Define variable with compression settings
        temp = rootgrp.createVariable("temp", "f4", ("time", "level", "lat", "lon"), compression=compression, least_significant_digit=least_significant_digit, significant_digits=significant_digits)
        #Assign data to the variable
        temp[:] = data
    # Check and print file size
    print(f"File size with compression={compression}, "
          f"least_significant_digit={least_significant_digit}, "
          f"significant_digits={significant_digits}: {os.path.getsize(file_path) / 1024:.2f} kB")

# Step 3: Test different compression settings
# 3.1 No compression
create_netcdf("data/temperature_data_nocompress.nc")

# 3.2 Compression with zlib only
create_netcdf("data/temperature_data_zlib.nc", compression='zlib')

# 3.3 Compression with zlib and least significant digit quantization
create_netcdf("data/temperature_data_zlib_lsd.nc", compression='zlib', least_significant_digit=3)

# 3.4 Compression with zlib and significant digits quantization
create_netcdf("data/temperature_data_zlib_sig.nc", compression='zlib', significant_digits=4)

File size with compression=None, least_significant_digit=None, significant_digits=None: 983.31 kB
File size with compression=zlib, least_significant_digit=None, significant_digits=None: 639.34 kB
File size with compression=zlib, least_significant_digit=3, significant_digits=None: 505.77 kB
File size with compression=zlib, least_significant_digit=None, significant_digits=4: 396.48 kB


## Compound data types

Let's work with compound data types and structured arrays.
- Create a netCDF file called `vectors.nc` in write mode.
- Define a compound data type that represents a 3D vector. Each vector should have 3 components:
    - `x`: a `float33` representing the x-coordinate
    - `y`: a `float32` representing the y-coordinate
    - `z`: a `float32` representing the z-coordinate
- Create a dimension named `num_vectors` to store an unlimited number of vectors
- Create a variable in the file using the compound data type from step 2, with the dimension from step 3.
- Generate a numpy structured arrat with 5 sample 3D vectors:
    - Each vector should have random values for `x`, `y` and `z` components.
    - Store these in the structured array and write them to the netCDF variable.
- Close the file and then reopen it in read mode.
- Read the data back into a new numpy structured array and print each vector.  

In [6]:
# Step 1: Create a netCDF file in write mode.
f = Dataset("data/vectors.nc", "w", format="NETCDF4")

# Step 2: Define a compound data type for a 3D vector with x,y,z as float32 fields
vector_dtype = np.dtype([("x", np.float32), ("y", np.float32), ("z", np.float32)])
vector_t = f.createCompoundType(vector_dtype, "vector3D")

# Step 3: Create a dimension for storing an unlimited number of vectors
num_vectors = f.createDimension("num_vectors", None)

# Step 4: Create a variable using the compound data type and the dimension
vector_var = f.createVariable("vector_data", vector_t, ("num_vectors",))

# Step 5: Generate a numpy structured array with 5 random 3D vectors
num_samples = 5
data = np.empty(num_samples, dtype=vector_dtype)
data["x"] = np.random.rand(num_samples)
data["y"] = np.random.rand(num_samples)
data["z"] = np.random.rand(num_samples)

# Write the structured array to the netCDF variable
vector_var[:] = data

# Close the file
f.close()

#Step 6: Reopen the file in read-mode
f = Dataset("data/vectors.nc", "r")
vector_var = f.variables["vector_data"]

# Step 7: Read the data back into a new structured array and print each vector
data_in = vector_var[:]
for i, vec in enumerate(data_in):
    print(f"Vector {i}: (x: {vec['x']:.2f}, y: {vec['y']:.2f}, z: {vec['z']:.2f})")

# Close the file
f.close()

Vector 0: (x: 0.77, y: 0.94, z: 0.85)
Vector 1: (x: 0.88, y: 0.60, z: 0.48)
Vector 2: (x: 0.10, y: 0.05, z: 0.59)
Vector 3: (x: 0.53, y: 0.26, z: 0.86)
Vector 4: (x: 0.65, y: 0.37, z: 0.09)


## Variable-length data types

Let's create and manipulate variable-length (vlen) arrays
- Create a netCDF file named `exercise_vlen.nc` in write mode.
- Define dimensions:
    - Create a dimension `a` with a size of `5`.
    - Create a dimension `b` with a size of `4`.
- Create a variable-length data type for signed 32-bit integers, named `my_vlen_int` using `np.int32` as the datatype.
- Use the vlen type you defined to create a variable `vlen_var` with dimensions `("a", "b")`. Populate `vlen_var` with random data:
    - Each element should be a 1D numpy array of random length between 2 and 8.
    - Each array element should contain random integers between 1 and 100. 
- Create a new dimension `c` with a size of `7`. Define a variable `vlen_str_var` along dimension `c`. Populate this variable with random strings of lengths between 3 and 10 using uppercase and lowercase alphabetic characters.
- Print the contents of `vlen_var` and `vlen_str_var`. Print the structure of the netCDF4 file to show defined dimensions, variables, and data types.

In [7]:
import random
import string

# Step 1: Create a netCDF file in write mode
f = Dataset("data/exercise_vlen.nc", "w")

# Step 2: Define dimensions a and b
f.createDimension("a", 5)
f.createDimension("b", 4)

# Step 3: Create a variable-length data type for signed 32-bit integers
vlen_int_type = f.createVLType(np.int32, "my_vlen_int")

# Step 4: Create and populate the variable length integer array
vlen_var = f.createVariable("vlen_var", vlen_int_type, ("a", "b"))

# Populate vlen_var with random-length integer arrays
data = np.empty((5,4), dtype=object)
for i in range(5):
    for j in range(4):
        random_length = random.randint(2, 8)
        data[i,j] = np.random.randint(1, 101, size=random_length, dtype=np.int32)

# Assign the data to the netCDF variable
vlen_var[:, :] = data

# Step 5: Create a dimension 'c' and define a variable-length string array
f.createDimension("c", 7)
str_var = f.createVariable("vlen_str_var", str, ("c",))

# Populate vlen_str_var with random strings of lengths 3-10
chars = string.ascii_letters
string_data = np.empty(7, dtype=object)
for i in range(7):
    random_length = random.randint(3,10)
    string_data[i] = ''.join(random.choice(chars) for _ in range(random_length))

# Assign the string data to the netCDF variable
str_var[:] = string_data

# Step 6: Print the contents of "vlen_var" and "vlen_str_var"
print("Contents of 'vlen_var':\n", vlen_var[:])
print("Contents of 'vlen_str_var':\n", str_var[:])

# Print the structure of the NetCDF4 file
print("\nNetCDF4 file structure:\n", f)
print("Details of 'vlen_var':\n", f.variables["vlen_var"])
print("Details of 'vlen_str_var':\n", f.variables["vlen_str_var"])

# Close the file
f.close()

Contents of 'vlen_var':
 [[array([82, 43, 68, 85, 90], dtype=int32)
  array([29, 68, 23], dtype=int32) array([ 3, 53], dtype=int32)
  array([61, 51,  5,  9, 78], dtype=int32)]
 [array([34, 96, 34, 82, 45, 56], dtype=int32)
  array([70, 22, 96, 86], dtype=int32)
  array([58, 90, 83, 69, 87, 63], dtype=int32)
  array([39, 68, 95, 39, 11, 57, 13, 21], dtype=int32)]
 [array([98, 41, 39], dtype=int32) array([43, 37, 57], dtype=int32)
  array([85, 84, 75, 25], dtype=int32)
  array([68, 25, 87, 42, 24], dtype=int32)]
 [array([ 2, 72,  2, 60, 94, 20], dtype=int32)
  array([41, 71, 21, 93, 60], dtype=int32)
  array([69, 81, 68, 78], dtype=int32)
  array([68, 13,  7, 90, 45, 59, 69, 89], dtype=int32)]
 [array([89, 33, 92, 88, 33, 38], dtype=int32)
  array([ 4, 42, 82, 28,  2], dtype=int32)
  array([98, 22, 64, 75,  3,  8, 75, 67], dtype=int32)
  array([96, 38, 97,  7], dtype=int32)]]
Contents of 'vlen_str_var':
 ['REulfoDs' 'BTx' 'pGqvr' 'dqYR' 'DjqTM' 'PAFCKNbi' 'niPqKWsLIh']

NetCDF4 file stru

## Enum data type

Let's create a netCDF file to store weather observation data including an enumerated type representing different types of precipiation.
- Create a Python dictionary `precip_dict` where:
    - `None` maps to `0`
    - `Rain` maps to `1`
    - `Snow` maps to `2`
    - `Sleet` maps to `3`
    - `Hail` maps to `4`
    - `Unknown` maps to `255`
- Use this dictionary to define an Enum data type called `precip_t` with a base type of `np.uint8`
- Define a dimension called `time` with an unlimited length for observations over time
- Create a 1D variable named `precipitation` of type `precip_t` that uses the `time` dimension
- Set the `fill_value` attribute of the variable to `255` (indicating missing data)
- Write the following precipiatation observations to the `precipitation` variable: `['None', 'Rain', 'Snow', 'Unknown', 'Sleet']`
- Close and reopen the file, then print the contents of the `precipitation` variable, inlcuding: the data values confirming they match the written values, the enum dictionary associated with the enum data type, verifying the precipitation mapping. 

In [8]:
# Step 1: Create a new netCDF file
nc = Dataset('data/weather_data.nc', 'w', format='NETCDF4')

# Step 2: Define the Enum dictionary and create the Enum type
precip_dict = {
    'None': 0,
    'Rain': 1,
    'Snow': 2,
    'Sleet': 3,
    'Hail': 4,
    'Unknown': 255
}

# Create an Enum type called 'precip_t' with base type uint8
precip_type = nc.createEnumType(np.uint8, 'precip_t', precip_dict)

# Step 3: Create a time dimension and variable using the Enum type
nc.createDimension('time', None)  # unlimited dimension for time

# Create the precipitation variable, setting the fill_value to 'Unknown' (255)
precip_var = nc.createVariable('precipitation', precip_type, ('time',),
                               fill_value=precip_dict['Unknown'])

# Step 4: Write data to the variable
precip_var[:] = [precip_dict[k] for k in ['None', 'Rain', 'Snow', 'Unknown', 'Sleet']]

# Close the file
nc.close()

# Step 5: Reopen the file, read and print the data
nc = Dataset('data/weather_data.nc', 'r')
precip_var = nc.variables['precipitation']

# Print the Enum dictionary
print("Enum dictionary:", precip_var.datatype.enum_dict)

# Print the data stored in the variable
print("Precipitation data:", precip_var[:])

# Close the file
nc.close()

Enum dictionary: {'None': 0, 'Rain': 1, 'Snow': 2, 'Sleet': 3, 'Hail': 4, 'Unknown': 255}
Precipitation data: [0 1 2 -- 3]
