# Exercise 8: NetCDF4 Advanced

## Aim: Introduce more advanced uses of the netCDF4 library in Python to read and create NetCDF4 Files.

Find the teaching material here: https://unidata.github.io/netcdf4-python/

### Issues covered:
- Working with time coordinates
- Multi-file datasets
- Compression of variables
- Compound datatypes
- Enum data type

## Time-coordinates

Most metadata standards specify that time should be measured relative to a fixed date with units such as `hours since YY-MM-DD hh:mm:ss`. We can convert values to and from calendar dates using `num2date` and `date2num` from the `cftime` library. Two other helpful functions are `datetime` and `timedelta` from the `datetime` library.

Q1. 
- Let's generate a list of data and time values: create a list called `dates` containing date and time values, starting from January 1st 2022, and incrementing by 6 hours for a total of 5 entries. 
- Use `date2num` to convert your list of dates to numeric values using: `units="hours since 2022-01-01 00:00:00"` amd `calendar="gregorian"`. Store these in an array called `times`.
- Print the numeric times values to confirm the numeric representation.
- Use `num2date` to convert times back to datetime objects using the same units and calendar. Store these in a list called `converted_dates`
- Print the converted dates to verify they match the original dates list. 

In [1]:
from datetime import datetime, timedelta
from cftime import num2date, date2num

# Step 1: Generate dates list
dates = [datetime(2022, 1, 1) + n * timedelta(hours=6) for n in range(5)]
print("Original dates:", dates)

# Step 2: Convert dates to numeric time values
units = "hours since 2022-01-01 00:00:00"
calendar = "gregorian"
times = date2num(dates, units=units, calendar=calendar)

# Step 3: Print numeric time values
print("Numeric time values (in units '{}'):\n{}".format(units, times))

# Step 4: Convert numeric time values back to calendar dates
converted_dates = num2date(times, units=units, calendar=calendar)

# Step 5: Print converted dates
print("Dates corresponding to numeric time values:\n", converted_dates)

Original dates: [datetime.datetime(2022, 1, 1, 0, 0), datetime.datetime(2022, 1, 1, 6, 0), datetime.datetime(2022, 1, 1, 12, 0), datetime.datetime(2022, 1, 1, 18, 0), datetime.datetime(2022, 1, 2, 0, 0)]
Numeric time values (in units 'hours since 2022-01-01 00:00:00'):
[ 0  6 12 18 24]
Dates corresponding to numeric time values:
 [cftime.DatetimeGregorian(2022, 1, 1, 0, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 1, 6, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 1, 12, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 1, 18, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2022, 1, 2, 0, 0, 0, 0, has_year_zero=False)]


## Multi-file datasets

Q2. Let's create multiple netCDF files with a shared variable and unlimited dimension, and use `MFDataset` to read the aggregated data as if it were contained in a single file.
- Create 5 netCDF files named `data/datafile0.nc` through to `data/datafile4.nc`. Each file should contain:
    - A single unlimited dimension named `time`.
    - A variable named `temperature` with 10 integer values ranging from `file_index * 10` to `(file_index+1) * 10 - 1`.
    - Ensure each file is saved in the `NETCDF4_CLASSIC` format.
    - **Hint: Use a loop such as `for .. in range(..):` to do this task.**
- Using `MFDataset` read all the `temperature` data from the 5 files at once by specifying a wildcard string `datafile*.nc` - store this in a variable `f`. Assign this data to a new variable using `temperature_data = f.variables["temperature"][:]`
- Print the aggregated `temperature` values to verify that they span from 0 to 49.

In [2]:
from netCDF4 import Dataset, MFDataset
import numpy as np

# Step 1: Create multiple netCDF files with a shared unlimited dimension and variable
for i in range(5):
    with Dataset(f"data/datafile{i}.nc", "w", format="NETCDF4_CLASSIC") as f:
        # Create an unlimited dimension
        f.createDimension("time", None)
        # Create a variable associated with the 'time' dimension
        temp_var = f.createVariable("temperature", "i4", ("time",))
        # Populate 'temperature' with a unique range of values for each file
        temp_var[:] = np.arange(i * 10, (i+1) * 10)

# Step 2: Use MFDataset to read all files at once
try:
    #Read and aggregate all data across all files
    f = MFDataset("data/datafile*.nc")
    temperature_data = f.variables["temperature"][:]
    # Print the aggregated data
    print(temperature_data)
finally:
    # Close the MFDataset object
    f.close()

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49]


## Compression of variables

Q3. Let's explore various compression options available in netCDF. 
- Run the following cell to create an array of random temperature data and create a function to create NetCDF files with given compression settings. Take a look at the function and figure out what it's doing.

In [3]:
import os

# Step 1: Create a random dataset 
time_dim, level_dim, lat_dim, lon_dim = 10, 5, 50, 100
data = np.random.rand(time_dim, level_dim, lat_dim, lon_dim) * 30 + 273.15

# Step 2: Create a function to create NetCDF files with the given compression settings:
file_path = "data/temperature_data.nc"
def create_netcdf(file_path, compression=None, least_significant_digit=None, significant_digits=None):
    with Dataset(file_path, 'w', format="NETCDF4") as rootgrp:
        # Create dimensions
        rootgrp.createDimension("time", time_dim)
        rootgrp.createDimension("level", level_dim)
        rootgrp.createDimension("lat", lat_dim)
        rootgrp.createDimension("lon", lon_dim)
        # Define variable with compression settings
        temp = rootgrp.createVariable("temp", "f4", ("time", "level", "lat", "lon"), compression=compression, least_significant_digit=least_significant_digit, significant_digits=significant_digits)
        #Assign data to the variable
        temp[:] = data
    # Check and print file size
    print(f"File size with compression={compression}, "
          f"least_significant_digit={least_significant_digit}, "
          f"significant_digits={significant_digits}: {os.path.getsize(file_path) / 1024:.2f} kB")

- Use this function to test the following cases:
    - First, create the temperature variable without compression and observe the file size. Use the file path `data/temperature_data_no_compress.nc`.
    - Then, enable zlib compression and observe the change in file size. Use the file path `data/temperature_data_zlib.nc`.
    - Next, add zlib and least signigicant digit quantization (`least_significant_digit=3`) and check the file size again. Use the file path `data/temperature_data_zlib_lsd.nc`.
    - Finally, add zlib and significant digits quantization (`significant_digits=4`) and check the file size again. Use the file path `data/temperature_data_zlib_sig.nc`.
    - Hint: call the function using: `create_netcdf(filepath, compression, least_significant_digit, significant_digit)`. Note that the default for the compression/signigificant digits arguments is None so if you don't need them you can omit them when calling the function.

In [4]:
# Step 3: Test different compression settings
# 3.1 No compression
create_netcdf("data/temperature_data_nocompress.nc")

# 3.2 Compression with zlib only
create_netcdf("data/temperature_data_zlib.nc", compression='zlib')

# 3.3 Compression with zlib and least significant digit quantization
create_netcdf("data/temperature_data_zlib_lsd.nc", compression='zlib', least_significant_digit=3)

# 3.4 Compression with zlib and significant digits quantization
create_netcdf("data/temperature_data_zlib_sig.nc", compression='zlib', significant_digits=4)

File size with compression=None, least_significant_digit=None, significant_digits=None: 983.31 kB
File size with compression=zlib, least_significant_digit=None, significant_digits=None: 639.42 kB


File size with compression=zlib, least_significant_digit=3, significant_digits=None: 505.93 kB
File size with compression=zlib, least_significant_digit=None, significant_digits=4: 396.40 kB


## Compound data types

Q4. Let's work with compound data types and structured arrays.
- Create a netCDF file called `data/vectors.nc` in write mode with `NETCDF4` format assigned to the variable `f`.
- Define a compound data type that represents a 3D vector. Each vector should have 3 components:
    - `x`: a `float33` representing the x-coordinate
    - `y`: a `float32` representing the y-coordinate
    - `z`: a `float32` representing the z-coordinate
    - Hint: use `np.dtype([('x', type), ('y'..), (...)])` to define x,y,z then `f.createCompoundType()` to create the compound data type.
- Create a dimension named `num_vectors` to store an unlimited number of vectors.
- Create a variable called `vector_data` in the file using the compound data type from step 2, with the dimension from step 3.
- Generate a numpy structured array with 5 sample 3D vectors:
    - Each vector should have random values for `x`, `y` and `z` components (use `np.random.rand(num_samples)`).
    - Store these in the structured array (initialize the array with `np.empty(num_samples, dtype)` then use `data["x"]` etc to assign the data.
    - Write them to the netCDF variable.
- Close the file and then reopen it in read mode.
- Read the data back into a new numpy structured array and print each vector.
    - Hint: use `f.variables['var_name']` to read in the variable data.
    - Hint: Use `data_in = vector_var[:]` to extract the data for the variables.
    - Hint: Use `for i, vev in enumerate(data_in):` to loop through the data so you can print each vector.

In [5]:
# Step 1: Create a netCDF file in write mode.
f = Dataset("data/vectors.nc", "w", format="NETCDF4")

# Step 2: Define a compound data type for a 3D vector with x,y,z as float32 fields
vector_dtype = np.dtype([("x", np.float32), ("y", np.float32), ("z", np.float32)])
vector_t = f.createCompoundType(vector_dtype, "vector3D")

# Step 3: Create a dimension for storing an unlimited number of vectors
num_vectors = f.createDimension("num_vectors", None)

# Step 4: Create a variable using the compound data type and the dimension
vector_var = f.createVariable("vector_data", vector_t, ("num_vectors",))

# Step 5: Generate a numpy structured array with 5 random 3D vectors
num_samples = 5
data = np.empty(num_samples, dtype=vector_dtype)
data["x"] = np.random.rand(num_samples)
data["y"] = np.random.rand(num_samples)
data["z"] = np.random.rand(num_samples)
# Write the structured array to the netCDF variable
vector_var[:] = data

# Step 6: Close and reopen in read mode
# Close the file
f.close()
# Reopen the file in read-mode
f = Dataset("data/vectors.nc", "r")

# Step 7: Read the data back into a new structured array and print each vector
vector_var = f.variables["vector_data"]
data_in = vector_var[:]
for i, vec in enumerate(data_in):
    print(f"Vector {i}: (x: {vec['x']:.2f}, y: {vec['y']:.2f}, z: {vec['z']:.2f})")

# Close the file
f.close()

Vector 0: (x: 0.99, y: 0.74, z: 0.50)
Vector 1: (x: 0.07, y: 0.85, z: 0.78)
Vector 2: (x: 0.79, y: 0.02, z: 0.28)
Vector 3: (x: 0.21, y: 0.61, z: 0.37)
Vector 4: (x: 0.78, y: 0.71, z: 0.08)


## Variable-length data types

Q5. Let's create and manipulate variable-length (vlen) arrays
- Create a netCDF file named `data/exercise_vlen.nc` in write mode.
- Define dimensions:
    - Create a dimension `a` with a size of `5`.
    - Create a dimension `b` with a size of `4`.
- Create a variable-length data type using `f.createVLType()` named `my_vlen_int` using `np.int32` as the datatype.
- Use the vlen type you defined to create a variable `vlen_var` with dimensions `("a", "b")`.
- Populate `vlen_var` with random data:
    - Use the following to generate the random data:
      ```
      data = np.empty((5,4), dtype=object)
      for i in range(5):
        for j in range(4):
          random_length = random.randint(2, 8)
          data[i,j] = np.random.randint(1, 101, size=random_length, dtype=np.int32)
      ```
    - Assign the data to the netCDF variable
- Create a new dimension `c` with a size of `7`.
- Define a variable `vlen_str_var` along dimension `c`.
- Populate this variable with random strings of lengths between 3 and 10 using uppercase and lowercase alphabetic characters using the following:
    ```
    chars = string.ascii_letters
    string_data = np.empty(7, dtype=object)
    for i in range(7):
        random_length = random.randint(3,10)
        string_data[i] = ''.join(random.choice(chars) for _ in range(random_length))
    # Assign the string data to the netCDF variable
    str_var[:] = string_data
    ```
- Print the contents of `vlen_var` and `vlen_str_var`. Print the structure of the netCDF4 file to show defined dimensions, variables, and data types.

In [6]:
import random
import string

# Step 1: Create a netCDF file in write mode
f = Dataset("data/exercise_vlen.nc", "w")

# Step 2: Define dimensions a and b
f.createDimension("a", 5)
f.createDimension("b", 4)

# Step 3: Create a variable-length data type for signed 32-bit integers
vlen_int_type = f.createVLType(np.int32, "my_vlen_int")

# Step 4: Create and populate the variable length integer array
vlen_var = f.createVariable("vlen_var", vlen_int_type, ("a", "b"))

# Step 5: Populate vlen_var with random-length integer arrays
data = np.empty((5,4), dtype=object)
for i in range(5):
    for j in range(4):
        random_length = random.randint(2, 8)
        data[i,j] = np.random.randint(1, 101, size=random_length, dtype=np.int32)
# Assign the data to the netCDF variable
vlen_var[:, :] = data

# Step 6: Create a dimension 'c'
f.createDimension("c", 7)

# Step 7: Define a variable-length string array
str_var = f.createVariable("vlen_str_var", str, ("c",))

# Step 8: Populate vlen_str_var with random strings of lengths 3-10
chars = string.ascii_letters
string_data = np.empty(7, dtype=object)
for i in range(7):
    random_length = random.randint(3,10)
    string_data[i] = ''.join(random.choice(chars) for _ in range(random_length))
# Assign the string data to the netCDF variable
str_var[:] = string_data

# Step 9: Print the contents of "vlen_var" and "vlen_str_var"
print("Contents of 'vlen_var':\n", vlen_var[:])
print("Contents of 'vlen_str_var':\n", str_var[:])
# Print the structure of the NetCDF4 file
print("\nNetCDF4 file structure:\n", f)
print("Details of 'vlen_var':\n", f.variables["vlen_var"])
print("Details of 'vlen_str_var':\n", f.variables["vlen_str_var"])

# Close the file
f.close()

Contents of 'vlen_var':
 [[array([93, 90, 94, 86], dtype=int32)
  array([93, 45, 89, 80, 43], dtype=int32)
  array([32, 42, 75, 38], dtype=int32)
  array([49, 57, 48, 43, 79, 46, 98, 82], dtype=int32)]
 [array([38, 98,  9], dtype=int32) array([63, 80], dtype=int32)
  array([20, 31], dtype=int32) array([98, 58, 46], dtype=int32)]
 [array([77,  2, 51, 17], dtype=int32)
  array([62, 98, 25, 41, 62, 42, 86, 23], dtype=int32)
  array([44, 52, 10, 56, 80, 29, 66, 17], dtype=int32)
  array([68, 72], dtype=int32)]
 [array([42, 74, 53], dtype=int32) array([68, 15, 91], dtype=int32)
  array([13, 42, 30, 17, 96, 24, 88, 21], dtype=int32)
  array([13, 96, 87, 76, 37], dtype=int32)]
 [array([51, 85, 44, 37, 87,  2, 84], dtype=int32)
  array([30, 92, 84, 11], dtype=int32)
  array([ 83,  91,   2,  11,  76,  55, 100], dtype=int32)
  array([98, 56, 55, 41, 89, 49, 68], dtype=int32)]]
Contents of 'vlen_str_var':
 ['KwF' 'AdpMJjMhcu' 'pasfMCKgA' 'nJvVKGF' 'jVbc' 'hkUa' 'uPnH']

NetCDF4 file structure:
 <

## Enum data type

Q6. Let's create a netCDF file to store weather observation data including an enumerated type representing different types of precipiation.
- Create a new netCDF file called `data/weather_data.nc` in write mode with the `NETCDF4` format.
- Create a Python dictionary `precip_dict` where:
    - `None` maps to `0`
    - `Rain` maps to `1`
    - `Snow` maps to `2`
    - `Sleet` maps to `3`
    - `Hail` maps to `4`
    - `Unknown` maps to `255`
- Use this dictionary to define an Enum data type using `.createEnumType()` called `precip_t` with a base type of `np.uint8`
- Define a dimension called `time` with an unlimited length for observations over time
- Create a 1D variable named `precipitation` of type `precip_type` that uses the `time` dimension and has `fill_value=precip_dict['Unknown']`. The fill value indicates missing data.
- Write the following precipiatation observations to the `precipitation` variable: `precip_var[:] = [precip_dict[k] for k in ['None', 'Rain', 'Snow', 'Unknown', 'Sleet']]`.
- Close and reopen the file in read mode, then print the contents of the `precipitation` variable, inlcuding: the data values confirming they match the written values, the enum dictionary associated with the enum data type, verifying the precipitation mapping. 

In [7]:
# Step 1: Create a new netCDF file
nc = Dataset('data/weather_data.nc', 'w', format='NETCDF4')

# Step 2: Define the Enum dictionary and create the Enum type
precip_dict = {
    'None': 0,
    'Rain': 1,
    'Snow': 2,
    'Sleet': 3,
    'Hail': 4,
    'Unknown': 255
}

# Step 3: Create an Enum type called 'precip_t' with base type uint8
precip_type = nc.createEnumType(np.uint8, 'precip_t', precip_dict)

# Step 4: Create a time dimension
nc.createDimension('time', None)

# Step 5: Create the precipitation variable, setting the fill_value to 'Unknown'
precip_var = nc.createVariable('precipitation', precip_type, ('time',),
                               fill_value=precip_dict['Unknown'])

# Step 6: Write data to the variable
precip_var[:] = [precip_dict[k] for k in ['None', 'Rain', 'Snow', 'Unknown', 'Sleet']]

# Step 7: Close the file
nc.close()
# Reopen the file, read and print the data
nc = Dataset('data/weather_data.nc', 'r')
precip_var = nc.variables['precipitation']
# Print the Enum dictionary
print("Enum dictionary:", precip_var.datatype.enum_dict)
# Print the data stored in the variable
print("Precipitation data:", precip_var[:])

# Close the file
nc.close()

Enum dictionary: {'None': 0, 'Rain': 1, 'Snow': 2, 'Sleet': 3, 'Hail': 4, 'Unknown': 255}
Precipitation data: [0 1 2 -- 3]
