# Exercise 8: NetCDF4 Advanced

## Aim: Introduce more advanced uses of the netCDF4 library in Python to read and create NetCDF4 Files.

Find the teaching material here: https://unidata.github.io/netcdf4-python/

### Issues covered:
- Working with time coordinates
- Multi-file datasets
- Compression of variables
- Compound datatypes
- Enum data type

## Time-coordinates

Most metadata standards specify that time should be measured relative to a fixed date with units such as `hours since YY-MM-DD hh:mm:ss`. We can convert values to and from calendar dates using `num2date` and `date2num` from the `cftime` library. Two other helpful functions are `datetime` and `timedelta` from the `datetime` library.

Q1. 
- Let's generate a list of data and time values: create a list called `dates` containing date and time values, starting from January 1st 2022, and incrementing by 6 hours for a total of 5 entries. 
- Use `date2num` to convert your list of dates to numeric values using: `units="hours since 2022-01-01 00:00:00"` amd `calendar="gregorian"`. Store these in an array called `times`.
- Print the numeric times values to confirm the numeric representation.
- Use `num2date` to convert times back to datetime objects using the same units and calendar. Store these in a list called `converted_dates`
- Print the converted dates to verify they match the original dates list. 

## Multi-file datasets

Q2. Let's create multiple netCDF files with a shared variable and unlimited dimension, and use `MFDataset` to read the aggregated data as if it were contained in a single file.
- Create 5 netCDF files named `data/datafile0.nc` through to `data/datafile4.nc`. Each file should contain:
    - A single unlimited dimension named `time`.
    - A variable named `temperature` with 10 integer values ranging from `file_index * 10` to `(file_index+1) * 10 - 1`.
    - Ensure each file is saved in the `NETCDF4_CLASSIC` format.
    - **Hint: Use a loop such as `for .. in range(..):` to do this task.**
- Using `MFDataset` read all the `temperature` data from the 5 files at once by specifying a wildcard string `datafile*.nc` - store this in a variable `f`. Assign this data to a new variable using `temperature_data = f.variables["temperature"][:]`
- Print the aggregated `temperature` values to verify that they span from 0 to 49.

## Compression of variables

Q3. Let's explore various compression options available in netCDF. 
- Create an array of random temperature data using:
   ```
   time_dim, level_dim, lat_dim, lon_dim = 10, 5, 50, 100
   data = np.random.rand(time_dim, level_dim, lat_dim, lon_dim) * 30 + 273.15
   ```
- Create a function to create NetCDF files with given compression settings using the following:

  ```
  file_path = "data/temperature_data.nc"
  def create_netcdf(file_path, compression=None, least_significant_digit=None, significant_digits=None):
    with Dataset(file_path, 'w', format="NETCDF4") as rootgrp:
        # Create dimensions
        rootgrp.createDimension("time", time_dim)
        rootgrp.createDimension("level", level_dim)
        rootgrp.createDimension("lat", lat_dim)
        rootgrp.createDimension("lon", lon_dim)
        # Define variable with compression settings
        temp = rootgrp.createVariable("temp", "f4", ("time", "level", "lat", "lon"), compression=compression, least_significant_digit=least_significant_digit, significant_digits=significant_digits)
        #Assign data to the variable
        temp[:] = data
        # Check and print file size
        print(f"File size with compression={compression}, "
          f"least_significant_digit={least_significant_digit}, "
          f"significant_digits={significant_digits}: {os.path.getsize(file_path) / 1024:.2f} kB")
  ```
- Use this function to test the following cases:
    - First, create the temperature variable without compression and observe the file size. Use the file path `data/temperature_data_no_compress.nc`.
    - Then, enable zlib compression and observe the change in file size. Use the file path `data/temperature_data_zlib.nc`.
    - Next, add zlib and least signigicant digit quantization (`least_significant_digit=3`) and check the file size again. Use the file path `data/temperature_data_zlib_lsd.nc`.
    - Finally, add zlib and significant digits quantization (`significant_digits=4`) and check the file size again. Use the file path `data/temperature_data_zlib_sig.nc`.
    - Hint: call the function using: `create_netcdf(filepath, compression, least_significant_digit, significant_digit)`. Note that the default for the compression/signigificant digits arguments is None so if you don't need them you can omit them when calling the function.

## Compound data types

Q4. Let's work with compound data types and structured arrays.
- Create a netCDF file called `data/vectors.nc` in write mode with `NETCDF4` format assigned to the variable `f`.
- Define a compound data type that represents a 3D vector. Each vector should have 3 components:
    - `x`: a `float33` representing the x-coordinate
    - `y`: a `float32` representing the y-coordinate
    - `z`: a `float32` representing the z-coordinate
    - Hint: use `np.dtype([('x', type), ('y'..), (...)])` to define x,y,z then `f.createCompoundType()` to create the compound data type.
- Create a dimension named `num_vectors` to store an unlimited number of vectors.
- Create a variable called `vector_data` in the file using the compound data type from step 2, with the dimension from step 3.
- Generate a numpy structured array with 5 sample 3D vectors:
    - Each vector should have random values for `x`, `y` and `z` components (use `np.random.rand(num_samples)`).
    - Store these in the structured array (initialize the array with `np.empty(num_samples, dtype)` then use `data["x"]` etc to assign the data.
    - Write them to the netCDF variable.
- Close the file and then reopen it in read mode.
- Read the data back into a new numpy structured array and print each vector.
    - Hint: use `f.variables['var_name']` to read in the variable data.
    - Hint: Use `data_in = vector_var[:]` to extract the data for the variables.
    - Hint: Use `for i, vev in enumerate(data_in):` to loop through the data so you can print each vector.

## Variable-length data types

Q5. Let's create and manipulate variable-length (vlen) arrays
- Create a netCDF file named `data/exercise_vlen.nc` in write mode.
- Define dimensions:
    - Create a dimension `a` with a size of `5`.
    - Create a dimension `b` with a size of `4`.
- Create a variable-length data type using `f.createVLType()` named `my_vlen_int` using `np.int32` as the datatype.
- Use the vlen type you defined to create a variable `vlen_var` with dimensions `("a", "b")`.
- Populate `vlen_var` with random data:
    - Use the following to generate the random data:
      ```
      data = np.empty((5,4), dtype=object)
      for i in range(5):
        for j in range(4):
          random_length = random.randint(2, 8)
          data[i,j] = np.random.randint(1, 101, size=random_length, dtype=np.int32)
      ```
    - Assign the data to the netCDF variable
- Create a new dimension `c` with a size of `7`.
- Define a variable `vlen_str_var` along dimension `c`.
- Populate this variable with random strings of lengths between 3 and 10 using uppercase and lowercase alphabetic characters using the following:
    ```
    chars = string.ascii_letters
    string_data = np.empty(7, dtype=object)
    for i in range(7):
        random_length = random.randint(3,10)
        string_data[i] = ''.join(random.choice(chars) for _ in range(random_length))
    # Assign the string data to the netCDF variable
    str_var[:] = string_data
    ```
- Print the contents of `vlen_var` and `vlen_str_var`. Print the structure of the netCDF4 file to show defined dimensions, variables, and data types.

## Enum data type

Q6. Let's create a netCDF file to store weather observation data including an enumerated type representing different types of precipiation.
- Create a new netCDF file called `data/weather_data.nc` in write mode with the `NETCDF4` format.
- Create a Python dictionary `precip_dict` where:
    - `None` maps to `0`
    - `Rain` maps to `1`
    - `Snow` maps to `2`
    - `Sleet` maps to `3`
    - `Hail` maps to `4`
    - `Unknown` maps to `255`
- Use this dictionary to define an Enum data type using `.createEnumType()` called `precip_t` with a base type of `np.uint8`
- Define a dimension called `time` with an unlimited length for observations over time
- Create a 1D variable named `precipitation` of type `precip_type` that uses the `time` dimension and has `fill_value=precip_dict['Unknown']`. The fill value indicates missing data.
- Write the following precipiatation observations to the `precipitation` variable: `precip_var[:] = [precip_dict[k] for k in ['None', 'Rain', 'Snow', 'Unknown', 'Sleet']]`.
- Close and reopen the file in read mode, then print the contents of the `precipitation` variable, inlcuding: the data values confirming they match the written values, the enum dictionary associated with the enum data type, verifying the precipitation mapping. 