Attempting to read in the instrument data

In [14]:
from pathlib import Path
import xarray as xr
import numpy as np
import pandas as pd

In [None]:
# set file path to netcdf files
PATH = Path('/gws/nopw/j04/iecdt/computer-vision-data/cloudnet-vertical-profile-data/')

In [None]:
#use xarray to read all of the netcdf files, file format .nc

data = xr.open_mfdataset(PATH.glob('*.nc'), combine='nested', join='left', concat_dim='time')

print(data)


<xarray.Dataset> Size: 445MB
Dimensions:                 (time: 60326, height: 459)
Coordinates:
  * height                  (height) float32 2kB 114.5 144.5 ... 1.384e+04
  * time                    (time) datetime64[ns] 483kB 2023-08-07T00:00:15.0...
Data variables:
    target_classification   (time, height) float64 222MB dask.array<chunksize=(2874, 459), meta=np.ndarray>
    detection_status        (time, height) float64 222MB dask.array<chunksize=(2874, 459), meta=np.ndarray>
    cloud_base_height_amsl  (time) float32 241kB dask.array<chunksize=(2874,), meta=np.ndarray>
    cloud_top_height_amsl   (time) float32 241kB dask.array<chunksize=(2874,), meta=np.ndarray>
    cloud_base_height_agl   (time) float32 241kB dask.array<chunksize=(2874,), meta=np.ndarray>
    cloud_top_height_agl    (time) float32 241kB dask.array<chunksize=(2874,), meta=np.ndarray>
    altitude                (time) float32 241kB 85.0 85.0 85.0 ... 85.0 85.0
    latitude                (time) float32 241kB 51.1

# Goals of this section #

The netCDF data contains an array target_classification that we will use as our GT. This array
includes 10 classes describing various atmospheric targets for each timestep.

Instrument Data Processing:

• Group the 10 original classes into two categories:

– No hydrometeors (clear sky): None, Aerosols & insects, Insects, Aerosols.

– Hydrometeors (cloud present): Melting & droplets, Melting ice, Ice & droplets, Ice, Drizzle &
droplets, Drizzle or rain, Droplets.


• For each timestamp, process the vertical profile to form a 459 × 2 vector. Here, each of the 459 discrete
height levels (ranging approximately from 100 m to 14 km) is assigned a pair of class probabilities
indicating the presence (hydrometeors) or absence (clear sky) of clouds

In [None]:
# inspect data
data.target_classification

Unnamed: 0,Array,Chunk
Bytes,211.26 MiB,10.09 MiB
Shape,"(60326, 459)","(2880, 459)"
Dask graph,21 chunks in 69 graph layers,21 chunks in 69 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 211.26 MiB 10.09 MiB Shape (60326, 459) (2880, 459) Dask graph 21 chunks in 69 graph layers Data type float64 numpy.ndarray",459  60326,

Unnamed: 0,Array,Chunk
Bytes,211.26 MiB,10.09 MiB
Shape,"(60326, 459)","(2880, 459)"
Dask graph,21 chunks in 69 graph layers,21 chunks in 69 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [None]:
# assign whether a target classification is a hydrometeor or not

# if data.target_classification == 0 OR > 7, then it is 'No Hydrometeors' (assign value 0). Else, it is 'Hydrometeors' (assign value 1). Store values in new array in data called Hydrometeors
data['Hydrometeors'] = xr.where((data.target_classification == 0) | (data.target_classification > 7), 0, 1)

# view values of hydrometer data to check if it worked
data['Hydrometeors'].load()

In [None]:
# view values for target classification to compare to the above hydrometeors to check if they have been classified correctly
data.target_classification.load()

In [122]:
#Task is... For each timestamp, process the vertical profile to form a 459 × 2 vector. Here, each of the 459 discrete
#height levels (ranging approximately from 100 m to 14 km) is assigned a pair of class probabilities
#indicating the presence (hydrometeors) or absence (clear sky) of clouds

# subset the data to only include the Time, Height and Hydrometeors columns
data_subset = data[['time', 'height', 'Hydrometeors']]
data_subset.load()


In [None]:
# set file path for images from camera A
A_DPATH = Path('/gws/nopw/j04/iecdt/computer-vision-data/cam_a/rectified_imgs')

In [None]:
# read in images from camera A, and convert the times to datetime format
times = [file.stem for file in list(A_DPATH.glob('*.png'))]
times_datetime = pd.to_datetime(times, unit='s')

In [None]:
# align the times from the images to the times in the data
tol = pd.Timedelta('5s')
aligned_times = data_subset.sel(time=times_datetime, method='nearest')
times_within_tol = (aligned_times.time - times_datetime) <= tol
data_aligned = aligned_times.sel(time=times_within_tol, method='nearest')

We are only using data within 5 seconds of the groun truth measurements. This means we lose approx. 1 day of measurements, which we think is trhe last day, but we still have data for over 13,000 measuermements

In [None]:
# save the aligned times, with their respective heights and hydrometeor values, to a netCDF file
data_aligned.to_netcdf(PATH / '../../JERMIT_the_frog/hydrometeors_time_aligned_classes.nc')