In [1]:
import pandas as pd
from pathlib import Path

In [2]:
filename = Path('data', 'Heitronics-Apogee_Comparison_irt60s.dat')

### Here is the first 6 lines of the file:

"TOA5","Heitronics-Apogee Comparison","CR1000X","9292","CR1000X.Std.03.01","CPU:Apogee-Heitronics-Analog.CR1X","46927","irt60s"

"TIMESTAMP","RECORD","sfc_ir_temp_Avg","sfc_ir_temp_Std","sfc_a_ir_temp_Avg","sfc_a_ir_temp_Std","logger_temp","logger_volt","logger_LiBat"

"TS","RN","Kelvin","Kelvin","Kelvin","Kelvin","Celsius","Volt","Volt"

"","","Avg","Std","Avg","Std","Smp","Smp","Smp"

"2022-07-15 18:57:00",0,318.3,0.062,318.1,0.342,26.34,13.64,3.93

"2022-07-15 18:58:00",1,318.1,0.347,317.7,1.452,26.38,13.64,3.931

We can read the file with no keywords and see what we get. It will do the best it can to read the file making some default assumptions. Not bad for knowing nothing about the data file. But there are some issues. So we need to understand what keywords we should use and the values for those keywords.

In [3]:
df = pd.read_csv(filename)
df[:5]

Unnamed: 0,TOA5,Heitronics-Apogee Comparison,CR1000X,9292,CR1000X.Std.03.01,CPU:Apogee-Heitronics-Analog.CR1X,46927,irt60s
TIMESTAMP,RECORD,sfc_ir_temp_Avg,sfc_ir_temp_Std,sfc_a_ir_temp_Avg,sfc_a_ir_temp_Std,logger_temp,logger_volt,logger_LiBat
TS,RN,Kelvin,Kelvin,Kelvin,Kelvin,Celsius,Volt,Volt
,,Avg,Std,Avg,Std,Smp,Smp,Smp
2022-07-15 18:57:00,0,318.3,0.062,318.1,0.342,26.34,13.64,3.93
2022-07-15 18:58:00,1,318.1,0.347,317.7,1.452,26.38,13.64,3.931


- Even though the delimiter was assumed correctly to be a comma we should be explicit.
- We should indicate what rows should be skipped. There is some metadata about the values that we can skip over.
- There is a column of dates and we can have Pandas do the work of parsing that for us.
- We should index on time so we can do that during the read in process.
- We can indicate what row contains the header names. Notice skiprows is applied first so the header is second row in the file, but we state it is the first row in the file after skiprows is applied.

In [4]:
df = pd.read_csv(filename, delimiter=',', skiprows=[0, 2, 3], header=0, parse_dates=[0], index_col=[0])

Since Pandas does not have a native place to hold the metadata we can convert the Pandas Dataframe to an Xarray Dataset and add the metadata to the DataArrays.

Before we do that we can clean up the index column name. CF convention indicates the dimension and coordinate name for time dimension should match. This allows Xarray tools to work correctly. Suggested name to use is 'time'.

In [5]:
df.index.name = 'time'  # Set the index name to 'time'
ds = df.to_xarray()  # Convert from Pandas Dataframe to Xarray Dataset
ds

Since the units and long name are in the header of the .dat file we can read the .dat file as ASCII and apply those values to the DataArray attributes.

In [6]:
# We will use the with: block controller to automatically close the file when we exit the loop
with open(filename, 'r') as file_handle:  # Open the file and assign the file handle
    for ii, line in enumerate(file_handle):  # Loop over and read each line of the file
        line = line.rstrip()  # Strip off any extra white spaces or returns from the strings.
        if ii == 0:  # Read first line
            long_name = line.split(',')  # Convert string into list of strings and assign to variable
        elif ii == 2:  # Read the thrid line
            units = line.split(',')  # Convert string into list of strings and assign to variable
        elif ii >= 4:  # After we get the header information exit loop.
            break

In [7]:
units = units[1: ]  # units has TIMESTAMP unit. But we made this the index so we need to drop first value.

In [8]:
for ii, var_name in enumerate(ds.data_vars):  # Loop over the variable names in the Dataset
    ds[var_name].attrs['long_name'] = long_name[ii]  # Set the long_name attribute
    ds[var_name].attrs['units'] = units[ii]  # Set the units attribute

In [9]:
ds['sfc_ir_temp_Avg']

The RECORD column is not needed so we can drop it from the Dataset. By waiting until the end of this process we do not need to accomidate the RECORD column in our loops. Often it is easier to do more work and then clean up at the end to ensure everyting lines up correctly. Now we have a Xarary Dataset with correctly formatted coordinates, variables and metadata. We can go do great things now.

In [10]:
del ds['RECORD']
ds