## Basic structure

Various input files will be in disparate formats.  The intention here is to parse them into a standardized format.

* baseMethod.py contains a dataclass called **genericLoggerFile** which is the template for standardizing data and metadata between file types.
* All methods which parse a given file type, will 

In [1]:
import importlib
import baseMethods
importlib.reload(baseMethods)

# the .__dataclass_fields__ attribute prints the names of the fields in a dataclass and metadata like their default values
baseMethods.genericLoggerFile(sourceFile='some/path/to/a/file').__dataclass_fields__

{'sourceFile': Field(name='sourceFile',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=False,hash=None,compare=True,metadata=mappingproxy({}),kw_only=True,_field_type=_FIELD),
 'timezone': Field(name='timezone',type=<class 'str'>,default='UTC',default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=True,_field_type=_FIELD),
 'fileType': Field(name='fileType',type=<class 'str'>,default=None,default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=True,_field_type=_FIELD),
 'frequency': Field(name='frequency',type=<class 'str'>,default=None,default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw

In [2]:
import parseCSI
importlib.reload(parseCSI)


file = r'example_data\TOA5_BBS.FLUX_2023_08_01_1530.dat'

parsedTOA5_example = parseCSI.parseTOA5(sourceFile=file)

# The parseTOA5 dataclass has its onw fields plus all fields it inherits from the base dataclass (genericLoggerFile)
parsedTOA5_example.__dataclass_fields__

{'sourceFile': Field(name='sourceFile',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=False,hash=None,compare=True,metadata=mappingproxy({}),kw_only=True,_field_type=_FIELD),
 'timezone': Field(name='timezone',type=<class 'str'>,default='UTC',default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=True,_field_type=_FIELD),
 'fileType': Field(name='fileType',type=<class 'str'>,default=None,default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=True,_field_type=_FIELD),
 'frequency': Field(name='frequency',type=<class 'str'>,default=None,default_factory=<dataclasses._MISSING_TYPE object at 0x0000029343E95280>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw

### Looking inside a parsed input file

The basic component of all processed logger files will include:

* A pandas DataFrame with a datetime index containing all data which can be read form the raw data file
    * timezone and frequency are metatadata provided as separate fields in the object
* a variableMap 
    * A mapping of each column in the dataframe to with metadata including.  Some of these metadata can be automatically parsed from the files, depending on the file type.  Otherwise the can be provided by the user (see next second code example below)
        * Original column name (spaces and special characters are replaced with _)
        * Units & data type
        * Sensor information 
        * Variable description

In [3]:
# the __dict__ attribute prints the names of the fields in a dataclass and their values
for key,value in parsedTOA5_example.__dict__.items():
    print(f'{key}: {value}')

sourceFile: example_data\TOA5_BBS.FLUX_2023_08_01_1530.dat
timezone: UTC
fileType: TOA5
frequency: 0.1s
timestampName: TIMESTAMP
fileTimestamp: 2023_08_01_1530
variableMap: {'RECORD': {'originalName': 'RECORD', 'ignore': False, 'instrument': None, 'unit': 'RN', 'dtype': '<i8', 'variableDescription': ''}, 'Ux': {'originalName': 'Ux', 'ignore': False, 'instrument': None, 'unit': 'm/s', 'dtype': '<f8', 'variableDescription': 'Smp'}, 'Uy': {'originalName': 'Uy', 'ignore': False, 'instrument': None, 'unit': 'm/s', 'dtype': '<f8', 'variableDescription': 'Smp'}, 'Uz': {'originalName': 'Uz', 'ignore': False, 'instrument': None, 'unit': 'm/s', 'dtype': '<f8', 'variableDescription': 'Smp'}, 'Ts': {'originalName': 'Ts', 'ignore': False, 'instrument': None, 'unit': 'C', 'dtype': '<f8', 'variableDescription': 'Smp'}, 'sonic_diag': {'originalName': 'sonic_diag', 'ignore': False, 'instrument': None, 'unit': 'arb', 'dtype': '<i8', 'variableDescription': 'Smp'}, 'CO2': {'originalName': 'CO2', 'ignore':

### Defining the variableMap

* The variableMap is a dictionary with a set of fields which are a applied on a per-column (variable) basis.  They can be user defined, or automatically parsed from the file where possible, or if all else values, the will take the default values defined for _variableMap.

* Say you want to overwrite the default metadata values for the "sonic_diag" column in example_data/TOA5_BBS.FLUX_2023_08_01_1530.dat because it is actually a counter variable that was misslabeled in the code.  We want to ignore the variable in further processing and add a note to explain why.  We can define variable map for sonic_diag while leaving all other information to be parsed automatically or set to the default.

* The variableMap for a given input file type for a given site/logger can then be saved as a YAML file so the metadata are in an easy to read format.  We still need to put in some thought into handling time-dependent updates to the metadata.  Where the metatada are common between timestamps we can minimize storage by having one yaml file per time-block.  When the metadata update we need a new file (full or just partial documenting changes > I think full is more readable even if it takes more space).  We then need an time-dependent index of the metatada files so things can be matched up correctly.

In [None]:
parsedTOA5_example_user_defined_varMap = parseCSI.parseTOA5(sourceFile=file,variableMap={'sonic_diag':
                                                                                         {'variableDescription':'The diagnostic variable is corrupted and should be ignored',
                                                                                          'ignore':True,
                                                                                          'units':'unitless',}})

import yaml
print(yaml.safe_dump(parsedTOA5_example_user_defined_varMap.variableMap,sort_keys=False))

RECORD:
  originalName: RECORD
  ignore: false
  instrument: null
  unit: RN
  dtype: <i8
  variableDescription: ''
Ux:
  originalName: Ux
  ignore: false
  instrument: null
  unit: m/s
  dtype: <f8
  variableDescription: Smp
Uy:
  originalName: Uy
  ignore: false
  instrument: null
  unit: m/s
  dtype: <f8
  variableDescription: Smp
Uz:
  originalName: Uz
  ignore: false
  instrument: null
  unit: m/s
  dtype: <f8
  variableDescription: Smp
Ts:
  originalName: Ts
  ignore: false
  instrument: null
  unit: C
  dtype: <f8
  variableDescription: Smp
sonic_diag:
  originalName: sonic_diag
  ignore: false
  instrument: null
  unit: arb
  dtype: <i8
  variableDescription: Smp
CO2:
  originalName: CO2
  ignore: false
  instrument: null
  unit: mg/m3
  dtype: <f8
  variableDescription: Smp
H2O:
  originalName: H2O
  ignore: false
  instrument: null
  unit: mg/m3
  dtype: <f8
  variableDescription: Smp
press:
  originalName: press
  ignore: false
  instrument: null
  unit: kPa
  dtype: <f8
  v