# StickNet Data QC script - For EOL Preparation
- - - - -
Written by Jessica McDonald, April 2020

Based on code originally written by Tony Reinhart and edited by Aaron Hill
- - - - - - 
### FOR THE USER:

* This script requires data in the following format:

        /MainDirectory/
            deployment1/
                raw/
                    0111A_YYYYmmdd_HHMMSS.txt
                    0112A_YYYYmmdd_HHMMSS.txt
                    etc.
            deployment2/
                raw/
                    0222A_YYYYmmdd_HHMMSS.txt
                    0223A_YYYYmmdd_HHMMSS.txt
                    etc.
    etc. 
    
    
* You need to pass the MainDirectory path to the "directory" variable.
  Check all of your data files. If anything is only 58 bytes in size, delete it. This is just a header and no data,     and it will error if you run it through this script.
   
   
* This script will create a new directory: /MainDirectory/DeploymentName/reformat/
  and put all QC'd files there.


* This script must be in the same directory as a folder called "QC_flagged_data"
  If you want to change this, edit the "flagged_file" variable, which is the filepath to  "/QC_flagged_data/flagged_DeploymentName.txt". The flagged file will hold all the lines of data that were flagged by the QC process, which can help determine if there is a "problem sticknet".
  
 
* If something goes wrong in the middle of proecessing, you can move the data that was already processed out of the raw folder and run again. The flagged file will not get overwritten as all new data is appended to the bottom. If you need to start over, delete the flagged file and reformat directory and run again. 


- - - - - -

#### This script performs several tasks:

1. Fixing Wind Direction: It replaces the header compass heading with the footer compass heading, which tends to be more accurate. If there is no footer, the header is kept. The raw WD values are based on the header, so if the footer is different, this difference is subtracted from the orginal WD data.


2. QCing the data: Each of the following variables: T, RH, P, and WS, are checked for any "crazy" values. There are two flags, the TFLAG (for the thermo variables T, RH, and P) and the WFLAG (WS). 0 = good, 1 = bad. A 10-pt boxcar is used to make a smoothed version of the data. The raw data and smoothed data are subtracted, and if this difference is larger than the values set in qcvals, that data is flagged. I believe these values are based on the sensor specs, but I'm not totally sure. They have been used for 10 years though, so I'm not changing them. This flagged data is then written out into the reformat directory. *** UPDATE MAY 2020: the Tflag is also raised if RH exceeds 100 % ***


3. Flagging the flagged data...: if a variable gets flagged, then that row of data is printed in the QC_flagged_data directory. Each of the QC'd variables are printed here along with the smoothed value they were compared against. The flags are also printed here. Each deployment directory in MainDirectory gets its own flagged data file, regardless of how many different sticknets are within (the file name which has the sticknet ID in it is included in each row of flagged data)


In [5]:
import os
import linecache
import glob
import numpy as np
import pandas as pd


#######################################################
# USER EDIT HERE =>

directory = '/Users/jessmcd/Documents/other/20170430_RapidProbeData/20170501_masstest/'

#######################################################
#######################################################
#######################################################

#######################################################
#      FUNCTIONS
#######################################################

def update_header(snfile):
    """ 
    snfile:     a directory path that leads directly to the sticknet file
    deploydir:  a directory path that leads to the deployment directory
    
    update_header looks at header and footer and replaces header with footer if the probeID in header equals
    the footer ID (i.e., a footer was actually written). If there is no footer, then the header is used.
    Why use the footer? Typically the footer is more accurate than the the header.
    
    Using deploydir, a new directory is created next to raw called "reformat"
    The new header is written into a file with the same name as snfile in the reformat directory
    
    Returns the header compass heading for comparisons in update_wind function
    
    """
    # read in the data, pull out first and last lines
    rawdata = open(snfile,'r')
    temp = rawdata.readlines()
    rawdata.close()
    header = temp[0].replace("\r\n", "")
    footer = temp[-1].replace("\r\n", "")

    print("header: "+header.replace("\n", ""))
    print("footer: "+footer.replace("\n", ""))

    splithead = header.split(',')
    splitfoot = footer.split(',')

    # create path to reformat data directory
    snfile_refmt = snfile.replace("/raw/","/reformat/") 

    # checks to see if a footer was written. if not, header is used instead of footer
    if splitfoot[0] != splithead[0]: 
        data_to_use = header
        print('Used header for wind dir')
    else:
        data_to_use = footer 
        print("Used footer for wind dir")

    # write out the data
    reformatdata = open(snfile_refmt,'w')    
    reformatdata.write(data_to_use)
    reformatdata.close()
            
    return float(splithead[6]) # compass heading from the header


def update_wind(snfile,headerwind):
    ''' 
    snfile:     a directory path that leads directly to the sticknet file
    headerwind: float value returned from the update_header function.
    
    This function updates the wind direction data based on the difference between the old header 
    returned from the update_header function and the new header from the reformat sticknet file 
    (These may be the same value!).
    
    This function returns a pandas dataframe of the raw data with corrected WD. '''

    # get reformated file directory
    snfile_refmt = snfile.replace("/raw/","/reformat/") 

    # read in the new wind direction from reformat header
    header_text = linecache.getline(snfile_refmt,1)
    newwind= float(header_text.split(',')[6])
    linecache.clearcache()

    # open raw data
    data = pd.read_csv(snfile,delimiter=",",header=1,
                       names=['time','T','RH','P','WS','WD','BATT'], error_bad_lines=False)
    WDdata = data['WD']

    # difference between header and footer (will be zero if there was no footer)
    deltadir = np.round(headerwind - newwind, 1) # ignore uneccessary resolution which can take up memory

    print('Header Heading: {0}'.format(headerwind))
    print('Footer heading: {0}'.format(newwind))
    print('Delta heading: {0}'.format(deltadir))

    # Corrects winds, if header isnt bad
    if headerwind == -999.0:
        print('bad compass, could not fix wind directions')

    else:
        WDdata = WDdata - deltadir
        data.loc[:,'WD'] = [d % 360 for d in WDdata.values]# fixes values outside [0,360]

    return data


def QC_StickNetFiles(snfile, data, qcvals, flagged_file):
    ''' 
    snfile:        a directory path that leads directly to the sticknet file
    data:          the pandas DataFrame returned from the update_wind function
    qcvals:        dictionary of values that represent the maximum allowed variance
                   of the raw data from data smoothed by a 10-pt box-car filter
    flagged_file:  path to file that will hold any flagged data
    
    This function takes the raw StickNet data (with updated wind directions) and 
    adds TFLAG and WFLAG which is 0 if thermo or wind data is good, and is 1 if it 
    has been flagged by qc test. 
    
    The qc test flags any difference between the raw data and data smoothed by a 10-pt box
    car filter (smooth function) that is larger than the value specified in qcvals. The data 
    is then written into the reformat sticknet file with the flag information.
    
    The flagged rows of data are also output into the flagged_file, with the smoothed values they
    were compared to.
    
    Note: I have updated the rolling boxcar mean function. It produces the same output as Tonys
    old function except for a few beginning values before n=10.
    
    ***NOTE: 19 May 2020 update:::: If RH has values greater than 100, the Tflag is raised!'''
    
    print('QCQCQCQCQCQCQCQCQCQCQCQC')
    
    # path to reformatted data file
    snfile_refmt = snfile.replace("/raw/","/reformat/") 
    

    # add in TFLAG and WFLAG
    data['TFLAG'] = 0
    data['WFLAG'] = 0

    # find locations of bad data, and save smoothed data
    smooth_var = {}
    TF_loc, WF_loc = [], []
    for var in ['T', 'RH', 'P', 'WS']:
        boxcar_width = 10 # DONT CHANGE
        smooth_var[var] = data[var].rolling(boxcar_width, win_type='boxcar', min_periods=1, center=True).mean() 
        if var == 'WS':
            WF_loc.extend(np.where(abs(data[var].values - smooth_var[var]) >= qcvals[var])[0])
        else:
            TF_loc.extend(np.where(abs(data[var].values - smooth_var[var]) >= qcvals[var])[0])
            # find where RH exceeds 100, which can happen if there are water intrusion or voltage issues
            if var == 'RH':
                TF_loc.extend(np.where(data[var].values > 100.0)[0])
    TF_loc = np.unique(TF_loc)

  
    # update data files based on where flags have been raised   
    data.loc[TF_loc, 'TFLAG'] = 1
    data.loc[WF_loc, 'WFLAG'] = 1

    # rewrite raw data with corrected winds and flags into reformat directory files
    print('Writing QC\'d data to file')
    f = open(snfile_refmt, 'a')
    data.to_csv(f, index=None, header=None, float_format='%.1f')
    f.close()

    # create dataframe of flagged values, labeled by the name of the sticknet file
    # output each variable with its averaged value it was compared to
    all_flags = np.unique(np.concatenate((WF_loc, TF_loc)))
    if len(all_flags) != 0: # only do this if there are flagged data
        flagged_data = data.loc[all_flags]
        flagged_data.insert(0, 'fname' , snfile_refmt.split('/')[-1])
        flagged_data.insert(1, 'row_num' , all_flags)
        flagged_data = flagged_data.drop(columns=['time', 'WD', 'BATT'])

        for i,var in enumerate(['T', 'RH', 'P', 'WS']):
            flagged_data.insert((i*2)+3, f'mean{var}' , smooth_var[var])

        # write data out into file
        ff = open(flagged_file, 'a')
        flagged_data.to_csv(ff, index=None, header=None, float_format='%.1f')
        ff.close()
    
    print('QC for {} complete!'.format(snfile_refmt.split('/')[-1]))

    
##################################
# DO NOT TOUCH THESE VARIABLES
# I'm not sure where these values came from, 
# but they've been used since at least 2010
################################
tempqc = 0.5
rhqc   = 5.0
presqc = 1.2
wspdqc = 4.0

qcvals = {'T': tempqc, 'RH': rhqc, 'P': presqc, 'WS': wspdqc }
###############################


#######################################################
#      IMPLEMENTATION OF FUNCTIONS
#######################################################

# finds all deployments in the directory file given 
deployments = glob.glob(f'{directory}/*')       
print('deployment directories found: ', deployments)

for deploy in deployments:
    
    # making the reformat directory
    deploy_refmt = f'{deploy}/raw/'.replace("/raw/","/reformat/")
    
    # path to data file containing flagged data
    flagged_file = 'QC_flagged_data/flagged_{}.txt'.format(deploy_refmt.split('/')[-3])
    
    # if the qcfile doesn't exist, create it and add a header
    if not os.path.exists(flagged_file):
        qcfile = open(flagged_file, 'a')
        qcfile.write('filename, row, T, meanT, RH, meanRH, P, meanP WS, meanWS, TFLAG(1=bad), WFLAG(1=bad)\n')
        qcfile.close()
    
    # if the reformat directory doesnt exist, create it
    if not os.path.exists(deploy_refmt):
        os.makedirs(deploy_refmt)
        
    # get all paths to each sticknet file
    sticknets = glob.glob(f'{deploy}/raw/*.txt')
    
    # print just the filenames, not the filepaths
    print('sticknet files found: ',[s.split('/')[-1] for s in sticknets], '\n')  

    for SN in sticknets:
        print('------\nProcessing {}'.format(SN.split('/')[-1]), '\n')
        
        # set appropriate header for sticknet file in reformat
        headerWD = update_header(SN)
        
        # update wind direction based on difference in header/footer compass heading
        data = update_wind(SN,headerWD) 
        
        # add in flags for potentially bad data
        # write data with flags into sticknet file in reformat
        # write flagged data into flagged data file
        QC_StickNetFiles(SN, data, qcvals, flagged_file)
        

deployment directories found:  ['/Users/jessmcd/Documents/other/20170430_RapidProbeData/20170501_masstest/masstest']
sticknet files found:  ['0221A_20170501_150634.txt', '0219A_20170501_150839.txt', '0224A_20170501_150633.txt', '0223A_20170501_150931.txt', '0218A_20170501_150634.txt', '0220A_20170501_150625.txt', '0222A_20170501_150645.txt'] 

------
Processing 0221A_20170501_150634.txt 

header: 0221A,20170501,15:06:34,34.724694,86.646406,184.0,183.2,3
footer: 0221A,20170501,23:36:22,34.724568,86.646475,178.5,183.5,3
Used footer for wind dir


b'Skipping line 305763: expected 7 fields, saw 8\n'


Header Heading: 183.2
Footer heading: 183.5
Delta heading: -0.3
QCQCQCQCQCQCQCQCQCQCQCQC
Writing QC'd data to file
QC for 0221A_20170501_150634.txt complete!
------
Processing 0219A_20170501_150839.txt 

header: 0219A,20170501,15:08:39,34.724516,86.646345,177.0,176.2,3
footer: 0219A,20170501,23:36:08,34.724545,86.646335,180.5,176.3,3
Used footer for wind dir


b'Skipping line 304383: expected 7 fields, saw 8\n'


Header Heading: 176.2
Footer heading: 176.3
Delta heading: -0.1
QCQCQCQCQCQCQCQCQCQCQCQC
Writing QC'd data to file
QC for 0219A_20170501_150839.txt complete!
------
Processing 0224A_20170501_150633.txt 

header: 0224A,20170501,15:06:33,34.724616,86.646413,186.5,185.4,3
footer: 20170501_233037.4,20.4,34.7,986.8,0.1,262.0,NaN
Used header for wind dir
Header Heading: 185.4
Footer heading: 185.4
Delta heading: 0.0
QCQCQCQCQCQCQCQCQCQCQCQC
Writing QC'd data to file
QC for 0224A_20170501_150633.txt complete!
------
Processing 0223A_20170501_150931.txt 

header: 0223A,20,0::,0.000000,0.000000,0,0,1
footer: 0223A,20170501,23:36:29,34.724640,86.646418,203.0,0,3
Used footer for wind dir


b'Skipping line 304172: expected 7 fields, saw 8\n'


Header Heading: 0.0
Footer heading: 0.0
Delta heading: 0.0
QCQCQCQCQCQCQCQCQCQCQCQC
Writing QC'd data to file
QC for 0223A_20170501_150931.txt complete!
------
Processing 0218A_20170501_150634.txt 

header: 0218A,20170501,15:06:34,34.724637,86.646375,198.5,179.5,3
footer: 20170501_213802.3,22.7,28.8,985.7,0.0,270.8,NaN
Used header for wind dir
Header Heading: 179.5
Footer heading: 179.5
Delta heading: 0.0
QCQCQCQCQCQCQCQCQCQCQCQC
Writing QC'd data to file
QC for 0218A_20170501_150634.txt complete!
------
Processing 0220A_20170501_150625.txt 

header: 0220A,20170501,15:06:25,34.724604,86.646571,176.0,181.3,3
footer: 0220A,20170501,23:36:45,34.724598,86.646468,181.5,181.4,3
Used footer for wind dir


b'Skipping line 306099: expected 7 fields, saw 8\n'


Header Heading: 181.3
Footer heading: 181.4
Delta heading: -0.1
QCQCQCQCQCQCQCQCQCQCQCQC
Writing QC'd data to file
QC for 0220A_20170501_150625.txt complete!
------
Processing 0222A_20170501_150645.txt 

header: 0222A,20170501,15:06:45,34.724682,86.646363,209.0,189.3,3
footer: 20170501_204849.4,23.7,30.9,985.5,0.0,237.7,NaN
Used header for wind dir
Header Heading: 189.3
Footer heading: 189.3
Delta heading: 0.0
QCQCQCQCQCQCQCQCQCQCQCQC
Writing QC'd data to file
QC for 0222A_20170501_150645.txt complete!
