# Montserrat Seisan database conversion to CSV files
The aim of this code is to read through an entire Seisan database of S-files (under REA/DB/yyyy/mm) and the corresponding WAV-files (under WAV/DB/yyyy/mm). WAV-files that do not appear in an S-file are ignored.

Glenn Thompson, 2021/07/22

For each linked WAV-file, we perform the following steps:

## function WAV2picklefile
1. Load the WAV-file into an ObsPy Stream object. 
2. Fix trace IDs.
3. Process each Trace object in the Stream object using various functions in metrics.py that:
4. - compute standard Miniseed QC metrics. Stored in tr.stats.metrics.
5. - clean the trace, which consists of:
6. - - clipping trace to remove absurd values
7. - - padding the trace by at least 10-seconds on each end, with a reflection of data on each end
8. - - detrending the trace
9. - - tapering the trace (just the padded parts)
10. - - filtering the trace from 0.1 to 40 Hz.
11. - - optionally correcting for instrument response (if corresponding StationXML file found). This will also change tr.stats.units from 'Counts' to 'm/s'.
12. - - unpadding the trace
13. - computing signal to noise ratio. This is done by breaking the signal into seconds, and computing the ratio of the highest amplitude 1-second to the lowest amplitude 1-second. Stored in tr.stats.metrics.
14. - computing SciPy.stats metrics including kurtosis. These are stored in tr.stats.scipy.
15. - recomputing the Miniseed QC metrics now that traces have been detrended, filtered and (optionally) instrument corrected.
16. Computing a quality_factor for each trace, based on the above metrics.
17. Removing any traces with quality_factor = 0.0 from the Stream object.
18. Saving the Stream object as a Pickle file. This goes into PICKLE/DB/YYYY/MM.

## function Stream2logfile
19. Dump all the metrics for each Trace object to a log file. This helps for debugging.

## function Stream2png

20. Optionally plot the Stream object and save as a PNG file. If the Stream contains a mixture of seismic and infrasound traces, one PNG will be created for seismic traces, another for infrasound traces, because they have different units (m/s or Pa).

## function add_ampengfft_metrics
21. Compute spectrograms for each trace using icewebPy.
22. Compute amplitude spectrum from the spectrograms.
23. Compute amplitude, energy and frequency metrics including peakf, medianf, RSAM_high, RSAM_low, band_ratio. These are added to tr.stats.metrics.

## function plot_spectrograms
24. Optionally plot spectrograms that were computed at step 9.

## function metrics2tracedf
25. Translate the metrics stored in tr.stats into a pandas DataFrame, one row per Trace.
26. Save this to a CSV file in the PICKLE directory.

## function tracedf2eventdf
27. Compute the median of each numeric column in the DataFrame, to create a 1-row summary for each event.
28. Optionally attempt to detect the event within the Trace using an STA/LTA. This is controlled by the bool_detect_event. Additional metrics from the detection result are added to the 1-row summary.

## function processWAV
This is a wrapper around all of the above, to apply each function to a single WAV-file.

## function processSeisanYearMonth
This is a wrapper around processWAV.
1. calls function get_sfile_list to make a list of all S-file within a particular REA/DB/YYYY/MM directory.
2. reads each S-file.
3. calls processWAV on any WAV-files linked from the S-file (if they can be found). There can be up to 2 - one from the analog network, one from the digital network (there were two networks from 10/1996 - 12/2004).
4. For each WAV-file, the output from processWAV is a 1-row event summary. To this the sfile path, main event classification and volcano-seismic event subclassification are added.
5. Each 1-row event summary is appended to a DataFrame which is stored as reawav_(DB)(YYYY)(MM).csv. This contains a summary of each WAV file corresponding to an S-file for the YYYY/MM in question.

## function main
This loops over every YYYY/MM directory in the given Seisan database, calling processSeisanYearMonth

In [1]:
import os
import sys
from glob import glob
import numpy as np
import pandas as pd

import datetime as dt
from pprint import pprint
import matplotlib.pyplot as plt

from obspy import read, read_inventory, Stream
#from obspy.io.xseed.core import _read_resp
#from obspy.imaging.cm import obspy_sequential

LIBpath = os.path.join( os.getenv('HOME'),'src','kitchensinkGT', 'LIB')
sys.path.append(LIBpath)
from libMVO import fix_trace_id, inventory_fix_id_mvo, load_mvo_inventory
from metrics import process_trace, choose_best_traces, select_by_index_list, ampengfft
from libseisGT import Stream_min_starttime, detect_network_event
from seisan_classes import spath2datetime, Sfile #, printEvents

sys.path.append(os.path.join( os.getenv('HOME'),'src', 'icewebPy') )
import IceWeb


def get_sfile_list(SEISAN_DATA, DB, startdate, enddate): 
    """
    make a list of Sfiles between 2 dates
    """

    event_list=[]
    reapath = os.path.join(SEISAN_DATA, 'REA', DB)
    years=list(range(startdate.year,enddate.year+1))
    for year in years:
        if year==enddate.year and year==startdate.year:
            months=list(range(startdate.month,enddate.month+1))
        elif year==startdate.year:
            months=list(range(startdate.month,13))
        elif year==enddate.year:
            months=list(range(1,enddate.month+1))
        else:
            months=list(range(1,13))
        for month in months:
            #print month
            yearmonthdir=os.path.join(reapath, "%04d" % year, "%02d" % month)
            flist=sorted(glob(os.path.join(yearmonthdir,"*L.S*")))
            for f in flist:
                #fdt = sfilename2datetime(f)
                fdt = spath2datetime(f)
                #print(f, fdt)
                if fdt>=startdate and fdt<enddate:
                    event_list.append(f)
    return event_list 




def WAV2picklefile(paths, shortperiod, correct_data):
    if os.path.exists(paths['picklefile']):
        st = read(paths['picklefile'])
    else:
        # We need to create the pickle file
        # Try to read the WAV file
        st = Stream()
        print('Processing %s.' % paths['wavbase'], end=' ')
        print('Reading.', end = ' ')
        try:               
            st = read(paths['wavfile'])
        except:
            print('ERROR. Could not load.')
            return st

        if len(st)==0:
            print('ERROR. No traces.')
            return st
        else: 
            print('Success.')
        
        ######################### START OF CLEAN/CORRECT BLOCK ######################
        print('Cleaning/correcting')
        fix_trace_id(st, shortperiod=shortperiod) 
        #st=st.select(component='Z')

        for tr in st:
            this_inv = None            
            if correct_data: # try to find corresponding station XML
                this_inv = load_mvo_inventory(tr, paths['CALDIR'])
            process_trace(tr, inv=None)
        
        # remove bad traces
        for tr in st:    
            if tr.stats.quality_factor <= 0.0:
                st.remove(tr)  
                
        ######################### END OF CLEAN/CORRECT BLOCK ######################
            
        # Write pickle file. This is now the best place to load stream data from in future, so replaces
        # seisan/WAV directory with seisan/PICKLE
        print('Writing ',paths['picklefile'])
        st.write(paths['picklefile'], format='PICKLE') 
    return st

def Stream2logfile(st, paths):        
    # save log file
    if not os.path.exists(paths['logfile']):
        print('Writing %s' % paths['logfile'])
        with open(paths['logfile'],'w') as fout:
            for tr in st:
                # print trace history
                pprint(tr.stats, stream=fout)
                fout.write('\n')
                
def Stream2png(st, paths):

    if not os.path.exists(paths['seismicpngfile']):
        print('Writing ',paths['seismicpngfile'])                
        chosen = choose_best_traces(st, MAX_TRACES=99, include_seismic=True, 
                                    include_infrasound=False, include_uncorrected=False)
        if len(chosen)>0:
            st2 = select_by_index_list(st, chosen)
            st2.plot(equal_scale=False, outfile=paths['seismicpngfile'])   
            
    if not os.path.exists(paths['infrasoundpngfile']):
        print('Writing ',paths['infrasoundpngfile'])                
        chosen = choose_best_traces(st, MAX_TRACES=99, include_seismic=False, 
                                    include_infrasound=True, include_uncorrected=False)
        if len(chosen)>0:
            st2 = select_by_index_list(st, chosen)
            st2.plot(equal_scale=False, outfile=paths['infrasoundpngfile'])    
            
def add_ampengfft_metrics(st, paths):
    print('Computing spectrogram data')        
    iwsobj = IceWeb.icewebSpectrogram(stream=st)
    iwsobj = iwsobj.precompute() # spectrograms data added
    print('Computing spectrum.', end = ' ')  
    iwsobj.compute_amplitude_spectrum(compute_bandwidth=True) # adds tr.stats.spectrum
    for tr in iwsobj.stream:
        ampengfft(tr, paths['PICKLEDIR']) # add peaktime, peakamp, energy
    return iwsobj


def plot_spectrograms(paths, iwsobj):
    st = iwsobj.stream

    # free scale  
    print('Creating %s.' % paths['sgramfile'], end = ' ')            
    titlestr = os.path.basename(paths['sgramfile']) 
    chosen = choose_best_traces(st, MAX_TRACES=10)            
    iwsobj.plot(outfile=paths['sgramfile'], log=False, equal_scale=False, add_colorbar=True, dbscale=True, title=titlestr, trace_indexes=chosen);

    # fixed scale 
    print('Creating %s.' % paths['sgramfixed'], end = ' ')
    titlestr = os.path.basename(paths['sgramfixed'])
    clim_in_dB = [-160, -100] # only works for corrected volcano-seismic data
    clim_in_units = [ IceWeb.dB2amp(clim_in_dB[0]),  IceWeb.dB2amp(clim_in_dB[1]) ]
    iwsobj.plot(outfile=paths['sgramfixed'], log=False, clim=clim_in_units, add_colorbar=True, dbscale=True, title=titlestr, trace_indexes=chosen);           
    # need separate plots for any infrasound channels or uncorrected data

def metrics2tracedf(traceCSVfile, st):
    if os.path.exists(traceCSVfile):
        tracedf = pd.read_csv(traceCSVfile)
    else: 
        print('- Building metrics dataframe.', end = ' ') 
        tracedf = pd.DataFrame()
        list_of_tracerows = []
        for tr in st:
            s = tr.stats
            tracerow = {'id':tr.id, 'starttime':s.starttime, 
                   'Fs':s.sampling_rate, 
                   'calib':s.calib, 'units':s.units, 
                   'quality':s.quality_factor}
            if 'spectrum' in s: 
                for item in ['medianF', 'peakF', 'peakA', 'bw_min', 'bw_max']:
                    try:
                        tracerow[item] = s.spectrum[item]
                    except:
                        pass
            if 'metrics' in s:
                m = s.metrics
                for item in ['snr', 'signal_level', 'noise_level', 'twin',
                             'peakamp', 'peaktime', 'energy', 'RSAM_high', 'RSAM_low',
                             'sample_min', 'sample_max', 'sample_mean', 'sample_median', 
                             'sample_lower_quartile', 'sample_upper_quartile', 'sample_rms', 
                             'sample_stdev', 'percent_availability', 'num_gaps', 'skewness', 'kurtosis']:
                             #'start_gap', 'num_gaps', 'end_gap', 'sum_gaps', 'max_gap', 
                             #'num_overlaps', 'sum_overlaps', 'num_records', 'record_length', 
                    try:
                        tracerow[item] = m[item]
                    except:
                        pass 
            if 'bandratio' in s:
                for dictitem in s['bandratio']:
                    label = 'bandratio_' +  "".join(str(dictitem['freqlims'])).replace(', ','_')
                    tracerow[label] = dictitem['RSAM_ratio']

            list_of_tracerows.append(tracerow)
        tracedf = pd.DataFrame(list_of_tracerows)
        tracedf = tracedf.round({'Fs': 2, 'secs': 2, 'quality':2, 'medianF':1, 'peakF':1, 'bw_max':1, 'bw_min':1, 'peaktime':2, 'twin':2, 'skewness':2, 'kurtosis':2})
        print('Saving to CSV.')
        tracedf.set_index('id')
        tracedf.to_csv(traceCSVfile)
    return tracedf

def tracedf2eventdf(tracedf, st, paths, correct_data, detect_event, make_png_files):
    # Summarize event
    print('Create a summary row for whole event')
    numOfRows = tracedf.shape[0]
    if correct_data:
        df = tracedf[tracedf["units"] == 'm/s']
        if len(df.index)==0:
            df = tracedf
    else:
        df = tracedf
    df.sort_values(by=['quality'], inplace=True)
    df = df.head(10) # get median of 10 best rows    
    wavrow = df.median(axis = 0, skipna = True).to_dict()        
    wavrow['path']=paths['wavfile']
    wavrow['num_traces']=numOfRows
    filetime=df.iloc[0]['starttime']
    wavrow['filetime']=filetime
    try:
        wavrow['year']=filetime[0:4]
        wavrow['month']=filetime[5:7]
        wavrow['day']=filetime[8:10]
        wavrow['hour']=filetime[11:13]
        wavrow['minute']=filetime[14:16]
        wavrow['second']=filetime[17:23]
    except:
        wavrow['year']=filetime.year
        wavrow['month']=filetime.month
        wavrow['day']=filetime.day
        wavrow['hour']=filetime.hour
        wavrow['minute']=filetime.minute
        wavrow['second']=filetime.second
    
    if detect_event:   
        trig, ontimes, offtimes = detect_network_event(st, sta=0.4, lta=5.0, threshon=4.0, threshoff=0.24, pad=5.0)
        print('%s: %d events detected' % (paths['picklefile'], len(ontimes)))
        durations = [t['duration'] for t in trig]
        if len(durations)>0:
            bestevent = np.argmax(durations)
            thistrig=trig[int(np.argmax(durations))]
            wavrow['ontime'] = thistrig['time']
            wavrow['offtime']=thistrig['time']+thistrig['duration']  
            wavrow['trigger_duration']=thistrig['duration']
            for item in ['coincidence_sum', 'cft_peak_wmean', 'cft_std_wmean']:
                wavrow[item]=thistrig[item]
            wavrow['detection_quality']=thistrig['coincidence_sum']*thistrig['cft_peak_wmean']*thistrig['cft_std_wmean']
            
        if make_png_files:
            chosen = choose_best_traces(st, MAX_TRACES=1, include_seismic=True, 
                                include_infrasound=False, include_uncorrected=False)
            tr = st[chosen[0]] 
            plt.figure()
            plt.plot(tr.times(), tr.data)
            plt.ylabel(tr.id)   
            t0 = tr.stats.starttime
            bottom, top = plt.ylim()
            plt.vlines([wavrow['ontime']-t0, wavrow['offtime']-t0], bottom, top )
            plt.savefig(paths['detectionfile'])            

    return wavrow

def wavfile2paths(wavfile):
    paths={}
    paths['WAVDIR'] = os.path.dirname(wavfile)
    paths['wavfile'] = wavfile
    paths['wavbase'] = os.path.basename(wavfile)
    parts = paths['WAVDIR'].split('WAV')
    paths['CALDIR'] = os.path.join(parts[0],'CAL')    
    paths['HTMLDIR'] = paths['WAVDIR'].replace('WAV', 'HTML')
    paths['PICKLEDIR'] = paths['WAVDIR'].replace('WAV', 'PICKLE')           
    paths['seismicpngfile'] = os.path.join(paths['HTMLDIR'], paths['wavbase'] + '_seismic.png')
    paths['infrasoundpngfile'] = os.path.join(paths['HTMLDIR'], paths['wavbase'] + '_infrasound.png')
    paths['picklefile'] = os.path.join(paths['PICKLEDIR'], paths['wavbase'] + '.pickle')
    paths['logfile'] = paths['picklefile'].replace('.pickle', '.log')
    paths['sgramfile'] = os.path.join(paths['HTMLDIR'], paths['wavbase'] + '_sgram.png')
    paths['sgramfixed'] = paths['sgramfile'].replace('_sgram.png', '_sgram_fixed.png')  
    paths['traceCSVfile'] = paths['picklefile'].replace('.pickle', '.csv')
    paths['detectionfile'] = os.path.join(paths['HTMLDIR'], paths['wavbase'] + '_detection.png')
    return paths
    
    
def processWAV(wavfile, shortperiod=False, correct_data=False, make_png_files=False, detect_event=False, compute_aef=True):
    paths = wavfile2paths(wavfile)
    if not os.path.exists(paths['HTMLDIR']) and make_png_files:
        os.makedirs(paths['HTMLDIR'])
    if not os.path.exists(paths['PICKLEDIR']):
        os.makedirs(paths['PICKLEDIR'])  
    wavrow={}
    
    st = WAV2picklefile(paths, shortperiod, correct_data)
    st.select(component='[ENZ]') # subset to seismic components only
    if len(st)==0:
        return wavrow
    Stream2logfile(st, paths)
    if make_png_files:
        Stream2png(st, paths)
        
    if compute_aef and not 'energy' in st[0].stats.metrics: 
        iwsobj = add_ampengfft_metrics(st, paths)
        if not os.path.exists(paths['sgramfile']) and make_png_files:
            plot_spectrograms(paths, iwsobj)
        for tr in iwsobj.stream:
            tr.stats.pop('spectrogramdata', None) # Remove spectrogramdata as it is too large for picklefile
            #tr.stats.spectrogramdata = {}
        print('Writing enhanced pickle file.')     
        st = iwsobj.stream
        st.write(paths['picklefile'], 'PICKLE') # Rewrite pickle file with extra attributes     
        
    tracedf = metrics2tracedf(paths['traceCSVfile'], st)
    wavrow = tracedf2eventdf(tracedf, st, paths, correct_data, detect_event, make_png_files)        
    return wavrow


def processSeisanYearMonth(SEISAN_DATA, DB, YYYY, MM, filesdone, MAXFILES=999999):
    failedWAVfiles=[]
    LoD = []
    
    # We aim to add a couple of columns from the S-file, and save to this
    reawavCSVfile=os.path.join(SEISAN_DATA, 'reawav_%s%s%s.csv' % (DB, YYYY, MM) )
    if os.path.exists(reawavCSVfile):
        return failedWAVfiles, filesdone
    
    # Get s-file list
    startdate = dt.datetime(int(YYYY), int(MM), 1)
    if int(MM)<12:
        enddate = dt.datetime(int(YYYY), int(MM)+1, 1)
    else:
        enddate = dt.datetime(int(YYYY)+1, 1, 1)
    slist = sorted(get_sfile_list(SEISAN_DATA, DB, startdate, enddate))
    
    for i,sfile in enumerate(slist):
        print('Processing %d of %d: %s' % (i, len(slist), sfile) )
        if i==MAXFILES:
            break
            
        s = Sfile(sfile, use_mvo_parser=True)
        #s.cat()
        #s.printEvents()
        d = s.to_dict()
        #pprint(d)
        
        for item in ['wavfile1', 'wavfile2']:
            if d[item]:
                if os.path.exists(d[item]):
                    wavbase = os.path.basename(d[item])
                    if 'MVO' in wavbase:
                        print('Processing ',d[item])
                        eventrow=[]
                        eventrow = processWAV(d[item], shortperiod=bool_shortperiod, correct_data=bool_correct_data, make_png_files=bool_make_png_files, detect_event=bool_detect_event)
                        if eventrow:
                            eventrow['sfile']=os.path.basename(s.path)
                            eventrow['mainclass']=s.mainclass
                            eventrow['subclass']=s.subclass
                            LoD.append(eventrow)
                            filesdone += 1
                            if filesdone >= MAXFILES:
                                break
    if LoD:
        df = pd.DataFrame(LoD)
        print('Writing ',reawavCSVfile)
        df.drop(df.filter(regex="Unname"),axis=1, inplace=True)
        df = df.set_index('filetime')       
        df.to_csv(reawavCSVfile, index=True) 
    return failedWAVfiles, filesdone

def main(DB, MAXFILES=999999):
    filesdone = 0
    yeardirs = sorted(glob(os.path.join('REA',DB,'[12]???')))
    for yeardir in yeardirs:
        YYYY = os.path.basename(yeardir)
        monthsdirs = sorted(glob(os.path.join(yeardir,'[01]?')))
        for monthdir in monthsdirs:
            if filesdone>=MAXFILES:
                break
            MM = os.path.basename(monthdir)
            print('**** Processing %s ****' % monthdir)
            failedWAVfiles, filesdone = processSeisanYearMonth('.', DB, YYYY, MM, filesdone, MAXFILES=999999)
            if len(failedWAVfiles)>0:
                fptr=open('failedWAVfiles.txt','a')
                for element in failedWAVfiles:
                    fptr.write(element)
                    fptr.write('\n')
                fptr.close()  

In [None]:
SEISAN_DATA = os.path.join( os.getenv('HOME'),'DATA','MVO')
os.chdir(SEISAN_DATA)
SEISAN_DB = 'MVOE_'
bool_shortperiod=False
bool_correct_data=True
bool_make_png_files=False
bool_detect_event=True
main(SEISAN_DB, 999999)

**** Processing REA/MVOE_/1996/10 ****
Processing 0 of 412: ./REA/MVOE_/1996/10/21-1628-55L.S199610
Parsing  ./REA/MVOE_/1996/10/21-1628-55L.S199610
Processing 1 of 412: ./REA/MVOE_/1996/10/22-0120-12L.S199610
Parsing  ./REA/MVOE_/1996/10/22-0120-12L.S199610
Processing 2 of 412: ./REA/MVOE_/1996/10/22-0145-59L.S199610
Parsing  ./REA/MVOE_/1996/10/22-0145-59L.S199610
Processing 3 of 412: ./REA/MVOE_/1996/10/22-0250-15L.S199610
Parsing  ./REA/MVOE_/1996/10/22-0250-15L.S199610
Processing 4 of 412: ./REA/MVOE_/1996/10/23-0212-03L.S199610
Parsing  ./REA/MVOE_/1996/10/23-0212-03L.S199610
Processing 5 of 412: ./REA/MVOE_/1996/10/23-0230-02L.S199610
Parsing  ./REA/MVOE_/1996/10/23-0230-02L.S199610
Processing 6 of 412: ./REA/MVOE_/1996/10/23-0308-30L.S199610
Parsing  ./REA/MVOE_/1996/10/23-0308-30L.S199610
Processing 7 of 412: ./REA/MVOE_/1996/10/23-0338-22L.S199610
Parsing  ./REA/MVOE_/1996/10/23-0338-22L.S199610
Processing 8 of 412: ./REA/MVOE_/1996/10/23-0455-14L.S199610
Parsing  ./REA/MVOE_

In [None]:
def find_stationxml(paths, shortperiod, correct_data):
    #if os.path.exists(paths['picklefile']): # removed
    if 0: # added
        st = read(paths['picklefile'])
    else:
        # We need to create the pickle file
        # Try to read the WAV file
        st = Stream()
        print('Processing %s.' % paths['wavbase'], end=' ')
        print('Reading.', end = ' ')
        try:               
            st = read(paths['wavfile'])
        except:
            print('ERROR. Could not load.')
            return st

        if len(st)==0:
            print('ERROR. No traces.')
            return st
        else: 
            print('Success.')
        
        ######################### START OF CLEAN/CORRECT BLOCK ######################
        print('Cleaning/correcting')
        print(st) # added
        fix_trace_id(st, shortperiod=shortperiod) 
        print(st) # added
        #st=st.select(component='Z')

        for tr in st:

            this_inv = None
            
            if correct_data: # try to find corresponding station XML
                #tr.stats.network = 'MV' 
                #xmlfile = os.path.join(caldir, "station.%s.%s.xml" % (tr.stats.network, tr.stats.station) )
                if tr.stats.channel[0] in 'ES':
                    matchcode = '[ES]'
                elif tr.stats.channel[0] in 'BH':
                    matchcode = '[BH]'
                xmlfilepattern = os.path.join(paths['CALDIR'], "station.MV.%s..%s*%s.xml" % (tr.stats.station, matchcode, tr.stats.channel[2]) )
                #print(xmlfilepattern)
                xmlfiles = glob(os.path.join(paths['CALDIR'], "station.MV.%s..%s*%s.xml" % (tr.stats.station, matchcode, tr.stats.channel[2]) ))

                N = len(xmlfiles)
                if N==1:
                    xmlfile = xmlfiles[0]
                    this_inv = read_inventory(xmlfile)   
                    print('Processing %s with %s' % (tr.id, xmlfile) )
                    process_trace(tr, inv=this_inv)
                    print('Done')
    return st
 

correct_data = True
shortperiod = False
wavfile = '/Users/thompsong/DATA/MVO/WAV/MVOE_/2002/08/2002-08-01-0148-43S.MVO___014'
paths = wavfile2paths(wavfile)
st = find_stationxml(paths, shortperiod, correct_data)
st.plot(equal_scale=False)

In [None]:
print(st)

In [None]:
st.select(component='[ENZ]')