# Preparing a set of labelled events from the Montserrat seismic event catalog.

## Overview

To conduct supervised machine learning, we need to prepare a set of labelled events. The Montserrat seismic event catalog is stored as a Seisan database, but event classifications are unreliable, which is the main motivation for attempting supervised learning. Since a Seisan event database is awkward to work with, we first convert it to a more convenient format for Python. We take 3 main steps:

1. Convert Seisan database to ObsPy & pandas friendly formats. 
2. Compute metrics on each waveform.
3. Reclassify events to build a list of events with verified classifications.

## Converting Seisan database to ObsPy and Pandas

This includes:

1. Converting Seisan waveform (WAV) files to MiniSEED.
2. Seisan S-files to ObsPy Catalog objects and CSV files. S-files are in __[Nordic format](http://isc-mirror.iris.washington.edu/standards/nordic/)__
3. Seisan calibration and RESP files to stationXML.
    
`01_seisandb_classifications`: translates VOLCANO.DEF if it exists.

`02_index_seisandb`: Create two indexes. One that maps all Sfiles to their corresponding WAVfiles, indicating if they exist. And another mapping all WAVfiles to their corresponding S-files. This is trickier because WAV files might not have the same time as the S-file. But the goal should be that we can easily identify S-files without existing WAV files, and vice versa. The index file creator can also detect the presence of other files including PNG files, PICKLE files, corrected/enhanced MSEED files, and trace level and event level CSV files. 5. Implement an index for each event in the Seisan database. This should log every Sfile, every WAV file, and have links to and from each. Thus, when any WAV file is examined, the corresponding Sfile can be found. And vice versa. It should also track the existence of other products, like log files, PNG files, Pickle files, CSV files, and track whether an event has been reviewed, and its corresponding classification percentages and weight.index the S-files and WAV files in both directions (sfile_index.csv). the S-file index file contains paths for up to 2 WAV files (when data are corrected, these links are updated) and bools that indicate if they exist, and columns for the original mainclass and subclass and a columns for each acceptable mainclass and subclass (the latter from volcano_def.csv) and a corresponding percentage probability, a weight, and a checkbox marking if the event has been reclassified, and another checkbox if it needs to be split.  and create a separate index of calibration files (wavfile_index.csv). finally, make an index of all traceIDs found in WAV files, calibration files, and S-files (originalTraceID_index.csv). Since those in the calibration files and particularly S-files are not fully qualified NSLC, these should be selected from the available traceIDs from the WAV file. Also record the earliest and latest occurrence of each traceID. a separation index of all mainclass/subclass is also given, with a count of each. this can be optionally restricted to just S-files that map to a WAV file. it should be easy to see which WAV files have not been registered, or which S-files have no linked or existing WAV files.

`03_seisandb2counts`: produce counts of how many times each traceID appears as a function of time (events). also how many S-files have no corresponding WAV-files, and vice versa.

`04_seisandb_fixtraceid`. translate all traceIDs in the index using a given function. this updates originalTraceID_index.csv. After execution, this file should be manually checked and can be hand edited before it is applied in later steps.

`05_qc_seisanwav`: quality check the WAV files in the seisan DB. Optionally apply this to linked WAV files only, and/or certain traceIDs only. update the index files. There is no output - Trace.stats.metrics are updated, but Trace objects can optionally be saved to Pickle files, or Trace.stats.metrics can be saved to traceCSV files.

`06_seisancal2stationxml`: attempt to convert all Seisan calibration and RESP files in the CAL directory to stationXML. TraceIDs will be fixed according to originalTraceID_index.csv. Then assemble them into either a single stationXML file, or one per trace id, stored back in the CAL directory. create a new index fixedTraceID_index.csv that maps each fixed TraceID to the corresponding stationXML file. 

`07_correct_seisandb`: attempt to apply instrument corrections from stationXML files to translate traces from WAV files. by default, omit any traces that have quality=0. by default, only retain corrected traces. optionally save them as MSEED or Pickle files. update the standard Miniseed metrics.

## Computing metrics
10_seisandb_ampengfft: compute various amplitude, energy and frequency metrics. this updates Trace.stats.metrics.
11_seisandb_detectevents: optionally detect events within each WAV file.
12_seisandb_plot_seismograms: optionally create PNG files of corrected seismic and infrasound traces.
13_seisandb_plot_spectrograms: optionally create PNG files of corrected seismic and infrasound spectrograms.
    
## Selecting best events
The following steps are undertaken to improve the original manual classifications to prepare the Seisan database for supervised machine learning. These steps are necessary because the original event classifications are unreliable, classification is difficult so it makes more sense to apply a probability for each event class, and each event file may also contain several events, or may be garbage. So what we do is iteratively select the top N events of each class, manually reclassify them giving percentage probabilities for each class (while marking bad events for deleted, or multi-event files for splitting). We can also generate/update fingerprints for each class, which can help guide the manual classification, or be used to automatically classify events.
    
60_select_events: select the top N events of each class or accepted volcano subclass. A file called volcano_def.csv is required for the latter to be available. By default, only unchecked events are loaded. To choose the top unchecked events, the quality is used. Optionally, checked events can be picked instead. To choose the top checked events, the probability is used, and a minimum threshold given (default: 50%). In either case, a set of class fingerprints is generated. These are stored in two CSV files: checked_fingerprints.csv and unchecked_fingerprints.csv.

    function fingerprint_events: using the selected events, compute fingerprints. for unchecked events, the original subclass is used. for checked events, the probabilistic subclasses are used and weighted using the weight column (see 61).
    
61_reclassify_events_manually: manually (re)classify the selected events. a probability can be given for each mainclass or accepted volcano subclass. a weight can also be given, indicating how clear the signal is. or the event can be marked for splitting because it is judged to contain multiple events. All processed events are marked as checked unless skipped.

62_reclassify_events_with_fingerprints: automatically reclassify the selected events. the probability for each class will be the result of comparing to event metrics to the fingerprints.   

63_write_AAA_input_file: looks through the entire sfile_index.csv, extracts only the top probability N events of each mainclass/subclass that have been checked (and not marked for splitting or deletion). Not sure how to use the weight. Or does it just use the selected events?

For selecting events:
1. Enable user to manually split event files into separate events.
2. Enable "noise" and "unknown" classifications. Implement a "weight" for each classification, from 0-9. Unchecked events are given a weight of 3. Very good examples can be given a weight of 9 and therefore weighted more strongly in the fingerprints.
3. Enable user to assign a probability to each classification. Keep asking for further classifications until probability reaches 1.0 (including unknown and noise).
4. 
- (a) Implement classification using single characters as 'T'='teleseism', 'R'='regional', 'r'='rockfall/PDC', 'e'='long-period + rockfall', 'l'='long-period', 'h'='hybrid', 'm'='tremor', 't'='volcano-tectonic', 'n'='noise', 'u'='unknown'. 
- (b) A classification code should be followed by ', p' where p is a percentage (an integer from 1-100). If omitted, percentage defaults to 100.
- (c) A percentage should be followed by ', w' where w is a weight. If omitted, weight defaults to 3. 
- (d) To accept an existing classification with probability 100% and weight 3 <ENTER>. 
- (e) To accept the probabilities from the fingerprints, enter 'f'.
- (f) To split an event, 's' (do not enter a percentage or weight). This will open a separate window to clip out each event.
- (g) If an event is garbage, 'd' should delete it and all corresponding files, including from the indexes.
- (h) 'k' skips to the next event.
    
Valid entries:
    'h' # 100% hybrid, weight 3
    'h, 100' # 100% hybrid, weight 3
    'h, 100, 9' # 100% hybrid, weight 9
    'h, 50, l, 50' # 50% hybrid, 50% long-period, weight 3
    's' # mark for splitting
    'd' # mark for deletion
    
    
    

    
## Problems with the Montserrat seismic event catalog:
1. WAV files often contain multiple events.
2. Events are often not clearly identifiable as any event class.
3. Events can appear to be candidates for two or more event classes.
4. There are a lot of events to process.

## Solutions - implement eev.py
1. Make processing easy.
2. Allow user to skip to next event, or return to previous event.

## To do
Testing:
    
- Test whether a pickle file is larger than a corresponding Miniseed and pickle file just containing metrics. Or can metrics all be stored in a CSV, using tracemetrics2csv and csv2tracemetrics? I could then wrap this with libseisGT.StreamWrite and libseisGT.StreamRead.
- Test whether writing each trace to a separate Miniseed file, and loading them again, is significantly slower than saving a whole stream to miniseed and loading it again.
- Troubleshoot why traces are not being corrected.
- Allow ampengfft and spectrograms to be computed on corrected traces
- A separate process needs to be divised for splitting marked events. Run processWAV to get updated metrics.

The event-level CSV file could track:
- Sfile path
- percentages for each classification
- weight
- overall Stream metrics, averaged from trace metrics
- WAV DSN path
- WAV ASN path
- For each trace in each WAV:
 - trace ID
 - trace quality
 - trace weight
 - trace deleted (if quality or weight are zero)
 - other trace metrics
 

My metrics do not take into account any timing information, or 3-C traces (for seeing S-waves). 
The 3-C traces would help me classify, so I should look at them, but AAA is only a 1-C classifier.
Morever, AAA is not a network-based classifier, whereas an analyst is. 

Some improvements:

- A stronger high pass filter should be applied for visual classification and probably before computing metrics.
- 3-D traces should be included in the visual reclassification.
- It might be best to focus on events before the dome collapsing phase, as those are predominantly just earthquakes. However, we have only the ASN for that time period, and those data clip.
- A mechanism needs adding to deselect visually poor traces. Those would then be removed from the corresponding tracesdf CSV file.
- I could examine the frequency metrics before and after peak amplitude time. And an envelope. What events would this help for:
- - r might be expected to have a single peak. Frequency no different before and after.
- - e could peak twice - in l and r phase. Frequency around each should be constant, but different from other peak.
- - l should peak once.
- - h should have a different frequency before and after peak. So this is where measuring frequency on time windows is potentially most useful.
- - t should peak once and frequency be constant before and after peak.
- - Would AR_pick be good for detecting impulsive onsets, to separate VTs and hybrids from rockfalls. Not sure how it would do with LPs.
- Where did signal to noise go?
- I could visually pick the start and end of waveforms and recompute metrics.
- Traces need to be corrected beforehand.
- I also need to take advantage of peak amplitudes across the network. Perhaps by running ASL on that. Only because MBLG is so often much larger than other traces for rockfalls, but not necessary for earthquakes.
- But I still need to keep the work I have done. So need a mechanism that can easily update metrics.
- Add option to replace full time window with the zoom in window, to sort-of autosplit.
- It might be better to move the manual picking to MATLAB. That would make it easier to select zoom windows, signal onsets and offsets, pick phases and signal peaks, and de-select channels, as well as apply tools like noise suppression and polarization analysis. To apply polarization analysis, I would need to pick a source location for rotating seismograms. This could be based on travel-time location - if available - or on peak-amplitude time differences, or ASL.

And:
- magnitude of each event might be useful too.
- I need to build station and volcano coordinates into my Python (and MATLAB) workflows.
- I need to convert those Seisan CAL files.
- Note that spectrograms are often taken to be the most useful metric, since they combine amplitude, phase, frequency and time.
- It could also be useful to run the MATLAB polarization codes to give an indication of wave type for 3-C stations.
- If phase pick times, locations and magnitudes are available, I should include that information. At least to select best earthquake signals from the dataset.
- I may need to build a metrics model for each NSLC. And then I could combine guesses from each good channel, perhaps weighted by amplitude or signal-to-noise.
- I also want to count events by NSLC.
- DO THE NOISE-SUPPRESION CODES IN MATLAB ACTUALLY HELP WITH CLASSIFICATION OF MVO EVENTS?
- What about Michi Wenner's landslide classification codes from ObsPy workshop? Are they better?
- Try band ratio from 0.8 - 4.0 and 4.0-20.0?
- Ensure I save the subclass models each time, so workflow should be:

1. Choose number of events for model.
2. Select next event. (PYTHON)
3. Show waveform plots. Try noise suppression. (MATLAB)
4. Show spectrogram. (Create PNG with Python to fixed path. Open in MATLAB).
5. Show spectrum. (Create PNG with Python to fixed path. Open in MATLAB).
6. Show polarization plot (spectrogram? MATLAB)
7. Show any travel-time location, magnitude, phase picks. (MATLAB: get from eventdf/tracedf).
8. Show metrics for each trace. (tracedf)
9. Show location from ASL, and event magnitude. (Do we run these in real-time in Python after selecting traces?)
10. Show guess classification from model. One per channel. (Python.)
11. Show on/off times and phase times from network detection and phase picker. (on MATLAB waveform plots).
11. Allow user to deselect channels (MATLAB waveform plots). This triggers those channels to be deleted from the trace metrics df. 
12. Allow user to define multiple event on/off times. (MATLAB waveform plots). This triggers old event to be ignored, and new events to be created, and added to the queue for computing metrics. Any phase picks, locations, magnitudes within timewindow of a new event need to be copied too.
13. Allow user to pick phases (MATLAB waveform plots). This will positively identify as earthquake, not rockfall.


What would Francois and Jean-Philippe and Alexis do with the small events? How would they classify events?