<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-description" data-toc-modified-id="Project-description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project description</a></span></li><li><span><a href="#Make-feature-file" data-toc-modified-id="Make-feature-file-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Make feature file</a></span></li><li><span><a href="#Train-test" data-toc-modified-id="Train-test-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train-test</a></span></li>

In [1]:
#%matplotlib notebook
import numpy as np
import pandas as pd
import obspy
import matplotlib
import matplotlib.pyplot as plt
import os
import sys
import random
from itertools import combinations
from datetime import datetime, timedelta
import time
import pickle
from itertools import cycle
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score #new
from sklearn.linear_model import LinearRegression #new
from sklearn.model_selection import KFold #new
from sklearn.model_selection import cross_val_score #new
from sklearn.model_selection import train_test_split
from obspy.clients.filesystem.sds import Client
from scipy import signal
import seaborn as sns
import joblib
import itertools
import glob
from ComputeAttributes_CH_V1 import *
from DF_detections import *

In [2]:
# set plotting default parameters
plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['figure.figsize'] = (6.4, 4.8) # if wider plot, only change first value.
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['Verdana']
#plt.rcParams['ytick.major.pad']= 2

In [17]:
path_to_raw_data = '/data/wsd03/data_manuela/Illgraben/miniseed/'
file_ending = 'beginning_2017'

# Project description

In this notebook we use a manually assembled event catalog containing slope failure events, earthquakes and noise signals recorded on Stations ILL16, ILL17 and ILL18.
We choose 40 s windows that start several seconds before the event-start in the catalog with a specific overlap (for further explanation see Wenner et al., 2021).
The generated/converted catalog is then used to compute features of the signals in the windows after Provost et al., 2017.
We then train a random forest classifier with manual picked events from 2017. The random forest classifier __distinguishes between two classes: slope failures and noise (including earthquakes)__.
Lastly, we apply the classifier on continuous data (2019) to look at seasonal variations in slope failure occurance and the impact of climatic forcing.

The notebook is structured in three main tasks:
1. Compute features from a snuffler file (pyrocko event picker format) -> 2017
2. Train and test a classifier
3. Apply on continuous data -> 2019

Explanations of the specific steps follow below

# Make feature file

Here, we read in a slightly modified snuffler file containing events picked on all Illgraben stations.


In [18]:
# Load catalog
df = pd.read_csv('../snuffler_files/catalog_{}.txt'.format(file_ending), delimiter=',')
print(df.shape)
df.head()

(720, 8)


Unnamed: 0,startdate,starttime,enddate,endtime,duration,bla,class,network
0,2017-06-02,10:02:57.8257,,,,,0.0,XP.ILL07..EHZ
1,2017-06-02,10:08:39.0653,,,,,0.0,XP.ILL07..EHZ
2,2017-06-02,10:13:14.0537,,,,,0.0,XP.ILL07..EHZ
3,2017-06-02,10:22:11.1152,,,,,0.0,XP.ILL07..EHZ
4,2017-06-02,10:32:15.0563,,,,,0.0,XP.ILL07..EHZ


In [19]:
# Define window length
wdow = 40

# Get times of continuous noise (class0)
# Snuffler output different than for events with starttime and endtime marked
# To sample continuous noise I only marked one timestamp at which nothing was happening
ct = df[df['duration'].isna()].reset_index(drop=True)
# ct = df[df['endtime'] == '0'].reset_index(drop=True)
ct.head()

Unnamed: 0,startdate,starttime,enddate,endtime,duration,bla,class,network
0,2017-06-02,10:02:57.8257,,,,,0.0,XP.ILL07..EHZ
1,2017-06-02,10:08:39.0653,,,,,0.0,XP.ILL07..EHZ
2,2017-06-02,10:13:14.0537,,,,,0.0,XP.ILL07..EHZ
3,2017-06-02,10:22:11.1152,,,,,0.0,XP.ILL07..EHZ
4,2017-06-02,10:32:15.0563,,,,,0.0,XP.ILL07..EHZ


In [10]:
# Here I filter to only get events from station ILL08 and None (marked at all stations)
# I do this to avoid sampling noise from station ILL08, which we consider to be one of the most valuable stations for our problem
station_list = ['ILL08']
df7 = df[df['duration'].notna()].reset_index(drop=True)
df7 = df7[(df7['network'] == f'XP.ILL08..EHZ') | (df7['network'] == 'None')].reset_index(drop=True)
df7.head()

Unnamed: 0,startdate,starttime,enddate,endtime,duration,bla,class,network
0,2017-05-24,01:43:28.3030,2017-05-24,01:43:40.9140,12.611000061035156,,1.0,XP.ILL08..EHZ
1,2017-05-24,06:46:04.2439,2017-05-24,06:46:13.4420,9.198099851608276,,4.0,XP.ILL08..EHZ
2,2017-05-24,08:17:01.1679,2017-05-24,08:18:01.3940,60.22609996795654,,1.0,XP.ILL08..EHZ
3,2017-05-24,10:33:25.4780,2017-05-24,10:34:11.9555,46.47749996185303,,1.0,XP.ILL08..EHZ
4,2017-05-25,08:58:35.1350,2017-05-25,08:59:28.3719,53.23690009117127,,4.0,XP.ILL08..EHZ


In [11]:
# Assemble catalog
classes = []
event_idxs = []
slice_idxs = []
times = []
stations = []
for idx, row in df7.iterrows(): # df7 includes only ILL08 and None stations
    sdate = row['startdate']
    stime = row['starttime']
    stt = obspy.UTCDateTime(f"{sdate}T{stime}")
    edate = row['enddate']
    etime = row['endtime']
    edt = obspy.UTCDateTime(f"{edate}T{etime}")
    cla = row['class']
    station = row['network'][3:8] #ILLxx

    # Start random up to 2/3 of window length before marker starts
    tt = stt - np.random.randint(int((2/3)*wdow)) # random numbers up to 2/3*40 =26.66s 
    j = 0
    while tt < edt-(int((1/3)*wdow)): # End random up to 1/3 of window length (13.33s) before marker ends
        classes.append(cla)
        event_idxs.append(idx)
        slice_idxs.append(j)
        times.append(tt)

        if station == 'e': # in None
            stations.append('ILL08')#(random.choice(station_list))
        else:
            stations.append(station)
        tt += int((1/3)*wdow) # overlap of about 2/3 (original code) & 1/3 (PhD Thesis text)
        j+=1

# Also add continuous noise to new catalog
for idxn, row in ct.iterrows():
    idx += 1 # attention: idx is NOT idxn
    sdate = row['startdate']
    stime = row['starttime']
    stt = obspy.UTCDateTime(f"{sdate}T{stime}")
    cla = 0

    classes.append(cla)
    event_idxs.append(idx)
    slice_idxs.append(0)
    times.append(stt + 0.5*wdow)
    stations.append('ILL08')#(random.choice(station_list))

# Get in dataframe format
dic_re = {'event_idx': event_idxs, 'slice_idx': slice_idxs, 'class': classes, 'mean_time': times, 'station': stations}
cat_re = pd.DataFrame(dic_re)
#cat_re[cat_re['class'] == 1]
print(cat_re.head())

# Save catalog
cat_re.to_csv(f"../catalog/catalog_{wdow}_{file_ending}.csv", index=False)
# output each time a bit different due to random element

   event_idx  slice_idx  class                    mean_time station
0          0          0    1.0  2017-05-24T01:43:23.303000Z   ILL08
1          1          0    4.0  2017-05-24T06:46:00.243900Z   ILL08
2          2          0    1.0  2017-05-24T08:16:46.167900Z   ILL08
3          2          1    1.0  2017-05-24T08:16:59.167900Z   ILL08
4          2          2    1.0  2017-05-24T08:17:12.167900Z   ILL08


#### => it is possible to start here and load the modified catalog file

In [12]:
# Read catalog
cat = pd.read_csv(f'../catalog/catalog_{wdow}_{file_ending}.csv')
cat.head()

Unnamed: 0,event_idx,slice_idx,class,mean_time,station
0,0,0,1.0,2017-05-24T01:43:23.303000Z,ILL08
1,1,0,4.0,2017-05-24T06:46:00.243900Z,ILL08
2,2,0,1.0,2017-05-24T08:16:46.167900Z,ILL08
3,2,1,1.0,2017-05-24T08:16:59.167900Z,ILL08
4,2,2,1.0,2017-05-24T08:17:12.167900Z,ILL08


In [13]:
def get_mseed(time, client=Client(path_to_raw_data), network = 'XP',
              stations=['ILL06','ILL07','ILL08'],channels=['EHZ'], locations=[''], 
              prepick=0, window_length=24*60*60.):

    st = obspy.Stream()
    for station in stations:
        for location in locations:
            for channel in channels:
                try:           
                    new_st = client.get_waveforms(network, station, location, channel,
                                                obspy.UTCDateTime(time)-prepick, 
                                                obspy.UTCDateTime(time)-prepick+window_length)
                    st += new_st
                except:
                    continue
    return(st)

In [14]:
def create_feature_file(df, wdow, filt):
    """
    Creat attribute files from catalog

    :param df: pandas dataframe containing event_idx, slice_idx, class, mean_time and station
    :param filt: list of integers or floats defining how to filter the steam

    """

    all_char = {}
    
    # Loop over rows in dataframe
    for idx, row in df.iterrows():
        #print(idx)
        tstring = row['mean_time']
        t = obspy.UTCDateTime(f"{tstring}")
        feats = np.array([])
        
        
        
        st = get_mseed(obspy.UTCDateTime('{}-{}'.format(t.year,t.julday)))
        st.detrend('demean')
        st1 = st.copy()
        
        if len(filt) == 0:
            fi = '_unfilt'
        elif len(filt) == 2:
            f_min = filt[0]
            f_max = filt[1]
            st1.filter('bandpass', freqmin=f_min, freqmax=f_max)
            fi = ''
        elif len(filt) == 4:
            f1_min = filt[0]
            f1_max = filt[1]
            st1.filter('bandpass', freqmin=f1_min, freqmax=f1_max)
            st2 = st.copy()
            f2_min = filt[2]
            f2_max = filt[3]
            st2.filter('bandpass', freqmin=f2_min, freqmax=f2_max)
            st1+= st2
            fi = '_2freq'
            
        st1.trim(t - 0.5*wdow, t + 0.5*wdow) # cut stream to 40s (wdow) segments (20s before t and 20s after t)
        print(st1)                           # t = mean_time of event (from df7)
        for tr in st1:
            att = calculate_all_attributes(tr.data,tr.stats.sampling_rate,0) # Compute all atrributes
            feats = np.append(feats,att)
            print(len(feats))
        feats = np.reshape(feats,(1,len(feats)))
        
        
        ev_type = np.array([int(row['event_idx']),int(row['slice_idx']),int(row['class'])])
        type_att = np.append(ev_type, feats)
        type_att = np.reshape(type_att,(1,len(type_att)))
        type_att = pd.DataFrame(type_att)
        # Save to file
        with open(f'/data/wsd03/data_manuela/Illgraben/feature_files/all_{file_ending}{fi}.csv', "a+") as f:
            type_att.to_csv(f, header=False,index=False)
    return(fi)

In [58]:
# def create_feature_file(df, wdow, filt):
#     """
#     Creat attribute files from catalog

#     :param df: pandas dataframe containing event_idx, slice_idx, class, mean_time and station

#     """

#     all_char = {}
    
#     # Loop over rows in dataframe
#     for idx, row in df.iterrows():
#         #print(idx)
#         tstring = row['mean_time']
#         t = obspy.UTCDateTime(f"{tstring}")
#         feats = np.array([])
                
#         st = get_mseed(obspy.UTCDateTime('{}-{}'.format(t.year,t.julday)))
#         st.detrend('demean')
#         st1 = st.copy()
#         if filt == 'yes':
#             st1.filter('bandpass', freqmin=1, freqmax=10)
#         st1.trim(t - 0.5*wdow, t + 0.5*wdow) # cut stream to 40s (wdow) segments (20s before t and 20s after t)
#         print(st1)                           # t = mean_time of event (from df7)
#         for tr in st1:
#             att = calculate_all_attributes(tr.data,tr.stats.sampling_rate,0) # Compute all atrributes
#             feats = np.append(feats,att)
#             print(len(feats))
#         feats = np.reshape(feats,(1,len(feats)))

#         ev_type = np.array([int(row['event_idx']),int(row['slice_idx']),int(row['class'])])
#         type_att = np.append(ev_type, feats)
#         type_att = np.reshape(type_att,(1,len(type_att)))
#         type_att = pd.DataFrame(type_att)
        
#         # Save to file
#         with open(f'/data/wsd03/data_manuela/Illgraben/feature_files/all_{file_ending}{fi}.csv', "a+") as f:
#             type_att.to_csv(f, header=False,index=False)

In [15]:
# Create feature file --> time consuming
fi = create_feature_file(cat, wdow, [1,10]) #[1,10,35,45]

1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T01:43:03.300000Z - 2017-05-24T01:43:43.300000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T06:45:40.240000Z - 2017-05-24T06:46:20.240000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T08:16:26.170000Z - 2017-05-24T08:17:06.170000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T08:16:39.170000Z - 2017-05-24T08:17:19.170000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T08:16:52.170000Z - 2017-05-24T08:17:32.170000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T08:17:05.170000Z - 2017-05-24T08:17:45.170000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T08:17:18.170000Z - 2017-05-24T08:17:58.170000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-24T10:32:57.480000Z - 2017-05-24T10:33:37.480000Z | 100.0 Hz, 4001 samples
58
1 Trace(

1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T17:56:44.110000Z - 2017-05-30T17:57:24.110000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T17:56:57.110000Z - 2017-05-30T17:57:37.110000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T17:57:10.110000Z - 2017-05-30T17:57:50.110000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T17:57:23.110000Z - 2017-05-30T17:58:03.110000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T18:12:18.890000Z - 2017-05-30T18:12:58.890000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T18:12:31.890000Z - 2017-05-30T18:13:11.890000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T18:12:44.890000Z - 2017-05-30T18:13:24.890000Z | 100.0 Hz, 4001 samples
58
1 Trace(s) in Stream:
XP.ILL08..EHZ | 2017-05-30T19:23:39.220000Z - 2017-05-30T19:24:19.220000Z | 100.0 Hz, 4001 samples
58
1 Trace(

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-02T12:58:00.550000Z - 2017-06-02T12:58:40.550000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-02T12:58:00.550000Z - 2017-06-02T12:58:40.550000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-02T12:58:00.550000Z - 2017-06-02T12:58:40.550000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-02T12:58:13.550000Z - 2017-06-02T12:58:53.550000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-02T12:58:13.550000Z - 2017-06-02T12:58:53.550000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-02T12:58:13.550000Z - 2017-06-02T12:58:53.550000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-02T19:04:52.740000Z - 2017-06-02T19:05:32.740000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-02T19:04:52.740000Z - 2017-06-02T19:05:32.740000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-02T19:04:52.740000Z - 2017-06-02T19:05:32.740000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-04T03:01:41.610000Z - 2017-06-04T03:02:21.610000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-04T03:01:41.610000Z - 2017-06-04T03:02:21.610000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-04T03:01:41.610000Z - 2017-06-04T03:02:21.610000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-04T13:44:00.780000Z - 2017-06-04T13:44:40.780000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-04T13:44:00.780000Z - 2017-06-04T13:44:40.780000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-04T13:44:00.780000Z - 2017-06-04T13:44:40.780000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-04T13:44:13.780000Z - 2017-06-04T13:44:53.780000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-04T13:44:13.780000Z - 2017-06-04T13:44:53.780000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-04T13:44:13.780000Z - 2017-06-04T13:44:53.780000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-06T02:30:55.090000Z - 2017-06-06T02:31:35.090000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-06T02:30:55.090000Z - 2017-06-06T02:31:35.090000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-06T02:30:55.090000Z - 2017-06-06T02:31:35.090000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-06T02:31:08.090000Z - 2017-06-06T02:31:48.090000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-06T02:31:08.090000Z - 2017-06-06T02:31:48.090000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-06T02:31:08.090000Z - 2017-06-06T02:31:48.090000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-06T02:31:21.090000Z - 2017-06-06T02:32:01.090000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-06T02:31:21.090000Z - 2017-06-06T02:32:01.090000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-06T02:31:21.090000Z - 2017-06-06T02:32:01.090000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-07T20:48:31.900000Z - 2017-06-07T20:49:11.900000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-07T20:48:31.900000Z - 2017-06-07T20:49:11.900000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-07T20:48:31.900000Z - 2017-06-07T20:49:11.900000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-07T20:48:44.900000Z - 2017-06-07T20:49:24.900000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-07T20:48:44.900000Z - 2017-06-07T20:49:24.900000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-07T20:48:44.900000Z - 2017-06-07T20:49:24.900000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-07T20:48:57.900000Z - 2017-06-07T20:49:37.900000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-07T20:48:57.900000Z - 2017-06-07T20:49:37.900000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-07T20:48:57.900000Z - 2017-06-07T20:49:37.900000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-10T10:51:26.040000Z - 2017-06-10T10:52:06.040000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-10T10:51:26.040000Z - 2017-06-10T10:52:06.040000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-10T10:51:26.040000Z - 2017-06-10T10:52:06.040000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-10T10:51:39.040000Z - 2017-06-10T10:52:19.040000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-10T10:51:39.040000Z - 2017-06-10T10:52:19.040000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-10T10:51:39.040000Z - 2017-06-10T10:52:19.040000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-10T11:36:43.940000Z - 2017-06-10T11:37:23.940000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-10T11:36:43.940000Z - 2017-06-10T11:37:23.940000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-10T11:36:43.940000Z - 2017-06-10T11:37:23.940000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-13T22:32:52.020000Z - 2017-06-13T22:33:32.020000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-13T22:32:52.020000Z - 2017-06-13T22:33:32.020000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-13T22:32:52.020000Z - 2017-06-13T22:33:32.020000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-13T22:33:05.020000Z - 2017-06-13T22:33:45.020000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-13T22:33:05.020000Z - 2017-06-13T22:33:45.020000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-13T22:33:05.020000Z - 2017-06-13T22:33:45.020000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-13T22:33:18.020000Z - 2017-06-13T22:33:58.020000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-13T22:33:18.020000Z - 2017-06-13T22:33:58.020000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-13T22:33:18.020000Z - 2017-06-13T22:33:58.020000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-14T19:12:12.810000Z - 2017-06-14T19:12:52.810000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-14T19:12:12.810000Z - 2017-06-14T19:12:52.810000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-14T19:12:12.810000Z - 2017-06-14T19:12:52.810000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-14T19:12:25.810000Z - 2017-06-14T19:13:05.810000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-14T19:12:25.810000Z - 2017-06-14T19:13:05.810000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-14T19:12:25.810000Z - 2017-06-14T19:13:05.810000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-14T19:12:38.810000Z - 2017-06-14T19:13:18.810000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-14T19:12:38.810000Z - 2017-06-14T19:13:18.810000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-14T19:12:38.810000Z - 2017-06-14T19:13:18.810000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-16T23:46:14.340000Z - 2017-06-16T23:46:54.340000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-16T23:46:14.340000Z - 2017-06-16T23:46:54.340000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-16T23:46:14.340000Z - 2017-06-16T23:46:54.340000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-16T23:46:27.340000Z - 2017-06-16T23:47:07.340000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-16T23:46:27.340000Z - 2017-06-16T23:47:07.340000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-16T23:46:27.340000Z - 2017-06-16T23:47:07.340000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-16T23:46:40.340000Z - 2017-06-16T23:47:20.340000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-16T23:46:40.340000Z - 2017-06-16T23:47:20.340000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-16T23:46:40.340000Z - 2017-06-16T23:47:20.340000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-17T19:54:42.970000Z - 2017-06-17T19:55:22.970000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-17T19:54:42.970000Z - 2017-06-17T19:55:22.970000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-17T19:54:42.970000Z - 2017-06-17T19:55:22.970000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-17T19:54:55.970000Z - 2017-06-17T19:55:35.970000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-17T19:54:55.970000Z - 2017-06-17T19:55:35.970000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-17T19:54:55.970000Z - 2017-06-17T19:55:35.970000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-06-17T19:55:08.970000Z - 2017-06-17T19:55:48.970000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-06-17T19:55:08.970000Z - 2017-06-17T19:55:48.970000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-06-17T19:55:08.970000Z - 2017-06-17T19:55:48.970000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-01T08:11:38.810000Z - 2017-07-01T08:12:18.810000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-01T08:11:38.810000Z - 2017-07-01T08:12:18.810000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-01T08:11:38.810000Z - 2017-07-01T08:12:18.810000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-01T09:29:23.740000Z - 2017-07-01T09:30:03.740000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-01T09:29:23.740000Z - 2017-07-01T09:30:03.740000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-01T09:29:23.740000Z - 2017-07-01T09:30:03.740000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-01T09:29:36.740000Z - 2017-07-01T09:30:16.740000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-01T09:29:36.740000Z - 2017-07-01T09:30:16.740000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-01T09:29:36.740000Z - 2017-07-01T09:30:16.740000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-03T11:24:24.269999Z - 2017-07-03T11:25:04.269999Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-03T11:24:24.270001Z - 2017-07-03T11:25:04.270001Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-03T11:24:24.270000Z - 2017-07-03T11:25:04.270000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-03T11:24:37.269999Z - 2017-07-03T11:25:17.269999Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-03T11:24:37.270001Z - 2017-07-03T11:25:17.270001Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-03T11:24:37.270000Z - 2017-07-03T11:25:17.270000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-03T11:24:50.269999Z - 2017-07-03T11:25:30.269999Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-03T11:24:50.270001Z - 2017-07-03T11:25:30.270001Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-03T11:24:50.270000Z - 2017-07-03T11:25:30.270000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-20T22:36:19.350000Z - 2017-07-20T22:36:59.350000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-20T22:36:19.350000Z - 2017-07-20T22:36:59.350000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-20T22:36:19.350000Z - 2017-07-20T22:36:59.350000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-20T22:36:32.350000Z - 2017-07-20T22:37:12.350000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-20T22:36:32.350000Z - 2017-07-20T22:37:12.350000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-20T22:36:32.350000Z - 2017-07-20T22:37:12.350000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-20T22:36:45.350000Z - 2017-07-20T22:37:25.350000Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-20T22:36:45.350000Z - 2017-07-20T22:37:25.350000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-20T22:36:45.350000Z - 2017-07-20T22:37:25.350000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-21T17:05:13.589989Z - 2017-07-21T17:05:53.589989Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-21T17:05:13.590000Z - 2017-07-21T17:05:53.590000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-21T17:05:13.590000Z - 2017-07-21T17:05:53.590000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-21T17:05:26.589989Z - 2017-07-21T17:06:06.589989Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-21T17:05:26.590000Z - 2017-07-21T17:06:06.590000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-21T17:05:26.590000Z - 2017-07-21T17:06:06.590000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s) in Stream:
XP.ILL06..EHZ | 2017-07-21T17:05:39.589989Z - 2017-07-21T17:06:19.589989Z | 100.0 Hz, 4001 samples
XP.ILL07..EHZ | 2017-07-21T17:05:39.590000Z - 2017-07-21T17:06:19.590000Z | 100.0 Hz, 4001 samples
XP.ILL08..EHZ | 2017-07-21T17:05:39.590000Z - 2017-07-21T17:06:19.590000Z | 100.0 Hz, 4001 samples
58
116
174
3 Trace(s)

# Train-test

In [17]:
# List of attributes we want to use for classification
attribute_names = ['duration', 'RappMaxMean','RappMaxMedian', 'AsDec', 'KurtoSig', \
                    'KurtoEnv', 'SkewnessSig', 'SkewnessEnv', 'CorPeakNumber', 'INT1', \
                    'INT2', 'INT_RATIO', 'ES[0]', 'ES[1]', 'ES[2]', 'ES[3]', 'ES[4]', 'KurtoF[0]', \
                    'KurtoF[1]', 'KurtoF[2]', 'KurtoF[3]', 'KurtoF[4]', 'DistDecAmpEnv', \
                    'env_max/duration(Data,sps)', 'MeanFFT', 'MaxFFT', 'FmaxFFT', \
                    'FCentroid', 'Fquart1', 'Fquart3', 'MedianFFT', 'VarFFT', 'NpeakFFT', \
                    'MeanPeaksFFT', 'E1FFT', 'E2FFT', 'E3FFT', 'E4FFT', 'gamma1', 'gamma2', \
                    'gammas', 'SpecKurtoMaxEnv', 'SpecKurtoMedianEnv', 'RATIOENVSPECMAXMEAN', \
                    'RATIOENVSPECMAXMEDIAN', 'DISTMAXMEAN', 'DISTMAXMEDIAN', 'NBRPEAKMAX', \
                    'NBRPEAKMEAN', 'NBRPEAKMEDIAN', 'RATIONBRPEAKMAXMEAN', \
                    'RATIONBRPEAKMAXMED', 'NBRPEAKFREQCENTER', 'NBRPEAKFREQMAX', \
                    'RATIONBRFREQPEAKS', 'DISTQ2Q1', 'DISTQ3Q2', 'DISTQ3Q1']

## Unsplited labeled Data (v5)

In [18]:
def train_test_grouped(df1, features, size):
    """
    Split into training and test data set such that events recorded on multiple stations will either be in train or in test data set
    Additional to sklearn function add feature values to X-files instead of only the event-index
    
    :param df1: whole data set
    :param gr: grouped per event index
    :param features: list of feautre names
    """
    # Get event idx and targets
    gr = df1.groupby('event_idx').first() # df inclueds only first time window per event
    idxs = np.asarray(gr.index) # index 0-290 (len = 291)
    y = np.asarray(gr['event_class']) # class 0 or 1 (len = 291)
    # Split training and validation data
    X_train, X_test, y_train, y_test = train_test_split(idxs, y, test_size=size,random_state=42)
    df_tr = df1.loc[df1['event_idx'].isin(X_train)] # df with feature values TRAIN
    df_va = df1.loc[df1['event_idx'].isin(X_test)] # df with feature values TEST
    X_train = np.asarray(df_tr[features]) # feature values to TRAIN
    y_train = np.asarray(df_tr['event_class']) # classes to TRAIN (only 0 and 1)
    X_test = np.asarray(df_va[features]) # feature values to TEST
    y_test = np.asarray(df_va['event_class']) # classes to TEST (only 0 and 1)
    return X_train, X_test, y_train, y_test

In [19]:
# Assemble whole header of feautre file
header = []
h = ['event_idx', 'slice_idx', 'event_class']
header = h + attribute_names + attribute_names + attribute_names # 3 times because 3 stations
print(len(header))

177


In [38]:
# Load feature file and set new header
# df_ff = pd.read_csv('/data/wsd03/data_manuela/Illgraben/feature_files/all_40s_hSNR_yfilt_v4_original.csv'.format(file_ending), header = None, names=range(len(header)))
df_ff = pd.read_csv('/data/wsd03/data_manuela/Illgraben/feature_files/all_{}{}.csv'.format(file_ending,fi), header = None, names=range(len(header)))

df_ff.columns = header
print(np.shape(df_ff)) # the three stations are behind each other (in x direction)

# Now we have to list the different stations below each other (in y direction)
df_final = pd.DataFrame()
for n in [3, 61, 119]:
    df_new = df_ff.iloc[:,:3].join(df_ff.iloc[:,n:n+58]) # first three cols copy and than add features
    df_final = df_final.append(df_new)
print(np.shape(df_final))

df_f = df_final[df_final['duration'].notna()].reset_index(drop=True) # keep only stations with data for the time window 
df_final = df_f

(430, 177)
(1290, 61)


  df_final = df_final.append(df_new)
  df_final = df_final.append(df_new)
  df_final = df_final.append(df_new)


In [39]:
# RFC can only handle two classes -> merge classes
def combine_classes_catalog_2(cat):
    cat.loc[cat['event_class'] != 3 , 'event_class'] = 0 # Earthqauke, Thunder, ambient Noise --> Noise
    cat.loc[cat['event_class'] == 3 , 'event_class'] = 1 # Slope failures from 3 to 1
    return cat

catalog = combine_classes_catalog_2(df_final)
catalog.to_csv('../catalog/combined_classes_catalog_{}{}'.format(file_ending,fi), index=False)
catalog

Unnamed: 0,event_idx,slice_idx,event_class,duration,RappMaxMean,RappMaxMedian,AsDec,KurtoSig,KurtoEnv,SkewnessSig,...,NBRPEAKMEAN,NBRPEAKMEDIAN,RATIONBRPEAKMAXMEAN,RATIONBRPEAKMAXMED,NBRPEAKFREQCENTER,NBRPEAKFREQMAX,RATIONBRFREQPEAKS,DISTQ2Q1,DISTQ3Q2,DISTQ3Q1
0,0.0,0.0,0.0,40.01,12.816385,39.788039,3.636153,24.604105,18.272831,0.612877,...,3.0,11.0,2.000000,0.545455,42.0,16.0,0.380952,4.756333,4.139822,8.896156
1,1.0,0.0,0.0,40.01,11.118979,16.465289,2.237055,26.181989,30.606662,0.018887,...,1.0,6.0,5.000000,0.833333,13.0,14.0,1.076923,4.855335,4.832085,9.687420
2,2.0,0.0,0.0,40.01,11.340990,14.003522,362.727273,23.350681,38.608780,0.823180,...,0.0,1.0,0.000000,0.000000,43.0,10.0,0.232558,4.954587,5.260092,10.214679
3,2.0,1.0,0.0,40.01,6.207292,9.933128,2.099148,6.627458,5.613452,0.122806,...,1.0,15.0,6.000000,0.400000,47.0,7.0,0.148936,4.625831,4.864835,9.490666
4,2.0,2.0,0.0,40.01,4.358168,5.077012,0.544191,3.973479,3.400073,0.024451,...,6.0,42.0,1.333333,0.190476,49.0,25.0,0.510204,4.674332,4.306575,8.980907
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1089,95.0,0.0,0.0,40.01,3.132603,3.336495,0.198981,2.882065,3.449377,0.039492,...,16.0,76.0,0.812500,0.171053,31.0,25.0,0.806452,5.559347,3.481311,9.040658
1090,95.0,1.0,0.0,40.01,9.406688,29.626751,3.951733,12.211623,8.224271,0.110801,...,1.0,3.0,3.000000,1.000000,28.0,20.0,0.714286,4.739333,3.734565,8.473898
1091,95.0,2.0,0.0,40.01,6.283808,10.068729,0.898008,7.163920,5.388522,0.045779,...,1.0,8.0,3.000000,0.375000,33.0,29.0,0.878788,4.157823,3.902568,8.060391
1092,95.0,3.0,0.0,40.01,5.399124,7.068993,0.174002,6.092394,6.013246,0.044561,...,1.0,8.0,2.000000,0.250000,16.0,27.0,1.687500,3.962569,4.246074,8.208644


In [40]:
X_train, X_test, y_train, y_test = train_test_grouped(catalog, attribute_names, size=0.3) # 30% data to test

print(np.shape(X_train)) # feature file
print(np.shape(y_train)) # classification (only 0 and 1)
print(np.shape(X_test)) # feature file
print(np.shape(y_test)) # classification (only 0 and 1)

print(len(catalog['event_class']==1))
print(np.where(catalog['event_class']==1))
print(np.shape(np.where(y_test==1)))
print(np.shape(np.where(y_train==1)))

(841, 58)
(841,)
(253, 58)
(253,)
1094
(array([ 46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,
        59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,
        72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,
        85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
        98,  99, 100, 126, 127, 128, 129, 137, 155, 156, 157, 162, 163,
       164, 165, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268,
       269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 280, 281, 282,
       430, 431, 432, 458, 459, 460, 461, 469, 487, 488, 489, 494, 495,
       496, 497, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600,
       601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 612, 613, 614,
       762, 763, 764, 790, 791, 792, 793, 801, 819, 820, 821, 826, 827,
       828, 829, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932,
       933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 944, 945, 946]),)
(1, 29)
(1, 140)


## Splited labeled data (v4)

In [45]:
# Read train and test feautre file
plot_ending = 'test_v4_bigsf_clf2'
#catalog = pd.read_csv(path_to_raw_data[:-9] + 'feature_files/train_features_yfilt_40s_hSNR_v4_with_thunder.csv')
catalog = pd.read_csv(path_to_raw_data[:-9] + 'feature_files/train_features_yfilt_40s_hSNR_v4_with_thunder_without_bigsf.csv')
catalog_test = pd.read_csv(path_to_raw_data[:-9] + 'feature_files/test_features_yfilt_40s_hSNR_v4_with_thunder.csv')

def combine_classes_catalog_2(cat):
    cat.loc[cat['event_class'] == 2, 'event_class'] = 0 # Earthqauke Illgraben to Noise
    return cat

catalog = combine_classes_catalog_2(catalog)
catalog_test = combine_classes_catalog_2(catalog_test)
print(np.shape(catalog))
catalog.head()

(620, 61)


Unnamed: 0,event_idx,slice_idx,event_class,duration,RappMaxMean,RappMaxMedian,AsDec,KurtoSig,KurtoEnv,SkewnessSig,...,NBRPEAKMEAN,NBRPEAKMEDIAN,RATIONBRPEAKMAXMEAN,RATIONBRPEAKMAXMED,NBRPEAKFREQCENTER,NBRPEAKFREQMAX,RATIONBRFREQPEAKS,DISTQ2Q1,DISTQ3Q2,DISTQ3Q1
0,0,0,0,40.01,11.621308,13.382875,0.0,29.192959,58.224163,0.654186,...,0,0,0.0,0.0,43,23,0.534884,5.813852,5.007588,10.821439
1,0,1,0,40.01,12.98994,41.775964,4.243775,24.769869,18.235417,0.617135,...,3,9,1.333333,0.444444,43,16,0.372093,4.841585,4.302575,9.14416
2,2,0,0,40.01,8.805683,13.153355,9.232737,12.392325,12.534898,0.263876,...,0,2,0.0,0.0,45,10,0.222222,4.941336,5.11959,10.060926
3,2,1,0,40.01,5.517401,8.359824,1.366056,5.482842,4.470063,0.100531,...,5,22,1.6,0.363636,51,9,0.176471,4.57633,4.914586,9.490916
4,2,2,0,40.01,4.076638,4.514342,92.046512,3.746714,3.910707,0.047172,...,7,53,1.142857,0.150943,45,27,0.6,4.473828,4.226824,8.700652


In [46]:
gr_train = catalog.groupby('event_idx').first() # takes only first timewindow from event
gr_test = catalog_test.groupby('event_idx').first() # takes only first timewindow from event
gr_test[gr_test['event_class'] == 1]

Unnamed: 0_level_0,slice_idx,event_class,duration,RappMaxMean,RappMaxMedian,AsDec,KurtoSig,KurtoEnv,SkewnessSig,SkewnessEnv,...,NBRPEAKMEAN,NBRPEAKMEDIAN,RATIONBRPEAKMAXMEAN,RATIONBRPEAKMAXMED,NBRPEAKFREQCENTER,NBRPEAKFREQMAX,RATIONBRFREQPEAKS,DISTQ2Q1,DISTQ3Q2,DISTQ3Q1
event_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,0,1,40.01,3.596649,3.894292,0.035455,3.132553,3.944306,0.007406,0.831002,...,3,14,2.0,0.428571,21,33,1.571429,4.245574,4.105572,8.351146
20,0,1,40.01,9.632125,29.351504,4.85798,12.222148,9.015058,0.20059,2.326823,...,1,9,3.0,0.333333,23,9,0.391304,4.415077,4.968837,9.383914
24,0,1,40.01,7.936242,39.820574,4.549237,11.392583,7.632915,0.085739,2.092004,...,3,8,1.0,0.375,40,22,0.55,3.878818,3.616063,7.494881
25,0,1,40.01,6.030819,7.589484,17.872642,6.582485,9.183023,0.03478,2.072771,...,0,2,0.0,0.0,25,22,0.88,4.819584,3.897068,8.716653
26,0,1,40.01,8.148813,34.01874,10.497126,10.511834,7.006029,0.144457,1.963308,...,2,8,1.5,0.375,35,33,0.942857,3.44206,3.130805,6.572865
27,0,1,40.01,3.352804,3.605667,1.795947,3.221219,3.831803,0.040959,0.85384,...,9,48,1.111111,0.208333,33,31,0.939394,6.196108,4.830335,11.026443
29,0,1,40.01,9.378785,25.950413,0.0,13.723142,9.579943,0.507181,2.516687,...,0,0,0.0,0.0,49,16,0.326531,5.281842,4.58508,9.866923
52,0,1,40.01,7.347712,9.820201,1.35769,9.15682,11.795604,0.326415,2.592355,...,2,15,1.0,0.133333,16,7,0.4375,4.370576,4.763583,9.13416
61,0,1,40.01,5.355649,7.201601,0.684632,6.299165,6.442957,0.074038,1.752272,...,1,12,5.0,0.416667,28,17,0.607143,4.482328,4.785084,9.267412
64,0,1,40.01,9.819074,19.369138,7.405462,17.722166,14.616016,0.094346,3.167648,...,0,1,0.0,0.0,27,25,0.925926,4.646581,4.509329,9.15591


In [47]:
# take all attributes
X_train = np.asarray(catalog[attribute_names])
y_train = np.asarray(catalog['event_class'])
X_test = np.asarray(catalog_test[attribute_names])
y_test = np.asarray(catalog_test['event_class'])

print(np.shape(X_train))
print(np.shape(y_train))
print(np.shape(X_test))
print(np.shape(y_test))

#print(len(catalog['event_class']==1))
#print(np.where(catalog['event_class']==1))
#print(np.shape(np.where(y_test==1)))
#print(np.shape(np.where(y_train==1)))

(620, 58)
(620,)
(263, 58)
(263,)


In [48]:
#X_train= df[df.columns.difference(['Year','X'])] # drop specific columns in df

## RF-model

In [49]:
# Train balanced random forest classifier
clf = BalancedRandomForestClassifier(n_estimators=1200,criterion='gini',sampling_strategy='majority', max_features='sqrt', \
                                     n_jobs=-1, min_samples_leaf = 1, max_depth=10, min_samples_split=20, \
                                     oob_score=False, bootstrap=True, class_weight=None,random_state=10)

In [50]:
# Fit model
clf.fit(X_train, y_train)
# Predict test data set
y_pred = clf.predict(X_test)
# Get probabilities
probas = clf.predict_proba(X_test)  
# Print confision matrix
print(confusion_matrix(y_test, y_pred))

[[193  17]
 [  3  50]]


In [43]:
print('{}{}.model'.format(file_ending,fi))

v4_bigsf_clf2.model


In [51]:
# save model
filename = '../model/RF_{}{}.model'.format(file_ending,fi)
pickle.dump(clf, open(filename, 'wb'))

In [84]:
# load model
filename = '../model/RF_{}{}.model'.format(file_ending,fi)
clf = pickle.load(open(filename, 'rb'))
clf

BalancedRandomForestClassifier(max_depth=10, max_features='sqrt',
                               min_samples_split=20, n_estimators=1200,
                               n_jobs=-1, random_state=10,
                               sampling_strategy='majority')

In [51]:
## take features with importance 0.005 or bigger
## get feature importance
#importances = clf.feature_importances_
#importances.sort()
#importances
##print(sum(clf.feature_importances_))
#df_imp = pd.DataFrame({'imp': clf.feature_importances_, 'label': attribute_names}).sort_values(by='imp', ascending=False)
#
## drop features with an importance smaller than 0.005
#indexNames = df_imp[df_imp['imp'] < 0.0025].index
#df_imp.drop(indexNames , inplace=True)
#imp_list = df_imp['label'].to_list()
#imp_list
#
#X_train = np.asarray(catalog[imp_list])
#y_train = np.asarray(catalog['event_class'])
#X_test = np.asarray(catalog_test[imp_list])
#y_test = np.asarray(catalog_test['event_class'])
#
#print(np.shape(X_train))
#print(np.shape(y_train))
#print(np.shape(X_test))
#print(np.shape(y_test))
#
## Fit model
#clf.fit(X_train, y_train)
## Predict test data set
#y_pred = clf.predict(X_test)
## Get probabilities
#probas = clf.predict_proba(X_test)  
## Print confision matrix
#print(confusion_matrix(y_test, y_pred))

(620, 56)
(620,)
(263, 56)
(263,)
