### Introduction

From: https://www.timeseriesclassification.com/description.php?Dataset=Earthquakes

* The earthquake classification problem involves predicting whether a major event is about to occur based on the most recent readings in the surrounding area.


* The data is taken from Northern California Earthquake Data Center and **each data is an averaged reading for one hour**, with the first reading taken on Dec 1st 1967, the last in 2003. 


* We transform this single time series into a classification problem by first defining a major event as any reading of over 5 on the Rictor scale. 


*  Major events are often followed by aftershocks. The physics of these are well understood and their detection is not the objective of this exercise. 


* Hence we consider a positive case to be one where a major event is not preceded by another major event for at least 512 hours.    
   

* To construct a negative case, we consider instances where there is a reading below 4 (to avoid blurring of the boundaries between major and non major events) that is preceded by at least 20 readings in the previous 512 hours that are non-zero (to avoid trivial negative cases). 


* None of the cases overlap in time (i.e. we perform a segmentation rather than use a sliding window).


* Of the 86,066 hourly readings, we produce 368 negative cases and 93 positive.


* 512/24 = 21.333333333333332

### Imports

In [2]:
%matplotlib inline

import numpy as np
import pandas as pd

# from scipy.io import arff

### Data ingest and prep

In [3]:
data_train = np.loadtxt('./data_in/data_SeismicBagnall/Earthquakes_TRAIN.txt')
data_test = np.loadtxt('./data_in/data_SeismicBagnall/Earthquakes_TEST.txt')

X_train = data_train[:, 1:]
y_train = data_train[:, 0].astype(int)

X_test = data_test[:, 1:]
y_test = data_test[:, 0].astype(int)

df_X_train = pd.DataFrame(X_train)
df_y_train = pd.DataFrame(y_train)
df_Xy_train = df_X_train.copy()
df_Xy_train['y'] = df_y_train.values

df_X_test = pd.DataFrame(X_test)
df_y_test = pd.DataFrame(y_test)
df_Xy_test = df_X_test.copy()
df_Xy_test['y'] = df_y_test.values

df_Xy = pd.concat([df_Xy_train, df_Xy_test])
df_Xy = df_Xy.reset_index(drop=True)

print(df_X_train.shape, df_y_train.shape, df_X_test.shape, df_y_test.shape, df_Xy.shape)
print(" ")
print('Train + Test Data: ')
display(df_Xy['y'].value_counts())
print('Documentation statement: ' + 'We produce 368 negative cases and 93 positive.')

(322, 512) (322, 1) (139, 512) (139, 1) (461, 513)
 
Train + Test Data: 


0    368
1     93
Name: y, dtype: int64

Documentation statement: We produce 368 negative cases and 93 positive.


### Save aggregated and cleaned data

In [4]:
# df_Xy.to_csv('data_seismic_BagnallTimeSeriesClassification.csv', index=False)

### Misc.

In [5]:
# data = arff.loadarff('../data_SeismicBagnall/Earthquakes_TRAIN.arff') # don't know what this is! Look into later

In [6]:
print(df_X_train.shape, df_X_test.shape)
print(322 + 139)
print(368 + 93)

display(df_y_train[0].value_counts())
display(df_y_test[0].value_counts())

print('Train: ' + str(58 / (264 + 58)))

print('Test: ' + str(35 / (35 + 104)))

(322, 512) (139, 512)
461
461


0    264
1     58
Name: 0, dtype: int64

0    104
1     35
Name: 0, dtype: int64

Train: 0.18012422360248448
Test: 0.2517985611510791


In [7]:
df_X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,502,503,504,505,506,507,508,509,510,511
0,-0.518009,-0.518009,2.654211,-0.518009,-0.518009,-0.518009,-0.518009,1.456243,2.558373,-0.518009,...,-0.518009,-0.518009,-0.518009,-0.518009,-0.518009,-0.518009,-0.518009,-0.518009,1.465826,-0.518009
1,1.943733,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,...,2.457789,3.365590,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115,-0.353115
2,2.638517,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,...,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102,-0.316102
3,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,...,1.366900,-0.531138,2.147402,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138,-0.531138
4,-0.593665,2.020105,1.174727,-0.593665,-0.593665,1.606043,1.217859,1.588790,-0.593665,-0.593665,...,1.226485,-0.593665,-0.593665,-0.593665,1.493901,-0.593665,-0.593665,-0.593665,1.899337,-0.593665
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,-0.579610,-0.579610,-0.579610,-0.579610,1.584470,1.892368,-0.579610,-0.579610,-0.579610,-0.579610,...,-0.579610,3.291103,1.214993,-0.579610,1.707629,1.602064,-0.579610,-0.579610,-0.579610,-0.579610
318,-0.478984,-0.478984,1.665022,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,...,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984,-0.478984
319,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,...,3.673102,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652,-0.264652
320,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,...,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827,-0.490827


### Dataset summary

Columns (features) are averages over hours leading up to the event. 512 hours tracked before event (21.3 days . . . )

Rows are separate seismic sequences (461 seismic sequences, 368 normal sequences, 93 "big event" sequences, 0.20 fraction of big event sequences)

In [8]:
93 / (93 + 368)

0.2017353579175705

### References

[1] https://www.timeseriesclassification.com/description.php?Dataset=Earthquakes

[2] https://github.com/dzlab/deepprojects/blob/master/timeseries/LSTM_FCN_pytorch.ipynb