# LANL Earthquake Prediction Kaggle Competition 2019
### Eric Yap, Joel Huang, Kyra Wang

---

In this notebook, we present our work for the LANL Earthquake Prediction Kaggle Competition 2019. The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The data comes from a well-known experimental set-up used to study earthquake physics. The `acoustic_data` input signal is used to predict the time remaining before the next laboratory earthquake (`time_to_failure`).

The training data is a single, continuous segment of experimental data. The test data consists of a folder containing many small segments. The data within each test file is continuous, but the test files do not represent a continuous segment of the experiment; thus, the predictions cannot be assumed to follow the same regular pattern seen in the training file.

For each `seg_id` in the test folder, we need to predict a single `time_to_failure` corresponding to the time between the last row of the segment and the next laboratory earthquake.

---

### Imports

In [1]:
%load_ext autoreload
%autoreload 2

# Data wrangling imports
import numpy as np
import pandas as pd

# Utility imports
import ast
from tqdm import tqdm
from joblib import Parallel, delayed

# Data visualization imports
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
import seaborn as sns

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
from torchvision import transforms

# Custom stuff
from data import LANLDataset, FeatureGenerator

# Setting the seeds for reproducibility
np.random.seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)
else:
    torch.manual_seed_all(42)

### Data preprocessing

As the training data and the test data are formatted differently, we must either preprocess the data such that the formats of both sets are the same, or ensure that our model is capable of predicting on the two different formats. We went with the first option because it is less time consuming to implement.

We did this by splitting the training data into segments the same size as the test data segments, i.e. 150000 data points each. Each segment is labeled with a single `time_to_failure` corresponding to the time between the last row of the segment and the next laboratory earthquake. We then put each of these segments into a single dataframe, and saved this as a pickle file to be used as our training data.

Following this, we merged the separate test segments into another single dataframe, and saved this as a pickle file to be used as our test data.

As the dataset is massive, we used Joblib to help run the functions as a pipeline jobs with parallel computing.

In [2]:
trainval_df = pd.read_pickle('./data/train_features.pkl')
trainval_df = trainval_df[:-2]
trainval_df.head()

Unnamed: 0,seg_id,segment,target
0,train_0,"[12, 6, 8, 5, 8, 8, 9, 7, -5, 3, 5, 2, 2, 3, -...",1.430797
1,train_1,"[5, 6, 8, 6, 3, -1, 5, 4, 4, 4, 6, 5, 5, 5, 6,...",1.391499
2,train_2,"[5, 5, 8, 9, 9, 10, 11, 12, 13, 5, 3, 7, 5, 3,...",1.353196
3,train_3,"[5, -5, -4, 1, 3, 4, 6, 12, 15, 17, 14, 9, 6, ...",1.313798
4,train_4,"[12, 6, 4, -1, 0, 6, 7, 6, 2, -2, 0, 4, 1, 5, ...",1.2744


In [3]:
test_df = pd.read_pickle('./data/test_features.pkl')
test_df.head()

Unnamed: 0,seg_id,segment,target
0,seg_00030f,"[4, 0, -2, 0, 2, -3, -9, -4, 11, 11, 8, 1, 10,...",-999
1,seg_0012b5,"[5, 8, 8, 7, 4, 1, -1, -4, -1, 0, 5, 7, -1, 7,...",-999
2,seg_00184e,"[8, 2, 3, 8, 7, 9, 7, 4, 4, 9, 9, 1, 2, 6, 4, ...",-999
3,seg_003339,"[2, 6, 3, 6, 8, 6, 8, 5, 4, 6, 2, 3, 1, 4, 6, ...",-999
4,seg_0042cc,"[5, 3, 1, 4, 6, 6, 7, 4, 5, 4, 3, 4, 6, 7, 3, ...",-999


At this point, we split the training data further into a 80/20 training/validation split. We then create dataloaders that will help load the data into the model in parallel using multiprocessing workers.

In [16]:
# Parameters
batch_size = 8
params = {'batch_size': batch_size,
          'shuffle': True,
          'num_workers': 16}

def get_df_features(start_index, end_index, chunk_size, mode="trainval"):
    feats = []
    for i in tqdm(range(len(trainval_df))):
        tv_seg = np.asarray(trainval_df["segment"][i])
        tmp = []
        for j in range(start_index, end_index, chunk_size):
            x = pd.Series(tv_seg[j:j+chunk_size])
            tmp.append([
                x.mean(), # mean
                x.std(), # std
                x.max(), # max
                x.min(), # min
                np.mean(np.diff(x)), # mean change abs
                np.abs(x).max(), # abs max
                np.abs(x).min(), # abs min
                x.max() / np.abs(x.min()), # max to min
                x.max() - np.abs(x.min()), # max to min diff
                len(x[np.abs(x) > 500]), # count big
                x.sum(), # sum
                np.quantile(x, 0.95), # q95
                np.quantile(x, 0.99), # q99
                np.quantile(x, 0.05), # q05
                np.quantile(x, 0.01), # q01
                np.quantile(np.abs(x), 0.95), # abs q95
                np.quantile(np.abs(x), 0.99), # abs q99
                np.quantile(np.abs(x), 0.05), # abs q05
                np.quantile(np.abs(x), 0.01), # abs q01
                np.abs(x).mean(), # abs mean
                np.abs(x).std(), # abs std
                x.mad(), # mad
                x.kurtosis(), # kurt
                x.skew(), # skew
                x.median(), # med
                x.sum() # sum
            ])
        feats.append(tmp)
    np.save(f"./data/{mode}_features15.npy", np.asarray(feats))
    return feats

seg = []
#trainval_feats = get_df_features(0, 150000, 10000)
#test_feats = get_df_features(0, 150000, 10000)
trainval_feats = np.load("./data/trainval_features15.npy").tolist()

train_val = LANLDataset(trainval_feats, trainval_df["target"].to_numpy())
#test = LANLDataset(test_feats, test_df["target"].to_numpy())
train, val = train_val.train_val_split(0.7, 0.3)

datasets = {'train': train,
            'valid': val}
            #'test' : test}
dataloaders = {phase: data.DataLoader(dataset, **params)
               for phase, dataset in datasets.items()}

### Defining the Model

In [17]:
class LANLModel(nn.Module):
    def __init__(self, device, input_dim=1, hidden_dim=64, output_dim=1, batch_size=64, num_layers=1):
        super(LANLModel, self).__init__()
        self.device = device
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.batch_size = batch_size
        self.num_layers = num_layers
        self.rnn = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers, batch_first=True)
        self.linear = nn.Linear(self.hidden_dim, self.output_dim)
        self.to(self.device)
        
    def init_hidden(self):
        return (
            torch.zeros(self.num_layers, self.batch_size, self.hidden_dim).to(self.device),
            torch.zeros(self.num_layers, self.batch_size, self.hidden_dim).to(self.device)
        )
        
    def forward(self, x):
        #x = x.unsqueeze(2)
        rnn_out, _ = self.rnn(x, self.init_hidden())
        out = self.linear(rnn_out[:,-1,:])
        return out

### Training the Model

In [18]:
from IPython.display import clear_output

device = torch.device("cuda")
model = LANLModel(device, input_dim=26, hidden_dim=100, num_layers=2, batch_size=batch_size)
criterion = nn.L1Loss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

def train():
    model.train(mode=True)
    for epoch in range(30):
        print("Epoch " + str(epoch))
        train_loss = 0
        for idx, sample in enumerate(dataloaders["train"]):
            if idx == len(dataloaders["train"]) - 1:
                continue
            data, targets = sample["data"].to(device), sample["target"].to(device)
            model.zero_grad()
            outputs = model(data)
            #print(outputs)
            loss = criterion(outputs.float(), targets)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        train_loss /= (len(dataloaders["train"]) - 1)
        print(train_loss)
        
train()

Epoch 0
3.587816712634811
Epoch 1
3.041500401301462
Epoch 2
3.0423623551436463
Epoch 3
3.040751801488178
Epoch 4
3.0450742899394427
Epoch 5
3.0435435052126483
Epoch 6
3.043567792313998
Epoch 7
3.039339923142084
Epoch 8
3.043522341655252
Epoch 9
3.040382200902928
Epoch 10
3.0436629157900157
Epoch 11
3.0410529417418393
Epoch 12
3.0451078056637706
Epoch 13
3.044564887148435
Epoch 14
3.0387568249077095
Epoch 15
3.0377392078357968
Epoch 16
3.043285914783269
Epoch 17
3.04161457285855
Epoch 18
3.041993435940456
Epoch 19
3.035007687865711
Epoch 20
3.047264357082179
Epoch 21
3.0433561740025796
Epoch 22
3.0418013093250047
Epoch 23
3.039602721486587
Epoch 24
3.0443828666145034
Epoch 25
3.0426980260291385
Epoch 26
3.044612945266109
Epoch 27
3.041947224101082
Epoch 28
3.0437127869637286
Epoch 29
3.040161385887959


### Evaluating the Model on the Test Data

In [None]:
def LANL_test():
    return

In [None]:
[x.shape for x in model.parameters()]