# LANL Earthquake Prediction Kaggle Competition 2019
### Eric Yap, Joel Huang, Kyra Wang

---

In this notebook, we present our work for the LANL Earthquake Prediction Kaggle Competition 2019. The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The data comes from a well-known experimental set-up used to study earthquake physics. The `acoustic_data` input signal is used to predict the time remaining before the next laboratory earthquake (`time_to_failure`).

The training data is a single, continuous segment of experimental data. The test data consists of a folder containing many small segments. The data within each test file is continuous, but the test files do not represent a continuous segment of the experiment; thus, the predictions cannot be assumed to follow the same regular pattern seen in the training file.

For each `seg_id` in the test folder, we need to predict a single `time_to_failure` corresponding to the time between the last row of the segment and the next laboratory earthquake.

---

### Imports

In [1]:
%load_ext autoreload
%autoreload 2

# Data wrangling imports
import numpy as np
import pandas as pd

# Utility imports
import ast
from tqdm import tqdm
from joblib import Parallel, delayed

# Data visualization imports
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
import seaborn as sns

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
from torchvision import transforms

# Custom stuff
from data import LANLDataset, FeatureGenerator

# Setting the seeds for reproducibility
np.random.seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)
else:
    torch.manual_seed_all(42)

### Data preprocessing

As the training data and the test data are formatted differently, we must either preprocess the data such that the formats of both sets are the same, or ensure that our model is capable of predicting on the two different formats. We went with the first option because it is less time consuming to implement.

We did this by splitting the training data into segments the same size as the test data segments, i.e. 150000 data points each. Each segment is labeled with a single `time_to_failure` corresponding to the time between the last row of the segment and the next laboratory earthquake. We then put each of these segments into a single dataframe, and saved this as a pickle file to be used as our training data.

Following this, we merged the separate test segments into another single dataframe, and saved this as a pickle file to be used as our test data.

As the dataset is massive, we used Joblib to help run the functions as a pipeline jobs with parallel computing.

In [2]:
trainval_df = pd.read_pickle('./data/train_features.pkl')
trainval_df = trainval_df[:-2]
trainval_df.head()

Unnamed: 0,seg_id,segment,target
0,train_0,"[12, 6, 8, 5, 8, 8, 9, 7, -5, 3, 5, 2, 2, 3, -...",1.430797
1,train_1,"[5, 6, 8, 6, 3, -1, 5, 4, 4, 4, 6, 5, 5, 5, 6,...",1.391499
2,train_2,"[5, 5, 8, 9, 9, 10, 11, 12, 13, 5, 3, 7, 5, 3,...",1.353196
3,train_3,"[5, -5, -4, 1, 3, 4, 6, 12, 15, 17, 14, 9, 6, ...",1.313798
4,train_4,"[12, 6, 4, -1, 0, 6, 7, 6, 2, -2, 0, 4, 1, 5, ...",1.2744


In [3]:
test_df = pd.read_pickle('./data/test_features.pkl')
test_df.head()

Unnamed: 0,seg_id,segment,target
0,seg_00030f,"[4, 0, -2, 0, 2, -3, -9, -4, 11, 11, 8, 1, 10,...",-999
1,seg_0012b5,"[5, 8, 8, 7, 4, 1, -1, -4, -1, 0, 5, 7, -1, 7,...",-999
2,seg_00184e,"[8, 2, 3, 8, 7, 9, 7, 4, 4, 9, 9, 1, 2, 6, 4, ...",-999
3,seg_003339,"[2, 6, 3, 6, 8, 6, 8, 5, 4, 6, 2, 3, 1, 4, 6, ...",-999
4,seg_0042cc,"[5, 3, 1, 4, 6, 6, 7, 4, 5, 4, 3, 4, 6, 7, 3, ...",-999


At this point, we split the training data further into a 80/20 training/validation split. We then create dataloaders that will help load the data into the model in parallel using multiprocessing workers.

In [4]:
# Parameters
batch_size = 16
params = {'batch_size': batch_size,
          'shuffle': True,
          'num_workers': 16}


transform = transforms.Compose([])

train_val = LANLDataset(trainval_df['segment'], trainval_df['target'], transforms=transform, spec=True)
test_set = LANLDataset(test_df['segment'], test_df['target'])
train_set, val_set = train_val.train_val_split(0.7, 0.3)

datasets = {'train': train_set,
            'valid': val_set,
            'test' : test_set}

dataloaders = {phase: data.DataLoader(dataset, **params)
               for phase, dataset in datasets.items()}

### Defining the Model

In [5]:
class QuakeNet(nn.Module):
    def __init__(self):
        super(QuakeNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU())
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU())
        self.layer3 = nn.Sequential(
            nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=3))
        self.fc1 = nn.Linear(564480, 1024)
        self.fc2 = nn.Linear(1024, 1)
        
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

### Training the Model

In [6]:
from IPython.display import clear_output

device = torch.device("cuda:1")
model = QuakeNet().to(device)
criterion = nn.L1Loss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

def train():
    model.train()
    for epoch in range(30):
        print("Epoch " + str(epoch))
        train_loss = 0
        for idx, sample in enumerate(dataloaders['train']):
            if idx == len(dataloaders['train']) - 1:
                continue
            data, target = sample['data'].to(device), sample['target'].to(device)
            model.zero_grad()
            outputs = model(data)
            loss = criterion(outputs.float(), target)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        train_loss /= (len(dataloaders["train"]) - 1)
        print("Epoch {} Avg loss: {}".format(epoch, train_loss))
        
train()

Epoch 0
Epoch 0 Avg loss: 131.39980873253828
Epoch 1
Epoch 1 Avg loss: 11.890554880183902
Epoch 2
Epoch 2 Avg loss: 5.776798233959844
Epoch 3
Epoch 3 Avg loss: 3.2237349865866487
Epoch 4
Epoch 4 Avg loss: 3.0594637178983843
Epoch 5
Epoch 5 Avg loss: 3.035147022679855
Epoch 6
Epoch 6 Avg loss: 3.0357434404352324
Epoch 7
Epoch 7 Avg loss: 3.0290121302578616
Epoch 8
Epoch 8 Avg loss: 3.031217481920628
Epoch 9


KeyboardInterrupt: 

### Evaluating the Model on the Test Data