# LANL Earthquake Prediction Kaggle Competition 2019
### Eric Yap, Joel Huang, Kyra Wang

---

In this notebook, we present our work for the LANL Earthquake Prediction Kaggle Competition 2019. The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The data comes from a well-known experimental set-up used to study earthquake physics. The `acoustic_data` input signal is used to predict the time remaining before the next laboratory earthquake (`time_to_failure`).

The training data is a single, continuous segment of experimental data. The test data consists of a folder containing many small segments. The data within each test file is continuous, but the test files do not represent a continuous segment of the experiment; thus, the predictions cannot be assumed to follow the same regular pattern seen in the training file.

For each `seg_id` in the test folder, we need to predict a single `time_to_failure` corresponding to the time between the last row of the segment and the next laboratory earthquake.

---

### Imports

In [10]:
# Data wrangling imports
import numpy as np
import pandas as pd

# Utility imports
from tqdm import tqdm
from joblib import Parallel, delayed

# Data visualization imports
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
import seaborn as sns

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
from torchvision import transforms

In [9]:
# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
if use_cuda:
    torch.cuda.manual_seed_all(42)
else:
    torch.manual_seed_all(42)

### Data preprocessing

As the training data and the test data are formatted differently, we must either preprocess the data such that the formats of both sets are the same, or ensure that our model is capable of predicting on the two different formats. We went with the first option because it is less time consuming to implement.

We did this by splitting the training data into segments the same size as the test data segments, i.e. 150000 data points each. Each segment is labeled with a single `time_to_failure` corresponding to the time between the last row of the segment and the next laboratory earthquake. We then put each of these segments into a single dataframe, and saved this as a CSV file to be used as our training data.

Following this, we merged the separate test segments into another single dataframe, and saved this as a CSV file to be used as our test data.

As the dataset is massive, we used Joblib to help run the functions as a pipeline jobs with parallel computing.

In [6]:
class FeatureGenerator(object):
    def __init__(self, dtype, n_jobs=1, chunk_size=None):
        self.chunk_size = chunk_size
        self.dtype = dtype
        self.filename = None
        self.n_jobs = n_jobs
        self.test_files = []
        if self.dtype == 'train':
            self.filename = './data/train.csv'
            self.total_data = int(629145481 / self.chunk_size)
        else:
            submission = pd.read_csv('./data/sample_submission.csv')
            for seg_id in submission.seg_id.values:
                self.test_files.append((seg_id, './data/test/' + seg_id + '.csv'))
            self.total_data = int(len(submission))

    def read_chunks(self):
        if self.dtype == 'train':
            iter_df = pd.read_csv(self.filename, iterator=True, chunksize=self.chunk_size,
                                  dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
            for counter, df in enumerate(iter_df):
                x = df.acoustic_data.values
                y = df.time_to_failure.values[-1]
                seg_id = 'train_' + str(counter)
                yield seg_id, x, y
        else:
            for seg_id, f in self.test_files:
                df = pd.read_csv(f, dtype={'acoustic_data': np.int16})
                x = df.acoustic_data.values
                yield seg_id, x, -999

    def features(self, x, y, seg_id):
        feature_dict = dict()
        feature_dict['target'] = y
        feature_dict['segment'] = x
        feature_dict['seg_id'] = seg_id

        # create features here
        # for example:
        # feature_dict['mean'] = np.mean(x)

        return feature_dict

    def generate(self):
        feature_list = []
        res = Parallel(n_jobs=self.n_jobs,
                       backend='threading')(delayed(self.features)(x, y, s)
                                            for s, x, y in tqdm(self.read_chunks(), total=self.total_data))
        for r in res:
            feature_list.append(r)
        return pd.DataFrame(feature_list)


training_fg = FeatureGenerator(dtype='train', n_jobs=4, chunk_size=150000)
training_data = training_fg.generate()

test_fg = FeatureGenerator(dtype='test', n_jobs=4, chunk_size=None)
test_data = test_fg.generate()

training_data.to_csv("./data/train_features.csv", index=False)
test_data.to_csv("./data/test_features.csv", index=False)

4195it [01:47, 38.87it/s]                          
100%|██████████| 2624/2624 [00:27<00:00, 93.84it/s]


In [7]:
training_data.head()

Unnamed: 0,seg_id,segment,target
0,train_0,"[12, 6, 8, 5, 8, 8, 9, 7, -5, 3, 5, 2, 2, 3, -...",1.430797
1,train_1,"[5, 6, 8, 6, 3, -1, 5, 4, 4, 4, 6, 5, 5, 5, 6,...",1.391499
2,train_2,"[5, 5, 8, 9, 9, 10, 11, 12, 13, 5, 3, 7, 5, 3,...",1.353196
3,train_3,"[5, -5, -4, 1, 3, 4, 6, 12, 15, 17, 14, 9, 6, ...",1.313798
4,train_4,"[12, 6, 4, -1, 0, 6, 7, 6, 2, -2, 0, 4, 1, 5, ...",1.2744


In [8]:
test_data.head()

Unnamed: 0,seg_id,segment,target
0,seg_00030f,"[4, 0, -2, 0, 2, -3, -9, -4, 11, 11, 8, 1, 10,...",-999
1,seg_0012b5,"[5, 8, 8, 7, 4, 1, -1, -4, -1, 0, 5, 7, -1, 7,...",-999
2,seg_00184e,"[8, 2, 3, 8, 7, 9, 7, 4, 4, 9, 9, 1, 2, 6, 4, ...",-999
3,seg_003339,"[2, 6, 3, 6, 8, 6, 8, 5, 4, 6, 2, 3, 1, 4, 6, ...",-999
4,seg_0042cc,"[5, 3, 1, 4, 6, 6, 7, 4, 5, 4, 3, 4, 6, 7, 3, ...",-999


### Defining the Model

In [11]:
class LANL_Model(nn.Module):
    def __init__(self):
        super(LANL_Model, self).__init__()
        
    def forward(self, x):
        return x