In this notebook, I am using the global statistical values (mean, max, min etc), spectral density of the sensor signals and shorttime fourier transforms as the features. 
I am using the principal components (first 21 components that explain upto 99.9% of the variance)of the spectral density features to reduce overfitting. 
Finally I am training an XGB model with five fold validation(Ordering the data in increasing order of time to eruption so as to ensure a similar distribution in each of the folds)

I referred to the following notebooks to obtain ideas and some code.Thanks to the respective authors.
1. https://www.kaggle.com/soheild91/ingv-nn-xgb-baseline (Getting started)
2. https://www.kaggle.com/amanooo/ingv-volcanic-basic-solution-stft (STFT features)
3. https://www.kaggle.com/kylesnyder/ingv-spectral-density-w-randomforest (Spectral density features)

In [2]:
from tqdm import tqdm

import numpy as np
import pandas as pd

import scipy
from scipy import signal
import random
import pickle

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

import xgboost

# Spectral density features

In [3]:
# #spectral density along with statistical features
input_df = pd.read_csv('../input/predict-volcanic-eruptions-ingv-oe/train.csv')
data = np.empty((input_df.shape[0],1410))#129 features per sensor 10 sensors =1290 + 120 stat ftrs(12*10)
#data = np.empty((input_df.shape[0],1290))
time = np.empty((input_df.shape[0],1))

for i,segment in enumerate(tqdm(input_df['segment_id'])):
    temp = pd.read_csv(f'../input/predict-volcanic-eruptions-ingv-oe/train/{segment}.csv').fillna(0)
    temp_arr = np.empty((0,))
    for col in temp.columns:
        freq,psd =signal.welch(temp[col],100)
        temp_arr= np.concatenate((temp_arr,psd))  
    temp_arr = np.concatenate((temp_arr,temp.abs().mean().to_numpy(),
                               temp.std().to_numpy(),
                               temp.mean().to_numpy(),
                               temp.var().to_numpy(),
                               temp.min().to_numpy(),
                               temp.max().to_numpy(),
                               temp.median().to_numpy(),
                               temp.quantile([0.1,0.25,0.5,0.75,0.9]).to_numpy().reshape(1,-1)[0]))
    temp_arr = temp_arr.reshape((1,-1))
    data[i,:] = temp_arr
    time[i,0] = input_df.loc[i,'time_to_eruption']

  0%|          | 20/4431 [00:03<13:21,  5.50it/s]


KeyboardInterrupt: 

In [9]:
sample_submission_df=pd.read_csv('../input/predict-volcanic-eruptions-ingv-oe/sample_submission.csv',nrows=5)
data_test=np.empty((sample_submission_df.shape[0],1410))
#data_test=np.empty((sample_submission_df.shape[0],1290))

for i,segment in enumerate(tqdm(sample_submission_df['segment_id'])):
    temp=pd.read_csv(f'../input/predict-volcanic-eruptions-ingv-oe/test/{segment}.csv').fillna(0)
    temp_arr = np.empty((0,))
    for col in temp.columns:
        freq,psd =signal.welch(temp[col],100)
        temp_arr= np.concatenate((temp_arr,psd))
    temp_arr = np.concatenate((temp_arr,temp.abs().mean().to_numpy(),
                                 temp.std().to_numpy(),
                                 temp.mean().to_numpy(),
                                 temp.var().to_numpy(),
                                 temp.min().to_numpy(),
                                 temp.max().to_numpy(),
                                 temp.median().to_numpy(),
                                 temp.quantile([0.1,0.25,0.5,0.75,0.9]).to_numpy().reshape(1,-1)[0]))
    temp_arr = temp_arr.reshape((1,-1))
    data_test[i,:] = temp_arr

100%|██████████| 5/5 [00:00<00:00,  5.53it/s]


In [10]:
train_df_sd = pd.DataFrame(data)
test_df_sd = pd.DataFrame(data_test)

In [17]:
with open('train_df_sd.pkl', 'wb') as f:
    pickle.dump(train_df_sd, f)
    
with open('test_df_sd.pkl', 'wb') as f:
    pickle.dump(test_df_sd, f)

# STFT Features

In [20]:
#STFT(Short Time Fourier Transform) Specifications
fs = 100                # sampling frequency 
#N = len(feature_df)     # data size
N = 60000
n = 256                 # FFT segment size
max_f = 20              # ～20Hz

delta_f = fs / n        # 0.39Hz
delta_t = n / fs / 2    # 1.28s

In [25]:
# #path ../input/predict-volcanic-eruptions-ingv-oe/test
def make_features(input_df, path):
    feature_set = []
    for i,segment_id in enumerate(input_df['segment_id']):
        temp=pd.read_csv(f'{path}/{segment_id}.csv').fillna(0)
        segment = [segment_id]
        for sensor in temp.columns:
            x = temp[sensor][:N]
            if x.isna().sum() > 1000:     ##########
                segment += ([np.NaN] * 10)
                continue
            f, t, Z = scipy.signal.stft(x.fillna(0), fs = fs, window = 'hann', nperseg = n)
            f = f[:round(max_f/delta_f)+1]
            Z = np.abs(Z[:round(max_f/delta_f)+1]).T    # ～max_f, row:time,col:freq

            th = Z.mean() * 1     ##########
            Z_pow = Z.copy()
            Z_pow[Z < th] = 0
            Z_num = Z_pow.copy()
            Z_num[Z >= th] = 1

            Z_pow_sum = Z_pow.sum(axis = 0)
            Z_num_sum = Z_num.sum(axis = 0)

            A_pow = Z_pow_sum[round(10/delta_f):].sum()
            A_num = Z_num_sum[round(10/delta_f):].sum()
            BH_pow = Z_pow_sum[round(5/delta_f):round(8/delta_f)].sum()
            BH_num = Z_num_sum[round(5/delta_f):round(8/delta_f)].sum()
            BL_pow = Z_pow_sum[round(1.5/delta_f):round(2.5/delta_f)].sum()
            BL_num = Z_num_sum[round(1.5/delta_f):round(2.5/delta_f)].sum()
            C_pow = Z_pow_sum[round(0.6/delta_f):round(1.2/delta_f)].sum()
            C_num = Z_num_sum[round(0.6/delta_f):round(1.2/delta_f)].sum()
            D_pow = Z_pow_sum[round(2/delta_f):round(4/delta_f)].sum()
            D_num = Z_num_sum[round(2/delta_f):round(4/delta_f)].sum()
            segment += [A_pow, A_num, BH_pow, BH_num, BL_pow, BL_num, C_pow, C_num, D_pow, D_num]

        feature_set.append(segment)

    cols = ['segment_id']
    for i in range(10):
        for j in ['A_pow', 'A_num','BH_pow', 'BH_num','BL_pow', 'BL_num','C_pow', 'C_num','D_pow', 'D_num']:
            cols += [f's{i+1}_{j}']
    feature_df = pd.DataFrame(feature_set, columns = cols)
    feature_df['segment_id'] = feature_df['segment_id'].astype('int')
    return feature_df

In [26]:
input_df = pd.read_csv('../input/predict-volcanic-eruptions-ingv-oe/train.csv')
path_train = "../input/predict-volcanic-eruptions-ingv-oe/train"
train_df_stft = make_features(input_df,path_train)
train_df_stft = pd.merge(input_df, train_df_stft, on = 'segment_id')
#test
sample_submission_df=pd.read_csv('../input/predict-volcanic-eruptions-ingv-oe/sample_submission.csv')
path_test = "../input/predict-volcanic-eruptions-ingv-oe/test"
test_df_stft = make_features(sample_submission_df,path_test)


In [27]:
with open('train_df_stft.pkl', 'wb') as f:
    pickle.dump(train_df_stft, f)
    
with open('test_df_stft.pkl', 'wb') as f:
    pickle.dump(test_df_stft, f)

#↑のデータは全て事前に作っておいて、ここからスタートしてください****

In [4]:
train_df_stft = pd.read_pickle('../input/ingv-data/train_df_stft.pkl')
time = train_df_stft['time_to_eruption']
train_df_stft = train_df_stft.drop(['segment_id', 'time_to_eruption'], axis=1)


test_df_stft = pd.read_pickle('../input/ingv-data/test_df_stft.pkl')
test_df_stft = test_df_stft.drop(['segment_id'],axis = 1)

train_df_sd = pd.read_pickle('../input/ingv-data/train_df_sd.pkl')
test_df_sd = pd.read_pickle('../input/ingv-data/test_df_sd.pkl')

In [5]:
train_df_sd.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409
0,11149.979226,56815.69231,30880.323186,20313.208263,14055.766676,10393.135986,6630.27144,4491.056525,3441.780685,2203.027738,...,383.0,799.0,368.0,493.0,0.0,888.0,676.0,798.0,471.0,767.0
1,8639.155232,53938.159167,96147.057293,91814.140016,61594.287618,35624.272697,31678.645122,24611.483525,14252.425074,8810.443189,...,528.0,782.0,433.0,452.0,318.0,422.0,569.0,456.0,496.0,1146.0
2,5242.809579,28238.359471,23425.556184,16529.638069,14384.604091,12603.827779,6918.195897,4844.703328,4251.058629,3040.150635,...,307.0,934.0,265.0,363.0,0.0,579.0,443.0,448.0,317.0,624.0
3,1634.84841,9772.758686,21485.432441,23905.653426,20394.542189,12118.667708,8169.856366,5419.216459,3400.171831,2440.07755,...,273.0,507.0,236.0,296.0,198.0,390.0,357.0,410.0,284.0,652.0
4,4283.858707,21653.494775,11665.881656,16214.825532,18377.954876,15524.83255,9576.657305,10686.961788,8907.870405,4364.788805,...,329.0,0.0,290.0,330.0,232.0,582.0,398.0,528.0,331.0,698.0


# PCA

In [6]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.999,svd_solver="full") #all components that explain upto 99.9% of the variance
train_pca = pca.fit_transform(train_df_sd)
train_pca = pd.DataFrame(train_pca)
test_pca = pca.transform(test_df_sd)
test_pca = pd.DataFrame(test_pca)

In [7]:
train_df = pd.concat([train_df_stft,train_pca] ,axis=1)
test_df = pd.concat([test_df_stft,test_pca],axis = 1)

# XGB with stratified k fold 

In [8]:
input_df = pd.read_csv('../input/predict-volcanic-eruptions-ingv-oe/train.csv')
input_df=input_df.copy()
input_df = input_df.sort_values("time_to_eruption")
fold_list = [1,2,3,4,5]
folds = []
for i in range(int((input_df.shape[0]-1)/5)):
    random.shuffle(fold_list)
    folds.extend(fold_list)
folds = folds + [1] #adding a remaining solitary record to fold 1
input_df['fold'] = folds
input_df.head(20)

Unnamed: 0,segment_id,time_to_eruption,fold
590,601524801,6250,5
2709,1658693785,25730,2
1145,1957235969,26929,4
413,442994108,28696,3
1724,1626437563,40492,1
3942,775594946,52074,2
1291,765529516,58525,4
3433,17116587,84289,5
4150,1534910358,84390,1
1126,827803506,87064,3


In [32]:
train_x, val_x, train_y, val_y=train_test_split(train_df, time,random_state=42, test_size=0.2,shuffle=True)

In [36]:
import xgboost as xgb
import optuna
from optuna.samplers import TPESampler

#LGBMRegressor + optuna
sampler=TPESampler(seed=42)

def create_model_xgb(trial):
    n_estimators=trial.suggest_int("n_estimators",50,300)
    max_depth=trial.suggest_int("max_depth",3,15)
    learning_rate=trial.suggest_uniform("learning_rate",0.0001,0.99)
    gamma=trial.suggest_uniform('min_data_in_leaf',0,1)
    model=xgb.XGBRegressor(
    n_estimators=n_estimators,
    max_depth=max_depth,
    learning_rate=learning_rate,
    gamma=gamma,
    random_state=42)
    
    return model


#目的関数
def objective_xgb(trial):
    model1=create_model_xgb(trial)
    model1.fit(train_x,train_y)
    preds=model1.predict(val_x)
    score=mean_absolute_error(val_y,preds)
    return score


study1=optuna.create_study(direction="minimize", sampler=sampler)
optuna.logging.disable_default_handler()#don't show log
study1.optimize(objective_xgb, n_trials=100)

#最適解
print(study1.best_params)
print(study1.best_value)
print(study1.best_trial)

[I 2021-01-02 07:19:52,395] A new study created in memory with name: no-name-fa9cd07e-599b-4010-b0e4-3f67a1fb975c


{'n_estimators': 170, 'max_depth': 12, 'learning_rate': 0.04915558807197376, 'min_data_in_leaf': 0.5416341991105386}
2353906.032764938
FrozenTrial(number=98, value=2353906.032764938, datetime_start=datetime.datetime(2021, 1, 2, 7, 31, 54, 45970), datetime_complete=datetime.datetime(2021, 1, 2, 7, 32, 3, 468482), params={'n_estimators': 170, 'max_depth': 12, 'learning_rate': 0.04915558807197376, 'min_data_in_leaf': 0.5416341991105386}, distributions={'n_estimators': IntUniformDistribution(high=300, low=50, step=1), 'max_depth': IntUniformDistribution(high=15, low=3, step=1), 'learning_rate': UniformDistribution(high=0.99, low=0.0001), 'min_data_in_leaf': UniformDistribution(high=1, low=0)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=98, state=TrialState.COMPLETE)


In [40]:
params=study1.best_params
params

{'n_estimators': 170,
 'max_depth': 12,
 'learning_rate': 0.04915558807197376,
 'min_data_in_leaf': 0.5416341991105386}

In [41]:
predictions = np.zeros(len(test_df))
for fold in tqdm(range(1,6)):
    train_index_list = input_df[input_df['fold'] != fold].index
    test_index_list = input_df[input_df['fold'] == fold].index

    X_train = train_df.iloc[train_index_list]
    y_train = time[train_index_list]
    X_val = train_df.iloc[test_index_list]
    y_val = time[test_index_list]

    model = xgboost.XGBRegressor(**params)
    
    eval_set = [(X_val, y_val)]
    
    model.fit(X_train, y_train,early_stopping_rounds=10,eval_metric='mae', eval_set=eval_set, verbose=False)
    #print(model.evals_result()['validation_0']['mae'][-5:])
    predictions += model.predict(test_df)
predictions = predictions/5

  0%|          | 0/5 [00:00<?, ?it/s]

Parameters: { min_data_in_leaf } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




 20%|██        | 1/5 [00:09<00:38,  9.72s/it]

Parameters: { min_data_in_leaf } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




 40%|████      | 2/5 [00:19<00:29,  9.68s/it]

Parameters: { min_data_in_leaf } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




 60%|██████    | 3/5 [00:29<00:19,  9.88s/it]

Parameters: { min_data_in_leaf } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




 80%|████████  | 4/5 [00:39<00:09,  9.75s/it]

Parameters: { min_data_in_leaf } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




100%|██████████| 5/5 [00:48<00:00,  9.70s/it]


In [31]:
sample_submission_df1=pd.read_csv('../input/predict-volcanic-eruptions-ingv-oe/sample_submission.csv')
sample_submission_df1['time_to_eruption']=predictions
sample_submission_df1.to_csv('xgb_5fldst_ft_sdpca_stft2.csv',index=False)