# Exploratory Data Analysis of the acustic signal on the frequency domain

In this notebook we will explore the shape and behavior of the frequency spectrum of the acoustic signals (training data only).
We will explore the following:
* The average power spectrum of the training data
* The average power spectrum as a function of the time to failure.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
sampling_frequency = 4e6
time_step = 1./sampling_frequency

chunksize = 150_000

train = pd.read_csv('../input/train.csv', iterator=True, 
                    chunksize=chunksize, 
                    dtype={'acoustic_data': np.int16, 'time_to_failure': np.float32})

First, we compute the frequency power spectrum for every chunk. Since the spectrum for each chunk is too big, we will downscale the data to reduce the memory requirements by:
* Smoothing the data using a 10-points moving window.
* Keeping one every 10 samples
* Keeping only frequencies < 450khz (larger frequencies have a negligible contribution).

Keeping low memory requirements facilitates the display of the data using interactive plots.

In [None]:
from scipy import signal

def get_spectrum(input_signal):
    """
    Get a pandas Series with the fourier power spectum for a given signal segment.
    """
    input_signal = np.asarray(input_signal.values, dtype='float64')
    
    # Remove the mean  
    input_signal -= input_signal.mean()  
    
    # Estimate power spectral density using a periodogram.
    frequencies , power_spectrum = signal.periodogram(input_signal, sampling_frequency, scaling='spectrum')    
    
    # Run a running windows average of 10-points to smooth the signal.
    power_spectrum = pd.Series(power_spectrum, index=frequencies).rolling(window=10).mean()        
    
    return pd.Series(power_spectrum)

def get_segment_spectrum(segment_df):
    """
    Get the fourier power spectrum of a given segment.
    
    Returns the quake_time, frequencies, and power_spectrum
    """
    
    quake_time =segment_df['time_to_failure'].values[-1]
    
    _power_spectrum = get_spectrum(segment_df['acoustic_data']).dropna() 
    
    # Keep only frequencies < 450khz (larger frequencies have a negligible contribution).
    _power_spectrum = _power_spectrum[_power_spectrum.index<450_000]
    
    # Keep one every 10 samples
    power_spectrum=_power_spectrum.values[::10]
    frequencies=_power_spectrum.index.values[::10]    
    
    return quake_time, frequencies, power_spectrum

Let's compute the frequency power spectrum for every segment.

In [None]:
quake_times = list()       
power_spectrums = list()

for df in train:    
    quake_time, _frequencies, power_spectrum = get_segment_spectrum(df)    
    if df.shape[0]<chunksize:
        continue
    
    frequencies=_frequencies
    quake_times.append(quake_time)    
    power_spectrums.append(power_spectrum)

power_spectrums = np.stack(power_spectrums, axis=0)
quake_times = pd.Series(quake_times)    

In [None]:
print("power_spectrums.shape:",power_spectrums.shape)
print("power_spectrums size:",power_spectrums.nbytes/(1024**2.),"[mb]")

## Explore the average power spectrum

First, let's see how the mean power spectrum looks like. We will display the results using bokeh so they can be explored interactively.

In [None]:
from bokeh.layouts import column, row
from bokeh.plotting import Figure, show
from bokeh.io import output_notebook
from bokeh.models import PrintfTickFormatter
from bokeh.models import LinearAxis, Range1d

output_notebook() # Display Bokeh plots inline in a classic Jupyter notebooks

# Compute the average spectrum
average_power_spectrum = pd.Series(power_spectrums.mean(axis=0))
average_power_spectrum = average_power_spectrum.rolling(window=10).mean()

# "@foo{(0.00 a)}" # formats 1230974 as: 1.23 m
TOOLTIPS = [ ("x", "@x{(0.00 a)}Hz"), ("y", "$y")   ]

pl = Figure(plot_width=800, plot_height=400,title="Average power spectrum",tooltips=TOOLTIPS)
pl.line(frequencies, average_power_spectrum, line_color="navy")
pl.xaxis[0].formatter = PrintfTickFormatter(format="%d")
pl.xaxis.axis_label = "Frequency [hz]"
pl.yaxis.axis_label = "Power [V**2]"
show(pl)

## Normalized average spectrum
The latter plot provides only a general picture of the signal spectrum. Signals with a very low total power (power spectrum integral) are average together with spectrums with a very large value. In consequence, this may result in certain features or peaks present in the spectrum being masked out.

Let's now plot the mean value of the spectrums, but normalized by their total energy.  This will show us the mean shape of the spectrum, independent the total energy of the segments signals.

In [None]:
normalized_spectrums = power_spectrums/(power_spectrums.sum(axis=1)[:,np.newaxis])
average_power_spectrum = pd.Series(normalized_spectrums.mean(axis=0))
average_power_spectrum = average_power_spectrum.rolling(window=10).mean()

TOOLTIPS = [ ("x", "@x{(0.00 a)}Hz"), ("y", "$y")   ]
pl = Figure(plot_width=800, plot_height=400, title="Average of normalized power-spectrums",
            tooltips=TOOLTIPS)
pl.line(frequencies, average_power_spectrum, line_color="navy")
pl.xaxis[0].formatter = PrintfTickFormatter(format="%d")
pl.xaxis.axis_label = "Frequency [hz]"
pl.yaxis.axis_label = "Normalized Power []"
show(pl)

We see that on average, the power spectrum have 5 regions where the power spectrum reaches a local maximum:
* Small maxima : 35-60 khz
* Medium peak : 70-110 khz
* Two contiguous maximums:  120-156khz and 156-200khz
* Large maximum: 200-300khz

# Time to failure composites

In the last section, we show the average value and the average shape of the spectrum in the entire train dataset. That analysis masked out the dependence between the shape of spectrum shape and the remaining time to failure (quake time).

In this section, we explore the relationship between the spectrum shape with the remaining time to failure. For that, we will compute the normalize spectrum average over different 


In [None]:
# Create normalized spectrum composites for mean, std, min and max.

dt = 0.2 # Interval of time to aggregate a composite
max_quake_time = quake_times.max()
quake_times_intervals = np.arange(dt/2, max_quake_time+dt/2+0.01, dt)    

composite_spectrums_mean = np.zeros((quake_times_intervals.size, frequencies.size))
composite_spectrums_std = np.zeros((quake_times_intervals.size, frequencies.size))
composite_spectrums_min = np.zeros((quake_times_intervals.size, frequencies.size))
composite_spectrums_max = np.zeros((quake_times_intervals.size, frequencies.size))

normalized_spectrums = power_spectrums/(power_spectrums.sum(axis=1)[:,np.newaxis])

for n , quake_time in enumerate(quake_times_intervals):
    t0 = quake_time-dt/2
    t1 = quake_time+dt/2
    segments = quake_times[(quake_times>=t0) & (quake_times<t1)].index.values
    if len(segments) > 0:
        composite_spectrums_mean[n] = normalized_spectrums[segments,:].mean(axis=0)
        composite_spectrums_std[n] = normalized_spectrums[segments,:].std(axis=0)
        composite_spectrums_min[n] = normalized_spectrums[segments,:].min(axis=0)
        composite_spectrums_max[n] = normalized_spectrums[segments,:].max(axis=0)

print("composite_spectrums_mean.shape:",composite_spectrums_mean.shape)
print("composite_spectrums_mean size:",composite_spectrums_mean.nbytes/(1024**2.),"[mb]")

Next, we will make an interactive visualization of the mean normalized spectrum for each quake time interval.

In [None]:
from bokeh.models import CustomJS, Slider
from bokeh.plotting import figure, output_file, show, ColumnDataSource

_quake_time = quake_times_intervals[0]

TOOLTIPS = [ ("x", "@frequencies{(0.00 a)}Hz"), ("y", "$y")   ]

pl = Figure(plot_width=800, plot_height=400, title= f"{_quake_time-dt/2} <= Quake time < {_quake_time+dt/2}",
           y_range=(0, 4e-3), tooltips=TOOLTIPS)

data_dict = dict()
for i in range(quake_times_intervals.size):    
    data_dict[str(i)] = composite_spectrums_mean[i]
all_data = ColumnDataSource(data=data_dict)
source = ColumnDataSource(data=dict(frequencies=frequencies, composite_mean=composite_spectrums_mean[0]))

pl.line('frequencies', 'composite_mean', line_color="navy", source=source)
pl.xaxis[0].formatter = PrintfTickFormatter(format="%d")

                            
callback = CustomJS(args=dict(source=source, 
                              all_data=all_data,plot=pl,
                              quake_times_intervals=quake_times_intervals,
                              dt=dt), 
                    code="""
    var data = source.data;
    var interval = slider.value;
    var composite_means = all_data.data[String(interval)];    
    var y = data['composite_mean']
    for (var i = 0; i < y.length; i++) {
        y[i] = composite_means[i];
    }
    var _quake_time = quake_times_intervals[interval]
    console.log(String(_quake_time-dt/2));
    console.log(String(_quake_time+dt/2));
    plot.title.text = String((_quake_time-dt/2).toFixed(1)) + " <= Quake time < " + String((_quake_time+dt/2).toFixed(1));
    source.change.emit();
""")

time_slider = Slider(start=0, end=composite_spectrums_mean.shape[0], value=0, step=1,
                     title="Quaketime interval number", callback=callback)
callback.args["slider"] = time_slider
                            
layout = column(time_slider,  pl)
show(layout)

# Final remarks

In this notebook, we explored the Fourier spectrum characteristics. We found out that the main contributions come from 5 regions of the frequency spectrum. I have tried computing the common features over those frequency intervals but only marginal improvement over the usual computations was observed.

In addition,  the power spectra (normalized) have a similar shape for different quake times, suggesting that it is independent of the remaining time to failure. The motivation of this analysis was finding unusual trends or relationships that can be used as new features. Nevertheless, I couldn't find any interesting information to be used for new features. But probably I missed something.

I hope that you found this kernel useful and or at least interesting.  

Andrés