# Sensor Data Analysis

Analyse "Bag of sensors" data from PhysioNet: https://physionet.org/physiobank/database/noneeg/

Under/Oversampling

Feature Extraction
- Statistical
- Spectral

Modeling

## Data Introduction

## Load Data

Each subject has several datafiles:
- SubjectN_AccTempEDA.atr: annotation
- SubjectN_AccTempEDA.dat: data
- SubjectN_AccTempEDA.hea: header
- SubjectN_Sp02HR.dat: data
- SubjectN_Sp02HR.hea: header

These files are in the WFDB (WaveForm DataBase) format, and can be read using the `wfdb` python module.
(https://github.com/MIT-LCP/wfdb-python)

https://www.physionet.org/standards/npsg/Moody.pdf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf

plt.style.use('seaborn-white')

# pip install wfdb
import wfdb

# render plots inline
%matplotlib inline

### Acc Temp EDA

In [None]:
ann = wfdb.rdann('./data/physionet/Subject10_AccTempEDA', extension='atr', summarize_labels=True)
print(ann.__dict__)

In [None]:
record_acc_temp_eda = wfdb.rdrecord('./data/physionet/Subject10_AccTempEDA')
print(record_acc_temp_eda.__dict__)

wfdb.plot_wfdb(record=record_acc_temp_eda, title='Subject10_AccTempEDA', annotation=ann, plot_sym=True, 
               time_units='seconds', figsize=(15, 10))

In [None]:
data_acc_temp_eda = record_acc_temp_eda.p_signal
data_acc_temp_eda.shape

### SpO2 HR

In [None]:
record_spo2_hr = wfdb.rdrecord('./data/physionet/Subject10_SpO2HR')
print(record_spo2_hr.__dict__)

wfdb.plot_wfdb(record=record_spo2_hr, title='Subject10_SpO2HR', time_units='seconds', figsize=(15, 5))

In [None]:
data_spo2_hr = record_spo2_hr.p_signal
data_spo2_hr.shape

In [None]:
# number of acceleration, etc samples per second
record_acc_temp_eda.fs

In [None]:
# number of SpO2 and HR samples per second
record_spo2_hr.fs

## Aligning data of different frequencies

The two dataset frequencies (number of samples per second) are different.

To support processing both datasets at the same time, we need to match the frequencies.

This is a common situation when taking readings from different sensors or data sources.

Two strategies:
1. Upsampling the smaller frequency data. E.g: repeat samples or interpolate.
2. Downsampling the larger frequency data. E.g: replace with mean or median.

Which one to pick depends on requirements: whether you need to maintain precision of the higher frequency dataset.

Example: https://machinelearningmastery.com/resample-interpolate-time-series-data-python/

### Option 1: Upsampling SpO2 HR to 8 samples per second

In [None]:
# create an index with 1 second timestamps, using the length of data_spo2_hr
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.period_range.html
#
# frequency strings: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

# for this dataset, the start date is just an arbitrary reference
per_second_index = pd.period_range(start='2019-01-01', periods=len(data_spo2_hr), freq='S')
per_second_index

In [None]:
# create a dataframe for SpO2 data using the above period index
df_spO2_hr = pd.DataFrame(data_spo2_hr, index=per_second_index, columns=record_spo2_hr.sig_name)
df_spO2_hr.head()

In [None]:
# upsample to match the frequency of the other data (8 times)

In [None]:
factor = record_acc_temp_eda.fs / record_spo2_hr.fs
factor

In [None]:
# resample, then interpolate
# Note: whether interpolation makes sense depends on the sensor and type of data
upsampled = df_spO2_hr.resample('125ms')

df_upsampled = upsampled.interpolate()
df_upsampled.head(10)

In [None]:
df_upsampled.info()

In [None]:
# Note: there are fewer values in the Acc dataframe, so we need to ignore the
# later entries from df_upsampled.

df_acc_temp_eda = pd.DataFrame(data_acc_temp_eda, columns=record_acc_temp_eda.sig_name)
df_acc_temp_eda.info()

In [None]:
df_acc_temp_eda.index = df_upsampled.index[:len(df_acc_temp_eda)]
df_acc_temp_eda.info()

In [None]:
# concatenate the two dataframes, column-wise
df_option1 = pd.concat([df_acc_temp_eda, df_upsampled], axis=1).dropna()
df_option1.head()

In [None]:
df_option1.info()

In [None]:
# https://stackoverflow.com/questions/48126330/python-int-too-large-to-convert-to-c-long-plotting-pandas-dates
df_option1.index = pd.to_datetime(df_option1.index.to_timestamp())

df_option1.plot(figsize=(15, 10))
ax = plt.gca()
ax.set_title('Upsampled Data')
plt.show()

In [None]:
df_option1.info()

### Option 2: Downsampling Acc Temp EDA to 1 sample per second

In [None]:
# create an index with 125 millisecond timestamps, using the length of data_acc_temp_eda
#
# frequency strings: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

# for this dataset, the start date is just an arbitrary reference
per_125_ms_index = pd.period_range(start='2019-01-01', periods=len(data_acc_temp_eda), freq='125ms')
per_125_ms_index

In [None]:
# create a dataframe for Acc Temp EDA using the 125ms period index
df_acc_temp_eda2 = pd.DataFrame(data_acc_temp_eda, index=per_125_ms_index, columns=record_acc_temp_eda.sig_name)
df_acc_temp_eda2.head()

In [None]:
# downsample using median
df_acc_temp_eda_downsampled = df_acc_temp_eda2.resample('S').median()
df_acc_temp_eda_downsampled.head(10)

In [None]:
df_acc_temp_eda_downsampled.info()

In [None]:
df_spo2_hr2 = pd.DataFrame(data_spo2_hr, columns=record_spo2_hr.sig_name, index=per_second_index)
df_spo2_hr2.info()

In [None]:
# concatenate the two dataframes, column-wise
df_option2 = pd.concat([df_acc_temp_eda_downsampled, df_spo2_hr2], axis=1).dropna()
df_option2.head()

In [None]:
df_option2.info()

In [None]:
# Not needed, but for consistency with df_option1
df_option2.index = pd.to_datetime(df_option2.index.to_timestamp())

df_option2.plot(figsize=(15, 10))
ax = plt.gca()
ax.set_title('Downsampled Data')
plt.show()

In [None]:
# Let's zoom into a 1-second time window and compare the plots

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 10))
start_time = '2019-01-01 00:05'
end_time = '2019-01-01 00:06'

df_option1[(df_option1.index >= start_time) & (df_option1.index < end_time)].plot(ax=ax1)
ax1.set_title('Upsampled (with interpolation)')
df_option2[(df_option2.index >= start_time) & (df_option2.index < end_time)].plot(ax=ax2)
ax2.set_title('Downsampled (with median)')
plt.show()

## Statistical Features

- Mean, median, standard deviation
- Quantisation / discretisation
- Correlation
- Auto-correlation

In [None]:
df = df_option1

In [None]:
df.mean() # mean of each column

In [None]:
df.median() # median is less sensitive to outliers than mean

In [None]:
df.std() # standard deviation

In [None]:
df.max()

In [None]:
df.min()

In [None]:
df.columns

### Discretise into quantiles

Discretisation is useful when there is a lot of noise in the signal.

https://datascience.stackexchange.com/questions/19782/what-is-the-rationale-for-discretization-of-continuous-features-and-when-should

In [None]:
df.ax.values.ravel() # raw values

In [None]:
df['ax_q10'] = pd.qcut(df.ax.values.ravel(), 10, labels=False, duplicates='drop')

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
df['ax_q10'].plot(ax=ax)
plt.show()

In [None]:
# histogram showing distribution in the 10 levels
# Note: a histogram best applies to discrete variables
df['ax_q10'].hist()

In [None]:
df['ay_q10'] = pd.qcut(df.ay.values.ravel(), 10, labels=False, duplicates='drop')
df['az_q10'] = pd.qcut(df.az.values.ravel(), 10, labels=False, duplicates='drop')

In [None]:
# Plotting multiple histograms
df.loc[:, ['ax_q10', 'ay_q10', 'az_q10']].hist()
plt.show()

### Pair-plot

Pair plots are a combination of scatter plots and histograms. 

They are done for each pair of features (e.g. ax vs. ay)

https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
sns.pairplot(df)

### Correlation

Correlations provide a metric to indicate whether two variables are strongly dependent.

https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/

In [None]:
df.corr(method='pearson')

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df.corr(method='pearson'), annot=True, fmt='.2f')
plt.show()

### Auto-correlation

Runs correlation on progressive longer time steps (lags)

https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/

The Pearson’s correlation coefficient is a number between -1 and 1 that describes a negative or positive correlation respectively. A value of zero indicates no correlation.

We can calculate the correlation for time series observations with observations with previous time steps, called lags. Because the correlation of the time series observations is calculated with values of the same series at previous times, this is called a serial correlation, or an autocorrelation.

Confidence intervals are drawn as a cone. By default, this is set to a 95% confidence interval, suggesting that correlation values outside of this code are very likely a correlation and not a statistical fluke.

In [None]:
columns = ['ax', 'ay', 'az', 'SpO2', 'EDA', 'hr']
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))
axes = axes.flatten()

for ax, c in zip(axes, columns):
    plot_acf(df[c], ax=ax)
    ax.set_title(f'Autocorrelation: {c}')

plt.show()

In [None]:
lags = 30

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))
axes = axes.flatten()
for ax, c in zip(axes, columns):
    plot_acf(df[c], ax=ax, lags=lags)
    ax.set_title(f'Autocorrelation: {c}, lags: {lags}')

plt.show()

## Spectal Features

- FFT: https://ipython-books.github.io/101-analyzing-the-frequency-components-of-a-signal-with-a-fast-fourier-transform/