# Sensor Data Analysis

Analyse raw sensor data from PhysioNet: https://physionet.org/physiobank/database/noneeg/

Data: "Bag of Sensors"

Feature Extraction
- Statistical
- Continuous
- Spectral

Modeling

## Load Data

Each subject has several datafiles:
- SubjectN_AccTempEDA.atr: annotation
- SubjectN_AccTempEDA.dat: data
- SubjectN_AccTempEDA.hea: header
- SubjectN_Sp02HR.dat: data
- SubjectN_Sp02HR.hea: header

These files are in the WFDB format, and can be read using the `wfdb` python module.
(https://github.com/MIT-LCP/wfdb-python)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

# pip install wfdb
import wfdb

# render plots inline
%matplotlib inline

### Acc Temp EDA

In [None]:
ann = wfdb.rdann('./data/physionet/Subject10_AccTempEDA', extension='atr', summarize_labels=True)
print(ann.__dict__)

In [None]:
record_acc_temp_eda = wfdb.rdrecord('./data/physionet/Subject10_AccTempEDA')
print(record_acc_temp_eda.__dict__)

wfdb.plot_wfdb(record=record_acc_temp_eda, title='Subject10_AccTempEDA', annotation=ann, plot_sym=True, 
               time_units='seconds', figsize=(15, 10))

In [None]:
data_acc_temp_eda = record_acc_temp_eda.p_signal
data_acc_temp_eda.shape

### SpO2 HR

In [None]:
record_spo2_hr = wfdb.rdrecord('./data/physionet/Subject10_SpO2HR')
print(record_spo2_hr.__dict__)

wfdb.plot_wfdb(record=record_spo2_hr, title='Subject10_SpO2HR', time_units='seconds', figsize=(15, 5))

In [None]:
data_spo2_hr = record_spo2_hr.p_signal
data_spo2_hr.shape

In [None]:
# number of acceleration, etc samples per second
record_acc_temp_eda.fs

In [None]:
# number of SpO2 and HR samples per second
record_spo2_hr.fs

## Aligning data of different frequencies

The two dataset frequencies (number of samples per second) are different.

To support processing both datasets at the same time, we need to match the frequencies.

This is a common situation when taking readings from different sensors or data sources.

Two strategies:
1. Upsampling the smaller frequency data. E.g: repeat samples or interpolate.
2. Downsampling the larger frequency data. E.g: replace with mean or median.

Which one to pick depends on requirements: whether you need to maintain precision of the higher frequency dataset.

Example: https://machinelearningmastery.com/resample-interpolate-time-series-data-python/

### Upsampling SpO2 HR to 8 samples per second

In [None]:
# create an index with 1 second timestamps, using the length of data_spo2_hr
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.period_range.html

# for this dataset, the start date is just an arbitrary reference
per_second_index = pd.period_range(start='2019-01-01', periods=len(data_spo2_hr), freq='S')
per_second_index

In [None]:
# create a dataframe for SpO2 data using the above period index
df_spO2_hr = pd.DataFrame(data_spo2_hr, index=per_second_index, columns=record_spo2_hr.sig_name)
df_spO2_hr.head()

In [None]:
# upsample to match the frequency of the other data (8 times)

In [None]:
factor = record_acc_temp_eda.fs / record_spo2_hr.fs
factor

In [None]:
# resample, then interpolate
# Note: whether interpolation makes sense depends on the sensor and type of data
upsampled = df_spO2_hr.resample('125ms')

df_upsampled = upsampled.interpolate()
df_upsampled.head(10)

In [None]:
df_upsampled.info()

In [None]:
# Note: there are fewer values in the Acc dataframe, so we need to ignore the
# later entries from df_upsampled.

df_acc_temp_eda = pd.DataFrame(data_acc_temp_eda, columns=record_acc_temp_eda.sig_name)
df_acc_temp_eda.info()

In [None]:
df_acc_temp_eda.index = df_spO2_hr_upsampled.index[:len(data_acc_temp_eda)]
df_acc_temp_eda.info()

In [None]:
# concatenate the two dataframes, column-wise
df = pd.concat([df_acc_temp_eda, df_spO2_hr_upsampled], axis=1)
df.head()

## Statistical Features

In [None]:
df.mean() # mean of each column

In [None]:
df.median() # median is less sensitive to outliers than mean

In [None]:
df.std() # standard deviation

In [None]:
df.max()

In [None]:
df.min()

In [None]:
df.columns

### Discretise into quantiles

In [None]:
df.ax.values.ravel() # raw values

In [None]:
df['ax_q10'] = pd.qcut(df.ax.values.ravel(), 10, labels=False)

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
df['ax_q10'].plot(ax=ax)
plt.show()

In [None]:
# histogram showing distribution in the 10 levels
df['ax_q10'].hist()

In [None]:
df['ay_q10'] = pd.qcut(df.ay.values.ravel(), 10, labels=False)
df['az_q10'] = pd.qcut(df.az.values.ravel(), 10, labels=False)