# Introduction to ECG Analysis with WESAD

Take some time to explore the folder structure of WESAD in the path ``WESAD_ROOT`` specified below. You can see that there is a readme file with informaiton about the dataset and a folder per subject of the study. We have selected a random ``subject`` for our exploratory analysis. As you can see inside the subject folder, there are multiple signal files, questionnaires and information about the data collection. However, the data to be used for inference are all gathered in the ``.pkl`` file, which we will load:

In [None]:
import pickle

WESAD_ROOT = "/home/kavra/Datasets/physio/WESAD"
subject = "S10"

with open(f"{WESAD_ROOT}/{subject}/{subject}.pkl", "rb") as f:
    subject_data = pickle.load(f, encoding="latin1")

The pickle files are typically heavy, hence expect loading to take some time, especially the first time you execute it. Let's take a look at the structure of the file we loaded:

In [None]:
# Print the keys of the pickle file
print(subject_data.keys())

# TODO: Print the 'subject' element
...

# TODO: Print the 'label' element
...

# TODO: Print the keys of the 'signal' element
...

# TODO: Plot the distribution of the labels
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 3))
plt.plot(subject_data["label"])
plt.title("Course of the Experiment")
plt.xlabel("Time (samples)")
plt.ylabel("Task Label")
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.show()

The pickle file contains 3 dictionary entries, containing different information.
* ``subject_data["subject"]`` contains the name (here it is ``S10``)
* ``subject_data["label"]`` is a list of integers (0 to 7) indicating the phase of the experiment for each sample

The ``signal`` entry contains our data. As you may recall, WESAD employs 2 sensors to detect physiological signals, one at the ``wrist`` (Empatica) and one around the ``chest`` (RespiBAN). We are going to use the ECG signals from the ``chest`` sensor:

In [None]:
chest_data = subject_data["signal"]["chest"]
print(chest_data.keys())

RespiBAN collects 6 the 6 physiological signals you see above as keys. Here we focus on the ECG modality. Let's plot it using the matplotlib library:

In [None]:
ecg_data = chest_data["ECG"].squeeze()

plt.figure(figsize=(20, 3))
plt.plot(ecg_data)
plt.title(f"ECG Signal from {subject}")
plt.xlabel("Time (samples)")
plt.ylabel("Normalized Signal Amplitude")
plt.show()

We cannot really observe much in this plot, since it represents the ECG signal of the whole experiment. Let's plot a random short interval of 20 seconds. If you do not remember the sampling rate of RespiBAN, or any other WESAD-related information, you can find it in the readme file located in ``WESAD_ROOT``.

In [None]:
# TODO: Specify the sampling rate of RespiBAN
sampling_rate = ...
win_length = ...

# TODO: Select a random 20-second window of the ECG signal
import numpy as np

win_start = np.random.randint(0, len(ecg_data) - win_length)
win_end = ...

plt.figure(figsize=(20, 3))

# TODO: Plot the selected window of the ECG signal
time_axis = np.arange(win_start, win_end) / sampling_rate
ecg_sample = ecg_data[win_start:win_end]

plt.plot(...)
plt.title(f"20-second ECG Sample from {subject}")
plt.xlabel("Time (s)")
plt.ylabel("Normalized Signal Amplitude")
plt.show()

You can run the above cell multiple times to get a sense of the whole ECG waveform. Note that in this plot we also altered the time axis to represent actual time in seconds rather than samples. Hopefully you can already detect the structure of a typical ECG signal in the above graph, the repeted heart beat pattern and the prominent ECG peaks, mainly the R peaks of the signal.

However, you might also notice that some parts are noisy or fluctuating over time. We therefore need to clean and preprocess this signal. ECG signal preprocessing typically includes the following steps, which you will implement in order:

1. Downsampling: 700 Hz is a very high sampling frequency for ECG. The type of information we are interested in usually falls below 100 Hz.
2. Bandpass filtering: ECG can be influenced both by low-frequency (e.g., breathing) and high-frequency noise (e.g., muscle movement, electromagnetic interference). Hence, we typically apply a bandpass filter to clean the noise. Bandpass filters come with a ``low`` and ``high`` frequency thresholds and, when applied to the signal, they discard all frequency components outside this range. Here we will use the filters of the ``scipy`` library with a low of 0.67 Hz and a high of 45 Hz.
3. Normalization: In some cases, e.g., when we want to evaluate within-subject effects over time, we choose to normalize (or z-score) the signals at hand. In that case, the whole ECG signal is already z-scored, as you can see in the first plot. Hence we do not normalize the sample.

In [None]:
# TODO: downsample to 100Hz
from scipy.signal import resample_poly
downsampled = ...

# TODO: Bandpass filtering
from scipy.signal import butter, filtfilt

# 5th order Butterworth bandpass filter from scipy
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.butter.html
lowcut, highcut = ...
b, a = butter(5, [lowcut, highcut], btype="band", fs=100)
filtered = filtfilt(b, a, downsampled)

# TODO: plot the processed ECG signal
plt.figure(figsize=(20, 3))
...

Let's wrap up the analysis and clean the whole dataset, so that we can use it in the next steps of the assignment. You should implement the following for each subject in the dataset:

1. Load the data using ``pickle``
2. Isolate only the ECG signal from RespiBAN
3. Select the samples only from the labels of interest. As a reminder, the ``data["label"]`` contains a list with task numbers from 0 to 7. Legend:
    * 0 = not defined
    * 1 = baseline
    * 2 = stress
    * 3 = amusement
    * 4 = meditation
    * 5/6/7 = should be ignored
    
    For our assignment, we will only need data for label = 1 and label = 2 (baseline vs stress). Get those in separate variables.
4. Preprocess the whole ECG data for the groups just like above, and save them in ``.npy`` files. It will take some time (15 subjects in total).

In [None]:
# These are some helper functions

def get_WESAD_subjects():
    initial = list(range(1, 18))
    subject_list = [f"S{i}" for i in initial]
    subject_list.remove("S1")
    subject_list.remove("S12")
    return subject_list

def preprocess_ECG(ecg_signal, sampling_rate=700):
    # TODO: Write the function
    ...
    return filtered

In [None]:
# structure of clean folder
import os

clean_dir = "WESAD_clean"
os.makedirs(clean_dir, exist_ok=True)

for subject in get_WESAD_subjects():
    print(f"Processing {subject}")

    # TODO: Load the subject's data
    ...

    # TODO: Extract the ECG signal
    ecg_data = ...
    label_data = ...

    # TODO: Isolate label 1 (baseline) and label 2 (stress) using np.where
    baseline_indices = ...
    stress_indices = ...

    ecg_data_baseline = ecg_data[baseline_indices]
    ecg_data_stress = ecg_data[stress_indices]

    # TODO: Preprocess the ECG signals using the function you wrote
    ecg_data_baseline = ...
    ecg_data_stress = ...

    # Save the preprocessed ECG signals
    np.save(f"{clean_dir}/{subject}_baseline.npy", ecg_data_baseline)
    np.save(f"{clean_dir}/{subject}_stress.npy", ecg_data_stress)

print("Done!")

That is the end of this notebook. Make sure all the clean files have been generated. You can then return and proceed to the next part of the assignment.

In [None]:
check_data = np.load(f"{clean_dir}/S10_baseline.npy")
assert check_data.shape[0] == 118000, "Something went wrong :("
assert len(os.listdir(clean_dir)) == 15 * 2, "Something went wrong :("
print("All good!")