# Effects of taVNS on HRV 
## Discription
2 groups (true label temporally unknown, each patient received VNS for several days during which there were 2 sessions of stimulation). Subject ID from 2020004 to 2020015.

**Warning**: Both the number of days and the number of sessions depend on the subject.
## Objective
- To see the Effects of taVNS on HRV
- Unsupervised clustering to see whether we can distinguise the two groups.

## Input
- nn_interval file (.pickle) for each subject
- info csv (stimulation_timestamped.csv) indicating when the stimulation begins 
**Warning**: the info file was automatically generated. It needs a second check.

## Output
- For each group (patient), we explore the effect of VNS on HRV (boxplot, pre/during/post-stim)
- Taking day as a variable, two independent variable: day and session
- clustering including Features engineering (e.g. $HRV_{during} - \frac{HRV_{pre} + HRV_{post}}{2})$ per patient) and unsupervised learning (e.g. kmeans, DBSCAN)

## Workflow
### 1. double check df_info
- the length of the stimulation should be 20 min
- there should be two stimulations per patient in a single day

## data loading

In [2]:
import pickle
import os
import pandas as pd
import numpy as np
import datetime

data_dir = os.path.expanduser("~/Desktop/GT/ECG_VNS/data")
# os.path.exists(data_dir)  # test

info_filename = os.path.join(data_dir, 'stimulation_timestamped.csv')
df_info = pd.read_csv(info_filename, index_col=0)

## filter bad rows in df_info (drop rows)
### Untackled problem:
- in this version, data were discarded if multipul plausible stimulation onsets were found
### improvement 2 implement:
- loops for list creation and concatenation

In [39]:
# raw by raw scanning
# criteria: 20 min duration
# timestamp format: 2/22/21 20:18
tolerence = 100 # 100 seconds
VNS_duration = 1200 # 1200 seconds
noon_s = 12*60*60

list_subj2filtered_df = []
list_date2filtered_df = []
list_time_s2filtered_df = []
list_switch2filtered_df = []  # 1 stands for on, 0 stands for off

array_subjs = df_info['subject'].unique()
for subj in array_subjs:
    df_per_subj = df_info[df_info['subject'] == subj]
    list_date = []
    list_time_s = []
    for index, row in df_per_subj.iterrows():
        date = row['timestamp'].split('/')[0] + '/' + row['timestamp'].split('/')[1] # month/day
        time = row['timestamp'].split('/')[2].split(' ')[1]
        hour = int(time.split(':')[0])
        minute = int(time.split(':')[1])
        time_s = datetime.timedelta(hours=hour, minutes=minute).total_seconds()
        list_date.append(date)
        list_time_s.append(time_s)
    list_time_s = np.array(list_time_s)
    list_date = np.array(list_date)
    _, unique_indices = np.unique(list_time_s, return_index=True)
    unique_indices = np.sort(unique_indices)
    # make sure that there are either 2 or 4 stim each day
    list_time_s = list_time_s[unique_indices]
    list_date = list_date[unique_indices]
    
    list_time_s_updated = []
    list_date_updated = []
    for date in np.unique(list_date):
        indices2search = np.where(list_date == date)[0]
        find_stim_onset = False
        num_stim_found = 0
        for i in indices2search:
            for j in indices2search[np.where(indices2search == i)[0][0] + 1:]:
                if (VNS_duration - tolerence) < (list_time_s[j] - list_time_s[i]) < (VNS_duration + tolerence):
                    list_time_s_updated.append(list_time_s[i])
                    list_time_s_updated.append(list_time_s[j])
                    find_stim_onset = True
                    num_stim_found +=1
        
                        
        if num_stim_found > 2:  # more than two stim onsets were found
#             list_date_updated = list_date_updated[list_date_updated != date]
            list_time_s_updated = list_time_s_updated[:-2 * num_stim_found]
        else:
            for i in range(num_stim_found):
                list_date_updated.append(date)
    list_subj2filtered_df = list_subj2filtered_df + [subj] * len(list_time_s_updated)
    list_date2filtered_df = list_date2filtered_df + [date for date in list_date_updated for _ in (0, 1)]
    list_time_s2filtered_df = list_time_s2filtered_df + list_time_s_updated
    list_switch2filtered_df = list_switch2filtered_df + [i for j in range(len(list_date_updated)) for i in [1, 0] ]

# we create new df from arrays here
df_info_filtered = pd.DataFrame(data=np.array([np.array(list_subj2filtered_df), np.array(list_date2filtered_df),
                                               np.array(list_time_s2filtered_df), 
                                      np.array(list_switch2filtered_df)]).T,
                                    columns=['subj', 'date', 'time_s', 'switch'])

# Test properties here via assert. This applies when 

In [43]:
# Let's see how it looks like
df_info_filtered.head(12)

Unnamed: 0,subj,date,time_s,switch
0,2020004,2/22,73080.0,1
1,2020004,2/22,74340.0,0
2,2020004,2/23,75780.0,1
3,2020004,2/23,77040.0,0
4,2020004,2/24,27300.0,1
5,2020004,2/24,28560.0,0
6,2020004,2/24,71700.0,1
7,2020004,2/24,72960.0,0
8,2020004,2/25,31800.0,1
9,2020004,2/25,33060.0,0


## extract pre/during/post-stm from pickle files

In [46]:
for subj_id in array_subjs[0:1]:
    nn_filename = os.path.join(data_dir, str(subj_id) + '.pickle')
    df = pd.read_pickle(nn_filename)
    
    df

In [47]:
df

Unnamed: 0,timestamp,nn_interval
0,2021-09-17 10:43:12.037850,650.0
1,2021-09-17 10:43:12.687850,652.0
2,2021-09-17 10:43:13.339850,646.0
3,2021-09-17 10:43:13.985850,638.0
4,2021-09-17 10:43:14.623850,636.0
...,...,...
1260850,2021-09-30 01:56:40.542143,902.0
1260851,2021-09-30 01:56:41.444143,890.0
1260852,2021-09-30 01:56:42.334143,902.0
1260853,2021-09-30 01:56:43.236143,898.0
