# Logistics

* If you feel lost while working on this notebook (in class or outside or class), don't hesitate to post questions here: https://piazza.com/class/jrzeap5kpirw

# Description of the dataset

The file "Bertrand-physiodata.csv" was captured using the Empatica E4 wristband when Bertrand was teaching a class last Fall. The E4 collects information about a person's heart rate, electrodermal actibity, movements and temperature:

![title](https://support.empatica.com/hc/article_attachments/360000797783/e4_specs.jpg)

the csv file contains the following columns:
* **tags**: users can tag events by pressing a button on the wristband
* **real time**: time of the data collection
* **unix time**: number of seconds since 00:00:00 Thursday, 1 January 1970
* **BVP**: Blood volume pressure (used to compute HR data)
* **HR**: Heart rate data
* **EDA**: Electrodermal activity (i.e., physiological arousal)
* **TEMP**: temperature of the skin
* **ACC_x**: accelerometer data on the x axis
* **ACC_y**: accelerometer data on the y axis
* **ACC_z**: accelerometer data on the z axis

Today we are going to explore Bertrand's physiological response to teaching! :) more specifically, we are going to explore the relationship between heart rate and electrodermal activity. 

# Panda Review

Concepts: 
* head, tail, info, values
* zip, dict, pd.DataFrame
* df.columns
* read_csv, delimiter, header, index



In [1]:
# 1) import the pandas library as pd
import pandas as pd

In [3]:
# 2) import the csv file into a variable called df


In [1]:
# 3) print the column names, just to make 
# sure that it matches what we have above


In [2]:
# 4) use the head() function to check your data


In [3]:
# 5) use the tail() function on your dataframe. How many rows do you have?


In [4]:
# 6) use the info() function to inspect your data:


In [5]:
# 7) knowing that BVP is collected 64 times per second (i.e., 64Hz),
# what can you tell of the sampling frequency of the other measures?


# Plotting

Concepts: 
* plot, subplot, line plots, scatter, box plots, hist, ...
* mean, median, quantiles, STD, etc. 
* separate and summarize

In [6]:
# 8) make sure you're plotting your graphs inline
# Hint: https://stackoverflow.com/questions/19410042/how-to-make-ipython-notebook-matplotlib-plot-inline


### Let's work on the heart rate (HR) data first

In this section we are going to produce various graphs to inspect Bertrand's heart rate data. 

In [7]:
# 9) produce a histogram of the heart rate data; what can you say from it?


In [8]:
# 10) Try to plot the values over time (e.g., use the real time for the x axis):


11) What happened? Come up with 2-3 reasons why this didn't work before you move on to the next question:
- reason 1:
- reason 2: 
- reason 3: 

After you've anwered the question above, feel free to look at this hint and try to fix your dataframe: https://stackoverflow.com/questions/22551403/python-pandas-filtering-out-nan-from-a-data-selection-of-a-column-of-strings/22553757

In [9]:
# 12) Fix your dataframe using the link above: 


### Now let's look at the electrodermal activity data (EDA)

In this section we are going to produce various graphs to inspect Bertrand's electrodermal data. 

In [10]:
# 13) produce a line plot to visually inspect the EDA data
# Hint: make sure you filter the nonnull data


Feel free to look at the following page to make sense of the units of the EDA data: 
* https://support.empatica.com/hc/en-us/articles/203621955-What-should-I-know-to-use-EDA-data-in-my-experiment-

In [11]:
# 14) we don't have any labels on the x axis! 
# convert the 'real time' column into a real date time
# Hint: https://campus.datacamp.com/courses/pandas-foundations/time-series-in-pandas?ex=3


In [12]:
# 15) print the mean and median values of the EDA data; explain how they are different


In [13]:
# 16) plot a histogram of the EDA values; does that confirm your interpretation above?


### Combining EDA and HR data on the same graph

In this section we are going to produce various graphs to inspect both the HR and EDA data.

In [14]:
# 17) filter both the EDA and HR values to keep the non-null rows:


In [15]:
# 18) plot EDA and HR on two different graphs using subplots
# hint: https://stackoverflow.com/questions/31726643/how-do-i-get-multiple-subplots-in-matplotlib


In [16]:
# 19) plot EDA and HR on the same graph; what went wrong?


In [17]:
# 20) normalize the HR and EDA columns using your favorite normalization strategy
# Hint: https://stackoverflow.com/questions/12525722/normalize-data-in-pandas


In [18]:
# 21) plot EDA and HR on the same graph; does the result look better?


In [19]:
# 22) what can you observe from the graph? Does there seem to be an agreement between HR and EDA?


**IN-CLASS DISCUSSION**: why do we normalize values? When do we want to normalize them?

# Time series

Concepts:
* indexing, slicing, datetimeIndex
* resampling, rolling mean
* method chaining and filtering
* plotting time series

In this section, we are going to work with some built-in function of pandas to work with time series. More specifically, we are going to downsample our data and use a rolling window to generate additional graphs.

In [53]:
# let's reimport our data to make sure it's clean
df = pd.read_csv('Bertrand-physiodata.csv')
filtered_df = df[df['EDA'].notnull() & df['EDA'].notnull()].copy()

# make sure that you are converting the real time column into a datetime
filtered_df['real time'] = pd.to_datetime(filtered_df['real time'], format='%d/%m/%y %H:%M')
filtered_df = filtered_df.set_index('real time')

filtered_df.head()

Unnamed: 0_level_0,tags,unix time,BVP,HR,EDA,TEMP,ACC_x,ACC_y,ACC_z
real time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-11-09 08:28:00,0.0,1536669000.0,0.0,,0.0,33.31,-50.0,7.0,28.0
2018-11-09 08:28:00,0.0,1536669000.0,-0.05,,0.836457,33.31,-32.0,64.0,27.0
2018-11-09 08:28:00,0.0,1536669000.0,6.2,,1.18386,33.31,-14.0,57.0,50.0
2018-11-09 08:28:00,0.0,1536669000.0,22.9,,1.167277,33.31,-22.0,52.0,24.0
2018-11-09 08:28:00,0.0,1536669000.0,93.76,,1.278719,33.31,-17.0,53.0,27.0


### Down sampling

In [20]:
# 23) Use the instruction from datacamp to resample your data in 60 seconds windows and plot the result
# Hint: https://campus.datacamp.com/courses/pandas-foundations/time-series-in-pandas?ex=7


In [21]:
# 24) do the same thing, but this time using the rolling() function in a 60sec window


In [None]:
# 25) What is the difference between rolling() and resample()? Why do the graphs look different?

## Correlations

In this section we're going to keep exploring the relationship between heart rate data and electrodermal activity. We are going to do this visually (with a scatter plot first) and then using a statistical test (Pearson's correlation). 

In [22]:
# 26) create a scatter plot between HR and EDA:


In [23]:
# 27) compute pearson's correlation between the HR and EDA data


In [24]:
# 29) what can you conclude? Is there a linear relationship between HR and EDA data in this dataset?
