## Part 2: Clinical Application

### Contents
Fill out this notebook as part 2 of your final project submission.

**You will have to complete the Code (Load Metadata & Compute Resting Heart Rate) and Project Write-up sections.**  

- [Code](#Code) is where you will implement some parts of the **pulse rate algorithm** you created and tested in Part 1 and already includes the starter code.
  - [Imports](#Imports) - These are the imports needed for Part 2 of the final project. 
    - [glob](https://docs.python.org/3/library/glob.html)
    - [os](https://docs.python.org/3/library/os.html)
    - [numpy](https://numpy.org/)
    - [pandas](https://pandas.pydata.org/)
  - [Load the Dataset](#Load-the-dataset)  
  - [Load Metadata](#Load-Metadata)
  - [Compute Resting Heart Rate](#Compute-Resting-Heart-Rate)
  - [Plot Resting Heart Rate vs. Age Group](#Plot-Resting-Heart-Rate-vs.-Age-Group)
- [Project Write-up](#Project-Write-Up) to describe the clinical significance you observe from the **pulse rate algorithm** applied to this dataset, what ways/information that could improve your results, and if we validated a trend known in the science community. 

### Dataset (CAST)

The data from this project comes from the [Cardiac Arrythmia Suppression Trial (CAST)](https://physionet.org/content/crisdb/1.0.0/), which was sponsored by the National Heart, Lung, and Blood Institute (NHLBI). CAST collected 24 hours of heart rate data from ECGs from people who have had a myocardial infarction (MI) within the past two years.[1] This data has been smoothed and resampled to more closely resemble PPG-derived pulse rate data from a wrist wearable.[2]

1. **CAST RR Interval Sub-Study Database Citation** - Stein PK, Domitrovich PP, Kleiger RE, Schechtman KB, Rottman JN. Clinical and demographic determinants of heart rate variability in patients post myocardial infarction: insights from the Cardiac Arrhythmia Suppression Trial (CAST). Clin Cardiol 23(3):187-94; 2000 (Mar)
2. **Physionet Citation** - Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.

-----

### Code
#### Imports

When you implement the functions, you'll only need to you use the packages you've used in the classroom, like [Pandas](https://pandas.pydata.org/) and [Numpy](http://www.numpy.org/). These packages are imported for you here. We recommend you don't import other packages outside of the [Standard Library](https://docs.python.org/3/library/) , otherwise the grader might not be able to run your code.

In [1]:
import glob
import os

import numpy as np
import pandas as pd

#### Load the dataset

The dataset is stored as [.npz](https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html) files. Each file contains roughly 24 hours of heart rate data in the 'hr' array sampled at 1Hz. The subject ID is the name of the file. You will use these files to compute resting heart rate.

Demographics metadata is stored in a file called 'metadata.csv'. This CSV has three columns, one for subject ID, age group, and sex. You will use this file to make the association between resting heart rate and age group for each gender.

Find the dataset in `../datasets/crisdb/`

In [2]:
hr_filenames = glob.glob('/data/crisdb/*.npz')

#### Load Metadata
Load the metadata file into a datastructure that allows for easy lookups from subject ID to age group and sex.

In [3]:
metadata_filename = '/data/crisdb/metadata.csv'

# Load the metadata file into this variable.
with open(metadata_filename, 'r') as f:
    metadata = pd.read_csv(metadata_filename)
    
metadata.head()

Unnamed: 0,subject,age,sex
0,e198a,20-24,Male
1,e198b,20-24,Male
2,e028b,30-34,Male
3,e028a,30-34,Male
4,e061b,30-34,Male


#### Compute Resting Heart Rate
For each subject we want to compute the resting heart rate while keeping track of which age group this subject belongs to. An easy, robust way to compute the resting heart rate is to use the lowest 5th percentile value in the heart rate timeseries.

In [4]:
def AgeAndRHR(metadata, filename):
   
    # Load the heart rate timeseries
    hr_data = np.load(filename)['hr']
    
    # Compute the resting heart rate from the timeseries by finding the lowest 5th percentile value in hr_data
    rhr = np.percentile(hr_data, 5)

    # Find the subject ID from the filename.
    subject =filename.split('/')[-1][:-4]
    
    # Find the age group for this subject in metadata.
    age_group = metadata[metadata['subject'] == subject]['age'].values[0]
    
    # Find the sex for this subject in metadata.
    sex = metadata[metadata['subject'] == subject]['sex'].values[0]

    return age_group, sex, rhr

df = pd.DataFrame(data=[AgeAndRHR(metadata, filename) for filename in hr_filenames],columns=['age_group', 'sex', 'rhr'])


#### Plot Resting Heart Rate vs. Age Group
We'll use [seaborn](https://seaborn.pydata.org/) to plot the relationship. Seaborn is a thin wrapper around matplotlib, which we've used extensively in this class, that enables higher-level statistical plots.

We will use [lineplot](https://seaborn.pydata.org/generated/seaborn.lineplot.html#seaborn.lineplot) to plot the mean of the resting heart rates for each age group along with the 95% confidence interval around the mean. Learn more about making plots that show uncertainty [here](https://seaborn.pydata.org/tutorial/relational.html#aggregation-and-representing-uncertainty).

In [5]:
import seaborn as sns
from matplotlib import pyplot as plt

labels = sorted(np.unique(df.age_group))
df['xaxis'] = df.age_group.map(lambda x: labels.index(x)).astype('float')
plt.figure(figsize=(12, 8))
sns.lineplot(x='xaxis', y='rhr', hue='sex', data=df)
_ = plt.xticks(np.arange(len(labels)), labels)

### Clinical Conclusion
Answer the following prompts to draw a conclusion about the data.
> 1. For women, we see .... 
> 2. For men, we see ... 
> 3. In comparison to men, women's heart rate is .... 
> 4. What are some possible reasons for what we see in our data?
> 5. What else can we do or go and find to figure out what is really happening? How would that improve the results?
> 6. Did we validate the trend that average resting heart rate increases up until middle age and then decreases into old age? How?

Your write-up will go here...


#### Clinical conclusions for Women

The hear rates for male users are more balanced with respect to their ages. There are some variances but not too high to infer some conclusions about their hear rates. 

#### Clinical conclusions for Men

There is an age bias towards heart rates of younger users, it can be seen that their heart rates are higher when compared with users older than 65 years old. 

#### Women Vs Men heart rates

Previous figures show heart rates for female and male users with respect to their ages. It can be concluded that male users have more stable heart rates with respect to their ages. That is not the case for female users.

#### Data behaviour

Physiological resting heart rates are in the range from 60 to 100 beats per minute. One possible explanation for the data behaviour is those female users younger than 65 years old were selected from a specific group of people, such as athletes. Therefore, there is a bias towards those resting heart rates, which increase the difference between older people who probably are not as active as them. For male users, there can be observed a more balanced selected of users.

A different reason can be how much movements users have with respect to their ages, maybe female users older than 60 years generate movements that are translated into artefacts that increased their estimated heart rates.

#### What else can we do or go and find to figure out what is really happening? How would that improve the results?

The first thing to do is to see if the age groups are balanced, then to see if the same amount of people is used to represent each of the groups of interest and to see how those are compared when considering different genders. Additionally, it can be analyzed the average movements for female users that belong to a different group of ages, to see if some important highlights can be found.

***Analysing male groups users:***

- Number of users per group of age:

In [6]:
A = df[df.sex == 'Male'].groupby('age_group')
for a in list(A.groups.keys()):
    print('Number of users in group ' + a + ':',  A.get_group(a).shape[0])

Number of users in group 35-39: 24
Number of users in group 40-44: 54
Number of users in group 45-49: 109
Number of users in group 50-54: 146
Number of users in group 55-59: 215
Number of users in group 60-64: 246
Number of users in group 65-69: 230
Number of users in group 70-74: 157
Number of users in group 75-79: 79


- AXIS mean value per group of age:

In [7]:
A = df[df.sex == 'Male'].groupby('age_group')
print('AXIS values per group of ages:')
for a in list(A.groups.keys()):
    print('Age group: ' + a, '-> High rates: ',A.get_group(a)[A.get_group(a).rhr > 65].xaxis.mean(), ', Low rates:',A.get_group(a)[A.get_group(a).rhr < 65].xaxis.mean())

AXIS values per group of ages:
Age group: 35-39 -> High rates:  0.0 , Low rates: 0.0
Age group: 40-44 -> High rates:  1.0 , Low rates: 1.0
Age group: 45-49 -> High rates:  2.0 , Low rates: 2.0
Age group: 50-54 -> High rates:  3.0 , Low rates: 3.0
Age group: 55-59 -> High rates:  4.0 , Low rates: 4.0
Age group: 60-64 -> High rates:  5.0 , Low rates: 5.0
Age group: 65-69 -> High rates:  6.0 , Low rates: 6.0
Age group: 70-74 -> High rates:  7.0 , Low rates: 7.0
Age group: 75-79 -> High rates:  8.0 , Low rates: 8.0


***Analysing female groups users:***

- Number of users per group of age:

In [8]:
A = df[df.sex == 'Female'].groupby('age_group')
for a in list(A.groups.keys()):
    print('Number of users in group ' + a + ':',  A.get_group(a).shape[0])

Number of users in group 35-39: 4
Number of users in group 40-44: 8
Number of users in group 45-49: 15
Number of users in group 50-54: 18
Number of users in group 55-59: 46
Number of users in group 60-64: 67
Number of users in group 65-69: 61
Number of users in group 70-74: 39
Number of users in group 75-79: 19


- AXIS mean value per group of age:

In [9]:
A = df[df.sex == 'Female'].groupby('age_group')
print('AXIS values per group of ages:')
for a in list(A.groups.keys()):
    print('Age group: ' + a, '-> High rates: ',A.get_group(a)[A.get_group(a).rhr > 65].xaxis.mean(), ', Low rates:',A.get_group(a)[A.get_group(a).rhr < 65].xaxis.mean())

AXIS values per group of ages:
Age group: 35-39 -> High rates:  0.0 , Low rates: 0.0
Age group: 40-44 -> High rates:  1.0 , Low rates: 1.0
Age group: 45-49 -> High rates:  2.0 , Low rates: 2.0
Age group: 50-54 -> High rates:  3.0 , Low rates: 3.0
Age group: 55-59 -> High rates:  4.0 , Low rates: 4.0
Age group: 60-64 -> High rates:  5.0 , Low rates: 5.0
Age group: 65-69 -> High rates:  6.0 , Low rates: 6.0
Age group: 70-74 -> High rates:  7.0 , Low rates: 7.0
Age group: 75-79 -> High rates:  8.0 , Low rates: 8.0


***Conclusion:***

It is difficult to get any conclusion of why younger female users have higher resting-state heart rates with the available data. This, since the behaviour across genders and group of ages, is similar in all scenarios analyzed. Nonetheless, a different conclusion can be given if more female users are analyzed, at least having the same amount of female users per group as in the male case to have a better comparison.

#### Did we validate the trend that average resting heart rate increases up until middle age and then decreases into old age? How?

No, we didn't validate resting-state heart rates.

## FIN