# Withings Sleep Analyzer Data Import

<div class="alert alert-block alert-info">
This example explains how to import and parse data retrieved from the Withings Health Mate app.
    
<b>Note</b>: This notebook is just to illustrate how to <i>generally</i> approach such a data wrangling problem. The full code (and much more!) is readily available in <code>BioPsyKit</code>: <code>biopsykit.sleep_analyzer.io.load_withings_sleep_analyzer_raw()</code>.
</div>





## Setup and Helper Functions

In [None]:
from pathlib import Path

import re

import pandas as pd
import numpy as np

import biopsykit as bp

import matplotlib.pyplot as plt
import seaborn as sns

from ast import literal_eval

%matplotlib widget
%load_ext autoreload
%autoreload 2

In [None]:
plt.close("all")

tz = "Europe/Berlin"

palette = bp.colors.fau_palette
sns.set_theme(
    context="notebook", 
    style="ticks", 
    font="sans-serif",
    palette=palette
)

plt.rcParams['figure.figsize'] = (8,4)
plt.rcParams['pdf.fonttype'] = 42
plt.rcParams['mathtext.default'] = "regular"

palette

## Data Import

### Read Data from File

Load example data (or read the csv file into a dataframe using `pandas.read_csv()`).

In [None]:
data = bp.example_data.get_sleep_analyzer_raw_file_unformatted(data_source="heart_rate")

We first want to get an impression how the data looks like by displaying the data. In Jupyter Notebooks, ending a cell with the *name of a variable* or *unassigned output of a statement*, Jupyter will ``display`` that variable (in a nice layout) without the need for a ``print`` statement. 

You can for example call ``data`` to display the or ``data.head()`` to display the beginning of the dataframe.

We see that we have three columns: A 'start' column with timestamps, a 'duration' column and a 'value' column. We can read this data row-wise and as follows:
Beginning at time 'start', we get the heart rate values in the 'value' column for a 'duration' per value.

In [None]:
data.head()

### Data type conversion

All values are imported as strings, so we need to convert these into the correct data types:
* The *String* timestamps in the 'start' column are converted into *datetime* objects that offer extensive functions for handling time series data
* The lists in the 'duration' and 'value' columns are also stored as strings so we need to convert them into actual lists with numbers. Googling "*pandas convert string to array*" leads us to this StackOverflow post https://stackoverflow.com/questions/23119472/in-pandas-python-reading-array-stored-as-string, where the accepted answer suggests this:

```
    from ast import literal_eval
    df['col2'] = df['col2'].apply(literal_eval)
```

In the end, we set the 'start' column as the new index of the dataframe and sort the data by the index

In [None]:
print("Before: {}".format([type(value) for value in data.iloc[0]]))

data['start'] = pd.to_datetime(data['start'])
data['duration'] = data['duration'].apply(literal_eval)
data['value'] = data['value'].apply(literal_eval)

print("After: {}".format([type(value) for value in data.iloc[0]]))

data = data.set_index('start').sort_index()
# rename index
data.index.name = 'time'

Our data now looks like this:

In [None]:
data.head()

## Explode Arrays

We now want to convert the values stored in the arrays into single values. Googling "*pandas convert list of values to rows*" leads us to this StackOverflow post: https://stackoverflow.com/questions/39954668/how-to-convert-column-with-list-of-values-into-rows-in-pandas-dataframe. Here, we don't take the accepted answer, but the answer below:
```
    df.explode('column')
```

In [None]:
print("Before Explode:")
display(data['value'].head())
print("")
print("After Explode:")
display(data['value'].explode('value').head())

The `pd.Series.explode()` function only works on one single column. If we want to apply this on multiple columns at once, we need to call `pd.DataFrame.apply()` and pass the function as argument to the apply function.

In [None]:
data_explode = data.apply(pd.Series.explode)

Our dataframe now looks like this:

In [None]:
data_explode.head()

However, we now see that the timestamp is the same for each exploded value. The documentation of `explode()` says the following: 

`Transform each element of a list-like to a row, *replicating* index values`.

To get the correct timestamps we would need to add the 'duration' values cumulatively to the timestamps. However, only summing up the values in 'duration' would not work, we need to perform this only within those timestamps that are the same. One way to achieve this is to group the data into subparts with the same timestamp using `pd.DataFrame.groupby` where we pass the index name (i.e. `time`) to group along. For that, we define our own function that is applied onto each group.

In [None]:
def explode_timestamps(df):
    # sum up the time durations and subtract the first value from it (so that we start from 0)
    # dur_sum then looks like this: [0, 60, 120, 180, ...]
    dur_sum = df['duration'].cumsum() - df['duration'].iloc[0]
    # Add these time durations to the index timestamps. 
    # For that, we need to convert the datetime objects from the pandas DatetimeIndex into a float and add the time onto it
    # (we first need to multiply it with 10^9 because the time in the index is stored in nanoseconds)
    index_sum = df.index.values.astype(float) + 1e9 * dur_sum
    # convert the float values back into a DatetimeIndex
    df['time'] = pd.to_datetime(index_sum)
    # set this as index and convert it back into the right time zone
    df = df.set_index('time')
    df = df.tz_localize('UTC').tz_convert(tz)
    # we don't need the duration column anymore so we can drop it
    df = df.drop(columns='duration')
    return df

In [None]:
# call groupby and apply our custom function on each group
df_hr = data_explode.groupby('time', group_keys=False).apply(explode_timestamps)
# rename the value column
df_hr.columns = ['heart_rate']

df_hr

## Filtering and plotting

### Filter data by day

Assume we want to filter only data from a particular date, e.g. Oct 11 2020.

For this, we can slice the index to only include data from this particular date by doing the following steps:
* *Normalize* the `DateTimeIndex` (set every date to midnight)
* Filter for the desired day
* Slice the DataFrame

In [None]:
df_hr_day = df_hr.loc[df_hr.index.normalize() == '2020-10-11']

In [None]:
df_hr_day

Plot this data as example

In [None]:
fig, ax = plt.subplots()
df_hr_day.plot(ax=ax)

ax.legend().remove()
ax.set_ylabel("Heart Rate [bpm]");
ax.set_xlabel("Time");

# That's it!

This code is also available in `BioPsyKit` and can be used like this:

In [None]:
sleep_data = bp.example_data.get_sleep_analyzer_raw_example()

In [None]:
sleep_data.keys()

In [None]:
sleep_data["2020-10-10"].head()

Only load a specific data source (in this case, our example data):

In [None]:
sleep_state_data = bp.example_data.get_sleep_analyzer_raw_file("sleep_state")

Alternatively: Load your own Sleep Analyzer raw data

In [None]:
#sleep_state_data = bp.io.sleep_analyzer.load_withings_sleep_analyzer_raw_file(
#    "<path-to-sleep-analyzer-raw-file.csv>", 
#    data_source="sleep_state"
#)

In [None]:
sleep_state_data["2020-10-10"].head()