# Importing Data from Fitbit 
Fitbit offers an export option for all of a user's data. 
This data is exported in multiple folders, each containing multiple JSON and .csv files. 
For this project, I am interested in the following data: 
- bpm 
- sleep type (heavy, light, REM) and length of each sleep type
- sleep score 
- step data
- active minutes (light, moderate, and very active minutes) 
- resting heartrate
- outside temperature at bedtime (10pm local) 

Each section has a 2 row sample of the dataframe at the end. 

The data is all time based. The goal is to reduce all of the data to get a summary for each day. 
For example, the bpm data is recorded every 5 seconds, and I will reduce it to a daily summary.  

**All resulting dataframes will have one datapoint per day, and will be indexed by date (US/Pacific).**

All of the data is then pickled (as loading from the pickle will be much faster than re-reading the JSON). 
Resulting files used in the analysis will be: 
- bpm.pkl 
- sleep_score.pkl
- sedentary_minutes.pkl
- lightly_active.pkl
- moderately_active.pkl
- very_active.pkl
- step_daily.pkl
- resting_heartrate.pkl

** When you're all done, make sure each data frame has the expect # rows and the expected datatype
Make sure each pickled file exists, and also that its being sourced NOT from the test dir 

Tables:
-bpm
-sleep levels (done)
-sleep score (done)
-steps (done)
-sedentary minutes (done)
-lightly active minutes (done)
-moderately active minutes (done)
-very active minutes (done)
-resting heartrate (done)
-temperature 
-etc.?
merge all the tables together using date as the primary key 
daily sleep will be the response variable 
everything else will be explanatory variables 



In [236]:
import os
import pandas as pd
import numpy as np 
import datetime as dt 

In [1057]:
# Directory where most of this data lives 
phys_dir = '/Users/jackiekinsler/projects/sleep_analysis_py/physical_data/Physical_Activity'

### IMPORT FROM JSON FUNCTION
This function will help import and concatenate the many JSON files that compromise each data type 

In [297]:
def import_data_from_dir(file_prefix, directory):
    """Reads JSON file(s) in a folder and returns a single dataframe. 
    Takes strings of file_prefix and directory as input. 
    """
    dfs = []
    for file in os.listdir(directory):
        if file_prefix in file: 
            dfs.append(pd.read_json(f"{directory}/{file}"))
    return pd.concat(dfs)

### BPM
Import the bpm data from JSON. 
The JSON data contains a date field and a'value' field. 
The 'value' field contains a dictionary with 'bpm' and 'confidence'. 
The data is imported, the nested 'value' column is unnested. 
The index is also reset, as the index values are not unique.

The data is taken every 5 seconds. The data will be reduced to get daily values for max_bpm and average_bpm. 

In [None]:
bpm_nested = import_data_from_dir('heart_rate-', '/Users/jackiekinsler/projects/sleep_analysis_py/physical_data/heart_rate')
bpm_nested.to_pickle('data/bpm_nested.pkl')

In [1150]:
bpm_nested = pd.read_pickle('data/bpm_nested.pkl')

In [None]:
# Index needs to be reset as there are repeated values (will come into play in the concat function later!)
bpm_nested.reset_index(inplace = True)
# Explode the dictionary in the values column to get 'bpm' and 'confidence'
bpm_explode = pd.json_normalize(bpm_nested['value'])

In [1216]:
# Here, two columns are brought together: the dateTime column from bpm_nested, 
# and the two exploded columns (bpm, confidence) that makeup bpm_explode 
bpm_detail = pd.concat([bpm_nested['dateTime'], bpm_explode], axis = 1)

In [1217]:
# Use dateTime as a DatetimeIndex. Normalize will drop the time information from the datetime 
bpm_detail.set_index(pd.DatetimeIndex(bpm_detail['dateTime']).normalize(), inplace=True)
# Drop the old dateTime column
bpm_detail.drop(['dateTime','confidence'], axis=1, inplace=True)


In [1219]:
bpm_detail.to_pickle("data/bpm_detail.pkl")

Now we will aggregate the data and get some daily stats!

In [1228]:
bpm_grouped = bpm_detail.groupby(by='dateTime')
bpm = bpm_grouped.agg(['max', 'mean'], axis=1)

In [1232]:
bpm.head(2)

Unnamed: 0_level_0,bpm,bpm
Unnamed: 0_level_1,max,mean
dateTime,Unnamed: 1_level_2,Unnamed: 2_level_2
2017-07-19,108,77.52356
2017-07-20,124,69.342717


### SLEEP DETAIL
Import the detailed sleep data from JSON.  
The raw JSON has many columns, but the 'levels' columnn is perhaps the most interesting.  
The 'levels' column contains a dictionary of data about the amount of time spent in each sleep type.  
Sleep types include:
- deep
- wake
- light
- REM 

The 'levels' column will be unnested and added back to the dataframe.  
It is important to note that there may be multiple sleep entries for a given day (for example: if there was a long waking period in the middle of sleeping). 

The sleep details for these days will be aggregated into one day. 

In [1118]:
# Import from the sleep directory 
sleep_nested = import_data_from_dir('sleep-', '/Users/jackiekinsler/projects/sleep_analysis_py/physical_data/Sleep')
# There are some entries that are recorded twice... this is because they are in two of the JSON datasets 
# Remove duplicate logId entries 
sleep_nested.drop_duplicates(subset=['logId'], inplace=True)

Originally, I tried using the `json_normalize` function to get these vaules, but it was dropping a lot of rows for an unknown reason. I think it had to do with unexpected handing of "None" values. 
Instead, the below function is used from https://medium.com/analytics-vidhya/exploring-your-fitbit-sleep-data-with-python-pandas-and-seaborn-in-jupyter-notebook-a997f17c3a42
I'd love to use the faster `json_normalize` instead of `apply`, but `apply` is quick enough on this small dataset. 
NOTE: In other instances where I use `json_normalize` I checked that it did not drop rows. 

In [1119]:
# Checks if data exists before trying to extract it 
def get_minutes(levels, sleep_phase):
    if not levels.get('summary'):
        return None
    if not levels.get('summary').get(sleep_phase):
        return None
    if not levels.get('summary').get(sleep_phase).get('minutes'):
        return None
    return levels['summary'][sleep_phase]['minutes']

In [1120]:
sleep_nested['deep_mins'] = sleep_nested.levels.apply(get_minutes, args=('deep',))
sleep_nested['wake_mins'] = sleep_nested.levels.apply(get_minutes, args=('wake',))
sleep_nested['light_mins'] = sleep_nested.levels.apply(get_minutes, args=('light',))
sleep_nested['rem_mins'] = sleep_nested.levels.apply(get_minutes, args=('rem',))

In [1121]:
# keep columns of interest
sleep_minutes = sleep_nested.loc[:,[
    'dateOfSleep', 
    'minutesAsleep', 
    'mainSleep', 
    'deep_mins', 
    'wake_mins', 
    'light_mins', 
    'rem_mins'
]]

In [1127]:
sleep_detail = sleep_minutes.groupby(by='dateOfSleep').sum()
sleep_detail.index = pd.to_datetime(sleep_detail.index)

In [1129]:
sleep_detail.to_pickle("data/sleep_detail.pkl")

In [1142]:
sleep_detail.head(2)

Unnamed: 0_level_0,minutesAsleep,mainSleep,deep_mins,wake_mins,light_mins,rem_mins
dateOfSleep,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-07-25,471,1,101.0,80.0,260.0,110.0
2017-07-30,237,1,0.0,0.0,0.0,0.0


### SLEEP SCORE
Sleep score is in a .csv with 1 row per sleep (there may be multiple sleeps per night). 

`overall_sleep` score is a sum of individual scores in sleep duration, sleep quality, and restoration
- Excellent: 90-100
- Good: 80-89
- Fair: 60-79
- Poor: Less than 60
To understand more about the sleep score: https://help.fitbit.com/articles/en_US/Help_article/2439.htm

The resulting dataframe is indexed by date, with `overall_score`, `sleep_resting_heartrate`, and `deep_sleep_in_min` columns. 

In [844]:
sleep_score_full = pd.read_csv('/Users/jackiekinsler/projects/sleep_analysis_py/physical_data/Sleep/sleep_score.csv')
# Keep only the rows of interest
sleep_score_reduced = sleep_score_full.loc[:,['timestamp', 'overall_score', 'deep_sleep_in_minutes', 'resting_heart_rate']]

In [948]:
ss = sleep_detail[sleep_detail['dateOfSleep'] == '2022-02-14'] 
ss

Unnamed: 0,dateOfSleep,minutesAsleep,minutesAwake,mainSleep,summary.deep.count,summary.deep.minutes,summary.deep.thirtyDayAvgMinutes,summary.wake.count,summary.wake.minutes,summary.wake.thirtyDayAvgMinutes,summary.light.count,summary.light.minutes,summary.light.thirtyDayAvgMinutes,summary.rem.count,summary.rem.minutes,summary.rem.thirtyDayAvgMinutes
36,2022-02-14,125,5,False,5.0,111.0,86.0,26.0,58.0,61.0,27.0,173.0,256.0,6.0,97.0,76.0
37,2022-02-14,270,46,True,4.0,101.0,85.0,33.0,60.0,61.0,31.0,243.0,257.0,6.0,85.0,75.0


In [845]:
# Convert timestamp to a date, and then remove the time portion of the timestamp
sleep_score_reduced['timestamp'] = pd.to_datetime(sleep_score_reduced['timestamp']).dt.normalize()

It is possible to have more than one entry per date (there may be multiple sleeps per night).  

Below, the data is grouped by date. Then, a weighted average of the `overall_score` and `resting_heart_rate` is taken using the `deep_sleep_in_minutes` column. 

The `deep_sleep_in_minutes` is simply summed.
This results in a table with unique dates. 
The date is then used as the index. 

In [895]:
# Group entries by date 
grouped_by_time = sleep_score_reduced.groupby(by='timestamp')

# Here, we return a series to maintain the column name (would be lost otherwise)
overall_score = grouped_by_time.apply(
        lambda x: 
        pd.Series({
            'overall_score' : np.average(x.overall_score, weights=x.deep_sleep_in_minutes)
        })
    )
sleep_resting_heartrate = grouped_by_time.apply(
        lambda x: 
        pd.Series({ 
            'sleep_resting_heartrate' : np.average(x.resting_heart_rate, weights=x.deep_sleep_in_minutes)
        })
    )
deep_sleep_in_min = pd.DataFrame(grouped_by_time['deep_sleep_in_minutes'].sum())

In [896]:
# Concatenate the three columns into a new dataframe 
sleep_score = pd.concat([overall_score, sleep_resting_heartrate, deep_sleep_in_min], axis=1)

In [898]:
sleep_score.head(2)

Unnamed: 0_level_0,overall_score,sleep_resting_heartrate,deep_sleep_in_minutes
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-08-22 00:00:00+00:00,83.0,48.0,102
2019-08-29 00:00:00+00:00,73.0,52.0,93


In [901]:
sleep_score.to_pickle("data/sleep_score.pkl")

### STEPS
Import steps data from JSON.  
Steps data is recorded every few minutes. The number of steps for that period of time is recorded.   
The data will be reduced to the total number of steps for each day.   

Resulting table `step_daily` is date (in US/Pacific), and the total number of steps on that date.

In [359]:
step_detail = import_data_from_dir('steps-', phys_dir)
step_detail.to_pickle("data/step_detail.pkl")

In [562]:
step_detail = pd.read_pickle('data/step_detail.pkl')
# Convert the dateTime column from UTC to Pacific 
step_detail['dateTime'] = step_detail['dateTime'].dt.tz_localize('UTC').dt.tz_convert('US/Pacific')

In [563]:
# Get the sum of steps for each day 
step_daily = step_detail.groupby([step_detail['dateTime'].dt.date]).sum()
# .groupby turns the grouped column (dateTime) into the index.
# Use .to_datetime() to make it a DatetimeIndex
step_daily.index = pd.to_datetime(step_daily.index)
step_daily.to_pickle('data/step_daily.pkl')

In [564]:
step_daily.head(2)

Unnamed: 0_level_0,value
dateTime,Unnamed: 1_level_1
2017-07-19,4040
2017-07-20,9033


In [565]:
# Sample of getting a step value for a specific date 
step_daily.loc['2022-12-9']

value    11571
Name: 2022-12-09 00:00:00, dtype: int64

### ACTIVITY MINUTES

Activity minutes are imported from JSON. 
The documentation does not provide information about what time zone the dateTime stamp is from. 
By looking at the data and aligning it with known activity on different days, I am making the assumption that the data is recorded in local time. Although, if it was recorded in UTC, the data would be off by 1 day.

In [715]:
def import_activity_min(file_prefix, directory):
    """
    For activity minutes only! Takes a file_prefix and directory, and returns a data table 
    with a DatetimeIndex and a value for activity minutes for that day. 
    """
    df = import_data_from_dir(file_prefix, directory)
    df.dropna(how='any', inplace=True)
    # Create a DatetimeIndex from the dateTime column, then drop the original dateTime column 
    df.set_index(pd.DatetimeIndex(df['dateTime']), inplace=True)
    df.drop('dateTime', axis=1, inplace=True)
    return df 

In [717]:
# Importing activity minutes 
sedentary_minutes = import_activity_min('sedentary_minutes', phys_dir)
lightly_active = import_activity_min('lightly_active', phys_dir)
moderately_active = import_activity_min('moderately_active', phys_dir)
very_active = import_activity_min('very_active_minutes', phys_dir)

# Pickle the data for future use 
sedentary_minutes.to_pickle("data/sedentary_minutes.pkl")
lightly_active.to_pickle("data/lightly_active.pkl")
moderately_active.to_pickle("data/moderately_active.pkl")
very_active.to_pickle("data/very_active.pkl")

In [726]:
moderately_active.head(2)

Unnamed: 0_level_0,value
dateTime,Unnamed: 1_level_1
2017-07-18,0
2017-07-19,37


In [725]:
# Sample of getting a value for a specific date 
moderately_active.loc['07/29/17']

value    11
Name: 2017-07-29 00:00:00, dtype: int64

### RESTING HEARTRATE

Import resting heartrate from JSON. 

The final table `resting heartrate` will be indexed by date (in US/Pacific), with a heartrate value, and error. 

In [1058]:
# Importing resting_heartrate data 
resting_heartrate_nested = import_data_from_dir('resting_heart_rate', phys_dir)
# The data in this dataframe is nested, and only the last column ('value') has the needed data 
# NOTE: json_normalize will drop rows where values come in as 'none'... not obvious in documentation
resting_heartrate = pd.json_normalize(resting_heartrate_nested['value'])
# Drop any rows with a NaN value 
resting_heartrate.dropna(how='any', inplace=True)
# Make 'date' the index, and convert it to a Datetime data type 
resting_heartrate.set_index(pd.DatetimeIndex(resting_heartrate['date']), inplace=True)
# Drop the old date column 
resting_heartrate.drop('date', axis=1, inplace=True)

resting_heartrate.to_pickle("data/resting_heartrate.pkl")

In [1059]:
# Sample of getting a resting_heartrate value for a specific date 
resting_heartrate.loc['2022-07-29']

value    52.782105
error    26.761181
Name: 2022-07-29 00:00:00, dtype: float64