Explore thru graphs and clean the data. 
All resulting data will end up in `cleaned_data/` 

Outliers: 
1) Plot all of the data and identify outliers.  
2) Understand where outliers are coming from.  
3) Remove outliers if applicable. 

Other cleaning operations: 
- (done) rename all the indices to 'date' 
- (done) ensure column headers are meaningful (useful for merging later) 
- (done) ensure all the indices are indeed 'DatetimeIndex'
- (done) ensure no duplicates on the index 
- (done) get all the indices sorted by date
- (done) handle any NaN values -- there were none 

All of the data will start from the `data/` folder, and will end up in the `cleaned_data/` folder. 

In [1]:
import os
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import datetime as dt 

In [2]:
sleep_score_orig = pd.read_pickle('data/sleep_score.pkl')
sleep_detail_orig = pd.read_pickle('data/sleep_detail.pkl')

lightly_active_orig = pd.read_pickle('data/lightly_active.pkl')
moderately_active_orig = pd.read_pickle('data/moderately_active.pkl')
very_active_orig = pd.read_pickle('data/very_active.pkl')
sedentary_minutes_orig = pd.read_pickle('data/sedentary_minutes.pkl')

step_daily_orig = pd.read_pickle('data/step_daily.pkl')

bpm_orig = pd.read_pickle('data/bpm.pkl')
resting_heartrate_orig = pd.read_pickle('data/resting_heartrate.pkl')

In [15]:
dfs = [
    sleep_score_orig, 
    sleep_detail_orig, 
    lightly_active_orig,
    moderately_active_orig,
    very_active_orig,
    sedentary_minutes_orig,
    step_daily_orig,
    bpm_orig,
    resting_heartrate_orig,
]

#### Quick cleaning of the indices 
1) Ensure all the indices are named 'date' (for ease of merging later)   
2) Sort all of the data by date   
3) Ensure all the indices are of DatetimeIndex format  
4) Ensure all the index values are unique  

In [45]:
# Nothing will print if all checks pass 
for df in dfs: 
    df.index.name = 'date'
    df.sort_index(inplace=True)    
    if not isinstance(df.index, pd.DatetimeIndex):
        print (f'{df.head(1)} does not have DatetimeIndex')
    if not df.index.is_unique:
        print(f'{df.head(1)} does not have unique index')


#### Check for any NaN values 

In [46]:
# Nothing should print if there are no null values 
for df in dfs: 
    if df.isnull().values.any():
        print(f'{df.head(1)} has NaN values')

#### Give the data columns meaningful names 
This will be particularly useful later when merging and plotting the data. 

In [84]:
# Many tables have column names that are not very meaningful. Let's fix that.  
sleep_score_orig.rename(columns={'overall_score':'overall_sleep_score'}, inplace=True)
lightly_active_orig.rename(columns={'value':'light_act_mins'}, inplace=True)
moderately_active_orig.rename(columns={'value':'moderate_act_mins'}, inplace=True)
very_active_orig.rename(columns={'value':'very_act_mins'}, inplace=True)
sedentary_minutes_orig.rename(columns={'value':'sedentary_mins'}, inplace=True)
step_daily_orig.rename(columns={'value':'daily_steps'}, inplace=True)
bpm_orig.rename(columns={'max':'bpm_max', 'mean':'bpm_mean'}, inplace=True)
resting_heartrate_orig.rename(columns={'value':'resting_hr', 'error':'rest_hr_error'}, inplace=True)

#### Explore the data graphically 
Look for outliers / anything suspicious 

In [None]:
sleep_score_orig, 
sleep_detail_orig, 
lightly_active_orig,
moderately_active_orig,
very_active_orig,
sedentary_minutes_orig,
step_daily_orig,
bpm_orig,
resting_heartrate_orig,


In [None]:
step_daily = step_daily[step_daily['value'] < 50000]

In [None]:
def date_index_check(dataframe):
    if not isinstance(dataframe.index, pd.DatetimeIndex):
        return False 

In [None]:
dfs = [
    bpm, 
    sleep_detail, 
    sleep_score, 
    sedentary_minutes, 
    lightly_active, 
    moderately_active, 
    very_active, 
    resting_heartrate, 
    df
]

In [None]:
for df in dfs: 
    if isinstance(df.index, pd.DatetimeIndex) == False: 
        print(f'{df.head(1)} does not have datetimeIndex')

In [None]:
for df in dfs: 
    df.sort_index()

A bit of cleaning... 
After plotting the daily steps I noticed some major outliers. Let's filter for those values and drop them. 

No day should have more than 60,000 steps. 