# Daily Step Count Methods

Workbook created by Benjamin Winiarski, re-using code from serval other notebooks.

Goal of this workbook is to create a pipeline of functions to that will transfrom any dataset into a daily step count table with different calculation methods

In [7]:
# Importing required functions
import pandas as pd
import numpy as np
from datetime import datetime

Loading in the reading data functions that were created in Martin's notebooks

In [8]:
# Loading reading data functions from Martin's workbooks
def read_Pacer_data(filename):
    #Read in the data
    dat = pd.read_csv(filename)
    #Select necessary columns
    dat = dat[["date","steps"]]
    #Extract datetime data
    dat["datetime"] = pd.to_datetime(dat["date"], format = '%m/%d/%Y, %H:%M:%S %z')
    dat["Date"] = dat["datetime"].dt.date
    dat["Hour"] = dat["datetime"].dt.hour
    dat["Min"] = dat["datetime"].dt.minute
    #Aggregate over the hours
    dat = dat.groupby(["Date","Hour"])["steps"].agg("sum").reset_index()
    #Relabel columns
    dat.columns = [["Date", "Hour", "Steps"]]
    
    return dat

def read_QS_data(filename):
    #Read in CSV file
    dat = pd.read_csv(filename)
    #Extract datetime information
    dat["Datetime"] = pd.to_datetime(dat["Start"], format = '%d-%b-%Y %H:%M')
    dat["Date"] = dat["Datetime"].dt.date
    dat["Hour"] = dat["Datetime"].dt.hour
    #Format columns
    dat = dat[["Date", "Hour", "Steps (count)"]]
    dat.columns = ["Date", "Hour", "Steps"]
    
    return dat

Loading in the run method functions from V1 of this notebook

In [13]:
# Method 1 calculates daily steps based on all the step acitivity that is available during the day
def run_method_1(df):
    
    df = df.copy()
    df["Date"] = pd.to_datetime(df["Date"],format='%Y-%m-%d')
    df.index = df["Date"]
    daily_step_count = df.drop(['Hour'], axis=1).resample('D').sum().reset_index()
    
    return(daily_step_count)

# Method 2 calculates daily steps based on the 10 most active hours of the day
def run_method_2(df):
    
    df = df.copy().sort_values(by=["Date", "Steps"], ascending=False).groupby('Date').head(10)
    df["Date"] = pd.to_datetime(df["Date"],format='%Y-%m-%d')
    df.index = df["Date"]
    daily_step_count = df.drop(['Hour'], axis=1).resample('D').sum().reset_index()
    
    return(daily_step_count)

# Method 3 calculates daily steps based on a 10 hour time block during the day
def run_method_3(df, start_hour, end_hour):
    
    df = df.copy()
    df["Date"] = pd.to_datetime(df["Date"],format='%Y-%m-%d')
    mask = (df['Hour'] >= start_hour) & (df['Hour'] <= end_hour)
    df = df.loc[mask]
    df.index = df["Date"]
    daily_step_count = df.drop(['Hour'], axis=1).resample('D').sum().reset_index()
    
    return(daily_step_count)

## Test the functions on a few datasets

I want to testing the sequence of functions on the User1 dataset to see if it is behaving correctly

In [14]:
filename_user1 = "../../data/Participant_ID_A/User1.csv"
df_user1 = read_QS_data(filename_user1)

dailysteps_user1_method1 = run_method_1(df_user1)
dailysteps_user1_method2 = run_method_2(df_user1)

#Set Start and End Time Range
start_time = 8
end_time = 18
dailysteps_user1_method3 = run_method_3(df_user1, start_time, end_time)


Viewing the results of this test

In [15]:
dailysteps_user1_method1.head()

Unnamed: 0,Date,Steps
0,2014-12-07,2693.0
1,2014-12-08,6567.0
2,2014-12-09,6879.0
3,2014-12-10,7845.0
4,2014-12-11,916.0


We can see that method 1 function worked well, but I feel like there should be someway that we can merge the three dataframes so we can view all the results from the different methods in one data frame.

In [16]:
# Merge the datasets together
dailysteps_user1_merged = dailysteps_user1_method1.merge(dailysteps_user1_method2,on ='Date').merge(dailysteps_user1_method3,on ='Date')
dailysteps_user1_merged.columns = [["Date", "Method_1", "Method_2", "Method_3"]]
dailysteps_user1_merged.head()

Unnamed: 0,Date,Method_1,Method_2,Method_3
0,2014-12-07,2693.0,2693.0,2693.0
1,2014-12-08,6567.0,6550.0,6173.0
2,2014-12-09,6879.0,6879.0,6651.0
3,2014-12-10,7845.0,7845.0,7528.0
4,2014-12-11,916.0,916.0,895.0


Merging the data frames worked really well. Now when we run our final analysis we will have the results from all three emthods all in the same object which will make it easier for our graphics. 

Also want to test some inital analysis of the data

In [17]:
dailysteps_user1_merged.describe()

Unnamed: 0,Method_1,Method_2,Method_3
count,1754.0,1754.0,1754.0
mean,4873.11188,4677.100134,3637.647143
std,4059.530686,3915.520534,3206.93481
min,0.0,0.0,0.0
25%,1753.0,1753.0,1036.204984
50%,4210.5,3975.0,3025.0
75%,6828.480169,6675.919855,5336.903594
max,30234.470106,28359.0,22032.174027


From the describe function we can see that there are some differents to the results measured by the different methods. While the difference is small at the moment, I feel that once we add the adherence into the picture we will see a drastic difference between the methods.

Finally, we should create a function to create and merge the three methods

In [22]:
def calculate_daily_steps(file_name, file_type, start_time, end_time):
    
    if(file_type == "QS"):
        df = read_QS_data(file_name)
    elif (file_type == "Pacer"):
        df = read_Pacer_data(file_name)
    
    dailysteps_method1 = run_method_1(df)
    dailysteps_method2 = run_method_2(df)
    dailysteps_method3 = run_method_3(df, start_time, end_time)
    
    dailysteps_merged = dailysteps_method1.merge(dailysteps_method2,on ='Date').merge(dailysteps_method3,on ='Date')
    dailysteps_merged.columns = [["Date", "Method_1", "Method_2", "Method_3"]]
    
    return (dailysteps_merged)
    

Test this pipeline function out on user 1 again

In [26]:
filename_user1 = "../../data/Participant_ID_A/User1.csv"

start_time = 8
end_time = 18

daily_steps_user1 = calculate_daily_steps(filename_user1, "QS", start_time, end_time)
daily_steps_user1.head()

Unnamed: 0,Date,Method_1,Method_2,Method_3
0,2014-12-07,2693.0,2693.0,2693.0
1,2014-12-08,6567.0,6550.0,6173.0
2,2014-12-09,6879.0,6879.0,6651.0
3,2014-12-10,7845.0,7845.0,7528.0
4,2014-12-11,916.0,916.0,895.0


Results is excatly as we expected, making it far easier to get this data frame!

Test it all out again on another dataset to make sure it is robust

In [28]:
filename_user2 = "../../data/Participant_ID_C/User3.csv"

start_time = 8
end_time = 18

daily_steps_user2 = calculate_daily_steps(filename_user2, "QS", start_time, end_time)
daily_steps_user2.head()

Unnamed: 0,Date,Method_1,Method_2,Method_3
0,2015-11-28,145.0,145.0,145.0
1,2015-11-29,4233.0,4115.703713,4080.573591
2,2015-11-30,4162.0,4013.0,3906.552367
3,2015-12-01,3209.0,3191.0,2979.0
4,2015-12-02,2773.0,2724.0,2544.0


From this, we can see the function is robust. We can now do a inital analysis on this user.

In [29]:
daily_steps_user2.describe()

Unnamed: 0,Method_1,Method_2,Method_3
count,1784.0,1784.0,1784.0
mean,4676.801009,4616.481099,4009.699816
std,3437.52058,3362.922941,2942.759397
min,8.0,8.0,8.0
25%,2111.25,2105.75,1741.5
50%,4167.0,4111.758886,3651.0
75%,6532.25,6448.142595,5741.715108
max,20913.0,20829.0,20699.0


From here we can see that there is less change between the methods than there was in user 1. If this is the case even after adding in the adherence component, we might need to investigate a new method for calculating the daily steps.