# Daily Step Count Methods

The goal of this notebook is to explore some different ways to calculate the daily step count of the users using a 10-hour window, and create a robust pipeline to use this method of daily step count in further analysis.

In [27]:
#Load required modules
import pandas as pd
import numpy as np

Start by analysing the User 1 data set to get a feel for things before moving onto other datasets

In [28]:
df_user1 = pd.read_csv("../../data//Participant_ID_A/User1.csv")

In [29]:
#Check if was loaded correctly
df_user1.head()

Unnamed: 0,Start,Finish,Steps (count)
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0
4,07-Dec-2014 13:00,07-Dec-2014 14:00,0.0


We can see that the User 1 dataset was loaded correctly and as such we can move on with our analysis

In [30]:
#Look at the datatypes
df_user1.dtypes

Start             object
Finish            object
Steps (count)    float64
dtype: object

I want to change the Start and Finish times into Datetime objects instead or regular objects as they will be easier to deal with later on

In [31]:
# Re-using Code from Jermery
format = '%d-%b-%Y %H:%M'
df_user1['Start'] = pd.to_datetime(df_user1['Start'],format=format)
df_user1['Finish'] = pd.to_datetime(df_user1['Finish'],format=format)

In [32]:
#Check dtypes
df_user1.dtypes

Start            datetime64[ns]
Finish           datetime64[ns]
Steps (count)           float64
dtype: object

We can see that the datatypes were converted to Datetime objects as expected

## Method 1 - Calculate all steps in a Day (Same as the Paper)

So we want to try and replicate the paper here, calculating all the steps in the day (with wear time being the first step to the last step) as the daily step count

I want to try and group the Start Column by date, and sum the Steps column to get the daily steps

The easist way would be to index the df by the start time and resample by the day

In [33]:
df_user1.index = df_user1['Start']
df_user1.head()

Unnamed: 0_level_0,Start,Finish,Steps (count)
Start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-12-07 09:00:00,2014-12-07 09:00:00,2014-12-07 10:00:00,941.0
2014-12-07 10:00:00,2014-12-07 10:00:00,2014-12-07 11:00:00,408.0
2014-12-07 11:00:00,2014-12-07 11:00:00,2014-12-07 12:00:00,157.0
2014-12-07 12:00:00,2014-12-07 12:00:00,2014-12-07 13:00:00,1017.0
2014-12-07 13:00:00,2014-12-07 13:00:00,2014-12-07 14:00:00,0.0


Re-indexing worked as expected. Now we can sum each day by the steps count.

In [34]:
# Assign the Daily Step Count as a seperate dataframe
daily_step_count_user1 = df_user1.resample('D').sum()

#Have a look at the creation
daily_step_count_user1.head()

Unnamed: 0_level_0,Steps (count)
Start,Unnamed: 1_level_1
2014-12-07,2693.0
2014-12-08,6567.0
2014-12-09,6879.0
2014-12-10,7845.0
2014-12-11,916.0


From the brief look at the datatable, it looks like we have completed the job of calculating the daily step counts from grouping the dates together

In [35]:
#Have a look at some of the details we can find from the daily step counts
daily_step_count_user1.describe()

Unnamed: 0,Steps (count)
count,1754.0
mean,4873.11188
std,4059.530686
min,0.0
25%,1753.0
50%,4210.5
75%,6828.480169
max,30234.470106


From this, we can see the average step count is 4873 steps a day, with an std of 4059 (which is quite high). Also the max of around 30,000 steps looks very high but plausible for a high acitvity day. 

I will not look into this data any further becuase the analysis of the daily steps will come in a different workbook.

## Method 2 - 10 hour window by taking the 10 highest hours of step counts from each day

So will again try and calculate the daily step counts for the user, but instead of taking all values, I will try only take 10 hours worth of data from each day by looking at the hours with the highest step counts and then summing them together to form the day

In [36]:
#Drop the index
df_user1 = df_user1.reset_index(drop=True)

In [37]:
# Create a new df for this method
df_method2 = df_user1

Create a Date column for the dataframe

In [38]:
df_method2["Date"] = df_method2["Start"].dt.date

Going to order the dataframe by Date, then by step count

In [39]:
df_method2.sort_values(by=["Date", "Steps (count)"], ascending=False)

Unnamed: 0,Start,Finish,Steps (count),Date
42064,2019-09-25 00:00:00,2019-09-25 01:00:00,0.0,2019-09-25
42065,2019-09-25 01:00:00,2019-09-25 02:00:00,0.0,2019-09-25
42066,2019-09-25 02:00:00,2019-09-25 03:00:00,0.0,2019-09-25
42067,2019-09-25 03:00:00,2019-09-25 04:00:00,0.0,2019-09-25
42068,2019-09-25 04:00:00,2019-09-25 05:00:00,0.0,2019-09-25
...,...,...,...,...
10,2014-12-07 19:00:00,2014-12-07 20:00:00,0.0,2014-12-07
11,2014-12-07 20:00:00,2014-12-07 21:00:00,0.0,2014-12-07
12,2014-12-07 21:00:00,2014-12-07 22:00:00,0.0,2014-12-07
13,2014-12-07 22:00:00,2014-12-07 23:00:00,0.0,2014-12-07


Need to work out a way to only keep the top 10 values for each date

In [40]:
n_largest_hours = df_method2.sort_values(by=["Date", "Steps (count)"], ascending=False).groupby('Date').head(10)
n_largest_hours

Unnamed: 0,Start,Finish,Steps (count),Date
42064,2019-09-25 00:00:00,2019-09-25 01:00:00,0.0,2019-09-25
42065,2019-09-25 01:00:00,2019-09-25 02:00:00,0.0,2019-09-25
42066,2019-09-25 02:00:00,2019-09-25 03:00:00,0.0,2019-09-25
42067,2019-09-25 03:00:00,2019-09-25 04:00:00,0.0,2019-09-25
42068,2019-09-25 04:00:00,2019-09-25 05:00:00,0.0,2019-09-25
...,...,...,...,...
8,2014-12-07 17:00:00,2014-12-07 18:00:00,33.0,2014-12-07
4,2014-12-07 13:00:00,2014-12-07 14:00:00,0.0,2014-12-07
5,2014-12-07 14:00:00,2014-12-07 15:00:00,0.0,2014-12-07
7,2014-12-07 16:00:00,2014-12-07 17:00:00,0.0,2014-12-07


As you can see, we were successful in just get the 10 largest values for each day. Now we just need to sum and find the total for each of the days

In [41]:
# Set the date as the column and resample by the date as done in the first method
n_largest_hours.index = n_largest_hours['Start']
n_largest_hours.head()

Unnamed: 0_level_0,Start,Finish,Steps (count),Date
Start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-09-25 00:00:00,2019-09-25 00:00:00,2019-09-25 01:00:00,0.0,2019-09-25
2019-09-25 01:00:00,2019-09-25 01:00:00,2019-09-25 02:00:00,0.0,2019-09-25
2019-09-25 02:00:00,2019-09-25 02:00:00,2019-09-25 03:00:00,0.0,2019-09-25
2019-09-25 03:00:00,2019-09-25 03:00:00,2019-09-25 04:00:00,0.0,2019-09-25
2019-09-25 04:00:00,2019-09-25 04:00:00,2019-09-25 05:00:00,0.0,2019-09-25


In [42]:
# Assign the Daily Step Count as a seperate dataframe
daily_step_count_method2 = n_largest_hours.resample('D').sum()

#Have a look at the creation
daily_step_count_method2.head()

Unnamed: 0_level_0,Steps (count)
Start,Unnamed: 1_level_1
2014-12-07,2693.0
2014-12-08,6550.0
2014-12-09,6879.0
2014-12-10,7845.0
2014-12-11,916.0


Using the same strategy as in Method 1, I was successfully able to calculate the daily steps of the user by using the top 10 hours of each day

In [43]:
#Have a look at some of the details we can find from the daily step counts
daily_step_count_method2.describe()

Unnamed: 0,Steps (count)
count,1754.0
mean,4677.100134
std,3915.520534
min,0.0
25%,1753.0
50%,3975.0
75%,6675.919855
max,28359.0


From this, we can see the average step count has reduced slightly, 4677 steps a day, with a smaller std of 3915 (which is still quite high). Also the max has reduced slightly to of around 28,000 steps which still looks very high but plausible for a high acitvity day. 

Just from these intial analysis, taking the top 10 hours of activity as the daily step count doesn't seem to change the results from method 1. More analysis is required.

## Method 3 - 10 hours in a set window

For method 3, we will try and calculate the daily step counts by only taking in daily steps coutns that occured during a 10 hour window of time.

To start, we should try and identify a good window of time to pick, hopefully the one with the most activity

In [44]:
# Create a new df for this method
df_method3 = df_user1

# Get the hour column from the data (re-using Serena's code)
df_method3["Hour"] = df_method3["Start"].dt.hour

I want to now calculate the daily number of steps between the hours 10am to 6pm. Firstly, I need to remove all the rows that lie outside that time.

In [45]:
#Create a mask greater than the start hour and smaller than the end hour
start_hour = 10
end_hour = 20
mask = (df_method3['Hour'] >= start_hour) & (df_method3['Hour'] <= end_hour)

In [46]:
# Apply mask to df and check results
df_method3 = df_method3.loc[mask]

df_method3.head()

Unnamed: 0,Start,Finish,Steps (count),Date,Hour
1,2014-12-07 10:00:00,2014-12-07 11:00:00,408.0,2014-12-07,10
2,2014-12-07 11:00:00,2014-12-07 12:00:00,157.0,2014-12-07,11
3,2014-12-07 12:00:00,2014-12-07 13:00:00,1017.0,2014-12-07,12
4,2014-12-07 13:00:00,2014-12-07 14:00:00,0.0,2014-12-07,13
5,2014-12-07 14:00:00,2014-12-07 15:00:00,0.0,2014-12-07,14


Great, it looks like the mask worked and we are getting the rows starting from the 10th hour

Now we can perform the same sets as the other methods to get the daily step counts

In [47]:
#Re-index the df
df_method3.index = df_method3['Start']

# Assign the Daily Step Count as a seperate dataframe
daily_step_count_method3 = df_method3.drop(['Hour'], axis=1).resample('D').sum()

#Have a look at the creation
daily_step_count_method3.head()

Unnamed: 0_level_0,Steps (count)
Start,Unnamed: 1_level_1
2014-12-07,1752.0
2014-12-08,6509.0
2014-12-09,6176.892578
2014-12-10,7744.0
2014-12-11,719.0


Great, looks like the procedures all work, now we can do some simple analysis to compare

In [48]:
#Have a look at some of the details we can find from the daily step counts
daily_step_count_method3.describe()

Unnamed: 0,Steps (count)
count,1754.0
mean,3389.789295
std,3045.602837
min,0.0
25%,791.991845
50%,2896.736473
75%,5087.75
max,19829.102528


From this, we can see the average step count has reduced a lot to 3389 steps a day, with a smaller std of 3045. Also the max has reduced a lot to of around 20,000 steps.

As such, it looks like this time window had a big impact on how we caluculate the daily step data of a user and would be useful to use to compare how the resulting analysis will change with the different daily step count methods

## Data Pipeline

Now that we have our 3 methods created, it is now my aim to create a function that will convert any input dataset into the daily_step_count df using the 3 methods described.

Creating a function for the first method

In [49]:
def run_method_1(df, time_format, start_col_name):
    df[start_col_name] = pd.to_datetime(df[start_col_name],format=time_format)
    df.index = df[start_col_name]
    daily_step_count = df.resample('D').sum()
    
    return(daily_step_count)

In [50]:
# Test the above function using raw data
user1 = pd.read_csv("../../data/Participant_ID_A/User1.csv")

time_format = '%d-%b-%Y %H:%M'
start_col_name = 'Start'

user1_method1 = run_method_1(user1, time_format, start_col_name)
user1_method1.head()

Unnamed: 0_level_0,Steps (count)
Start,Unnamed: 1_level_1
2014-12-07,2693.0
2014-12-08,6567.0
2014-12-09,6879.0
2014-12-10,7845.0
2014-12-11,916.0


We can see that the function work as expected. Now to create and test for the other two methods.

In [51]:
def run_method_2(df, time_format, start_col_name, step_count_col):
    df[start_col_name] = pd.to_datetime(df[start_col_name],format=time_format)
    
    df["Date"] = df[start_col_name].dt.date
    df2 = df.sort_values(by=["Date", step_count_col], ascending=False).groupby('Date').head(10)

    df2.index = df2[start_col_name]
    daily_step_count = df2.resample('D').sum()
    
    return(daily_step_count)

In [52]:
# Test the above function using raw data
time_format = '%d-%b-%Y %H:%M'
start_col_name = 'Start'
step_count_col = 'Steps (count)'

user1_method2 = run_method_2(user1, time_format, start_col_name,step_count_col)
user1_method2.head()

Unnamed: 0_level_0,Steps (count)
Start,Unnamed: 1_level_1
2014-12-07,2693.0
2014-12-08,6550.0
2014-12-09,6879.0
2014-12-10,7845.0
2014-12-11,916.0


Works as expected. Now for the final function.

In [53]:
def run_method_3(df, time_format, start_col_name, start_hour, end_hour):
    df[start_col_name] = pd.to_datetime(df[start_col_name],format=time_format)
    
    df["Hour"] = df[start_col_name].dt.hour
    mask = (df['Hour'] >= start_hour) & (df['Hour'] <= end_hour)
    df = df.loc[mask]

    df.index = df[start_col_name]
    daily_step_count = df.drop(['Hour'], axis=1).resample('D').sum()
    
    return(daily_step_count)

In [54]:
# Test the above function using raw data
time_format = '%d-%b-%Y %H:%M'
start_col_name = 'Start'
start_hour = 10
end_hour = 20

user1_method3 = run_method_3(user1, time_format, start_col_name,start_hour, end_hour)
user1_method3.head()

Unnamed: 0_level_0,Steps (count)
Start,Unnamed: 1_level_1
2014-12-07,1752.0
2014-12-08,6509.0
2014-12-09,6176.892578
2014-12-10,7744.0
2014-12-11,719.0


Function looks to be working as expected

Using these function, the next step would be to test them on new datasets when they come in to see they they really can be generalised to any dataset that has been formated to a useable standard.