Stanley Dong - 480428169

Personal Informatics 

Part Assigned: Data Processing Pipeline, more specifically --> time blocks/ adherence measures, what counts as a 'valid day'

**Driving Question :** What is the difference between the Althoff et al (2017) calculation of step counts with one that requires 10-hours of the day with steps for a small number of users?

The purpose of this EDA is to define adherence or what counts as a valid day. We want to find substitutes to the 10 hour day of non zero steps. I drew inspiration from Assignment 1 Paper 2's 3 valid day definitions. 

Valid day Criteria
1. Total step count for the day greater than 500 steps.
2. 10 hours of non-zero step counts. (This is our current driving question)
3. Between 3 time blocks ( 3am-11am, 11am-3pm, 3pm-3am), if all 3 timeblocks contain more than 0 step count, then it counts as a valid day.

Thus we will be exploring 1 and 3 as possible further research section and see how many days of the original dataset is left once we apply this. 

In [None]:
#Importing Library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

In [None]:
#Reading in data
df = pd.read_csv("/content/User1.csv")

In [None]:
#Having a quick look at the data again
df

Unnamed: 0,Start,Finish,Steps (count)
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0
4,07-Dec-2014 13:00,07-Dec-2014 14:00,0.0
...,...,...,...
42071,25-Sep-2019 07:00,25-Sep-2019 08:00,0.0
42072,25-Sep-2019 08:00,25-Sep-2019 09:00,0.0
42073,25-Sep-2019 09:00,25-Sep-2019 10:00,0.0
42074,25-Sep-2019 10:00,25-Sep-2019 11:00,0.0


In [None]:
#Splitting the start into a date and hour column. This will help ease our computation later on. 
df["Date"] = [s.split(" ")[0] for s in df["Start"]]
df["Hour"] = [int((s.split(":")[0])[-2:]) for s in df["Start"]]
df 

Unnamed: 0,Start,Finish,Steps (count),Date,Hour
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0,07-Dec-2014,9
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0,07-Dec-2014,10
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0,07-Dec-2014,11
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0,07-Dec-2014,12
4,07-Dec-2014 13:00,07-Dec-2014 14:00,0.0,07-Dec-2014,13
...,...,...,...,...,...
42071,25-Sep-2019 07:00,25-Sep-2019 08:00,0.0,25-Sep-2019,7
42072,25-Sep-2019 08:00,25-Sep-2019 09:00,0.0,25-Sep-2019,8
42073,25-Sep-2019 09:00,25-Sep-2019 10:00,0.0,25-Sep-2019,9
42074,25-Sep-2019 10:00,25-Sep-2019 11:00,0.0,25-Sep-2019,10


In [None]:
#Issues of daylight savings, and not starting from the start of a day (first and last entry of dataset)
df1 = df.value_counts("Date") 
df1 

Date
03-Apr-2016    25
02-Apr-2017    25
01-Apr-2018    25
05-Apr-2015    25
07-Apr-2019    25
               ..
01-Oct-2017    23
02-Oct-2016    23
04-Oct-2015    23
07-Dec-2014    15
25-Sep-2019    12
Length: 1754, dtype: int64

We also seem to encounter issues of daylight savings here, will affect our computation but we will assume there is no effect for now. (Another group member is resposnible for resolving this)

Ok on to our exploration of how many days are there greater than 500 steps. This is one of the ways in which Tang et al. constituted it as a valid day.

In [None]:
#Adding up all the steps in a day
df1 = df['Steps (count)'].groupby(df['Date']).sum()
print(df1)

Date
01-Apr-2015     7828.000000
01-Apr-2016     6831.350375
01-Apr-2017     4122.000000
01-Apr-2018    11518.000000
01-Apr-2019     5484.000000
                   ...     
31-May-2019     6424.000000
31-Oct-2015     7534.000000
31-Oct-2016      453.000000
31-Oct-2017     2198.000000
31-Oct-2018     6444.139901
Name: Steps (count), Length: 1754, dtype: float64


Here we just aggregated all the steps up per day. 

Now we are going to apply a 500 step filter and find out the percentage of days that are valid by the 500 step definition. 

In [None]:
#Number of days with steps greater than 500
print(sum(df1 > 500))
print(sum(df1 > 500)/len(df1))

1533
0.8740022805017104


It appears that there are 1533 records out of 1754 rows that are considered valid. This is 87% of records. 

Now moving on to our second exploration, we are going to apply the 3 time block. 

In [None]:
#Valid day is if there is data within 3 time periods, 3am-11am,11am-3pm,3pm-3am
bins = [-1,2,10,14,25,np.inf]
names = ['3','1','2', '3', '3']
df2 = df
df2['3timeblock'] = pd.cut(df['Hour'],bins, labels = names, ordered=False)
df2

Unnamed: 0,Start,Finish,Steps (count),Date,Hour,3timeblock
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0,07-Dec-2014,9,1
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0,07-Dec-2014,10,1
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0,07-Dec-2014,11,2
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0,07-Dec-2014,12,2
4,07-Dec-2014 13:00,07-Dec-2014 14:00,0.0,07-Dec-2014,13,2
...,...,...,...,...,...,...
42071,25-Sep-2019 07:00,25-Sep-2019 08:00,0.0,25-Sep-2019,7,1
42072,25-Sep-2019 08:00,25-Sep-2019 09:00,0.0,25-Sep-2019,8,1
42073,25-Sep-2019 09:00,25-Sep-2019 10:00,0.0,25-Sep-2019,9,1
42074,25-Sep-2019 10:00,25-Sep-2019 11:00,0.0,25-Sep-2019,10,1


AS the data was shaped in 24 hours zones, there was a bit of complication regarding the timeblock assigning but all resolved now

Now we are aggregating all the step counts within the same timeblock of the day. We are also applying a boolean of True or False to see if there is any activity. 

In [None]:
df2 = df2.groupby(['Date','3timeblock'], as_index=False)['Steps (count)'].sum()
print(df2)
df2['Activity'] = df2['Steps (count)'] > 0
print(df2)

             Date 3timeblock  Steps (count)
0     01-Apr-2015          1    1721.000000
1     01-Apr-2015          2    4105.000000
2     01-Apr-2015          3    2002.000000
3     01-Apr-2016          1    3671.836598
4     01-Apr-2016          2    1608.797745
...           ...        ...            ...
5257  31-Oct-2017          2       0.000000
5258  31-Oct-2017          3    1425.000000
5259  31-Oct-2018          1    2148.046634
5260  31-Oct-2018          2    1074.023317
5261  31-Oct-2018          3    3222.069951

[5262 rows x 3 columns]
             Date 3timeblock  Steps (count)  Activity
0     01-Apr-2015          1    1721.000000      True
1     01-Apr-2015          2    4105.000000      True
2     01-Apr-2015          3    2002.000000      True
3     01-Apr-2016          1    3671.836598      True
4     01-Apr-2016          2    1608.797745      True
...           ...        ...            ...       ...
5257  31-Oct-2017          2       0.000000     False
5258  31-Oct-20

Pretty self explanatory here, you can just see the steps being applied.

Now we are going to sum up the 'Activity' column and see if each day hits 3 Trues. If it does, then it counts as a full day by our definition

In [None]:
df2 = df2['Activity'].groupby(df2['Date']).sum()
print(sum(df2==3))
print(sum(df2==3)/(len(df2)))

985
0.5615735461801596


As you can see, using this method yields 985 rows and out of the 1754 rows this represents 56% of the original data.