# Calculating Step Count Metrics

### Driving Question
This booklet will give Python code for calculating each of the relevant step-count metrics, given as Python functions which can act on Pandas databases in the required format. The data will come from the User1.csv file and will work on other files strucutred in the same way.

### Importing the Data

In [2]:
#Import required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

We import the `User1.csv` file from the data folder.

In [3]:
dat = pd.read_csv("../data/User1.csv")
dat.head(4)

Unnamed: 0,Start,Finish,Steps (count)
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0


If the data has been properly imported, the first 4 rows should appear. We rename the column heading for ease of use later.

In [4]:
dat.columns = ["Start", "Finish", "Steps"]

### Calculating Daily Total Steps and Number of Hours Walked

The most simple metric is to simply find the total number of steps walked and how many hours in that day had non-zero walking hours. This will be the basis of our other metrics. We also want to remove the first and last days of measurement if they do not have all 24 hours measured in order to maintain fairness.

In [5]:
dat["Date"] = [s.split(" ")[0] for s in dat["Start"]]
dat["Hour"] = [int((s.split(":")[0])[-2:]) for s in dat["Start"]]
dat.head(4)

Unnamed: 0,Start,Finish,Steps,Date,Hour
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0,07-Dec-2014,9
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0,07-Dec-2014,10
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0,07-Dec-2014,11
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0,07-Dec-2014,12


We can see that we have extracted the date and hour of each observation. This will help when aggregating the number of steps over each day.

In [6]:
#Removes dates from the data frame that do not have a record for each hour
def rmv_partial_dates(df):
    #Count which dates do not have 24 measurements
    date_counts = dat["Date"].value_counts() != 24
    #Create a list of these dates
    partial_dates = date_counts.index[date_counts == True].to_list()
    
    #Filter out all dates that do not have the required number
    for date in partial_dates:
        df = df[df["Date"] != date]
    
    return df
    
    
dat1 = rmv_partial_dates(dat)
print(dat.shape[0], dat1.shape[0])

42076 41832


### Only counting days with at least n hours of non-zero observations

Notice that we dropped a few rows that correspond to days that did not contain a full 24 hours. Now we can count how many hours of non-zero observations are in each day.

In [18]:
def days_with_required_hours(df, n = 10, minsteps = 0):
    #Only look at observations with more than minsteps
    df1 = df[df["Steps"] > minsteps]
    #List the days that have more than n hours of observations left
    required_hours = df1["Date"].value_counts() > n
    
    return required_hours.index[required_hours.values].to_list()

We now put everything together, creating a new dataframe that gives the number of hours walked in each day and whether or not the day achieved the required number of steps. For this we do not need to use the previous function, we simply do a count and then introduce a new Boolean variable.

In [28]:
def daily_steps_df(df):
    #Remove any partial dates and select required columns
    df1 = rmv_partial_dates(dat)[["Steps","Date","Hour"]]
    #Pivot data which helps with aggregating
    df1 = df1.pivot(index = "Date", columns = "Hour")
    #Find number of nonzero measurements and sum of steps
    df1["nonzero"] = df1.agg(np.count_nonzero, axis = 1)
    df1["sum"] = df1.drop(columns = ["nonzero"]).agg(sum, axis = 1)
    #Create new dataframe with this information
    df2 = pd.DataFrame(df1["sum"])
    df2["nonzero"] = df1["nonzero"]
    
    return df2
    

dat1 = daily_steps_df(dat)
dat1["required"] = dat1["nonzero"] >= 5
dat1.head(10)

Unnamed: 0_level_0,sum,nonzero,required
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01-Apr-2015,7828.0,10,True
01-Apr-2016,6831.350375,15,True
01-Apr-2017,4122.0,7,True
01-Apr-2019,5484.0,22,True
01-Aug-2015,5746.0,6,True
01-Aug-2016,4758.0,11,True
01-Aug-2017,4986.0,10,True
01-Aug-2018,4282.0,6,True
01-Aug-2019,21837.0,10,True
01-Dec-2015,3593.0,6,True


In [29]:
print(dat1.shape)
#print(dat1["required" == True])
sum(dat1["required"])

(1743, 3)


1373