# Reading Pacer Data

Workbook created by Martin Gossow. Much of the code is based on the *readingQS* workbook.

### Driving Question

This notebook mirrors the *readingQS* notebook in developing a pipeline for converting raw data given by the Pacer app into the standardised form that can be used in analysis. We then create a function for doing this automatically.

### Setting up Packages and Filename

In [10]:
#Import required packages
import numpy as np
import pandas as pd
from datetime import datetime

In [13]:
#File path for csv file
FILEPATH = "../../data/Participant_ID_01/DetailedSteps_2020_10_24_1932.csv"

Note that for the Pacer datalogs, we want to use the *DetailedSteps* file so that we can achieve an hour-by-hour breakdown of the stepcounts.

### Reading the Data

We us the pandas `read_csv` function which converts the file straight into a dataframe.

In [14]:
dat = pd.read_csv(FILEPATH)
dat.head(8)

Unnamed: 0,date,steps,grossCalories,calories,distanceInMeters,activeTimeInSeconds
0,"10/18/2020, 00:00:00 +1100",1.0,0.0,0.0,0.0,0.0
1,"10/18/2020, 00:15:00 +1100",0.0,0.0,0.0,0.0,0.0
2,"10/18/2020, 00:30:00 +1100",0.0,0.0,0.0,0.0,0.0
3,"10/18/2020, 00:45:00 +1100",0.0,0.0,0.0,0.0,0.0
4,"10/18/2020, 01:00:00 +1100",1.0,0.0,0.0,0.0,0.0
5,"10/18/2020, 01:15:00 +1100",0.0,0.0,0.0,0.0,0.0
6,"10/18/2020, 01:30:00 +1100",0.0,0.0,0.0,0.0,0.0
7,"10/18/2020, 01:45:00 +1100",0.0,0.0,0.0,0.0,0.0


### Formatting the Data

Again, there are a few things to change. We are only interested in the number of steps, the date and the hour. We will need to aggregate over the 15 minute intervals for each hour. We start by removing the other columns and convering the date column to a datetime object.

In [15]:
#Select necessary columns
dat = dat[["date","steps"]]
#Extract datetime data
dat["datetime"] = pd.to_datetime(dat["date"], format = '%m/%d/%Y, %H:%M:%S %z')
dat["Date"] = dat["datetime"].dt.date
dat["Hour"] = dat["datetime"].dt.hour
dat["Min"] = dat["datetime"].dt.minute
dat.head(5)

Unnamed: 0,date,steps,datetime,Date,Hour,Min
0,"10/18/2020, 00:00:00 +1100",1.0,2020-10-18 00:00:00+11:00,2020-10-18,0,0
1,"10/18/2020, 00:15:00 +1100",0.0,2020-10-18 00:15:00+11:00,2020-10-18,0,15
2,"10/18/2020, 00:30:00 +1100",0.0,2020-10-18 00:30:00+11:00,2020-10-18,0,30
3,"10/18/2020, 00:45:00 +1100",0.0,2020-10-18 00:45:00+11:00,2020-10-18,0,45
4,"10/18/2020, 01:00:00 +1100",1.0,2020-10-18 01:00:00+11:00,2020-10-18,1,0


Finally, we want to sum the number of steps across each hour. We can do that with a `group_by` function.

In [16]:
dat1 = dat.groupby(["Date","Hour"])["steps"].agg("sum").reset_index()
dat1.head(5)

Unnamed: 0,Date,Hour,steps
0,2020-10-18,0,1.0
1,2020-10-18,1,1.0
2,2020-10-18,2,1.0
3,2020-10-18,3,1.0
4,2020-10-18,4,1.0


This is exactly what we want. Finally, we label the columns.

In [17]:
dat1.columns = [["Date", "Hour", "Steps"]]

### Displaying the dataframe

We print the first few rows of the dataframe in its final form, mathcing the standardised form we established in the `readingQS` notebook.

In [18]:
dat1.head(10)

Unnamed: 0,Date,Hour,Steps
0,2020-10-18,0,1.0
1,2020-10-18,1,1.0
2,2020-10-18,2,1.0
3,2020-10-18,3,1.0
4,2020-10-18,4,1.0
5,2020-10-18,5,1.0
6,2020-10-18,6,1.0
7,2020-10-18,7,23.0
8,2020-10-18,8,2.0
9,2020-10-18,9,40.0


### Putting Everything Together

Again, we define a single function that takes in the file name and outputs the required dataframe in its standard form.

In [19]:
def read_Pacer_data(filename):
    #Read in the data
    dat = pd.read_csv(filename)
    #Select necessary columns
    dat = dat[["date","steps"]]
    #Extract datetime data
    dat["datetime"] = pd.to_datetime(dat["date"], format = '%m/%d/%Y, %H:%M:%S %z')
    dat["Date"] = dat["datetime"].dt.date
    dat["Hour"] = dat["datetime"].dt.hour
    dat["Min"] = dat["datetime"].dt.minute
    #Aggregate over the hours
    dat = dat.groupby(["Date","Hour"])["steps"].agg("sum").reset_index()
    #Relabel columns
    dat.columns = [["Date", "Hour", "Steps"]]
    
    return dat

In [20]:
read_Pacer_data(FILEPATH).head()

Unnamed: 0,Date,Hour,Steps
0,2020-10-18,0,1.0
1,2020-10-18,1,1.0
2,2020-10-18,2,1.0
3,2020-10-18,3,1.0
4,2020-10-18,4,1.0
