# Reading Step-Count Data

### Driving Question

This notebook compiles each of the three methods developed for counting and tabulating step-data from each of the three different data types. We also introduce a method for reading data that has already been cleaned, which means we can simply write the files to csv and skip the preprocessing each time we run the analysis.

In [1]:
import numpy as np
import pandas as pd

### Reading data from each source

First we present the three functions from the other three booklets for reading data that are in this folder. Each of them take a string input corresponding to the related filename containing the data. The output is a pandas dataframe which always contains the same column headings and data-types.

In [2]:
def read_Pacer_data(filename):
    #Read in the data
    dat = pd.read_csv(filename)
    #Select necessary columns
    dat = dat[["date","steps"]]
    #Extract datetime data
    dat["datetime"] = pd.to_datetime(dat["date"], format = '%m/%d/%Y, %H:%M:%S %z')
    dat["Date"] = dat["datetime"].dt.date
    dat["Hour"] = dat["datetime"].dt.hour
    dat["Min"] = dat["datetime"].dt.minute
    #Aggregate over the hours
    dat = dat.groupby(["Date","Hour"])["steps"].agg("sum").reset_index()
    #Relabel columns
    dat.columns = [["Date", "Hour", "Steps"]]
    
    return dat

In [3]:
def read_QS_data(filename):
    #Read in CSV file
    dat = pd.read_csv(filename)
    #Extract datetime information
    dat["Datetime"] = pd.to_datetime(dat["Start"], format = '%d-%b-%Y %H:%M')
    dat["Date"] = dat["Datetime"].dt.date
    dat["Hour"] = dat["Datetime"].dt.hour
    #Format columns
    dat = dat[["Date", "Hour", "Steps (count)"]]
    dat.columns = ["Date", "Hour", "Steps"]
    
    return dat

In [4]:
def read_XML_data(filename):
    #Read in XML file
    with open(filename, 'r') as xml_file:
        input_data = xmltodict.parse(xml_file.read())
    #Extract record data from XML
    record_list = input_data['HealthData']['Record']
    df = pd.DataFrame(record_list)
    #Convert dates to datetime objects and steps to numeric
    date_format = '%Y-%m-%d %H:%M:%S %z'
    df['@startDate'] = pd.to_datetime(df['@startDate'], format = date_format)
    df['@endDate'] = pd.to_datetime(df['@endDate'], format = date_format)
    df['@value'] = pd.to_numeric(df['@value'])
    #Sum up values for each hour
    dat = df.resample("H", on="@startDate").sum().reset_index()
    #Extract date and hour information, and relabel columns
    dat["Date"] = dat["@startDate"].dt.date
    dat["Hour"] = dat["@startDate"].dt.hour
    dat["Steps"] = dat["@value"]
    dat = dat[["Date","Hour","Steps"]]
    
    return dat

In [5]:
def read_CLEAN_data(filename):
    #Read in CSV file
    dat = pd.read_csv(filename)
    #Convert datetimes
    dat["Date"] = pd.to_datetime(dat["Date"], format = '%Y-%m-%d').dt.date
    
    return dat

### Creating a single function

We now want a single function that takes in the filename and the filetype and runs the required of the three above functions and returns the dataframe.

In [6]:
def read_step_data(filename, read_type):
    read_type = read_type.lower()
    if read_type == "pacer":
        return read_Pacer_data(filename)
    elif read_type == "qsaccess" or read_type == "qs":
        return read_QS_data(filename)
    elif read_type == "xml":
        return read_XML_data(filename)
    elif read_type == "clean" or read_type == "cleaned":
        return read_CLEAN_data(filename)
    else:
        raise Exception("Not a valid file type to read! Use pacer, qs, xml or clean")

### Example and Testing

We read in one of the participant files, and make sure the data has been imported correctly. We then write the file to csv and read it through the cleaned function to ensure that this is also functioning as intended.

In [7]:
READPATH = "../../data/Participant_ID_01/DetailedSteps_2020_10_24_1932.csv"
dat = read_step_data(READPATH, "pacer")
dat.head(10)

Unnamed: 0,Date,Hour,Steps
0,2020-10-18,0,1.0
1,2020-10-18,1,1.0
2,2020-10-18,2,1.0
3,2020-10-18,3,1.0
4,2020-10-18,4,1.0
5,2020-10-18,5,1.0
6,2020-10-18,6,1.0
7,2020-10-18,7,23.0
8,2020-10-18,8,2.0
9,2020-10-18,9,40.0


This works perfectly! We now write the dataframe to a new csv file in a "cleaned data" folder.

In [8]:
WRITEPATH = "../../data/cleaned/participant1.csv"
dat.to_csv(WRITEPATH, index = False)

We now read in the data again and check that it matches the dataframe above.

In [9]:
dat1 = read_step_data(WRITEPATH, "clean")
dat1.head(10)

Unnamed: 0,Date,Hour,Steps
0,2020-10-18,0,1.0
1,2020-10-18,1,1.0
2,2020-10-18,2,1.0
3,2020-10-18,3,1.0
4,2020-10-18,4,1.0
5,2020-10-18,5,1.0
6,2020-10-18,6,1.0
7,2020-10-18,7,23.0
8,2020-10-18,8,2.0
9,2020-10-18,9,40.0


In [10]:
print(dat.dtypes)
print()
print(dat1.dtypes)

Date      object
Hour       int64
Steps    float64
dtype: object

Date      object
Hour       int64
Steps    float64
dtype: object


We see that the datasets and types in each data frame agree, and so everything has worked as planned.