# Importing QS Access Data

Workbook create by Martin Gossow. Some of the code is borrowed or modified from Serena's *Metric Analysis* notebook.

### Driving Question

Develop a pipeline that allows us to input data from a QSAccess CSV file and output a dataframe that gives the date, starting hour and number of steps for each of these hour slots. We also establish a standard format for which the analysis can be run.

### Setting up Packages and Filename

In [12]:
#Import required packages
import numpy as np
import pandas as pd
from datetime import datetime

In [13]:
#File path for csv file
FILEPATH = "../../data/Participant_ID_A/User1.csv"

### Reading the Data

We us the pandas `read_csv` function which converts the file straight into a dataframe.

In [14]:
dat = pd.read_csv(FILEPATH)
dat.head(8)

Unnamed: 0,Start,Finish,Steps (count)
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0
4,07-Dec-2014 13:00,07-Dec-2014 14:00,0.0
5,07-Dec-2014 14:00,07-Dec-2014 15:00,0.0
6,07-Dec-2014 15:00,07-Dec-2014 16:00,137.0
7,07-Dec-2014 16:00,07-Dec-2014 17:00,0.0


### Formatting the Data

There are some things we need to fix up. We want to change the column `Steps (count)` to `Steps`. We also want to extract the `Start` column into datetime format, and extract the date and hour.

In [15]:
dat["Datetime"] = pd.to_datetime(dat["Start"], format = '%d-%b-%Y %H:%M')
dat["Date"] = dat["Datetime"].dt.date
dat["Hour"] = dat["Datetime"].dt.hour
dat.head(8)

Unnamed: 0,Start,Finish,Steps (count),Datetime,Date,Hour
0,07-Dec-2014 09:00,07-Dec-2014 10:00,941.0,2014-12-07 09:00:00,2014-12-07,9
1,07-Dec-2014 10:00,07-Dec-2014 11:00,408.0,2014-12-07 10:00:00,2014-12-07,10
2,07-Dec-2014 11:00,07-Dec-2014 12:00,157.0,2014-12-07 11:00:00,2014-12-07,11
3,07-Dec-2014 12:00,07-Dec-2014 13:00,1017.0,2014-12-07 12:00:00,2014-12-07,12
4,07-Dec-2014 13:00,07-Dec-2014 14:00,0.0,2014-12-07 13:00:00,2014-12-07,13
5,07-Dec-2014 14:00,07-Dec-2014 15:00,0.0,2014-12-07 14:00:00,2014-12-07,14
6,07-Dec-2014 15:00,07-Dec-2014 16:00,137.0,2014-12-07 15:00:00,2014-12-07,15
7,07-Dec-2014 16:00,07-Dec-2014 17:00,0.0,2014-12-07 16:00:00,2014-12-07,16


We've extracted the date (as a datetime object) and the hour. Now we remove the columns that are no longer needed and rename the columns as required.

In [16]:
#Extract needed columns
dat = dat[["Date", "Hour", "Steps (count)"]]
#Rename columns
dat.columns = ["Date", "Hour", "Steps"]

### Displaying the dataframe

Finally, we display the dataframe in its final standardised format.

In [17]:
dat.head(10)

Unnamed: 0,Date,Hour,Steps
0,2014-12-07,9,941.0
1,2014-12-07,10,408.0
2,2014-12-07,11,157.0
3,2014-12-07,12,1017.0
4,2014-12-07,13,0.0
5,2014-12-07,14,0.0
6,2014-12-07,15,137.0
7,2014-12-07,16,0.0
8,2014-12-07,17,33.0
9,2014-12-07,18,0.0


### Putting everything together

We define a single function that takes in the filename and outputs the cleaned filedata. All the code is simply taken from the above cells.

In [18]:
def read_QS_data(filename):
    #Read in CSV file
    dat = pd.read_csv(filename)
    #Extract datetime information
    dat["Datetime"] = pd.to_datetime(dat["Start"], format = '%d-%b-%Y %H:%M')
    dat["Date"] = dat["Datetime"].dt.date
    dat["Hour"] = dat["Datetime"].dt.hour
    #Format columns
    dat = dat[["Date", "Hour", "Steps (count)"]]
    dat.columns = ["Date", "Hour", "Steps"]
    
    return dat

In [19]:
read_QS_data(FILEPATH).head()

Unnamed: 0,Date,Hour,Steps
0,2014-12-07,9,941.0
1,2014-12-07,10,408.0
2,2014-12-07,11,157.0
3,2014-12-07,12,1017.0
4,2014-12-07,13,0.0
