## PREPARE | Collect Data

The data used for this analysis is a public domain (CC0 1.0 Universal Public Domain) dataset made available on Kaggle (via user MÖBIUS).

Data source: [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit)  

### Data Details
The data uploaded by Kaggle user MÖBIUS originates from respondents to a distributed survey via Amazon Mechanical Turk from March 12 to May 12, 2016. Thirty-three Fitbit users submitted personal tracker data collected in 18 files covering:
- Physical Activity (measured in Steps, Calories, and METs (metabolic equivalents)
- Sleep (measured in minutes)
- Heart rate (bpm)
- Weight/BMI (lbs/kg)

Data covers a 30-day period (04.12.16 - 05.12.16).



This analysis will focus on **Physical Activity** (daily, hourly), **Sleep** (daily), and **weight/bmi** to understand usage at a broader level. 


While 33 unique individuals provided data for physical activity, the other measured data contained fewer individuals.  
- Physical Activity: 33
- Sleep monitoring: 24
- Weight: 8
- Heart rate: 14 

### Licensing, Privacy, Security, Integrity
[CC0 1.0 Universal Public Domain](https://creativecommons.org/publicdomain/zero/1.0/)

All users participating in the survey consented to the submission of personal tracking data. The users' privacy has been protected by only identifying unique individuals via randomly generated ID numbers. This data has been provided by a 3rd party, the Kaggle user MÖBIUS.

### Data Integrity
- Sample selection bias since, insignificant 
- Variation in output due to different types of Fitbit trackers
- Variation in individual tracking behavior/preferences
- Concerns that there is no demographic data (sex, age, location) 
- Obsolescence - 5 years old



In [17]:
import pandas as pd
import pandas_gbq
import os

from google.cloud import bigquery
%load_ext google.cloud.bigquery

pandas_gbq.context.project = 'gac-bellabeat'
pandas_gbq.context.dialect = 'standard'

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/Users/atat/code/phlln/gcp_keys/gac-bellabeat-jupyter-bigquery-key.json'

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [19]:
%%bigquery
SELECT *
  FROM `gac-bellabeat.activity.sleepdaily`
  LIMIT 1

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 313.36query/s]                          
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.62s/rows]


Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12 00:00:00+00:00,1,327,346


## PROCESS | Wrangle Data

Inspect, Wrangle, Validate Data
- Check data type, number of records, number of nulls, stat summary
- addition of columns for easier analysis _by day of week and hour
- We'll look at Physical Activity, Sleep, Weight/BMI

### Data Details
The data uploaded by Kaggle user MÖBIUS originates from respondents to a distributed survey via Amazon Mechanical Turk from March 12 to May 12, 2016. Thirty-three Fitbit users submitted personal tracker data collected in 18 files covering:
- Physical Activity (measured in Steps, Calories, and METs (metabolic equivalents)
- Sleep (measured in minutes)
- Heart rate (bpm)
- Weight/BMI (lbs/kg)

Data covers a 30-day period (04.12.16 - 05.12.16).

This analysis will focus on **Physical Activity** (daily, hourly), **Sleep** (daily), and **weight/bmi** to understand usage at a broader level. 


### Activity Data

The most detailed and complete data is for logging physical activity. Various metrics for measuring physical exertion (Steps, Calories, METs) are provided whether measure by time or distance. Additionally, levels of intensity have been categorized into four levels ('VeryActive', 'FairlyActive', 'LightlyActive', 'Sedentary'). This data has also been recorded at differnt time scales (daily, hourly, and minute).   

#### Daily Activity Data

##### Validate number of unique individuals

In [81]:
%%bigquery
SELECT
    COUNT(DISTINCT ID) AS num_unique_users
FROM `gac-bellabeat.activity.daily`;
  

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 505.09query/s] 
Downloading: 100%|██████████| 1/1 [00:02<00:00,  2.04s/rows]


Unnamed: 0,num_unique_users
0,33


##### Check for Null Values

In [87]:
%%bigquery
SELECT
    COUNT(*) total_rows,
    SUM(CASE WHEN ActivityDate IS NULL THEN 1 ELSE 0 END) activitydate_num_null,
    SUM(CASE WHEN TotalSteps IS NULL THEN 1 ELSE 0 END) totalsteps_num_null,
    SUM(CASE WHEN VeryActiveMinutes IS NULL THEN 1 ELSE 0 END) veryactive_num_null,
    SUM(CASE WHEN FairlyActiveMinutes IS NULL THEN 1 ELSE 0 END) fairlyactive_num_null,
    SUM(CASE WHEN LightlyActiveMinutes IS NULL THEN 1 ELSE 0 END) lightlyactive_num_null,
    SUM(CASE WHEN SedentaryMinutes IS NULL THEN 1 ELSE 0 END) sedentary_num_null,
    SUM(CASE WHEN Calories IS NULL THEN 1 ELSE 0 END) calories_num_null
FROM `gac-bellabeat.activity.daily`;

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 626.95query/s]                         
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.42s/rows]


Unnamed: 0,total_rows,activitydate_num_null,totalsteps_num_null,veryactive_num_null,fairlyactive_num_null,lightlyactive_num_null,sedentary_num_null,calories_num_null
0,940,0,0,0,0,0,0,0


##### Create summary stats panel (Total, Mean, Min, 25%, 50%, 75%, Max, Std)

In [91]:
%%bigquery
SELECT 1 AS Index, 
      'Count' AS Statistic,
       SUM(TotalSteps) AS total_steps,
       SUM(VeryActiveMinutes) AS very_active_minutes,
       SUM(FairlyActiveMinutes) AS fairly_active_minutes,
       SUM(LightlyActiveMinutes) AS lightly_active_minutes,
       SUM(SedentaryMinutes) AS sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
 UNION ALL
SELECT 2, 
      'Mean',
       ROUND(AVG(TotalSteps), 1) AS mean_total_steps,
       ROUND(AVG(VeryActiveMinutes), 1) AS mean_veryactive_minutes,
       ROUND(AVG(FairlyActiveMinutes), 1) AS mean_fairlyactive_minutes,
       ROUND(AVG(LightlyActiveMinutes), 1) AS mean_lightlyactive_minutes,
       ROUND(AVG(SedentaryMinutes), 1) AS mean_sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
  UNION ALL
SELECT 3,
       'Min',
       MIN(TotalSteps) AS min_total_steps,
       MIN(VeryActiveMinutes) AS min_veryactive_minutes,
       MIN(FairlyActiveMinutes) AS min_fairlyactive_minutes,
       MIN(LightlyActiveMinutes) AS min_lightlyactive_minutes,
       MIN(SedentaryMinutes) AS min_sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
  UNION ALL
(SELECT 4,
       '25%',
       PERCENTILE_CONT(TotalSteps, 0.25) OVER() AS percentile25_total_steps,
       PERCENTILE_CONT(VeryActiveMinutes, 0.25) OVER() AS percentile25_veryactive_minutes,
       PERCENTILE_CONT(FairlyActiveMinutes, 0.25) OVER() AS percentile25_fairlyactive_minutes,
       PERCENTILE_CONT(LightlyActiveMinutes, 0.25) OVER() AS percentile25_lightlyactive_minutes,
       PERCENTILE_CONT(SedentaryMinutes, 0.25) OVER() AS percentile25_sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
  LIMIT 1) 
  UNION ALL 
(SELECT 5,
       '50%',
       PERCENTILE_CONT(TotalSteps, 0.50) OVER() AS percentile50_total_steps,
       PERCENTILE_CONT(VeryActiveMinutes, 0.50) OVER() AS percentile50_veryactive_minutes,
       PERCENTILE_CONT(FairlyActiveMinutes, 0.50) OVER() AS percentile50_fairlyactive_minutes,
       PERCENTILE_CONT(LightlyActiveMinutes, 0.50) OVER() AS percentile50_lightlyactive_minutes,
       PERCENTILE_CONT(SedentaryMinutes, 0.50) OVER() AS percentile50_sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
  LIMIT 1) 
  UNION ALL 
(SELECT 6,
       '75%',
       PERCENTILE_CONT(TotalSteps, 0.75) OVER() AS percentile75_total_steps,
       PERCENTILE_CONT(VeryActiveMinutes, 0.75) OVER() AS percentile75_veryactive_minutes,
       PERCENTILE_CONT(FairlyActiveMinutes, 0.75) OVER() AS percentile75_fairlyactive_minutes,
       PERCENTILE_CONT(LightlyActiveMinutes, 0.75) OVER() AS percentile75_lightlyactive_minutes,
       PERCENTILE_CONT(SedentaryMinutes, 0.75) OVER() AS percentile75_sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
  LIMIT 1) 
UNION ALL
SELECT 7,
       'Max',
       MAX(TotalSteps) AS max_total_steps,
       MAX(VeryActiveMinutes) AS max_veryactive_minutes,
       MAX(FairlyActiveMinutes) AS max_fairlyactive_minutes,
       MAX(LightlyActiveMinutes) AS max_lightlyactive_minutes,
       MAX(SedentaryMinutes) AS max_sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
UNION ALL
(SELECT 8,
       'STD',
       ROUND(STDDEV_POP(TotalSteps) OVER(), 1) AS std_total_steps,
       ROUND(STDDEV_POP(VeryActiveMinutes) OVER(), 1) AS std_veryactive_minutes,
       ROUND(STDDEV_POP(FairlyActiveMinutes) OVER(), 1) AS std_fairlyactive_minutes,
       ROUND(STDDEV_POP(LightlyActiveMinutes) OVER(), 1) AS std_lightlyactive_minutes,
       ROUND(STDDEV_POP(SedentaryMinutes) OVER(), 1) AS std_sedentary_minutes
FROM `gac-bellabeat.activity.daily`
LIMIT 1)
ORDER BY Index ASC;

Query complete after 0.02s: 100%|██████████| 10/10 [00:00<00:00, 1350.04query/s]                       
Downloading: 100%|██████████| 8/8 [00:01<00:00,  5.68rows/s]


Unnamed: 0,Index,Statistic,total_steps,very_active_minutes,fairly_active_minutes,lightly_active_minutes,sedentary_minutes
0,1,Count,7179636.0,19895.0,12751.0,181244.0,931738.0
1,2,Mean,7637.9,21.2,13.6,192.8,991.2
2,3,Min,0.0,0.0,0.0,0.0,0.0
3,4,25%,3789.75,0.0,0.0,127.0,729.75
4,5,50%,7405.5,4.0,6.0,199.0,1057.5
5,6,75%,10727.0,32.0,19.0,264.0,1229.5
6,7,Max,36019.0,210.0,143.0,518.0,1440.0
7,8,STD,5084.4,32.8,20.0,109.1,301.1


##### Create new DayOfWeek column extracted from ActivityDate to faciliate analysis

In [93]:
%%bigquery
SELECT Id,
       ActivityDate,
       FORMAT_DATE('%a', ActivityDate) AS DayOfWeek,
       TotalSteps,
       VeryActiveMinutes,
       FairlyActiveMinutes,
       LightlyActiveMinutes,
       SedentaryMinutes, 
       Calories
FROM `gac-bellabeat.activity.daily`

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 157.18query/s]                          
Downloading: 100%|██████████| 940/940 [00:01<00:00, 773.58rows/s]


Unnamed: 0,Id,ActivityDate,DayOfWeek,TotalSteps,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1624580081,2016-05-01,Sun,36019,186,63,171,1020,2690
1,1644430081,2016-04-14,Thu,11037,5,58,252,1125,3226
2,1644430081,2016-04-19,Tue,11256,5,58,278,1099,3300
3,1644430081,2016-04-28,Thu,9405,3,53,227,1157,3108
4,1644430081,2016-04-30,Sat,18213,9,71,402,816,3846
...,...,...,...,...,...,...,...,...,...
935,1844505072,2016-04-20,Wed,8,0,0,1,1439,1349
936,4020332650,2016-04-17,Sun,16,0,0,2,1438,1990
937,4319703577,2016-05-12,Thu,17,0,0,2,0,257
938,6775888955,2016-05-03,Tue,9,0,0,1,1439,1843


In [None]:
## Hourly