## PREPARE | Collect Data

The data used for this analysis is a public domain (CC0 1.0 Universal Public Domain) dataset made available on Kaggle (via user MÖBIUS).

Data source: [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit)  

### Data Details
The data uploaded by Kaggle user MÖBIUS originates from respondents to a distributed survey via Amazon Mechanical Turk from March 12 to May 12, 2016. Thirty-three Fitbit users submitted personal tracker data collected in 18 files covering:
- Physical Activity (measured in Steps, Calories, and METs (metabolic equivalents)
- Sleep (measured in minutes)
- Heart rate (bpm)
- Weight/BMI (lbs/kg)

Data covers a 30-day period (04.12.16 - 05.12.16).



This analysis will focus on **Physical Activity** (daily, hourly), **Sleep** (daily), and **weight/bmi** to understand usage at a broader level. 


While 33 unique individuals provided data for physical activity, the other measured data contained fewer individuals.  
- Physical Activity: 33
- Sleep monitoring: 24
- Weight: 8
- Heart rate: 14 

### Licensing, Privacy, Security, Integrity
[CC0 1.0 Universal Public Domain](https://creativecommons.org/publicdomain/zero/1.0/)

All users participating in the survey consented to the submission of personal tracking data. The users' privacy has been protected by only identifying unique individuals via randomly generated ID numbers. This data has been provided by a 3rd party, the Kaggle user MÖBIUS.

### Data Integrity
- Sample selection bias since, insignificant 
- Variation in output due to different types of Fitbit trackers
- Variation in individual tracking behavior/preferences
- Concerns that there is no demographic data (sex, age, location) 
- Obsolescence - 5 years old



In [17]:
import pandas as pd
import pandas_gbq
import os

from google.cloud import bigquery
%load_ext google.cloud.bigquery

pandas_gbq.context.project = 'gac-bellabeat'
pandas_gbq.context.dialect = 'standard'

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/Users/atat/code/phlln/gcp_keys/gac-bellabeat-jupyter-bigquery-key.json'

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [19]:
%%bigquery
SELECT *
  FROM `gac-bellabeat.activity.sleepdaily`
  LIMIT 1

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 313.36query/s]                          
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.62s/rows]


Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12 00:00:00+00:00,1,327,346


## PROCESS | Wrangle Data

Inspect, Wrangle, Validate Data
- Check data type, number of records, number of nulls, stat summary
- addition of columns for easier analysis _by day of week and hour
- We'll look at Physical Activity, Sleep, Weight/BMI

### Data Details
The data uploaded by Kaggle user MÖBIUS originates from respondents to a distributed survey via Amazon Mechanical Turk from March 12 to May 12, 2016. Thirty-three Fitbit users submitted personal tracker data collected in 18 files covering:
- Physical Activity (measured in Steps, Calories, and METs (metabolic equivalents)
- Sleep (measured in minutes)
- Heart rate (bpm)
- Weight/BMI (lbs/kg)

Data covers a 30-day period (04.12.16 - 05.12.16).

This analysis will focus on **Physical Activity** (daily, hourly), **Sleep** (daily), and **weight/bmi** to understand usage at a broader level. 


First, we'll check the columns and data types for the following tables: daily, hourly, sleep, and weight. We'll also exclude columns tracking distance instead preferring those using time or steps. 

In [141]:
%%bigquery
SELECT table_name,
       column_name,
       data_type
  FROM `gac-bellabeat.activity.INFORMATION_SCHEMA.COLUMNS`
WHERE column_name NOT LIKE '%Distance'; 

Query complete after 0.03s: 100%|██████████| 1/1 [00:00<00:00, 30.97query/s]                           
Downloading: 100%|██████████| 39/39 [00:01<00:00, 25.43rows/s]


Unnamed: 0,table_name,column_name,data_type
0,daily,Id,INT64
1,daily,ActivityDate,DATE
2,daily,TotalSteps,INT64
3,daily,VeryActiveMinutes,INT64
4,daily,FairlyActiveMinutes,INT64
5,daily,LightlyActiveMinutes,INT64
6,daily,SedentaryMinutes,INT64
7,daily,Calories,INT64
8,sleep,Id,INT64
9,sleep,SleepDay,TIMESTAMP


### Activity Data

The most detailed and complete data is for logging physical activity. Various metrics for measuring physical exertion (Steps, Calories, METs) are provided whether measure by time or distance. Additionally, levels of intensity have been categorized into four levels ('VeryActive', 'FairlyActive', 'LightlyActive', 'Sedentary'). This data has also been recorded at differnt time scales (daily, hourly, and minute).   

#### Daily Activity Data

##### Validate number of unique individuals

In [120]:
%%bigquery
SELECT
    COUNT(DISTINCT ID) AS user_count
FROM `gac-bellabeat.activity.daily`;
  

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1123.57query/s]                        
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.95s/rows]


Unnamed: 0,user_count
0,33


##### Check for Null Values

In [87]:
%%bigquery
SELECT
    COUNT(*) total_rows,
    SUM(CASE WHEN ActivityDate IS NULL THEN 1 ELSE 0 END) activitydate_num_null,
    SUM(CASE WHEN TotalSteps IS NULL THEN 1 ELSE 0 END) totalsteps_num_null,
    SUM(CASE WHEN VeryActiveMinutes IS NULL THEN 1 ELSE 0 END) veryactive_num_null,
    SUM(CASE WHEN FairlyActiveMinutes IS NULL THEN 1 ELSE 0 END) fairlyactive_num_null,
    SUM(CASE WHEN LightlyActiveMinutes IS NULL THEN 1 ELSE 0 END) lightlyactive_num_null,
    SUM(CASE WHEN SedentaryMinutes IS NULL THEN 1 ELSE 0 END) sedentary_num_null,
    SUM(CASE WHEN Calories IS NULL THEN 1 ELSE 0 END) calories_num_null
FROM `gac-bellabeat.activity.daily`;

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 626.95query/s]                         
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.42s/rows]


Unnamed: 0,total_rows,activitydate_num_null,totalsteps_num_null,veryactive_num_null,fairlyactive_num_null,lightlyactive_num_null,sedentary_num_null,calories_num_null
0,940,0,0,0,0,0,0,0


##### Create summary stats panel (Total, Mean, Min, 25%, 50%, 75%, Max, Std)

In [136]:
%%bigquery
SELECT 1 AS Index, 
      'Count' AS Statistic,
       COUNT(TotalSteps) AS total_steps,
       COUNT(VeryActiveMinutes) AS very_active_minutes,
       COUNT(FairlyActiveMinutes) AS fairly_active_minutes,
       COUNT(LightlyActiveMinutes) AS lightly_active_minutes,
       COUNT(SedentaryMinutes) AS sedentary_minutes
  FROM `gac-bellabeat.activity.daily`
 UNION ALL
SELECT 2, 
      'Mean',
       ROUND(AVG(TotalSteps), 1),
       ROUND(AVG(VeryActiveMinutes), 1),
       ROUND(AVG(FairlyActiveMinutes), 1),
       ROUND(AVG(LightlyActiveMinutes), 1),
       ROUND(AVG(SedentaryMinutes), 1)
  FROM `gac-bellabeat.activity.daily`
 UNION ALL
(SELECT 3,
       'STD',
       ROUND(STDDEV_POP(TotalSteps) OVER(), 1),
       ROUND(STDDEV_POP(VeryActiveMinutes) OVER(), 1),
       ROUND(STDDEV_POP(FairlyActiveMinutes) OVER(), 1),
       ROUND(STDDEV_POP(LightlyActiveMinutes) OVER(), 1),
       ROUND(STDDEV_POP(SedentaryMinutes) OVER(), 1)
  FROM `gac-bellabeat.activity.daily`
 LIMIT 1)
 UNION ALL
SELECT 4,
       'Min',
       MIN(TotalSteps),
       MIN(VeryActiveMinutes),
       MIN(FairlyActiveMinutes),
       MIN(LightlyActiveMinutes),
       MIN(SedentaryMinutes)
  FROM `gac-bellabeat.activity.daily`
  UNION ALL
(SELECT 5,
       '25%',
       PERCENTILE_CONT(TotalSteps, 0.25) OVER(),
       PERCENTILE_CONT(VeryActiveMinutes, 0.25) OVER(),
       PERCENTILE_CONT(FairlyActiveMinutes, 0.25) OVER(),
       PERCENTILE_CONT(LightlyActiveMinutes, 0.25) OVER(),
       PERCENTILE_CONT(SedentaryMinutes, 0.25) OVER()
  FROM `gac-bellabeat.activity.daily`
  LIMIT 1) 
  UNION ALL 
(SELECT 66,
       '50%',
       PERCENTILE_CONT(TotalSteps, 0.50) OVER(),
       PERCENTILE_CONT(VeryActiveMinutes, 0.50) OVER(),
       PERCENTILE_CONT(FairlyActiveMinutes, 0.50) OVER(),
       PERCENTILE_CONT(LightlyActiveMinutes, 0.50) OVER(),
       PERCENTILE_CONT(SedentaryMinutes, 0.50) OVER()
  FROM `gac-bellabeat.activity.daily`
  LIMIT 1) 
  UNION ALL 
(SELECT 7,
       '75%',
       PERCENTILE_CONT(TotalSteps, 0.75) OVER(),
       PERCENTILE_CONT(VeryActiveMinutes, 0.75) OVER(),
       PERCENTILE_CONT(FairlyActiveMinutes, 0.75) OVER(),
       PERCENTILE_CONT(LightlyActiveMinutes, 0.75) OVER(),
       PERCENTILE_CONT(SedentaryMinutes, 0.75) OVER()
  FROM `gac-bellabeat.activity.daily`
 LIMIT 1) 
 UNION ALL
SELECT 8,
       'Max',
       MAX(TotalSteps),
       MAX(VeryActiveMinutes),
       MAX(FairlyActiveMinutes),
       MAX(LightlyActiveMinutes),
       MAX(SedentaryMinutes)
  FROM `gac-bellabeat.activity.daily`
ORDER BY Index ASC;

Query complete after 0.00s: 100%|██████████| 10/10 [00:00<00:00, 1411.84query/s]                       
Downloading: 100%|██████████| 8/8 [00:01<00:00,  5.63rows/s]


Unnamed: 0,Index,Statistic,total_steps,very_active_minutes,fairly_active_minutes,lightly_active_minutes,sedentary_minutes
0,1,Count,940.0,940.0,940.0,940.0,940.0
1,2,Mean,7637.9,21.2,13.6,192.8,991.2
2,3,STD,5084.4,32.8,20.0,109.1,301.1
3,4,Min,0.0,0.0,0.0,0.0,0.0
4,5,25%,3789.75,0.0,0.0,127.0,729.75
5,7,75%,10727.0,32.0,19.0,264.0,1229.5
6,8,Max,36019.0,210.0,143.0,518.0,1440.0
7,66,50%,7405.5,4.0,6.0,199.0,1057.5


##### Create new DayOfWeek column extracted from ActivityDate to faciliate analysis

In [93]:
%%bigquery
SELECT Id,
       ActivityDate,
       FORMAT_DATE('%a', ActivityDate) AS DayOfWeek,
       TotalSteps,
       VeryActiveMinutes,
       FairlyActiveMinutes,
       LightlyActiveMinutes,
       SedentaryMinutes, 
       Calories
FROM `gac-bellabeat.activity.daily`;

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 157.18query/s]                          
Downloading: 100%|██████████| 940/940 [00:01<00:00, 773.58rows/s]


Unnamed: 0,Id,ActivityDate,DayOfWeek,TotalSteps,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1624580081,2016-05-01,Sun,36019,186,63,171,1020,2690
1,1644430081,2016-04-14,Thu,11037,5,58,252,1125,3226
2,1644430081,2016-04-19,Tue,11256,5,58,278,1099,3300
3,1644430081,2016-04-28,Thu,9405,3,53,227,1157,3108
4,1644430081,2016-04-30,Sat,18213,9,71,402,816,3846
...,...,...,...,...,...,...,...,...,...
935,1844505072,2016-04-20,Wed,8,0,0,1,1439,1349
936,4020332650,2016-04-17,Sun,16,0,0,2,1438,1990
937,4319703577,2016-05-12,Thu,17,0,0,2,0,257
938,6775888955,2016-05-03,Tue,9,0,0,1,1439,1843


### Hourly Activity Data

There are three tables (hourly_steps, hourly_calories, hourly_intensities) recording hourly data. Let's check if there are any differences among the three tables prior to merging them on the columns Id and ActivityHour as a compound key. We'll do that by comparing a pair of tables at a time and using EXCEPT, reverse the positions of the tables, and then UNION ALL the result. If there are any combinations of Id, Activity that are only present in one of the tables, we should see that combination in our query results.

In [98]:
%%bigquery
(SELECT hs.Id, hs.ActivityHour
   FROM `gac-bellabeat.activity.hourly_steps` hs
 EXCEPT DISTINCT
 SELECT hc.Id, hc.ActivityHour
  FROM `gac-bellabeat.activity.hourly_calories` hc)
UNION ALL
(SELECT hc.Id, hc.ActivityHour
   FROM `gac-bellabeat.activity.hourly_calories` hc
 EXCEPT DISTINCT
 SELECT hs.Id, hs.ActivityHour
  FROM `gac-bellabeat.activity.hourly_steps` hs);

Query complete after 0.02s: 100%|██████████| 7/7 [00:00<00:00, 369.36query/s]                         
Downloading: 0rows [00:00, ?rows/s]


Unnamed: 0,Id,ActivityHour


In [99]:
%%bigquery
(SELECT hi.Id, hi.ActivityHour
   FROM `gac-bellabeat.activity.hourly_intensities` hi
 EXCEPT DISTINCT
 SELECT hc.Id, hc.ActivityHour
   FROM `gac-bellabeat.activity.hourly_calories` hc)
UNION ALL
(SELECT hc.Id, hc.ActivityHour
   FROM `gac-bellabeat.activity.hourly_calories` hc
 EXCEPT DISTINCT
 SELECT hi.Id, hi.ActivityHour
   FROM `gac-bellabeat.activity.hourly_intensities` hi);

Query complete after 0.01s: 100%|██████████| 7/7 [00:00<00:00, 827.93query/s]                         
Downloading: 0rows [00:00, ?rows/s]


Unnamed: 0,Id,ActivityHour


Since there are no resulting rows from our queries, we can safely merge all three tables on the compound keys Id, ActivityHour.

In [107]:
%%bigquery
SELECT *
  FROM `gac-bellabeat.activity.hourly_intensities` hi
  JOIN `gac-bellabeat.activity.hourly_steps` hs
    ON hi.Id = hs.Id
   AND hi.ActivityHour = hs.ActivityHour
  JOIN  `gac-bellabeat.activity.hourly_calories` hc
    ON hi.Id = hc.Id
   AND hi.ActivityHour = hc.ActivityHour  
 ORDER BY hi.Id, hi.ActivityHour

Query complete after 0.01s: 100%|██████████| 4/4 [00:00<00:00, 690.28query/s]                         
Downloading: 100%|██████████| 22099/22099 [00:01<00:00, 12702.06rows/s]


Unnamed: 0,Id,ActivityHour,TotalIntensity,AverageIntensity,Id_1,ActivityHour_1,StepTotal,Id_2,ActivityHour_2,Calories
0,1503960366,2016-04-12 00:00:00+00:00,20,0.333333,1503960366,2016-04-12 00:00:00+00:00,373,1503960366,2016-04-12 00:00:00+00:00,81
1,1503960366,2016-04-12 01:00:00+00:00,8,0.133333,1503960366,2016-04-12 01:00:00+00:00,160,1503960366,2016-04-12 01:00:00+00:00,61
2,1503960366,2016-04-12 02:00:00+00:00,7,0.116667,1503960366,2016-04-12 02:00:00+00:00,151,1503960366,2016-04-12 02:00:00+00:00,59
3,1503960366,2016-04-12 03:00:00+00:00,0,0.000000,1503960366,2016-04-12 03:00:00+00:00,0,1503960366,2016-04-12 03:00:00+00:00,47
4,1503960366,2016-04-12 04:00:00+00:00,0,0.000000,1503960366,2016-04-12 04:00:00+00:00,0,1503960366,2016-04-12 04:00:00+00:00,48
...,...,...,...,...,...,...,...,...,...,...
22094,8877689391,2016-05-12 10:00:00+00:00,12,0.200000,8877689391,2016-05-12 10:00:00+00:00,514,8877689391,2016-05-12 10:00:00+00:00,126
22095,8877689391,2016-05-12 11:00:00+00:00,29,0.483333,8877689391,2016-05-12 11:00:00+00:00,1407,8877689391,2016-05-12 11:00:00+00:00,192
22096,8877689391,2016-05-12 12:00:00+00:00,93,1.550000,8877689391,2016-05-12 12:00:00+00:00,3135,8877689391,2016-05-12 12:00:00+00:00,321
22097,8877689391,2016-05-12 13:00:00+00:00,6,0.100000,8877689391,2016-05-12 13:00:00+00:00,307,8877689391,2016-05-12 13:00:00+00:00,101


To make take this merged table easier to work with for analysis, we'll remove the duplicate columns (Id, ActivityHour) and extract separate HourOfDay and DayOfWeek columns from ActivityHour, and then save it as a view (`gac-bellabeat.activity.hourly`) to access later in our analysis.

In [113]:
%%bigquery
SELECT hi.Id,
       hi.ActivityHour,
       EXTRACT(hour FROM hi.ActivityHour) AS HourOfDay,
       FORMAT_DATE('%a', hi.ActivityHour) AS DayOfWeek,
       TotalIntensity,
       AverageIntensity,
       StepTotal,
       Calories
  FROM `gac-bellabeat.activity.hourly_intensities` hi
  JOIN `gac-bellabeat.activity.hourly_steps` hs
    ON hi.Id = hs.Id
   AND hi.ActivityHour = hs.ActivityHour
  JOIN  `gac-bellabeat.activity.hourly_calories` hc
    ON hi.Id = hc.Id
   AND hi.ActivityHour = hc.ActivityHour  
 ORDER BY hi.Id, hi.ActivityHour
              

Query complete after 0.01s: 100%|██████████| 4/4 [00:00<00:00, 700.66query/s]                         
Downloading: 100%|██████████| 22099/22099 [00:01<00:00, 12811.76rows/s]


Unnamed: 0,Id,ActivityHour,HourOfDay,DayOfWeek,TotalIntensity,AverageIntensity,StepTotal,Calories
0,1503960366,2016-04-12 00:00:00+00:00,0,Tue,20,0.333333,373,81
1,1503960366,2016-04-12 01:00:00+00:00,1,Tue,8,0.133333,160,61
2,1503960366,2016-04-12 02:00:00+00:00,2,Tue,7,0.116667,151,59
3,1503960366,2016-04-12 03:00:00+00:00,3,Tue,0,0.000000,0,47
4,1503960366,2016-04-12 04:00:00+00:00,4,Tue,0,0.000000,0,48
...,...,...,...,...,...,...,...,...
22094,8877689391,2016-05-12 10:00:00+00:00,10,Thu,12,0.200000,514,126
22095,8877689391,2016-05-12 11:00:00+00:00,11,Thu,29,0.483333,1407,192
22096,8877689391,2016-05-12 12:00:00+00:00,12,Thu,93,1.550000,3135,321
22097,8877689391,2016-05-12 13:00:00+00:00,13,Thu,6,0.100000,307,101


Next, we'll save this table as a view to use later in our analysis.

Let's run our summary stat panel on this data. 

In [134]:
%%bigquery
SELECT 1 AS Index, 
      'Count' AS Statistic,
       COUNT(TotalIntensity) AS total_intensity,
       COUNT(AverageIntensity) AS avg_intensity,
       COUNT(StepTotal) AS step_total,
       COUNT(Calories) AS calories
  FROM `gac-bellabeat.activity.hourly`
 UNION ALL
SELECT 2, 
      'Mean',
       ROUND(AVG(TotalIntensity), 1),
       ROUND(AVG(AverageIntensity), 1),
       ROUND(AVG(StepTotal), 1),
       ROUND(AVG(Calories), 1)
  FROM `gac-bellabeat.activity.hourly`
  UNION ALL
(SELECT 3,
       'STD',
       ROUND(STDDEV_POP(TotalIntensity) OVER(), 1),
       ROUND(STDDEV_POP(AverageIntensity) OVER(), 1),
       ROUND(STDDEV_POP(StepTotal) OVER(), 1),
       ROUND(STDDEV_POP(Calories) OVER(), 1)
  FROM `gac-bellabeat.activity.hourly`
 LIMIT 1)    
  UNION ALL
SELECT 4,
       'Min',
       MIN(TotalIntensity) AS min_total_steps,
       MIN(AverageIntensity) AS min_avg_intensity,
       MIN(StepTotal) AS min_step_total,
       MIN(Calories) AS min_calories
  FROM `gac-bellabeat.activity.hourly`
  UNION ALL
(SELECT 5,
       '25%',
       PERCENTILE_CONT(TotalIntensity, 0.25) OVER(),
       PERCENTILE_CONT(AverageIntensity, 0.25) OVER(),
       PERCENTILE_CONT(StepTotal, 0.25) OVER(),
       PERCENTILE_CONT(Calories, 0.25) OVER()
  FROM `gac-bellabeat.activity.hourly`
  LIMIT 1) 
  UNION ALL 
(SELECT 6,
       '50%',
       PERCENTILE_CONT(TotalIntensity, 0.50) OVER(),
       PERCENTILE_CONT(AverageIntensity, 0.50) OVER(),
       PERCENTILE_CONT(StepTotal, 0.50) OVER(),
       PERCENTILE_CONT(Calories, 0.50) OVER()
  FROM `gac-bellabeat.activity.hourly`
  LIMIT 1) 
  UNION ALL 
(SELECT 7,
       '75%',
       PERCENTILE_CONT(TotalIntensity, 0.75) OVER(),
       PERCENTILE_CONT(AverageIntensity, 0.75) OVER(),
       PERCENTILE_CONT(StepTotal, 0.75) OVER(),
       PERCENTILE_CONT(Calories, 0.75) OVER()
  FROM `gac-bellabeat.activity.hourly`
  LIMIT 1) 
UNION ALL
SELECT 8,
       'Max',
       MAX(TotalIntensity) AS max_total_intensity,
       MAX(AverageIntensity) AS max_avg_intensity,
       MAX(StepTotal) AS max_step_total,
       MAX(Calories) AS max_calories
  FROM `gac-bellabeat.activity.hourly`
ORDER BY Index ASC;

Query complete after 0.00s: 100%|██████████| 30/30 [00:00<00:00, 6271.08query/s]                       
Downloading: 100%|██████████| 8/8 [00:01<00:00,  5.56rows/s]


Unnamed: 0,Index,Statistic,total_intensity,avg_intensity,step_total,calories
0,1,Count,22099.0,22099.0,22099.0,22099.0
1,2,Mean,12.0,0.2,320.2,97.4
2,3,STD,21.1,0.4,690.4,60.7
3,4,Min,0.0,0.0,0.0,42.0
4,5,25%,0.0,0.0,0.0,63.0
5,6,50%,3.0,0.05,40.0,83.0
6,7,75%,16.0,0.266667,357.0,108.0
7,8,Max,180.0,3.0,10554.0,948.0


Let's also validate the number of unique individuals.

In [121]:
%%bigquery
SELECT COUNT(DISTINCT Id) AS user_count
 FROM `gac-bellabeat.activity.hourly`;

Query complete after 0.01s: 100%|██████████| 5/5 [00:00<00:00, 990.11query/s]                         
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.42s/rows]


Unnamed: 0,user_count
0,33


### Sleep Data

Let's inspect the sleep monitoring data and count the number of unique users that contributed sleep tracking data.

In [124]:
%%bigquery
SELECT COUNT(DISTINCT Id) AS user_count
  FROM `gac-bellabeat.activity.sleep`;
    

Query complete after 0.01s: 100%|██████████| 3/3 [00:00<00:00, 604.45query/s]                         
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.42s/rows]


Unnamed: 0,user_count
0,24


In [129]:
%%bigquery
SELECT *
FROM `gac-bellabeat.activity.sleep`
LIMIT 1;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 213.77query/s]                          
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.46s/rows]


Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12 00:00:00+00:00,1,327,346


We see that only 24 out of 33 users tracked their sleep data. Let's also run a stat summary panel. 

In [132]:
%%bigquery
SELECT 1 AS Index, 
      'Count' AS Statistic,
       COUNT(TotalSleepRecords) AS total_sleep_records,
       COUNT(TotalMinutesAsleep) AS total_minutes_asleep,
       COUNT(TotalTimeinBed) AS total_time_in_bed
  FROM `gac-bellabeat.activity.sleep`
 UNION ALL
SELECT 2, 
      'Mean',
       ROUND(AVG(TotalSleepRecords), 1),
       ROUND(AVG(TotalMinutesAsleep), 1),
       ROUND(AVG(TotalTimeinBed), 1)
  FROM `gac-bellabeat.activity.sleep`
 UNION ALL
(SELECT 3,
       'STD',
       ROUND(STDDEV_POP(TotalSleepRecords) OVER(), 1),
       ROUND(STDDEV_POP(TotalMinutesAsleep) OVER(), 1),
       ROUND(STDDEV_POP(TotalTimeinBed) OVER(), 1)
  FROM `gac-bellabeat.activity.sleep`
 LIMIT 1)
 UNION ALL
(SELECT 4,
       'Min',
       MIN(TotalSleepRecords),
       MIN(TotalMinutesAsleep),
       MIN(TotalTimeinBed)
  FROM `gac-bellabeat.activity.sleep`)
  UNION ALL
(SELECT 5,
       '25%',
       PERCENTILE_CONT(TotalSleepRecords, 0.25) OVER(),
       PERCENTILE_CONT(TotalMinutesAsleep, 0.25) OVER(),
       PERCENTILE_CONT(TotalTimeinBed, 0.25) OVER()
  FROM `gac-bellabeat.activity.sleep`
  LIMIT 1) 
  UNION ALL 
(SELECT 6,
       '50%',
       PERCENTILE_CONT(TotalSleepRecords, 0.50) OVER(),
       PERCENTILE_CONT(TotalMinutesAsleep, 0.50) OVER(),
       PERCENTILE_CONT(TotalTimeinBed, 0.50) OVER()
  FROM `gac-bellabeat.activity.sleep`
  LIMIT 1) 
  UNION ALL 
(SELECT 7,
       '75%',
       PERCENTILE_CONT(TotalSleepRecords, 0.75) OVER(),
       PERCENTILE_CONT(TotalMinutesAsleep, 0.75) OVER(),
       PERCENTILE_CONT(TotalTimeinBed, 0.75) OVER()
  FROM `gac-bellabeat.activity.sleep`
  LIMIT 1) 
  UNION ALL
(SELECT 8,
       'Max',
       MAX(TotalSleepRecords),
       MAX(TotalMinutesAsleep),
       MAX(TotalTimeinBed)
  FROM `gac-bellabeat.activity.sleep`)
ORDER BY Index ASC;


Query complete after 0.00s: 100%|██████████| 10/10 [00:00<00:00, 920.97query/s]                        
Downloading: 100%|██████████| 8/8 [00:01<00:00,  5.65rows/s]


Unnamed: 0,Index,Statistic,total_sleep_records,total_minutes_asleep,total_time_in_bed
0,1,Count,413.0,413.0,413.0
1,2,Mean,1.1,419.5,458.6
2,3,STD,0.3,118.2,126.9
3,4,Min,1.0,58.0,61.0
4,5,25%,1.0,361.0,403.0
5,6,50%,1.0,433.0,463.0
6,7,75%,1.0,490.0,526.0
7,8,Max,3.0,796.0,961.0


Let's also create a DayOfWeek column extracted from SleepDay to help with analysis later.

In [137]:
%%bigquery
SELECT SleepDay,
       FORMAT_DATE('%a', SleepDay) AS DayOfWeek,
       TotalSleepRecords,
       TotalMinutesAsleep,
       TotalTimeInBed
 FROM `gac-bellabeat.activity.sleep`;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 167.64query/s]                          
Downloading: 100%|██████████| 413/413 [00:01<00:00, 242.59rows/s]


Unnamed: 0,SleepDay,DayOfWeek,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,2016-04-12 00:00:00+00:00,Tue,1,327,346
1,2016-04-15 00:00:00+00:00,Fri,1,412,442
2,2016-04-17 00:00:00+00:00,Sun,1,700,712
3,2016-04-19 00:00:00+00:00,Tue,1,304,320
4,2016-04-20 00:00:00+00:00,Wed,1,360,377
...,...,...,...,...,...
408,2016-04-17 00:00:00+00:00,Sun,2,525,591
409,2016-05-07 00:00:00+00:00,Sat,2,459,513
410,2016-04-12 00:00:00+00:00,Tue,3,750,775
411,2016-04-24 00:00:00+00:00,Sun,3,552,595


### Weight Data

We'll now inspect the weight table and count how many unique users contributed data.

In [143]:
%%bigquery
SELECT COUNT(DISTINCT Id) AS user_count 
  FROM `gac-bellabeat.activity.weight`;

Query complete after 0.01s: 100%|██████████| 3/3 [00:00<00:00, 490.18query/s]                         
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.42s/rows]


Unnamed: 0,user_count
0,8


Let's run our stat summary panel.

In [147]:
%%bigquery
SELECT 1 AS Index, 
      'Count' AS Statistic,
       COUNT(WeightKg) AS weight_kg,
       COUNT(WeightPounds) AS weight_pounds,
       COUNT(Fat) AS fat,
       COUNT(BMI) AS bmi  
  FROM `gac-bellabeat.activity.weight`
 UNION ALL
SELECT 2, 
      'Mean',
       ROUND(AVG(WeightKg), 1),
       ROUND(AVG(WeightPounds), 1),
       ROUND(AVG(Fat), 1),
       ROUND(AVG(BMI), 1),
  FROM `gac-bellabeat.activity.weight`
 UNION ALL
(SELECT 3,
       'STD',
       ROUND(STDDEV_POP(WeightKg) OVER(), 1),
       ROUND(STDDEV_POP(WeightPounds) OVER(), 1),
       ROUND(STDDEV_POP(Fat) OVER(), 1),
       ROUND(STDDEV_POP(BMI) OVER(), 1)
  FROM `gac-bellabeat.activity.weight`
 LIMIT 1)
 UNION ALL
(SELECT 4,
       'Min',
       MIN(WeightKg),
       MIN(WeightPounds),
       MIN(Fat),
       MIN(BMI)
  FROM `gac-bellabeat.activity.weight`)
  UNION ALL
(SELECT 5,
       '25%',
       PERCENTILE_CONT(WeightKg, 0.25) OVER(),
       PERCENTILE_CONT(WeightPounds, 0.25) OVER(),
       PERCENTILE_CONT(Fat, 0.25) OVER(),
       PERCENTILE_CONT(BMI, 0.25) OVER()
  FROM `gac-bellabeat.activity.weight`
  LIMIT 1) 
  UNION ALL 
(SELECT 6,
       '50%',
       PERCENTILE_CONT(WeightKg, 0.50) OVER(),
       PERCENTILE_CONT(WeightPounds, 0.50) OVER(),
       PERCENTILE_CONT(Fat, 0.50) OVER(),
       PERCENTILE_CONT(BMI, 0.50) OVER()
  FROM `gac-bellabeat.activity.weight`
  LIMIT 1) 
  UNION ALL 
(SELECT 7,
       '75%',
       PERCENTILE_CONT(WeightKg, 0.75) OVER(),
       PERCENTILE_CONT(WeightPounds, 0.75) OVER(),
       PERCENTILE_CONT(Fat, 0.75) OVER(),
       PERCENTILE_CONT(BMI, 0.75) OVER()
  FROM `gac-bellabeat.activity.weight`
  LIMIT 1) 
  UNION ALL
(SELECT 8,
       'Max',
       MAX(WeightKg),
       MAX(WeightPounds),
       MAX(Fat),
       MAX(BMI)
  FROM `gac-bellabeat.activity.weight`)
ORDER BY Index ASC;

Query complete after 0.01s: 100%|██████████| 10/10 [00:00<00:00, 1640.19query/s]                       
Downloading: 100%|██████████| 8/8 [00:01<00:00,  5.62rows/s]


Unnamed: 0,Index,Statistic,weight_kg,weight_pounds,fat,bmi
0,1,Count,67.0,67.0,2.0,67.0
1,2,Mean,72.0,158.8,23.5,25.2
2,3,STD,13.8,30.5,1.5,3.0
3,4,Min,52.599998,115.963147,22.0,21.450001
4,5,25%,61.400002,135.363832,22.75,23.959999
5,6,50%,62.5,137.788914,23.5,24.389999
6,7,75%,85.049999,187.503152,24.25,25.559999
7,8,Max,133.5,294.31712,25.0,47.540001
