# Google Data Analytics Professional certification: Capstone Project 

## Table of Contents

* I [Summary](#summary)
* II [Ask Phase](#ask)
 - II.1. [Business Task](#task)
 - II.2. [Key Stakeholders](#keys)  
     - II.2.a. [Primary Stakeholders](#pri) 
     - II.2.b. [Secondary Stakeholders](#sec)   
* III [Prepare Phase](#prepare) 
    - III.1. [Dataset](#ds)   
    - III.2. [Data Organization](#dorg)  
    - III.3. [Credibility](#cred) 
* IV [Process Phase](#process)  
   - IV.1. [Tools](#tools) 
   - IV.2. [Load packages](#load)  
   - IV.3. [Import datasets](#imp)  
   - IV.4. [View data](#view) 
   - IV.5. [Cleaning Dataset](#cleand)
   - IV.6. [Data Transformation](#dt)    
* V [Analyze & Share Phase](#analyze) 
   - V.1. [Summary Statistics](#sumstat)
   - V.2. [Visualization](#visualize)
       - V.2.a. [Daily Steps Vs. Calories](#dsc) 
       - V.2.b. [Calories burned by active minutes](#ca) 
       - V.2.c. [Calories burned by active distance](#cd) 
       - V.2.d. [Intensity by Week Day](#intw) 
       - V.2.e. [Energy Expenditure: Intensity Vs. Step Count](#eeis) 
       - V.2.f. [Energy Expenditure: Intensity Vs. Distance](#eeid)
       - V.2.g. [Energy Expenditure: METs Vs. Step Count](#eems) 
       - V.2.h. [Overall Analysis: Step count Vs. METs Vs. Intensity Levels Vs. Calories](#ovs) 
       - V.2.i. [Time Asleep Vs. Time in bed](#asb) 
       - V.2.j. [Sleep efficiency by Week Day](#eff) 
       - V.2.k. [Sleep Efficiency Vs. Intensity](#iseff) 
       - V.2.l. [Sleep Cycle](#scyl) 
       - V.2.m. [Heart rate](#hr)
   - V.3. [Share Key Findings](#key)
       - V.3.a. [Results](#result)
       - V.3.b. [Limitations]( #limit)
* VII [Act Phase](#act)   
    - VII.1. [Recommendations](#recommend)  
* VIII [References](#ref)  

## I Summary <a id="summary"></a>
Bellabeat, a high-tech company that manufactures health-focused smart products for women. Bellabeat products  empower women with the knowledge about their own health and habits. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

## II Ask Phase <a id="ask"></a>
### II.1. Business Task: <a id="task"></a>
To analyze the trends in non-Bellabeat consumers smart device usage and provide high-level recommendations for how these insights can inform Bellabeat's marketing strategy. 
Following are few guiding questions for this analysis:
* What are some of the trends in smart device usage?
* How can you apply these trends to Bellabeat customers?
* How do they manage their overall health and wellness.?
* How can these trends help influence Bellabeat marketing strategy?

### II.2. Key Stakeholders <a id="keys "></a>
 #### II.2.a.  Primary stakeholders: <a id=" pri"></a>
    * Urska Srsen - Bellabeat's co-founder and Chief Creative Officer.
    * Sando Mur - Mathematician and Bellabeat's co-founder; key member of executive team.
 #### II.2.b. Secondary stakeholders: <a id="sec "></a>
    * Bellabeat marketing analytics team.

## III Prepare Phase <a id="prepare "></a>    
### III.1. Dataset: <a id="ds "></a>
The [FitBit Fitness Tracker Data](http://www.kaggle.com/arashnic/fitbit) (CC0: Public Domain, dataset made available through Mobius): This Kaggle dataset contains a personal fitness tracker from thirty FitBit users. Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. 
<a id=" dorg"></a>
### III.2. Data  Organization/format:
This public dataset contains total of 18 Comma-Seperated Values(csv) files. 15 long format and 3 wide format files. 
<a id=" cred"></a>
### III.3. Credibility:
ROCCC process is used to determine the credibility and integrity of the data.
 * *Reliable:* 
    1. The population size of this dataset is only, 30, which is very small. So the entire population must be used for analysis and the accuracy will still be limited. 
    2. The data may be baised, there is not enough evidence to confirm the data collection is from random population, and  particiants are only surveyed online.
    
 * *Original:* Data is not original, as it was generated by respondents to a distributed survey via Amazon Mechanical Turk. 
   
 * *Comprehensive:* Information like gender,age, diet are missing which is crucial to Bellabeat who focus on women centric products. 
   
 * *Current:* This data was collected between 03.12.2016-05.12.2016. This will not be represent the current trend.

 * *Cited:* Original data source found [here](http://zenodo.org/record/53894#.Ygm17t_MK3B)

Although there are limitations with this dataset, this case study is mainly for the data analysis process.


## IV Process Phase <a id="process "></a>
### IV.1. Tools: <a id=" tools"></a> 
 **R Programming language** , is used throughout data processing, analysis and data visualization phases in this case study.
* *R packages used:* 'tidyverse', lubridate', 'dplyr','ggplot','janitor', 'viridis', 'ggpubr', 'scales', 'gridExtra', 'ggthemes', 'readr','hrbrthemes'.

### IV.2. Install and load packages: <a id="load "></a>

In [None]:
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(readr)
library(janitor)
library(ggthemes)
library(ggpubr)
library(scales)
library(gridExtra)
library(hrbrthemes)

### IV.3. Import datasets: <a id="imp"></a>

Analyzing sleep patterns, step count,exercise intensities and calories burned, heart rate, BMI can give insights for this case study. Following datasets are used. 

In [None]:
activity <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
calories <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
intensities <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
sleep <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
heart_rate <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
weight <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

# Analysis with minutes data
steps_min<-read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv")
int_min<-read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteIntensitiesNarrow_merged.csv")
cal_min<-read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv")
met_min<-read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv")
sleep_min<-read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")  

### IV.4. View data: <a id="imp "></a>

In [None]:
str(activity)
str(calories)
str(intensities)
str(sleep)
str(heart_rate)
str(weight)
str(met_min)
str(steps_min)
str(int_min)
str(cal_min)
str(sleep_min)

* Date column in all dataset is a character type,that needs to changed to POSIXct format before analysis. Rest of the variables in numeric datatype.


In [None]:
# ndistinct # To check number of distinct users in each dataframe
n_distinct(activity$Id)
n_distinct(calories$Id)
n_distinct(intensities$Id)
n_distinct(sleep$Id)
n_distinct(heart_rate$Id)
n_distinct(weight$Id)
n_distinct(met_min$Id)
n_distinct(steps_min$Id)
n_distinct(int_min$Id)
n_distinct(cal_min$Id)
n_distinct(sleep_min$Id)

 * All the datasets contains 33 users, expect sleep dataset with 24, heartrate dataset with 14 and the weight dataset with only 8 user entries.
 * There are 33 users, not 30 users.

In [None]:
# To check for enough data availability in 2 datasets(by common element 'Id') before merging. 
#This returns common Ids between 2 datasets.

u<-intersect(activity$Id, heart_rate$Id)
cat("Actvity and heart_rate: ", length(u),"\n",u,"\n")

sh<-intersect(sleep$Id, heart_rate$Id)
cat("Sleep and heart_rate: ", length(sh),"\n",u,"\n")

ci<-intersect(activity$Id, calories$Id)
cat("Actvity and calories: ", length(ci),"\n",ci,"\n")

ai<-intersect(activity$Id, intensities$Id)
cat("Actvity and intensity: ", length(ai),"\n",ai,"\n")

si<-intersect(activity$Id, sleep$Id)
cat("Actvity and sleep: ", length(si),"\n",si,"\n")

ms<-intersect(met_min$Id, sleep$Id)
cat("METs and sleep: ", length(ms),"\n",ms,"\n")


hc<-intersect(heart_rate$Id, calories$Id)
cat("Heart_rate and Calories: ", length(hc),"\n",hc,"\n")


hm<- intersect(heart_rate$Id, met_min$Id)
cat("Heart_rate and METs: ", length(hm),"\n",hm,"\n")


sm<-intersect(steps_min$Id, met_min$Id)
cat("Steps_min and METs: ", length(sm),"\n",sm,"\n")

* From the above analysis activity, intensities, calories are more usable if combined together. Similarly met_min,steps_min dataframes can be combined.


### IV.5. Cleaning Dataset:<a id="cleand "></a>

* #### **Check for NULL:**

In [None]:
is.null(activity)
is.null(calories)
is.null(intensities)
is.null(sleep)
is.null(heart_rate)
is.null(weight)
is.null(met_min)
is.null(steps_min)
is.null(int_min)
is.null(cal_min)
is.null(sleep_min)

* #### **Remove duplicates and NAs:**
 **Note:** The variable names aren't converted to consistent and/or lowercase format. clean() function can be used, if needed.

In [None]:
activity <- activity%>% distinct()%>% drop_na()
head(activity,5)

calories <- calories%>% distinct()%>% drop_na()
head(calories,5)

intensities<-intensities%>% distinct()%>%drop_na()
head(intensities,5)

sleep<- sleep%>% distinct() %>% drop_na()
head(sleep,5)

heart_rate <- heart_rate %>% distinct() %>% drop_na()
head(heart_rate,5)

weight <- weight %>%clean_names()%>% distinct()%>% drop_na()
head(weight,5)

met_min <- met_min %>%distinct() %>% drop_na()
head(met_min,5)

steps_min <- steps_min %>% distinct() %>% drop_na()
head(steps_min,5)

int_min <- int_min%>%distinct() %>% drop_na()
head(steps_min,5)

cal_min <- cal_min %>% distinct() %>% drop_na()
head(cal_min,5)

sleep_min <- sleep_min %>% distinct() %>% drop_na()
head(sleep_min,5)

* **weight dataframe** "Fat" column had NA values and was removed. Since it has insufficient data,it is not required for analysis.

### IV.7. Data Transformation: <a id="dt "></a>

In [None]:
#Before merging, format data.
#Activity 
activity$ActivityDate = as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz="")   #assigns time zone
activity$date <- format(activity$ActivityDate, format="%m/%d/%y")

#intensity
intensities$ActivityHour = as.POSIXct(intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz="")  
intensities$date <- format(intensities$ActivityHour, format="%m/%d/%y")
intensities$time <- format(intensities$ActivityHour, format="%H:%M:%S")
intensities$day <- weekdays(intensities$ActivityHour)

#sleep
sleep$SleepDay = as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz="")   
sleep$date <- format(sleep$SleepDay, format="%m/%d/%y")
sleep$time <- format(sleep$SleepDay, format="%H:%M:%S")
sleep$day <- weekdays(sleep$SleepDay)

#Calories
calories$ActivityHour = as.POSIXct(calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz="")  
calories$date <- format(calories$ActivityHour, format="%m/%d/%y")
calories$time <- format(calories$ActivityHour, format="%H:%M:%S")

#heart_rate
heart_rate$Time = as.Date(heart_rate$Time,"%m/%d/%Y")

#sleep_min
sleep_min$date <- strptime(as.character(sleep_min$date), format="%m/%d/%Y %I:%M:%S %p", tz="") 
sleep_min$day <-factor(weekdays(sleep_min$date, abbreviate = FALSE),levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
sleep_min$time <-format(sleep_min$date, format="%H:%M:%S") 
sleep_min$Date <- format(sleep_min$date, format="%m/%d/%y") 

#step_min
steps_min$date = strptime(as.character(steps_min$ActivityMinute), format="%m/%d/%Y %I:%M:%S %p", tz="") 
steps_min$day <- factor(weekdays(steps_min$date, abbreviate = FALSE),levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

#int_min
int_min$date = strptime(as.character(int_min$ActivityMinute), format="%m/%d/%Y %I:%M:%S %p", tz="") 
int_min$day<- factor(weekdays(int_min$date, abbreviate = FALSE),levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

#cal_min
cal_min$date = strptime(as.character(cal_min$ActivityMinute), format="%m/%d/%Y %I:%M:%S %p", tz="") 
cal_min$day<- factor(weekdays(cal_min$date, abbreviate = FALSE),levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

#met_min
met_min$date = strptime(as.character(met_min$ActivityMinute), format="%m/%d/%Y %I:%M:%S %p", tz="") 
met_min$day<- factor(weekdays(met_min$date, abbreviate = FALSE),levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

In [None]:
head(activity,3)
head(calories,3)
head(intensities,3)
head(sleep,3)
head(heart_rate,3)
#head(weight,5)
head(met_min,3)
head(steps_min,3)
head(int_min,3)
head(cal_min,3)
head(sleep_min,3)

* Dataset is cleaned,transformed and ready for next phase. 

## V Analyze & Share Phase <a id="analyze"></a>

Will Analyze and Visualize the data to identify trends,patterns and relationships to help bellabeat marketing team.

### V.1. Summary Statistics: <a id="sumstat"></a>
Lets begin our analysis with descriptive statistics, to understand the distribution of selected variables for each dataset.

In [None]:
#Summary statistics of our data

#activity
activity %>%
  select(TotalSteps,TotalDistance,Calories,
         VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes,
         SedentaryActiveDistance,LightActiveDistance,ModeratelyActiveDistance,VeryActiveDistance) %>%
  summary()

#sleep
sleep %>%
  select(TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()

#Intensity
intensities %>%
  select(TotalIntensity) %>%
  summary()

**Activity:**
* The average very active, fairly active and lightly active minutes is 21.16 minutes,13.56 minutes,192.8 minutes(3.21 hours) respectively. 
* The average sedentary minutes were at 991.2minutes(16.52 hours).
* The average very active distance at 1.50 miles and lightly active distance at 3.34 miles.
* This shows the users were inactive for long time and spent less time on exercising.


**Sleep:** 
* The average time asleep were at 419.2 minutes(6.98 hours), almost close to recommended sleep time(7-9 hours). But the minimum asleep time is recorded at 58 minutes.

### V.2. Visualization: <a id="visualize"></a>

#### V.2.a. Daily steps taken and Calories burned: <a id="dsc"></a>

In [None]:
#Relationship between Daily steps taken and Calories burned 

plota<- ggplot(activity,aes(x=TotalSteps,y= Calories,color=Calories)) +
    geom_jitter(stat="identity")+
    geom_smooth(method="lm", formula="y~x",color="lightseagreen" ,se= F)+
    scale_color_gradient(low="orangered",high="midnightblue")+
    labs(title= "Total steps taken Vs Calories Burned",x= "Total Steps taken", y= "Calories Burned")+
    theme_bw()
options(repr.plot.width = 10, repr.plot.height = 8, repr.plot.res = 150)

plota 

* The graph shows a positive linear correlation between the total steps taken and calories burned. Since there is a linear correlation, will use Pearson coefficient correlation (Pearson's product moment correlation coefficient) to determine how strong these two variables are related.

* ####  **Correlation test between TotalSteps taken and Calories burned:**
 **cor.test()-** test for correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation.

In [None]:
cor.test(activity$TotalSteps,activity$Calories, method="pearson")

* From the above result, we can interpret that the p-value <2.2e-16, which is less than significance level(0.05). So we can conclude that Totalsteps and Calories are statically significant with moderately positive correlation(correlation coefficient=0.59).

* Let's do futher analysis on how different active minutes and distance have impact on this.

#### **V.2.b. Calories burned by active minutes:** <a id="ca"></a>

In [None]:
# Comparing by Calories burned at different Activity Minutes

activity1<- activity %>%
  group_by(Id,date)%>%
  select(TotalSteps,SedentaryMinutes,LightlyActiveMinutes,FairlyActiveMinutes,VeryActiveMinutes,Calories)%>%
  pivot_longer(cols=4:7,names_to ="Category",values_to = "Minutes")

head(activity1,4)

* #### **Calories burned during different activity minutes:**

In [None]:
#visualize
plotb <-ggplot(activity1,aes(x=Minutes/60,y=Calories,color=Category))+
  geom_jitter(stat="identity")+
  scale_color_manual(values=c("deeppink3","orange2","darkturquoise","aquamarine4"))+
  geom_smooth(formula=y~x,color="darkslategrey",method="lm",se=F)+
  facet_wrap(~ Category,scales="free_x")+
  ylim(0,5000)+
  labs(title= "Comparing Colaries burned on Activity levels",x= "Hours", y= "Calories Burned")+
  theme(legend.title = element_blank())+
  theme_minimal()
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

suppressWarnings(print(plotb))

* The graph shows a positive linear correlation between the VeryActive Minutes/FairlyActive minutes/Lightly active minutes and calories burned. But a negative linear correlation between Sedentary active minutes and calories burned. Will use Pearson coefficient correlation to determine how strong these variables are related.

* #### **Correlation test between Active Minutes and Calories burned:**

In [None]:
cor.test(activity$Calories,activity$SedentaryMinutes, method="pearson")
cor.test(activity$Calories,activity$LightlyActiveMinutes, method="pearson")
cor.test(activity$Calories,activity$FairlyActiveMinutes, method="pearson")
cor.test(activity$Calories,activity$VeryActiveMinutes, method="pearson")

**From the above result, we can interpret:**
* Very Active Minutes: The p-value <2.2e-16, which is less than significance level(0.05). VeryActive Minutes and Calories are statically significant with moderately positive correlation(correlation coefficient=0.61). 

* Fairly Active Minutes: The p-value <2.2e-16, which is less than significance level. Fairly Active Minutes and Calories are statically significant with low positive correlation(0.29).

* Lightly Active Minutes: The p-value <2.2e-16, which is less than significance level. LightlyActiveMinutes and Calories are statically significant with positive correlation(0.28).

* SedentaryMinutes: The p-value = 0.001021, which is less than significance level. SedentaryMinutes and Calories are statically significant with negative correlation(-0.106).

#### **V.2.c. Calories burned during different active distance:** <a id="cd"></a>

In [None]:
#By seperate levels of active distance 
activity2<- activity %>%
  group_by(Id,date)%>%
  select(TotalSteps,SedentaryActiveDistance,LightActiveDistance,ModeratelyActiveDistance,VeryActiveDistance,Calories)%>%
  pivot_longer(cols=4:7,names_to ="Category",values_to = "Distance")

head(activity2,4)

In [None]:
#visualize
plotc <-ggplot(activity2,aes(x=Distance,y=Calories,color=Category))+
  geom_jitter(stat="identity")+
  scale_color_manual(values=c("chartreuse3","darksalmon","darkslategray3","coral3"))+
  geom_smooth(formula=y~x,color="deepskyblue4",method="lm",se=F)+
  facet_wrap(~ Category,scales="free_x")+
  ylim(0,5000)+
  labs(title= "Comparing Calories burned by Distance",x= "Distance", y= "Calories Burned")+
  theme(legend.title = element_blank())+
  theme_minimal()
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

suppressWarnings(print(plotc))

* The graph shows a positive linear correlation between the VeryActive Distance/ModeratelyActive Distance/LightlyActive Distance and calories burned. But there is no correlation between Sedentary Active Distance and calories burned. Will use Pearson coefficient correlation to determine how strong these variables are related.

* #### **Correlation test between Active Distance and Calories burned:**

In [None]:
cor.test(activity$Calories,activity$SedentaryActiveDistance, method="pearson")
cor.test(activity$Calories,activity$LightActiveDistance, method="pearson")
cor.test(activity$Calories,activity$ModeratelyActiveDistance, method="pearson")
cor.test(activity$Calories,activity$VeryActiveDistance, method="pearson")

**From the above result, we can interpret:**
* Very Active Distance: The p-value <2.2e-16, which is less than significance level(0.05). VeryActive Distance and Calories are statically significant with low positive correlation(correlation coefficient=0.49). 

* Moderately Active Distance: The p-value = 1.844e-11, which is less than significance level. Moderately Active Distance and Calories burned have marginally low positive correlation(correlation coefficient=0.21). This may be due to, too many data points between (0-2 miles) and few outliers.

* Lightly Active Distance: The p-value <2.2e-1, which is less than significance level(0.05). LightlyActiveMinutes and Calories are statically significant with positive correlation(0.46).

* Sedentary Active Distance: The p-value = 0.1812, is higher and correlation coeficient is almost zero(0.043). Marginally low relationship between Sedentary Active Distance and Calories burned. This may be due to too many data points to zero and few outliers.

#### **V.2.d. Intensity by Week Day:** <a id="intw"></a>

In [None]:
#Intensity grouping 
intensities1<- intensities%>%
  filter(TotalIntensity>0)%>%
  group_by(day)%>%
  summarise(mean_intensity= round(mean(TotalIntensity),2),.groups='drop')%>%
  mutate(day= factor(day,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")))

head(intensities1)

In [None]:
#Visualize
ploti <- ggplot(intensities1,aes(x=day,y=mean_intensity,fill=mean_intensity))+
  geom_bar(stat="identity",color="black")+
  labs(title = "Average Intensity by day",x=" ", y="Average Intensity",fill="Avg. Intensity")+
  scale_fill_gradient(low="hotpink",high="indianred4")+
  theme_clean()
options(repr.plot.width = 10, repr.plot.height = 8, repr.plot.res = 150)  

ploti

* Participants have spent more time exercising on weekends with highest intensity level on Saturaday. Rest of the week except Monday shows almost same intensity levels.

* #### **Comparing Calories burned based Intensity level and Step Count/Distance:**

In [None]:
#intensity by date # Aggregation by date 
intensity1 <- intensities%>%
  group_by(Id,date,day)%>%
  summarise(totintensity= sum(TotalIntensity),.groups='drop')%>%
  mutate(day= factor(day,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")))

In [None]:
# combine Activity and intensity datasets
#Simple verification before merging two datasets
xx<-intersect(activity$Id, intensity1$Id)
cat("Actvity and intensity: ", length(xx),"\n",xx)

act_int <- merge(activity,intensity1,by=c('Id','date'))  
head(act_int,3)

act_int1 <- act_int %>%
  group_by(Id,day) %>%
  summarise(mean_steps= round(mean(TotalSteps),2),
            mean_intensity=round(mean(totintensity),2), 
            mean_distance= round(mean(TotalDistance),2),
            mean_calories=round(mean(Calories),2),
            .groups='drop')%>%
  mutate(day=factor(day,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")))

head(act_int1,3)

#### **V.2.e. Energy Expenditure: Step Count Vs Intensity** <a id="eeis"></a>

In [None]:
#Visualize
plotis <- ggplot(act_int1,aes(x=mean_steps,y=mean_intensity))+
  geom_point(aes(size=mean_calories,color=mean_calories),alpha=0.5)+
  scale_size(range=c(0.1,12),name ="Calories")+
  #scale_size_continuous(name ="Calories")+
  scale_color_gradientn(colors = rev(rainbow(12)))+
  labs(x="Avg. Steps taken", y="Avg.Intensity", title="Energy Expenditure: Step Count Vs. Intensity",subtitle="Comparing Calories burned based on Step count and Intensity levels",color="Calories")+
  theme_ipsum(grid="Y")+
  theme(legend.position = "bottom")
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

plotis

* The plot clearly shows more calories burned with higher degree of intensity levels and steps taken.

* #### **V.2.f. Energy Expenditure: Distance Vs. Intensity** <a id="eeid"></a>

In [None]:
plotid <- ggplot(act_int1,aes(x=mean_distance,y=mean_intensity))+
  geom_point(aes(size=mean_calories,color=mean_calories),alpha=0.5)+
  scale_size(range=c(0.1,12),name ="Calories")+
  #scale_size_continuous(name ="Calories")+
  scale_color_gradientn(colors = rev(topo.colors(12)))+   #
  labs(x="Avg. Distance", y="Avg.Intensity", title="Energy Expenditure: Distance Vs. Intensity",subtitle="Comparing Calories burned based on Distance and Intensity levels",color="Calories")+
  theme_ipsum(grid="Y")+
  theme(legend.position = "bottom")
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

plotid

* The plot clearly shows more calories burned with higher degree of intensity levels and distance.

* #### **Correlation between: Calories, steps, distance and intensity**

In [None]:
corr1 <-cor(act_int1[, c(3,4,5,6)], use = "complete.obs")
round(corr1,2)

* There is a strong positive correlation between steps, intensity and distance , however calories burned is moderately correlated.

#### **V.2.g. Energy Expenditure: METs Vs. Steps Taken** <a id="eems"></a>
A [MET](https://www.hsph.harvard.edu/nutritionsource/staying-active/)(metabolic equivalent of task) is a ratio of your working metabolic rate relative to your resting metabolic rate. Metabolic rate is the rate of energy expended per unit of time. It’s one way to describe the intensity of an exercise or activity.

* Resting—Uses 1.5 or fewer METs. Examples are sitting, reclining, or lying down.
* Light intensity—Uses from 1.6-3.0 METs. Examples are walking at a leisurely pace or standing in line at the store.
* Moderate intensity—Uses from 3.0-6.0 METs. Examples are walking briskly, vacuuming, or raking leaves.
* Vigorous intensity—Uses from 6.0+ METs.

Fitbit API returns MET values without a decimal point. So divide METs by 10 before analysis. [More Info](http://community.fitbit.com/t5/Web-API-Development/Definition-of-Mets-in-Charge-HR/td-p/1240824)

In [None]:
#MET data transformation to MET groups
#summarize(Id,date,METs,time) 

met11<- met_min%>%
  group_by(Id,date)%>%
  mutate(mets=METs/10)%>%
  summarize(Id,date,day,METs,mets,.groups='drop')%>%
  mutate(day=factor(day,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")))

head(met11,3)

In [None]:
#Factor to groups
met22<-met11%>%
  group_by(Id,date)%>%
  mutate(met_group= case_when(mets<=1.5 ~ "Resting", 
                              mets>=1.6 & mets<=2.9  ~ "Light", 
                              mets>= 3 & mets<= 5.9 ~ "Moderate", 
                              mets>=6 ~ "Vigorous")) %>%
  mutate(met_group= factor(met_group, level=c("Resting","Light","Moderate","Vigorous")))%>%
  summarise(Id,date,day,mets,met_group,.groups='drop')

head(met22,3)

In [None]:
#merge met and steps_min
met_steps <- merge(met22,steps_min, by=c('Id','date','day'))
head(met_steps,3)

met_steps1<- met_steps%>%
  select(-ActivityMinute)%>%
  drop_na()

head(met_steps1,3)

In [None]:
#Visualize met and steps # use log function

 mplot<- ggplot(met_steps1,aes(x=day,y=Steps,fill=met_group))+
  geom_violin(trim=FALSE,draw_quantiles = c(0.25, 0.5, 0.75))+
  stat_summary(fun=mean,geom="pointrange",shape=17,color="navyblue")+
  facet_grid(~met_group)+
  labs(title= "Energy Expenditure: MET Vs. Steps taken",subtitle="*Log10 scale on Y-axis",x= " ", y= "Steps taken",fill="METs")+
  expand_limits(y=0)+
  theme(legend.position="top")+
  scale_y_continuous(trans= log10_trans())+
  theme(axis.text.x=element_text(angle=90,vjust=1.0,hjust=1.0))
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

suppressWarnings(print(mplot))

**From the above graph, we can interpret,**
* Vigorous: 
Most of the step count values are distributed around the sample median just over 100 . The participants involved in higher MET activities on  Monday,Tuesday and Thursday. 
Throughout the week the distribution is negatively skewed indicates few Vigorous intense activities.

* Moderate:
Most of the step count values are distributed at first quartile, all days of the week. Throughout the week the distribution is looks similar and elongated. The mean is slightly lower than the median, indicates few low step counts values for moderately intense activities.  

* Light:
Throughout the week the distribution is looks similar. The mean is same as the median, indicates a symmetrical distribution of step counts. So few participants involved in  light intense activities . Shows two modes of the bimodal data set for all days. 

* Resting:
The step count values are between 0-2. Few participants were mostly sedentary. On thursday, there is few low activity distribution.Two modes of the bimodal dataset on Tuesday, Wednesday and weekends.

#### **V.2.h. Overall Analysis: Step count Vs METs Vs. Intensity LevelsVs. Calories** <a id="ovs"></a>

In [None]:
#Merging 4 dataframes: steps_min,int_min,cal_min,met_min

dat_merge <- Reduce(function(...) merge(..., all=TRUE),list(steps_min,int_min,cal_min,met_min))

dat_merge$date <- floor_date(dat_merge$date,"hour")                   

dat_merge1<- dat_merge%>%
  group_by(Id,date)%>%
  mutate(mets=METs/10)%>%
  summarize(Id,date,day,Steps,Intensity,Calories,mets,.groups='drop')

head(dat_merge1,3)

#Calculate average
dat_merge2<-dat_merge1%>%
  group_by(date)%>%
  summarise(mean_mets= round(mean(mets),2),
            mean_steps= round(mean(Steps),2),
            mean_intensity= round(mean(Intensity),2),
            mean_calories= round(mean(Calories),2),
            .groups='drop')

In [None]:
#visulaize
pm1<- ggplot(dat_merge2,aes(x=date,y=mean_steps))+ 
       geom_line(color="darkblue")+
        ylim(0,20)+
      labs(x="", y="Avg. Steps/min")+
      scale_x_datetime(breaks=date_breaks("days"), minor_breaks = date_breaks("hours"),expand=c(0,0))+
      theme_bw()+
      theme(axis.text.x=element_text(angle=90,vjust=1.0,hjust=1.0))
#pm1

pm2<-ggplot(dat_merge2,aes(x=date,y=mean_intensity))+ 
   geom_line(color="blue")+
  labs(y="Avg. Intensity/min")+
  scale_x_datetime(breaks=date_breaks("days"), minor_breaks = date_breaks("hours"),expand=c(0,0))+
  theme_bw()+ 
  theme(axis.text.x = element_blank(),
   axis.title.x = element_blank())
#pm2

pm3<-ggplot(dat_merge2,aes(x=date,y=mean_mets))+ 
   geom_line(color="darkgreen")+
  labs(y="Avg. METs/min")+
  scale_x_datetime(breaks=date_breaks("days"), minor_breaks = date_breaks("hours"),expand=c(0,0))+theme_bw()+
  theme(axis.text.x = element_blank(),
             axis.title.x = element_blank())
#pm3

pm4<-ggplot(dat_merge2,aes(x=date,y=mean_calories))+ 
    geom_line(color="indianred4")+
     labs(y="Avg.Calories/min")+
  scale_x_datetime(breaks=date_breaks("days"), minor_breaks = date_breaks("hours"),expand=c(0,0))+
   theme_bw()+
  theme(axis.text.x = element_blank(),
              axis.title.x = element_blank())
#pm4

rb<- rbind(ggplotGrob(pm4),ggplotGrob(pm2),ggplotGrob(pm3),ggplotGrob(pm1),size="first")
#grid.arrange(ggplotGrob(pm4),ggplotGrob(pm2),ggplotGrob(pm3),ggplotGrob(pm1),ncol=1)
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)
grid.arrange(rb,top="Overall Comparison: Steps taken Vs. METs Vs. Intensity level Vs. Calories burned")


* From the graph we can interpret, Vigorous intense exercise with more step results in significantly higher calorie burn. Participants calorie burn is higher on 04/27/16 and 05/12/16, clearly suggest they did Vigorous intense exercise with more number of step count. But on 05/11/16, intensity level is higher with average step count of 15, results in higher calorie burn. Suggests different type of  activity in Vigorous METs category.

#### **Sleep Analysis:**
#### **V.2.i. Time Asleep Vs. Time in bed:** <a id="asb"></a>

In [None]:
# Sleep Analysis
#Convert sleep in Hours and caluclate mean and efficiency

sleep2<-sleep%>%
  mutate(TotalHoursAsleep= round(TotalMinutesAsleep/60,2), TotalHoursInBed= round(TotalTimeInBed/60,2))
#head(sleep2,3)

sleep_df2 <- sleep2 %>%
  group_by(day) %>%
  summarise(mean_TotalHoursAsleep= round(mean(TotalHoursAsleep),2), mean_TotalHoursinBed= round(mean(TotalHoursInBed),2),effeciency=round((mean_TotalHoursAsleep/mean_TotalHoursinBed)*100,2),.groups='drop')%>%
  mutate(day=factor(day,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")))
#head(sleep_df2,3)  

sleep_df22 <- gather(sleep_df2,s1,m1,mean_TotalHoursAsleep:mean_TotalHoursinBed)
head(sleep_df22,3)

In [None]:
# Time Asleep vs Time in bed #Visualize
Splot1<- ggplot(sleep_df22,aes(day,m1,fill=s1))+
   geom_bar(stat="identity", position=position_dodge2(width=0.5))+
   labs(x="", y="Duration(in Hours)", title="Time Asleep Vs.Time In Bed")+
   #coord_flip()+ 
   theme_economist()+
   theme(axis.title.y=element_text(size=14))+
   theme(axis.title.y=element_text(margin=margin(t=0,r=10,b=0,l=0)))+
   theme(legend.position = "bottom",legend.title= element_blank())+
   scale_fill_manual(values =c("#007AA5","#555555"),
                    breaks=c("mean_TotalHoursAsleep","mean_TotalHoursinBed"), 
                    labels=c("Time Asleep","Time in Bed"))
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

Splot1

* Average sleep duration among these participants are between 6-8 hours.  Higher duration of sleep  is on Sunday, next on wednesday. Rest of the days are between 6-7 hours.

#### **Sleep efficiency:**
Its the measure of sleep quality.The percentage of time spend asleep to the time spent in bed.Sleep efficiency above 85% is considered to be a good and well rested.

#### **V.2.j. Average Sleep efficiency by Week day:** <a id="eff"></a>

In [None]:
#Average Sleep efficiency over week day #Visualize

Splot2<- ggplot(sleep_df2,aes(day,effeciency))+
  geom_bar(stat="identity",position=position_dodge2(width=0.2), width= 0.7,fill="steelblue")+
  coord_cartesian(ylim=c(85,95))+
  scale_y_continuous(labels=scales::label_percent(scale=1,accuracy=1))+
  geom_text(aes(label= effeciency),vjust=1.6, color="white", size=3.5)+
  labs(x="", y="Avgerage Sleep Efficiency (in %)", title="Sleep Efficiency by Week Day")+
  theme_economist_white()+
  theme(plot.title = element_text(size=20, hjust=0))+
  theme(axis.title.y=element_text(size=14))+
  theme(axis.title.y=element_text(margin=margin(t=0,r=15,b=0,l=0)))

options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)
Splot2

* The average sleep efficiency among the users are above 85% throughout the week, with highest percentage on Wednesday.

#### **V.2.k. Sleep Efficiency Vs. Workout Intensity:** <a id="iseff"></a>
To analyze if the workout intensity has any relationship on sleep efficiency. 

In [None]:
# Average sleep efficiency vs intensity

#Merge Sleep and intensity
# selecting few required coloumns 
sleep22<- sleep2 %>%
  select(-c(2:5,7))

sleep_intensity <- merge(sleep22,intensities,by=c('Id','date','day'))

sleep_intensity1<- sleep_intensity %>%
  group_by(day)%>%
  summarise(mean_intensity= round(mean(TotalIntensity),2),
            mean_TotalHoursAsleep= mean(TotalHoursAsleep),
            mean_TotalHoursinBed= mean(TotalHoursInBed),
            effeciency=round((mean_TotalHoursAsleep/mean_TotalHoursinBed)*100,2),
            .groups='drop')%>%
  mutate(day= factor(day,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")))

head(sleep_intensity1,3)

In [None]:
#correlation between sleep effeciency and intensity
cor.test(sleep_intensity1$effeciency,sleep_intensity1$mean_intensity, method="pearson")

* #### **Correlation plot between sleep efficiency and Intensity**

In [None]:
#correlation plot between sleep effeciency and intensity
pearplot<-ggscatter(sleep_intensity1,x="mean_intensity", y="effeciency", add="reg.line",conf.int = TRUE,
          cor.coef=TRUE, cor.method = "pearson",
          title ="Correlation between Sleep Effeciency and Exercise Intensity",  
          xlab="Intensity",ylab="Avg. Sleep")

options(repr.plot.width = 10, repr.plot.height = 8) 

pearplot

* From the above graph and correlation test we can interpet, the coefficient correlation = 0.069(almost zero) and with high p value=0.88, so there is no relation among exercise intensity and sleep among theses participants.

#### **V.2.l. Sleep Cycle:** <a id="scyl"></a> 
Fitbit supports two kinds of sleep log types:
**Stages:** 'Sleep Stages' levels include deep, light, rem, and wake.
**Classic:** 'Sleep Pattern' levels include asleep, restless, and awake. 
The classic sleep log returned, due to device not synced with the application and we have classic sleep log.
##### 'Sleep Pattern' levels include: 1 ("asleep"), 2 ("restless"), or 3 ("awake")
[More info.](http://dev.fitbit.com/build/reference/web-api/sleep/get-sleep-log-by-date/)

In [None]:
# Factor sleep minutes into 3 sleep stages
sleep_min1<-sleep_min %>%
  mutate(val =factor(value,levels=1:3,labels=c("Asleep","Restless","Awake")))%>%
  select(-logId)

head(sleep_min1,3)

In [None]:
sleep_min2 <- sleep_min1%>%
  group_by(Id)%>%
  summarize(Asleep1=length(which(value== 1)),
            Restless1=length(which(value== 2)),
            Awake1=length(which(value== 3)), .groups='drop')
 
 head(sleep_min2,4)

In [None]:
sleep21<- sleep_min2 %>%
  summarize(Asleep=round(mean(Asleep1),2), Awake=round(mean(Awake1),2),Restless=round(mean(Restless1),2), .groups = 'drop')%>%
  pivot_longer(cols=1:3,names_to ="stages",values_to = "Minutes")

head(sleep21)

In [None]:
sleeptest<- sleep21%>%
   # group_by(Stages)%>%
    mutate(percentage=round(Minutes/sum(Minutes),4) * 100,
           lab.pos = cumsum(percentage)- 0.5*percentage)%>%
    mutate( Stages = paste0(stages, " (", percentage,"%", ")"))
 
head(sleeptest)

In [None]:
#visualize
scplot<- ggplot(data = sleeptest,aes(x =2, y = percentage, fill = Stages)) +
  geom_col(color = "black",width=1) +
  geom_text(aes(label = percentage),position = position_stack(vjust = 0.6),color="White")+ #x=1.3,
  coord_polar(theta = "y") +
  xlim(c(0.2,2.5))+
  labs(title= "Duration of Sleep Stages", subtitle="Average time spent on each sleep cycle",color="White")+
  annotate(geom="text", x =0.5, y = 0, label = "\nSleep\nPattern",size=12, color="White" )+ #face="bold"
  scale_fill_manual(values = c( "skyblue","tomato3","#E5DE44"))+
        theme(panel.background = element_rect(fill = "steelblue4", color="steelblue4"),
        plot.background =  element_rect(fill = "steelblue4",color="steelblue4"),
        plot.title = element_text(color = "White", size = 18, face = "bold"),   #,hjust = 0.5
        plot.subtitle = element_text(color = "white",size=12),    #,hjust = 0.5
        panel.grid = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        legend.title = element_text(face="bold"),
        legend.text = element_text(face="bold"))
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

scplot

* The graph show overall participants spent a good 91.48% sleep throughout the night.

#### **V.2.m. Resting Heart rate comparision among users:** <a id="hr"></a>

Resting heart rate typically ranges from 60-100 bpm, but this range can vary based on age and fitness level. Resting heart rate can be an important indicator of your fitness level and overall cardiovascular health. This metric is the number of times your heart beats per minute when you are still and well-rested. In general, active people often have a lower resting heart rate.

**Reason for less users:** A sleep stage log is required to generate this value. When a classic sleep log is recorded, this value will be missing. [More info](http://dev.fitbit.com/build/reference/web-api/heartrate-timeseries/get-heartrate-timeseries-by-date/)

In [None]:
#Heart rate # To calculate Resting heart rate value for the day.
heartrate <- heart_rate%>%
  group_by(Id,Time)%>%
  summarize(value = median(Value), .groups = 'drop')

head(heartrate,3)

In [None]:
HRplot<- ggplot(heartrate, aes(x=Time, y=value)) +
  geom_line(color="firebrick3") + geom_point(color="firebrick1")+
  facet_wrap(~Id,ncol=4)+
  labs(x="", y="Avg. Heart Rate(BPM)", title="Average Heart Rate",subtitle="Comparing heart rate among individuals")+
  theme( plot.title = element_text(face = "bold", size = 12),
        legend.position = "None")+
  theme_bw()
options(repr.plot.width = 12, repr.plot.height = 10, repr.plot.res = 150)

HRplot

* Some users have average heart rate more than the 100, that raises concern.

### **V.3. Share Key Findings:** <a id="share"></a>
#### **V.3.a. Results:** <a id="result"></a>

* Most of the participants have light activity to sedentary workout habits. Bellabeat can address these, by sending  reminders to motivate users to exercise. 
* More calories burned during high workout intensity levels and increased Step count and/or distance. Exercise with higher METs increases the calories burn.
* The participants have more than 6 hours of well rested sleep. Spent around 92% sleep duration in asleep stage. There is no impact of workout intenstiy on their sleep efficiency score.
* Some users have average resting heart rate more than the 100, that raises concern. This may be due to stress, fever, medication,alcohol or caffeine intake. Exercise and Meditation can be recommended.

#### **V.3.b. Limitations:** <a id="limit"></a>

* The sample size is too small(33 users) and the data was collected during a short period of time. 
* Data was collected online via third-party.
* Relevant information like gender,age,wellness data is not included. Weight data is incomplete for analysis.
* Some inconsistences in data logging, suggests that device may have not been synced with the application properly.
* Heart rate variation is not synced with sleep data,that can help understand sleep cycle with REM better.

## **VI Act Phase** <a id="act"></a>

### **VI.1. Recommendation:** <a id="recommend"></a>

* Larger population size would improve the accuracy of an analysis. 
* Data collection process can be from Internal or primary/secondary data sources to increase the credibility and reliability of the data.
* Dataset representing female population with health statistics would be more relevant to analysis for women-centric Bellabeat products.
* Users may have concerns sharing personal information such as mensuaration cycle, age, or diet. Drafting a clear and comprehensive Privacy policy agreement that tells users why information is collected and how it is handled, builds trust, increases the brand recognition and improves product sales.
* Providing users with options to choose from  different activies like Cardio, Strength training, Meditation and Breathing exercises or setting up with personal goal will encourage users to get moving.
* Application can be designed to Remind users who are being inactive for long time, Notifications for breathing exercise or water intake and send Rewards- for goal completion would be a great way to help motivate users. Also providing options to setup SOS Emergency alerts will be a great addition to the product.



## VII References
* Fitbit data info: https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf
* Charts,Themes, Colors: https://r-charts.com/
* Code ref: https://stackoverflow.com/ 
* Thanks to following authors,their notebooks help me a lot with this project:
1. https://www.kaggle.com/chebotinaa/bellabeat-case-study-with-r
2. https://www.kaggle.com/mimosabella/fitness-tracker-a-usage-trends-analysis-with-r/notebook
