## Introduction

[Bellabeat](https://bellabeat.com/) is a high-tech manufacturer of health-focused smart products for women. Bellabeat’s app and multiple smart devices collect data on activity, sleep, stress, hydration levels, and reproductive health to empower women with an understanding of their own health and habits. The company was founded in 2013 by Urška Sršen and Sando Mur and has expanded quickly since, now with the possibility to become a greater player in the global smart device market.

Bellabeat’s product line is made up of the Bellabeat app, which allows users insight into their health by providing data on their activity, sleep, stress, menstrual cycle, and mindfulness habits. The Bellabeat app also connects to the company’s line of smart device products. Leaf is Bellabeat’s classic wellness tracker that can be worn as a bracelet, necklace, or clip. Leaf tracks the user’s activity, sleep, and stress and connects to the Bellabeat app. Time is a wellness smart watch that also tracks the user’s activity, sleep, and stress and connects to the Bellabeat app. Spring is a smart water bottle that tracks the daily water intake of its user to ensure proper hydration levels are maintained throughout the day. Spring also connects to the Bellabeat app to track this data. Bellabeat membership is a subscription-based membership program that provides users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty, and mindfulness based on their lifestyle and goals.

## Business Task
Analyze smart device useage data in order to gain insight into how consumers use non-Bellabeat devices and then use discovered insights to guide marketing strategy. This would help identify opportunities for growth.

In-Scope:   

  1. Summary of business task    
  2. Description of data sources used   
  3. Documentation of data cleaning and manipulation processes   
  4. Summary of analysis   
  5. Relevant data visualization and key findings    
  6. High-level recommendations    

Out-of-Scope:
  
  1. Predictive analysis    
  2. Machine learning algorithm    
 
## Prepare
The data for this analysis comes from a dataset obtained from FitBit Fitness Trackers. The dataset consists of 18 comma-separated value (csv) files for 30 Fitbit users. The data tracked user activity between 03/12/2016 and 05/12/2016 including minute-level output for physical activity, heart rate, and sleep monitoring. 
Critical limitations for this data exist due to the small sample size and absence of key demographic information of the participants, such as gender, age, and location.

To conduct this analysis, data used include datasets for daily activity, daily calories, daily intensities, daily steps, heart rate by seconds, minute METs, daily sleep, and weight log information.

The analysis is conducted in R Studio.

In [16]:
# Install and load packages
library("tidyverse")
library("dplyr")

The data  sets were initially opened in Microsoft Excel to inspect column formats. Columns with dates were changed to "date" format as applicable. These selected files were then imported to R Studio as dataframes. 

In [17]:
# Import data sets
daily_activity <- read_csv("../input/bellabeat-fitbit-datasets/dailyActivity_merged.csv",show_col_types = FALSE)
daily_calories <- read_csv("../input/bellabeat-fitbit-datasets/dailyCalories_merged.csv",show_col_types = FALSE)
daily_intensities <- read_csv("../input/bellabeat-fitbit-datasets/dailyIntensities_merged.csv",show_col_types = FALSE)
daily_steps <- read_csv("../input/bellabeat-fitbit-datasets/dailySteps_merged.csv",show_col_types = FALSE)
heart_rate_sec <- read_csv("../input/bellabeat-fitbit-datasets/heartrate_seconds_merged.csv",show_col_types = FALSE)
minute_METs <- read_csv("../input/bellabeat-fitbit-datasets/minuteMETsNarrow_merged.csv",show_col_types = FALSE)
sleep_day <- read_csv("../input/bellabeat-fitbit-datasets/sleepDay_merged.csv",show_col_types = FALSE)
weight_log <- read_csv("../input/bellabeat-fitbit-datasets/weightLogInfo_merged.csv",show_col_types = FALSE)

## Process
Once data frames have been imported into R Studio, the head() and colnames() functions were used to view the data frames to ensure they were imported correctly into R Studio.

In [18]:
# View data frames
head(daily_activity)
colnames(daily_activity)

head(daily_calories)
colnames(daily_calories)

head(daily_intensities)
colnames(daily_intensities)

head(daily_steps)
colnames(daily_steps)

head(heart_rate_sec)
colnames(heart_rate_sec)

head(minute_METs)
colnames(minute_METs)

head(sleep_day)
colnames(sleep_day)

head(weight_log)
colnames(weight_log)

## Analyze
#### Check daily calories, intensities, and steps contained in daily_activities
The "Id" column is considered the primary key for all selected data sets. This provides the opportunity to merge tables from different data sets. On close observation, one can see that the daily_activity data set also contains data on activity intensity, calories, and daily steps. This means there merging data sets based on "Id" may not be necessary. However, it is important to check and compare data sets (i.e., compare intensity, calories, and steps data in daily_activities to the respective data frames for each data type).

The sqldf package in R is loaded to execute this comparison.

In [19]:
# Check data for consistency
library(sqldf)

daily_activity_2 <- daily_activity %>%
  select(Id, ActivityDate, Calories)
head(daily_activity_2)

## Compare calories data in both data frames
sql_check <- sqldf('SELECT * FROM daily_activity_2 INTERSECT SELECT * FROM daily_calories')
head(sql_check)

The outputs of both data sets are identical, hence the daily_activity data frame contains accurate data for daily_intensities, daily_calories, and daily_steps.

In [20]:
# Check for unique values in data sets
n_distinct(daily_activity$Id)
nrow(daily_activity)

n_distinct(heart_rate_sec$Id)
nrow(heart_rate_sec)

n_distinct(minute_METs$Id)
nrow(minute_METs)

n_distinct(sleep_day$Id)
nrow(sleep_day)

n_distinct(weight_log$Id)
nrow(weight_log)

Further high-level exploration of the selected data sets show that:
* The *daily_activity* data set contains **33** distinct values and **940** observations
* The *heart_rate_sec* data set contains **7** distinct values and **1,048,575** observations
* The *minute_METs* data set contains **27** distinct values and **1,048,575** observations
* The *sleep_day* data set contains **24** distinct values and **413** observations
* The *weight_log* data set contains **8** distinct values and **67** observations

The *heart_rate_sec* and *weight_log* data sets contain very low number of distinct values, therefore they may not be used to make reliable recommendations.

#### Overview of summary of data sets

In [21]:
# Summary data
daily_activity %>% 
  select(TotalSteps, TotalDistance, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, Calories) %>% 
  summary()

heart_rate_sec %>% 
  select(Value) %>% 
  summary()

minute_METs %>% 
  select(METs) %>% 
  summary()

sleep_day %>% 
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>% 
  summary()

weight_log %>% 
  select(WeightKg, BMI) %>% 
  summary()

A few insights canbe obtained from the summary statistics above. These are enumerated below:
* The average user takes about **7,638** steps daily
* In terms of activity levels, a user on average spends variable amount of time engaging in different levels of activity on a daily basis:
    * An average of 21.16 minutes on very active (i.e., vigorous) activities
    * An average of 13.36 minutes on fairly active activities
    * An average of 192.80 minutes on lightly active activities, and
    * An average of 991.20 minutes **(~16.5 hours)** on sedentary activities
* On average, a user burns 2,304 calories daily
* The average daily MET value is 14.47. 
  A MET is a ratio of your working metabolic rate relative to your resting metabolic rate. Metabolic rate is the rate of energy expended per unit of time.<sup>1,2</sup> Since a MET value of 14.47 is relatively high and approximately 70% of the day on average is spent on sedentary activities, it is likely that the MET values recorded by user Fitbit devices are inaccurate.
* For sleep records, a user on average spends 458.6 minutes **(~7.6 hours)** daily in bed and 419.5 minutes **(~7 hours)** asleep. Here, it can be observed that the average user spends about 39.1 minutes awake in bed. This may include time spent in bed before falling asleep, time spent awake after sleeping before getting out of bed, or sleep interruptions. 

#### Analyze activity by day of the week. 
##### Create a new column in new daily_activity data frame for day of the week

In [22]:
# Analyze activity by day of the week
library(lubridate)
daily_activity_3 <- daily_activity

daily_activity_3$ActivityDate <- parse_date_time(daily_activity_3$ActivityDate, '%m/%d/%Y')
daily_activity_3$day_of_week <- format(as.Date(daily_activity_3$ActivityDate), "%A")
daily_activity_3$day_of_week <- factor(daily_activity_3$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

# To create a pie chart of daily activities
daily_activity_4 <- daily_activity %>%
  select(Id, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes)

# Add a column for Total Minutes
daily_activity_4$TotalMinutes <- (daily_activity_4$VeryActiveMinutes) + (daily_activity_4$FairlyActiveMinutes) +
  (daily_activity_4$LightlyActiveMinutes) + (daily_activity_4$SedentaryMinutes)
head(daily_activity_4)

# Calculate averages (means) and express as percentages of average Total Minutes
mean_TotalMinutes <- mean(daily_activity_4$TotalMinutes)
VeryActive_pct <- mean(daily_activity_4$VeryActiveMinutes) / mean_TotalMinutes * 100
FairlyActive_pct <- mean(daily_activity_4$FairlyActiveMinutes) / mean_TotalMinutes * 100
LightlyActive_pct <- mean(daily_activity_4$LightlyActiveMinutes) / mean_TotalMinutes * 100
Sedentary_pct <- mean(daily_activity_4$SedentaryMinutes) / mean_TotalMinutes * 100

# Create data frame with ActivityLevel and corresponding Percentages
daily_activity_pie <- data.frame(
  ActivityLevel = c("Very Active Minutes", "Fairly Active Minutes", "Lightly Active Minutes", "Sedentary Minutes"),
  Percentage = c(VeryActive_pct, FairlyActive_pct, LightlyActive_pct, Sedentary_pct))
head(daily_activity_pie)

## Share
#### Create data visualizations

In [23]:
# Figure 1
ggplot(data=daily_activity)+
  geom_point(mapping=aes(x=VeryActiveMinutes, y=Calories)) +
  geom_smooth(mapping=aes(x=VeryActiveMinutes, y=Calories), method=lm) +
  labs(title="Relationship Between Vigorous Activity and Calories Burned")

Figure 1 above shows the relationship between minutes spent doing vigorous activities (i.e., very active minutes) and calories burned. We see a positive correlation between these two variables. The more active the average user is, the more calories are burned.

In [24]:
# Figure 2
ggplot(data=daily_activity)+
  geom_point(mapping=aes(x=TotalSteps, y=Calories), color="Purple") +
  geom_smooth(mapping=aes(x=TotalSteps, y=Calories), method=lm) +
  labs(title="Relationship Between Total Daily Steps and Calories Burned")

Similarly, a positive relationship exists between daily calories burned and total steps taken, as shown in Figure 2 above.

In [25]:
# Figure 3
ggplot(data=daily_activity)+
  geom_point(mapping=aes(x=TotalDistance, y=Calories), color="Brown") +
  geom_smooth(mapping=aes(x=TotalDistance, y=Calories), method = lm) +
  labs(title="Relationship Between Total Distance and Calories Burned")

When total distance is plotted against calories burned as shown in Figure 3 above, we observe a relationship similar to that of Figure 2. The higher the total distance covered daily by the user, the higher the calories burned.

In [26]:
# Figure 4
ggplot(data=sleep_day)+
  geom_point(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
  geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed), method = lm) +
  labs(title="Relationship Between Total Minutes Asleep and Total Time in Bed")

In Figure 4 above, we see a close relationship between total time in bed and total time asleep. As suspected in the analysis doen earlier, we can infer that most users spent time in bed asleep.

In [27]:
# Figure 5
ggplot(data=daily_activity)+
  geom_point(mapping=aes(x=SedentaryMinutes, y=Calories), color="Magenta") +
  geom_smooth(mapping=aes(x=SedentaryMinutes, y=Calories), method = lm) +
  labs(title="Relationship Between Sedentary Minutes and Calories Burned")

In Figure 6, we see a negative relationship between calories burned and time spend doing sedentary activities.

In [28]:
# Figure 6: Bar chart showing activity by day of the week
library(RColorBrewer)
ggplot(data=daily_activity_3) +
  geom_bar(mapping=aes(x=day_of_week, fill=day_of_week)) +
  labs(title="Number of Active Users for Each Day of the Week") +
  scale_fill_brewer(palette="Set1")

To better understand how active the sampled users are, we explore user active hours by days of the week. From the bar chart in Figure 6 above, it is observed that the average user is active in the middle of the week, and least active on weekends. This may be explained by the lifestyle of users surveyed. If data on user demographics is made available, it may be helpful in understanding the distribution of active time during the week.

In [29]:
# Figure 7: Pie chart showing mean activity breakdown
ggplot(data=daily_activity_pie, aes(x="", y=Percentage, fill=ActivityLevel))+
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=45) +
  geom_text(aes(label = paste0(round(Percentage),"%")), position=position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Dark2") +
  labs(x = NULL, y = NULL, fill = NULL, title = "Average Percentage of Activity Levels Per Day") +
  theme_classic() + theme(axis.line = element_blank(),
                          axis.text = element_blank(),
                          axis.ticks = element_blank(),
                          plot.title = element_text(hjust = 0.5, color = "Black"))

In Figure 7, the proportion of activity time is shown based on level of activity. It can be observed that a significant proportion of total active time by the average user is spent doing sedentary activities.

## Act
### Conclusions
Based on the analysis conducted:
1. There is a clear relationship between calories burned and active time.
2. The more time a user spends on sedentary activities, the less calories the user burns on a daily basis.
3. The average user is most active mid-week and least active on weekends.
4. On average, a user spends about 81% of active time doing sedentary activities.
5. For the average user, total time spent in bed is close to total time asleep, indicating minimal struggle with insomnia. 

### Recommendations for Bellabeat 
To further expand business opportunities for Bellabeat, key findings from the analysis can be used to guide some business decisions. Some recommendations for Bellabeat app and products are as follows:
1. Recommend a minimum of 10,000 steps for users. This can drive a higher proportion of activity time dooing non-sedentary activities.
2. Enable periodic reminders (e.g., push notifications) for users to take a minimum number of steps every hour.
3. Provide options for tracking other health-related data like weight. 
4. Provide in-app wellness tips and suggestions.
5. Create weekend fitness challenges to encourage users to be more active on weekends.


#### References:
1. G;, J. M. S. K. B. (n.d.). Metabolic equivalents (Mets) in exercise testing, exercise prescription, and evaluation of functional capacity. Clinical cardiology. Retrieved January 1, 2022, from https://pubmed.ncbi.nlm.nih.gov/2204507/ 
2. Roland, J. (2019, October 21). What Are METs, and How Are They Calculated? Retrieved from https://www.healthline.com/health/what-are-mets