# Bellabeat Case Study
## About the company
Bellabeat was a high-tech company founded by Urška Sršen and Sando Mur that manufactures health-focused smart products. Since it was founded in 2013, Bellabeat has grown rapidly through its extensive focus in digital marketing and quickly positioned itself as a tech-driven wellness company for women.

## Initial questions
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers? 
3. How could these trends help influence Bellabeat marketing strategy? 

## Business task
Provide thorough analysis on current trends in smart device usage and deliver business solutions and recommendations on how these trends can positively impact the company's growth.

## Data sources
The data used for this analysis cover the personal fitness data of thirty FitBit users. This data can be found [here](https://www.kaggle.com/datasets/arashnic/fitbit).

## Loading packages and importing datasets

In [None]:
library(tidyverse)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(janitor)
library(ggpubr)
library(chron)

daily_activity <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
heartrate_seconds <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
sleep_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
weight_log_info <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
hourly_calories <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')

## Data validation

In [None]:
head(daily_activity)
head(heartrate_seconds)
head(sleep_day)
head(weight_log_info)
head(hourly_calories)

In [None]:
n_distinct(daily_activity$Id)
n_distinct(heartrate_seconds$Id)
n_distinct(sleep_day$Id)
n_distinct(weight_log_info$Id)
n_distinct(hourly_calories$Id)

Things to note:
- *weightLogInfo_merged* dataset only had 8 out of 33 total users recorded, which is a low representation of the users recorded in the *dailyActivity_merged* dataset (about 24% of population).
    - This may give a result that does not enough power and potentially result in a Type II error.

I also noticed that some date columns were in character format, and therefore converted to date format using the **lubridate** package.

In [None]:
daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate, format = NULL)
heartrate_seconds$Time <- mdy_hms(heartrate_seconds$Time, format = NULL)
sleep_day$SleepDay <- mdy_hms(sleep_day$SleepDay, format = NULL)

str(daily_activity)
str(heartrate_seconds)
str(sleep_day)

## Data preparation

In [None]:
# Renaming columns
daily_activity <- daily_activity %>%
    clean_names() %>%
    rename_with(tolower) %>%
    rename(date = activity_date)

heartrate_seconds <- heartrate_seconds %>%
    clean_names() %>%
    rename_with(tolower)

sleep_day <- sleep_day %>%
    clean_names() %>%
    rename_with(tolower) %>%
    rename(date = sleep_day)

hourly_calories <- hourly_calories %>%
    clean_names() %>%
    rename_with(tolower) %>%
    rename(date_time = activity_hour)

In [None]:
# Getting statistical summaries of each table
summary(daily_activity)
summary(heartrate_seconds)
summary(sleep_day)
summary(hourly_calories)

In [None]:
# Removing duplicates and N/A
daily_activity <- daily_activity %>%
    distinct() %>%
    drop_na()

heartrate_seconds <- heartrate_seconds %>%
    distinct() %>%
    drop_na()

sleep_day <- sleep_day %>%
    distinct() %>%
    drop_na()

## Merging data

In [None]:
merged_activity_sleep <- merge(daily_activity, sleep_day, by = c('id', 'date'))
head(merged_activity_sleep)

## Visualizations

In [None]:
ggplot(data = merged_activity_sleep, mapping = aes(x = calories, y = total_minutes_asleep)) +
    geom_point() +
    geom_smooth()

We can see from above that there is not much correlation between calories burnt and sleep quality. We will need to explore other relationships that can suggest new business ideas.

In [None]:
# Relationship between steps and calories
daily_activity %>%
    ggplot(aes(total_steps, calories)) +
    geom_point() +
    geom_smooth() +
    stat_cor(method = "pearson", label.x = 20000, label.y = 1000) +
    labs(title = 'Steps vs. Calories', x = 'Steps', y = 'Calories')

We can see from above that although there is a positive correlation between steps and calories, the correlation coefficient of 0.59 is lower than expected. 

In [None]:
daily_activity %>%
    mutate(weekday = weekdays(date)) %>%
    select(id, total_distance, weekday) %>%
    mutate(weekday = factor(weekday, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))) %>% 
    ggplot(aes(x = weekday, y = total_distance, fill = weekday)) + 
    geom_boxplot() +
    labs(title = 'Total Distance Tracked By Weekday', x = 'Weekday', y = 'Total Distance (in km)')

We can see that total distance tracked stayed consistent during the weekdays, with similar medians and maximum/minimum values. This could be because most people work on weekdays, resulting in consistent results in activity. Meanwhile, we can see that distance tracked was highest on Saturday and lowest on Sunday - this could be telling that people are more active and are out more on Saturdays, while they tend to stay at home on Sunday.

In [None]:
merged_activity_sleep %>%
    mutate(weekday = weekdays(date)) %>%
    select(id, date, total_minutes_asleep, weekday) %>%
    mutate(sleep_quality = ifelse(total_minutes_asleep <= 420, 'Low', ifelse(total_minutes_asleep <= 540, 'Optimal', 'High'))) %>%
    mutate(sleep_quality = factor(sleep_quality, levels = c('Low', 'Optimal', 'High'))) %>% 
    mutate(weekday = factor(weekday, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))) %>% 
    
    ggplot(aes(x = total_minutes_asleep, fill = sleep_quality)) + 
    geom_histogram(position = 'dodge', bins = 30) +
    facet_wrap(~weekday)
    labs(title = 'Distribution of Sleep Quality by Weekday', x = 'Total Minutes Asleep', y = 'Count')

We can see from above that the amount of sleep for the users follows a normal distribution, with a mean of around 7 hours of sleep. Going deeper into each day, we can see that the distribution of sleep time is relatively higher on Wednesday and Thursday, and lower on Saturday and Sunday. This is interesting as the weekend is generally thought to be the time to sleep in for longer amounts of time.

Another possible reason for this distribution is users not wearing their devices as often on weekends compared to weekdays.

In [None]:
hourly_calories %>%
    mutate(date_time = as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())) %>%
    separate(date_time, into = c('date', 'time'), sep = ' ') %>%
    mutate(date = ymd(date)) %>%
    mutate(weekday = weekdays(date)) %>%
    group_by(weekday, time) %>%
    summarize(average_calories = mean(calories), .groups = 'drop') %>%
    mutate(weekday = factor(weekday, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))) %>%

    ggplot(aes(time, weekday, fill = average_calories)) +
    scale_fill_gradient(low = 'white', high = 'green2') +
    geom_tile(color = 'white', lwd = 0.5, linetype = 1) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    labs(title = 'Calories Burnt Throughout Day', x = 'Time', y = 'Day', fill = 'average_calories')

We can tell from above that users are most actively burning calories in the afternoon, with almost no calories burnt between 12am - 4am. This is expected as most people are sedentary at this time. 

Another observation we can make is that users start their daily activity later in the weekends as opposed to weekdays.

## Summary
After completing analysis using the FitBit Fitness Tracker data, we have come to some conclusions that Bellabeat could use to further promote growth in it's success.

#### Target users
The target audience for the device would be users working in a professional environment, with 9-5 jobs to be more exact. The visualizations above show that users that used the FitBit tracker device had consistent lifestyles throughout the week, with slightly higher activity during the weekend. This suggests that the users were working in a traditional working environment and spending their personal time over the weekends.

We also see that users have little to no activity from 12am to 4am, while activity is at its highest in the afternoon from 12pm to 6pm. Bellabeat can explore adding notifications based on time, reminding the user to exercise during peak hours in the afternoon, and to get ready for bed around midnight.

#### Sleep quality
We can see from the distributions of sleep quality throughout the week that users had more irregular sleeping patterns during the weekend as opposed to the weekend. We also saw a decrease in optimal sleep quality (7-9 hours) during the weekend as well. This suggests that users depart from their weekday routines during the weekend, leading to these irregularities. Bellabeat can look into adding a feature to customize notifications and set custom routines for weekdays and weekend separately, emphasizing the impact it would have to the users' quality of sleep.

#### Tracking calories
Finally, we see a positive correlation between total steps taken and calories burnt. While this is a generall known fact, it is supported by the correlation being positive. This can help set a daily walking goal and inform users the impact it has in total calories burnt.