![https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.facebook.com%2Fbellabeat%2F&psig=AOvVaw2hwhGnwByq57xsqUnyt_xN&ust=1666099732137000&source=images&cd=vfe&ved=0CAwQjRxqFwoTCLDztamv5_oCFQAAAAAdAAAAABAE](http://)

Summary: As a small yet successful company in the fitness smart device market, Bellabeat is looking to analyze smart device data to unlock new growth opportunities for the company. 

 **Case Study Road Map:**
 
**ASK -**

**What is the business task?**  The goal of this task is to study non-Bellabeat device data to find trends and then determine marketing strategies Bellabeat could benefit from.

**Stakeholders**  Our main stakeholders are Bellabeat's cofounders, Urška Sršen and Sando Mur, and Bellabeat's executive team. We will also be communicating our findings to Bellabeat's marketing analytics team.

**PREPARE -**

**Data Source**  The data provided by Bellabeat's cofounder is the FitBit Fitness Tracker Dataset, which is a public domain. The dataset is also avaiable on Kaggle: https://kaggle.com/arashnic/fitbit

**Data Integrity**  The dataset can be assessed using ROCCC: 
    
    *Reliability:*
        
The dataset includes data from around 30 individuals. While this is a small amount, it is large enough to be reliable. There has not, however, been any demographic information provided, so we do not know how varied the users are. At this point, more reliable data may be sought out, but we will use it for the case study anyway.

    *Original:*  
       
The data was collected from thirty FitBit users via a survey distributed by Amazon Mechanical Turk between 03/12/2016-05/12/2016.

    *Comprehensive:*  
    
This dataset contains 18 csv files, which include minute-level output for physical activity, heart rate, weight, and sleep monitoring data for 30 individuals over a span of two months. A more in depth outline of our data:

* Daily activity
* Calories burned (daily, hourly, by minute)
* Intensities (daily, hourly, by minute)
* Steps (daily, hourly)
* Heart rate
* METs
* Sleep (by day and duration recorded as minutes)
* Weight log information

    *Current:*

This data was collected in 2016, so it is therefore NOT current. We will still use it for this case study to practice data analytics.

    *Cited:*
    
Furberg, Robert; Brinton, Julia' Keating, Michael; Ortiz, Alexa https://zenodo.org/record/53894#.Yg6vfBPMK3J


**PROCESS -**

For this data analysis, I will be using R due to the size of the datasets, as well as the ability to create data visualizations. This will allow me to summarize and share my insights with key stakeholders.

Packages:

For the first step in the data cleaning process, we need to install/load the necessary packages:




In [None]:
install.packages("tidyverse")
install.packages("readr")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")

library(tidyverse)
library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(vctrs)

**Data:**

Next, we need to import and rename the Fitbit data that was provided for the case study:

In [None]:
daily_activity <- read_csv("dailyActivity_merged.csv")
daily_calories <- read_csv("dailyCalories_merged.csv")
daily_intensities <- read_csv("dailyIntensities_merged.csv")
daily_steps <- read_csv("dailySteps_merged.csv")
heartrate_seconds <- read_csv("heartrate_seconds_merged.csv")
hourly_calories <- read_csv("hourlyCalories_merged.csv")
hourly_intensities <- read_csv("hourlyIntensities_merged.csv")
hourly_steps <- read_csv("hourlySteps_merged.csv")
minute_calories_N <- read_csv("minuteCaloriesNarrow_merged.csv")
minute_calories_W <- read_csv("minuteCaloriesWide_merged.csv")
minute_intensities_N <- read_csv("minuteIntensitiesNarrow_merged.csv")
minute_intensities_W <- read_csv("minuteIntensitiesWide_merged.csv")
minute_mets <- read_csv("minuteMETsNarrow_merged.csv")
minute_sleep <- read_csv("minuteSleep_merged.csv")
minute_steps_N <- read_csv("minuteStepsNarrow_merged.csv")
minute_steps_W <- read_csv("minuteStepsWide_merged.csv")
weight_log <- read_csv("weightLogInfo_merged.csv")
sleep_day <- read_csv("sleepDay_merged.csv")

**Data Cleaning:**

In order to get a better idea of the data we are working with, we will first preview the provided datasets:

In [None]:
glimpse(daily_activity)
glimpse(daily_calories)
glimpse(daily_intensities)
glimpse(daily_steps)
glimpse(heartrate_seconds)
glimpse(hourly_calories)
glimpse(hourly_intensities)
glimpse(hourly_steps)
glimpse(minute_calories_N)
glimpse(minute_calories_W)
glimpse(minute_intensities_N)
glimpse(minute_intensities_W)
glimpse(minute_mets)
glimpse(minute_sleep)
glimpse(minute_steps_N)
glimpse(minute_steps_W)
glimpse(weight_log)
glimpse(sleep_day)

Due to there being so many datasets, let's try to narrow our search. We wil now be checking the datasets for distinct users  or looking for incomplete datasets:

In [None]:
n_distinct(heartrate_seconds$Id)
n_distinct(minute_sleep$Id)
n_distinct(weight_log$Id)
n_distinct(sleep_day$Id)
n_distinct(minute_steps_W$Id)
n_distinct(minute_steps_N$Id)
n_distinct(minute_mets$Id)
n_distinct(minute_intensities_W$Id)
n_distinct(minute_intensities_N$Id)
n_distinct(minute_calories_N$Id)
n_distinct(minute_calories_W$Id)
n_distinct(hourly_steps$Id)
n_distinct(hourly_intensities$Id)
n_distinct(hourly_calories$Id)
n_distinct(daily_steps$Id)
n_distinct(daily_intensities$Id)
n_distinct(daily_calories$Id)
n_distinct(daily_activity$Id)

4 of the datasets above had significantly less user id's than the others. While many analysts might ignore these sets and focus on the remainder, I like the road less traveled. Of the 4, 1 does not seem to have much useful information, so we will examine the other 3 more closely to figure out why they are incomplete. Could this be helpful for Bellabeat?

We will be looking closer at:

1. weight_log
2. sleep_day
3. heartrate_seconds

To begin, lets check the datasets for empty fields. This will allow us to remove blanks and clean up our data.


In [None]:
heartrate_seconds <- heartrate_seconds[complete.cases(heartrate_seconds), ]
weight_log <- weight_log[complete.cases(weight_log), ]
sleep_day <- sleep_day[complete.cases(sleep_day), ]


The next step is to use those datasets to look for duplicates. They may not have blanks, but they may have repeated data:

In [None]:
sum(duplicated(sleep_day))
sum(duplicated(weight_log))
sum(duplicated(heartrate_seconds))

Since only 1 dataset showed duplicates, let's remove them:

In [None]:
sleep_day <- sleep_day[!duplicated(sleep_day), ]

**Analyze:**

Now that our data is cleaner, we still need to address formatting. Let's check the dates in each of the datasets to see if they are in date format:

In [None]:
str(heartrate_seconds)
str(weight_log)
str(sleep_day)

It looks like they are string format, so we will need to change that in case we need to compare dates at a later point:

In [None]:
data$date <- as.Date(data$date)
heartrate_seconds$Time <- mdy_hms(heartrate_seconds$Time)
sleep_day$SleepDay <-mdy_hms(sleep_day$SleepDay)
weight_log$Date <-mdy_hms(weight_log$Date)

Let's confirm it worked and that the dates are in the correct format now:

In [None]:
str(heartrate_seconds)
str(weight_log)
str(sleep_day)

Now we are going to group our data by Id to see if the user Id's are consistent. Are there users who took advantage of ALL Fitbit capabilities. Are these "super users" and therefore showing up in all of our datasets?

In [None]:
vec_count(heartrate_seconds$Id, sort = "key")
vec_count(sleep_day$Id, sort = "key")

The answer is no. The Id's are not consistent between the datasets. Data shows that its all over the board and usage is not consistent by user. While some users are not using their fitbit for weight or sleep, some are using it intermittently. Others are getting inconsistent heartrate data, likely from improper wear or from removal of device.
