---
title: "Bellabeat casestudy"
author: "Nidi Malik"
date: "2022-12-28"
output: html_document
---
# Bellabeats case study
this is my analysis for the bellabeat case study. 

# About the company Bellabeat

Bellabeat is a high-tech manufacturer of health-focused products for women. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.

Since it was founded in 2013, Bellabeat has grown rapidly and quickly
positioned itself as a tech-driven wellness company for women.

# Main Business Task
to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. 
main questions guiding the analysis process:

* What are some trends in smart device usage?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat marketing strategy?

# Stakeholders

* Urška Sršen: Cofounder and Chief Creative Officer of Bellabeat
* Sando Mur: Bellabeat’s co founder and key member of the Bellabeat executive team
* Bellabeat marketing analytics team: responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.


# About our data

We will be using FitBit Fitness Tracker Data data set made available through Mobius. <https://www.kaggle.com/datasets/arashnic/fitbit>

the data is open sourced.This Kaggle data set contains personal fitness tracker from thirty fitbit users. 
Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

however keeping in mind the limits our of dataset i.e it only provides a limited number of samples for us to work with as well as the fact that the data was only recorded for a time period of two months. It is possible that there might be some form of bias present in the data.

# ROCCC analysis
* Reliability : LOW – dataset was collected from 30 individuals whose gender is unknown.
* Originality : LOW – third party data collect using Amazon Mechanical Turk.
* Comprehensive : MEDIUM – dataset contains multiple fields on daily activity intensity, calories used, daily steps taken, daily sleep time and weight record.
* Current : MEDIUM – data is 7 years old but the habit of how people live does not change over a few years
* Cited : HIGH – data collector and source is well documented

# Preparing the data
We are using R to analyze this data as it is easier to work with large datasets in R.
Now we are downloading some packages to ensure that we have necessary tools for analysing our data.

In [None]:
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(janitor)
library(here)
library(skimr)


## importing our data to R

here we are importing our data in R using readr package.

we will only be focusing on 6 data sets

* daily_activity
* calories
* steps
* heartrate
* sleep
* weightloss
* intensities

In [None]:
daily_activities <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

calories <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")

steps <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")

heartrate <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

sleep <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

weightloss <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

intensities <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")

## preview of the data

after importing the data, we now use functions such as head(), view(), str(), colname() to get preview of our data

In [None]:
head(daily_activities)
str(daily_activities)
view(daily_activities)
colnames(daily_activities)

head(calories)
str(calories)
view(calories)
colnames(calories)

head(steps)
str(steps)
view(steps)
colnames(steps)

head(heartrate)
str(heartrate)
view(heartrate)
colnames(heartrate)

head(sleep)
str(sleep)
view(sleep)
colnames(sleep)

head(weightloss)
str(weightloss)
view(weightloss)
colnames(weightloss)


head(intensities)
str(intensities)
view(intensities)
colnames(intensities)

# Cleaning our data

while checking our data we noticed some problems in the structure of datetime data form several data sets. 
i also used clean_name() for clear formatting 
we will clean these datasets first before moving to our analysis.


In [None]:
# daily_activities
daily_activities$ActivityDate <- as.Date(daily_activities$ActivityDate, format = "%m/%d/%Y")

# calories
calories$ActivityDay <- as.Date(calories$ActivityDay, format = "%m/%d/%Y")

# steps
steps$ActivityDay <- as.Date(steps$ActivityDay, format = "%m/%d/%Y")

# heartrate
heartrate$Time <- as.POSIXct(heartrate$Time, format = "%m/%d/%Y %H:%M:%S")

# sleep
sleep$SleepDay <- as.POSIXct(sleep$SleepDay, format = "%m/%d/%Y %H:%M:%S")

# weihtloss
weightloss$Date <- as.POSIXct(weightloss$Date, format = "%m/%d/%Y %H:%M:%S")

# intensities
intensities$ActivityDay <-as.Date(intensities$ActivityDay, format = "%m/%d/%Y")


Now that all of our dates are in correct structure, we ensure that there are no duplicates in our data.

# Analyzing our data

After sorting and ensuring our data is clean we can start analyzing our data.
here we decided to get total number of unique data ids that are present in our data.


In [None]:
n_distinct(daily_activities$Id)
n_distinct(calories$Id)
n_distinct(steps$Id)
n_distinct(heartrate$Id)
n_distinct(sleep$Id)
n_distinct(weightloss$Id)
n_distinct(intensities$Id)

this showed us that in daily_activities, calories, steps and intensities all contain information about 33 users whereas heartrate and weightloss only contained information regarding 14 and 8 users respectively which is not sufficient for an analysis. 

We also checked for duplicates in our data using sum(duplicated()) and found only 3 duplicates in sleep dataset which we removed.


In [None]:
# checking for duplicates

sum(duplicated(daily_activities))
sum(duplicated(calories))
sum(duplicated(steps))
sum(duplicated(sleep))
sum(duplicated(intensities))


# removing duplicates

sleep <- sleep %>%
  distinct() %>%
  drop_na()

# we check if any other duplicates

sum(duplicated(sleep))


we will only use 
* daily_activities
* calories 
* steps
* sleep 
* intensities 
for our analysis as they contsin sufficient information.

### summary statistics about our data 

In [None]:
# summary statistics for daily_activities

daily_activities %>% 
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes, Calories, 
         VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>%
  summary()

# summary statistics for calories

calories %>% 
  select(Calories) %>% 
  summary()

# summary statistics for steps

steps %>% 
  select(StepTotal) %>% 
  summary()

# summary statistics for sleep

sleep %>% 
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>% 
  summary()

# summary statistics for intensities

intensities %>% 
  select(SedentaryMinutes, LightlyActiveMinutes, 
         FairlyActiveMinutes, VeryActiveMinutes) %>% 
  summary()

## key findings from statistical summary

* The average sedentary time was more than 16 hours, which is too high.

* Most of the participants were lightly active with high sedentary time.

* Average time participants slept for was about 7 hours.

* most  of the participants walked on an average of about 7638 steps daily. which they could be motivated to increase to the recommended 12000 steps by the CDC to decrease health risks.

## Merging data
 
Before beginning to visualize the data, I'm going to merge two data sets : daily_activity and Sleep data on columns Id.


In [None]:
# combining datasets

daily_activity_sleep <- merge(daily_activities, sleep, by=c ("Id"))
n_distinct(daily_activity_sleep$Id)

# Visualizing data

Now lets visualize some key observations.

In [None]:
# total steps vs calories 

ggplot(data = daily_activities, mapping = aes(x= TotalSteps, y= Calories ))+ 
  geom_point()+ geom_smooth() +
  labs(title = "Total Steps vs Calories")+ theme(panel.background = element_blank())

Here we can observe a positive correlation between total steps a participant walked vs the calories they burned. 

### relationship between very active minutes and sedentary walking

In [None]:

ggplot(data = daily_activities, aes(x = VeryActiveMinutes, y = SedentaryMinutes)) +  geom_point() + geom_smooth()+
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Very active minutes vs. SedentaryMinutes") + theme(panel.background = element_blank())

This plot shows us that there is a negative correlation between very active walking and sedentary minutes. this means that users who walk more spend less time sedentary and can be more fit that those who spend less time walking

### Relationship between Minutes Asleep and Time in Bed

In [None]:
# What's the relationship between minutes asleep and time in bed?

ggplot(data=sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + 
  geom_point()+ geom_smooth() +
  labs(title=" Minutes Asleep vs. Time in Bed Minutes")+ theme(panel.background = element_blank())

As we might expect, we can see here an almost completely linear trend between Minutes Asleep and Time in Bed. So to help users improve their sleep, the company should consider adding reminder feature to alert the user to go to sleep.

Now lets look at intensities data

In [None]:
intensities$ActiveIntensity <- (intensities$VeryActiveMinutes)/60

Combined_data <- merge(weightloss, intensities, by="Id", all=TRUE)
Combined_data$time <- format(Combined_data$Date, format = "%H:%M:%S")
view(Combined_data)


ggplot(data=Combined_data) + geom_histogram( mapping= aes(x= time, y=ActiveIntensity), stat = "identity", fill= "pink") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Total very Active Intensity vs. Time ") + theme(panel.background = element_blank())

By analyzing some Intensity data over time. The company will have a good idea on how customers are using their product during the day. Most users are active before and after work, I suppose. The company can use this time in the Bellabeat app to remind and motivate users to go for a run or for a walk.

# Conclusion and Business Recomendation
So, collecting data on activity, sleep, stress, etc. will allow the company Bellabeat to empower the customers with knowledge about their own health and daily habits. The company Bellabeat is growing rapidly and quickly positioned itself as a tech-driven wellness company for their customers.

By analyzing the FitBit Fitness Tracker Data set, I found some insights that would help influence Bellabeat marketing strategy.

# Target Audience

People working full-time jobs and spending a lot of time at the computer and in the office and need fitness and daily activities to be in shape.

The users are doing some light activity to stay healthy (according to the activity type analysis). And they need to improve their everyday activity to have more health benefits. And they might need some knowledge about developing healthy habits and motivation to keep them going.

# Recomendation for the marketing team

* The sedentary time for users is quite high (16 hours approximately). this needs to reduced by promoting a more health lifestyle. the marketing team should collaborate with celebrates to promote a new target oriented feature for the company's app with a reward system.

* Participants sleep 1 time for an average of 7 hours. To help users improve their sleep, Bellabeat should consider a sleep reminder feature that reminds users to go to bed. this might also help the Bellabeat app in reducing sedentary time as a well rested user would likely exersise more.

* The average total steps per day (which is 7638) is a little bit less than recommended by the CDC. According to the CDC research, taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes). And taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps. So, Bellabeat can encourage people to take at least 8,000 steps per day by explaining the healthy benefits of doing that.

* By analyzing the Intensity data over time. The company will have a good idea on how their customers are using their app during the day. Most users are active before and after work. The company can use this time in the Bellabeat app to remind and motivate users to go for a run or for a walk.

* For customers who want to lose weight, it can be a good idea to control daily calorie consumption. And Bellabeat can partner with chefs to add simple, inclusive and low calorie food recommendations for their users.

Thank you very much for your interest in my Bellabeat Case Study!

And I would appreciate any comments and recommendations for improvement!