In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![bellabeat logo](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLX-tdij_MQ7lDd2bjJxA5BGQra7ELOWAJklDKmQTE&s)

About the company:

Bellabeat is a technology company focused on wearable health-centric products for women. Bellabeat has found success and is looking for growth to become a power-player in the global smart device market. Founders Urška Sršen and Sando Mur, designed their technology to inform and empower women around the world to be more cognisant about their health and habits. Bellabeat wearables collect data on activity, sleep, stress, and reporductivity health. In addition to diffrent wearable options, these trackers are accompanied by the Bellabeat app to provide feedback and help users understand the data and their habits. Lastly, they even have a waterbottle which utilizes smart technology to track user water intake to make sure they’re properly hydrated. 

Guiding Questions from stakeholders:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Ask:

Guiding Questions for Analysis:
1. What is the problem you are trying to solve?
The problem we are trying to solve is to analyze smart device usage data and how to use that data to drive new business decisions.
2. How can your insights drive business decisions 
We can take the insights from our analysis to recommended actual data-backed suggestions on how to grow and improve Bellabeat.

Key Tasks: 
1. Identify the business task
The business task is as stated above. To analyze usage data and report our finding with recommendations on improvements to the stakeholders.
2. Consider key stakeholders
Urska Srsen: Co-founder and CCO
Sando Mur: Co-founder and Mathematician (executive team member)
Deliverable:
1. Clear statement of business task
Analyze fitbit tracker data to gain insight into usage trends and how those trends can be leveraged by the marketing team to identify new growth opportunities. It is important to note that this tracker data is not first-party from Bellabeat consumers. This is third-party data analyzed from a rival smart device called FitBit. Gaining insights from other competitor’s wearables can help Bellabeat with new strategies. 

Prepare: 

Guiding Questions:
1. Where is your data stored?
This data is stored on kaggle and was made by Mobius
2. How is the data organized? Is it long or wide format?
The data is organized in long format and consists of 18 .csv files in total
3. Are there issues with bias or credibility in this data? Does your data ROCCC?
Reliable - LOW this dataset only has a sample size of 30 which is the lowest recommended sample size. There are also many other unknown factors such as age, gender, and height.
Original - LOW this is third party data collected by Amazon Mechanical Turk between 03.12.2016-05.12.2016
Comprehensive - MEDIUM data contains a variety of variable including minute-level output for physical activity, heart rate, and sleep monitoring
Current - LOW data collected is from almost 7 years ago, a lot about a person’s lifestyle can change in that time 
Cited - HIGH the source and data collected is well documented
4. How are you addressing licensing, privacy, security, and accessibility?
This data is CC0: Public domain so it is free to use for the public
5. How did you verify the data’s integrity?
The data was easily accessible, transferable, and has a 10/10 usability score from Kaggle assuring its integrity
6. How does it help you answer your question?
This dataset will provide insight into the usage patterns of FitBit wearers
7. Are there any problems with the data?
At first glance there are no apparent problems with the data.

Key Tasks:
1. Download data and store it appropriately
The data was downloaded and stored in a folder named “bellabeat_casestudy”. Inside that folder was another name “FitBit Data 4.12.16-5.12.16” which holds the 18 .csv files we can use.
2. Identify how it’s organized 
Data is organized by scale: minute, hourly, daily and by type: activity, sleep, weight, etc.
3. Sort and filter the data
4. Determine the credibility of the data

Deliverables:
1. A description of all the data sources used
I will use Microsoft Excel for data cleaning

Process:

Guiding questions:

1. What tools are you choosing and why?  

 - The tools I am choosing are Excel and R. I chose Excel because the datasets we’re working with are not large enough to warrant SQL. I am using R for my analysis because it has both data cleaning and visualization capabilities within its platform.

2. Have you ensured your data’s integrity? 

 - Yes, I have ensured the data’s integrity by using conditional formatting to make sure all of the values make sense in the associated columns. For example, I made sure that there were no negative values because that wouldn’t make sense given our column names and what we’re analyzing. 

3. What steps have you taken to ensure that your data is clean? 

 - I utilized conditional formatting in Excel to make sure there are no missing values, and if there were any, to make note of it. Again, I also checked to see if there were any negative values.

4. How can you verify that your data is clean and ready to analyze? 

 - See the steps I have taken above. 

5. Have you documented your cleaning process so you can review and share those results?

 - Yes, see below.

Data Cleaning:
 - Used excel to clean datasets since the sample size is small
 - Found duplicate data using conditional formatting in multiple sheets with redundant data, so I wont be using them moving forward in my analysis.
 - Total_distance and tracker_distance repeat the same values so i will be deleting the column tracker_distance
 - There is no associated metric with distance…is it miles or kilometers?
 - Daily activity contains all of the data also found in daily_steps and daily_intensities, so for simplicity sake I will just be using the daily_activity sheet
 - The other two sheets I will be using in my analysis are weight_log and daily_sleep
 - In daily_sleep there are only 8 unique user_ids with data, many of which manually inputted the data, and many of which did not record data everyday in the observed time period. Due to such a small sample size, I won’t be able to gather any statistically significant findings using this data.

Analyze: 

 - Here is my analysis in R including the code chunks and their outputs.

In [None]:
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)

**Loading Datasets to be Used:**

Here we're going to load the three data sets that we'll be using in our analysis, which we'll call daily_activity, daily_sleep, and weight_log. 

In [None]:
daily_activity <- read.csv("/kaggle/input/bellabeat-analysis/daily_activity_cleaned.csv")
daily_sleep <- read.csv("/kaggle/input/bellabeat-analysis/daily_sleep_cleaned.csv")
weight_log <- read.csv("/kaggle/input/bellabeat-analysis/weight_log_cleaned.csv")

**Exploring Key Tables:**

Let's preview all of our datasets to make sure they imported properly.

In [None]:
head(daily_activity)
head(daily_sleep)
head(weight_log)

Now, let's see the column names in our datasets.

In [None]:
colnames(daily_activity)
colnames(daily_sleep)
colnames(weight_log)