# **Google Data Analytics Professional Certificate (Capstone Project)**
### Case Study: How Can a Wellness Technology Company Play It Smart?

#### **Prepared By: Musaini Ramlee**

#### **1) Introduction**

Welcome to the Bellabeat data analysis case study!      
In this case study, I will perform a real-world task of a junior data
analyst for Bellabeat, a high-tech manufacturer of health-focused products for women. 

This case study will follow the steps of the data analysis process as laid out in *Google Data Analytics Professional Certificate* : **Ask, Prepare, Process, Analyze, Share,** and **Act**

Bellabeat Products includes:

* **Bellabeat app**: The Bellabeat app provides users with health data related to their activity, sleep, stress,
menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and
make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

* **Leaf**: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects
to the Bellabeat app to track activity, sleep, and stress.

* **Time**: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user
activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your
daily wellness.

* **Spring**: This is a water bottle that tracks daily water intake using smart technology to ensure that you are
appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your
hydration levels.

* **Bellabeat membership**: Bellabeat also offers a subscription-based membership program for users.
Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and
beauty, and mindfulness based on their lifestyle and goals.

#### **2) Problem Statement (Business Tasks)**
###### _> (This is the 'Ask' stage)_
            

The co-founder of Bellabeat, Urška Sršen required an analysis on smart device usage by non-Bellabbeat smart device users in order to answer the following questions;

i.	What are some trends in smart device usage?   
ii.	How could these trends apply to Bellabeat customers?   
iii.	How could these trends help influence Bellabeat marketing strategy? 


#### **3) Data Sets**
###### _> (This is the 'Prepare' stage)_

a.	**Dataset Origin**: Sources of this dataset is from preprossesed [FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit). 
(CC0: Public Domain). The dataset is made available through Mobius and is available to be downloaded on Kaggle.

b.	**Description**: This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includesinformation about daily activity, steps, and heart rate that can be used to explore users’ habits.


c.	**Data Structure and Overview**: The raw data is pre-processes by using the code below. The objective is to identify how it is stuctured.

In [78]:
import pandas as pd
dailyAct = pd.read_csv('dailyActivity_merged.csv')
dailyAct

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,5/8/2016,10686,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,5/9/2016,20226,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,5/10/2016,10733,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,5/11/2016,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,88,12,213,1127,3832


In [79]:
dailyCal = pd.read_csv('dailyCalories_merged.csv')
dailyCal

Unnamed: 0,Id,ActivityDay,Calories
0,1503960366,4/12/2016,1985
1,1503960366,4/13/2016,1797
2,1503960366,4/14/2016,1776
3,1503960366,4/15/2016,1745
4,1503960366,4/16/2016,1863
...,...,...,...
935,8877689391,5/8/2016,2847
936,8877689391,5/9/2016,3710
937,8877689391,5/10/2016,2832
938,8877689391,5/11/2016,3832


It is clear from tables above, the data is stored in long format.

d.	**_ROCCC_ Analysis**: In order to make sure the data is "Good Data", we shall check the given data if they are;

* **Reliable** - This particular dataset is reliable because it is generated from FitBit, one of the leading brand in smart wearable technologies.
* **Original** - Although these dataset is not a primary source (not collected directly by Bellabeat), however, it is still original and in its raw form. Hence, it is a great secondary data, sufficient for analysis. The data owner is _Mobius_.
* **Comprehensive** - The data set is comprehensive enough. It covers all aspects from fitness to wellness. 
* **Current** - The dataset is consider recent ie; December 2020.
* **Cited** - This dataset does not have any copyright. However, the original dataset can be traced back in this link [here](https://zenodo.org/record/53894#.X9oeh3Uzaao)

In conclusion, the dataset can be considered as reliable and credible enough to be used for further analysis.

#### **4) Data Processing**
###### _> (This is the 'Process' stage)_

*	The data format is in CSV files. All of the downloaded files will be cleaned with a standard process

In [None]:
import glob
from pathlib import Path
import os
cwd = os.getcwd()

#Step 1 - Remove any duplicates for all the csv files
## using function 'drop_duplicates'

#Step 2 - Strip any white spaces for column that contains string
def trim_all_columns(df):
    """
    Trim whitespace from ends of each value across all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)

csv_files = [f for f in Path(cwd).glob('*.csv')] #list all csv

for csv in csv_files: #iterate list
    # Get data
    df = pd.read_csv(csv)

    # The cleaning operation for Step 1 & 2 above
    df.drop_duplicates(keep=False, inplace=True)  #drop duplicates
    df = trim_all_columns(df)  #trim extra spaces

    df.to_csv(cwd + "/cleaned/" + csv.name) #save the file in a new dir

#### **5) Data Analysis**
###### _> (This is the 'Analyze' stage; the efforts done to gain insights from the raw cleaned data)_

In [114]:
dailyAct.groupby(['ActivityDate']).sum()

Unnamed: 0_level_0,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
ActivityDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2016-04-12,271816,197.429999,196.619999,7.122864,60.27,11.42,112.53,0.01,736,259,6567,33865,78893
2016-04-13,237558,168.409999,167.36,6.943454,43.78,13.86,103.65,0.05,671,349,5998,33719,75459
2016-04-14,255538,184.780001,184.020001,5.538496,49.82,16.82,117.76,0.07,691,409,6633,33331,77761
2016-04-15,248617,174.5,174.5,0.0,34.84,13.33,124.32,0.05,633,326,7057,31715,77721
2016-04-16,277733,201.330001,201.330001,0.0,63.799999,22.68,110.42,0.05,891,484,6202,32085,76574
2016-04-17,205096,145.299999,145.299999,0.0,36.649999,15.92,90.31,0.02,605,379,5291,33599,71391
2016-04-18,252703,181.049999,179.98,7.022697,53.300001,22.27,105.0,0.05,781,516,6025,33959,74668
2016-04-19,257557,187.899998,186.919998,7.195223,60.26,16.61,110.83,0.03,767,441,6461,32126,75491
2016-04-20,261215,190.410001,189.510001,7.016988,59.58,20.27,110.430001,0.06,774,600,6515,31172,76647
2016-04-21,263795,192.960002,191.130002,6.334773,61.530001,19.9,101.189999,0.16,859,478,5845,33020,77500
