# INTRODUCTION

Capstone Project: **Google Data Analytics Professional Certificate** 

Goal: The **6 steps of Data Analysis**

Title: Bellabeat: How Can A Wellness Technology Company Play It Smart?

Author: Josh Chen

Date: 26 March 2021

***

# STEP 1: ASK

## 1.0 Background
1.0.1 Bellabeat is a high-tech manufacturer of health-focused products for women.

1.0.2 Bellabeat's product is smart decices with users' data.

## 1.1 Business Task:
1.1.1 Analyze FitBit Fitness Tracker Data 

1.1.2 Gain insights into how consumers are using the FitBit app 

1.1.3 Discover trends and insights for Bellabeat marketing strategy

## 1.2 Business Objectives:  
1.2.1 What are the trends identified?

1.2.2 How could these trends apply to Bellabeat customers?

1.2.3 How could these trends help influence Bellabeat marketing strategy?

## 1.3 Deliverables:
1.3.1 A clear summary of the business task

1.3.2 A description of all data sources used

1.3.3 Documentation of any cleaning or manipulation of data

1.3.4 A summary of analysis

1.3.5 Supporting visualizations and key findings

1.3.6 High-level content recommendations based on the analysis

## 1.4 Key Stakeholders:
1.4.1 Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer

1.4.2 Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team

1.4.3 Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat's marketing strategy.


***


# STEP 2: PREPARE

## 2.1 Information on Data Source:
1. The data is public available on [Kaggle: FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit) and stored in 18 csv files. 
2. Generated by respondents from a distributed survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016.
3. FitBit users consented to the submission of personal tracker data.
4. Data collected includes: (1) physical activity recorded in minutes; (2) heart rate; (3) sleep monitoring; (4) daily activity; (5) steps.

## 2.2 Is Data ROCCC?
A good data source is ROCCC which stands for **R**eliable, **O**riginal, **C**omprehensive, **C**urrent, and **C**ited.
1. Reliable: LOW - Not reliable as it only has 30 respondents
2. Original: LOW - Third party provider (Amazon Mechanical Turk)
3. Comprehensive: MED - Parameters match most of Bellabeat's products' parameters
4. Current: LOW - Data is obsolete or relevant
5. Cited: High - Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016) 

Although the dataset is not good enough to produce business recommendations, it is still good enough to practice.


#### 2.3 Data Selection:
```
    • dailyActivity_merged.csv
```


***

# STEP 3: PROCESS

## 3.1 Preparing the Environment

The ```numPy, pandas, matplotlib, datetime``` packages are installed and aliased for easy reading.

In [None]:
# import packages and alias
import numpy as np # data arrays
import pandas as pd # data structure and data analysis
import matplotlib as plt # data visualization
import datetime as dt # date time

## 3.2 Importing data set
Reading in the selected file.

In [None]:
# read_csv function to read the required CSV file
daily_activity = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

## 3.3 Data cleaning and manipulation

### 3.3.1 Steps

#### 1. Observe and familiarize with data: previewing using head function to show the first 10 rows of daily_activity to familiarise with the data.

In [None]:
# preview first 10 rows with all columns
daily_activity.head(10)

#### 2. Check for null or missing values: finding out whether there is any null or missing values in daily_activity.

In [None]:
# obtain the # of missing data points per column
missing_values_count = daily_activity.isnull().sum()

# look at the # of missing points in all columns
missing_values_count[:]

#### 3. Perform sanity check of data: finding out the basic information of daily_activity:
* no. of rows and columns
* name of columns
* type of value
* counting the unique ID

In [None]:
# show basic information of data
daily_activity.info()

# count distinct value of "Id"
unique_id = len(pd.unique(daily_activity["Id"]))
  
print("# of unique Id: " + str(unique_id))

### 3.3.2 Observations

From the above observation, noted that

1. There is no Null or missing values.

2. Data frame has 940 rows and 15 columns.

3. ActivityDate is wrongly classified as object dtype and has to be converted to datetime64 dtype.

4. There are 33 unique IDs.

### 3.3.3 Manipulation

#### 1. Let's convert ActivityDate to datatime64 dtype and convert format of ActivityDate to yyyy-mm-dd. Then, printing head 5 rows could confirm whether this outcome is as same as our expectation.

In [None]:
# convert "ActivityDate" to datatime64 dtype and format to yyyy-mm-dd
daily_activity["ActivityDate"] = pd.to_datetime(daily_activity["ActivityDate"], format="%m/%d/%Y")

# re-print information to confirm
daily_activity.info()

# print the first 5 rows of "ActivityDate" to confirm
daily_activity["ActivityDate"].head()

#### 2. Create new column DayOfTheWeek by separating the date into day of the week for further analysis, such Monday, Tuesday etc..

In [None]:
#r create new list of rearranged columns
new_cols = ['Id', 'ActivityDate', 'DayOfTheWeek', 'TotalSteps', 'TotalDistance', 'TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance', 'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes', 'TotalExerciseMinutes', 'TotalExerciseHours', 'Calories']

# reindex function to rearrange columns based on "new_cols"
df_activity = daily_activity.reindex(columns=new_cols)

# print 1st 5 rows to confirm
df_activity.head(5)

# create new column "day_of_the_week" to represent day of the week 
df_activity["DayOfTheWeek"] = df_activity["ActivityDate"].dt.day_name()

# print 1st 5 rows to confirm
df_activity["DayOfTheWeek"].head(5)

#Rearranging and renaming columns from ```XxxYyy``` to ```xxx_yyy```
# rename columns
df_activity.rename(columns = {"Id":"id", "ActivityDate":"date", "DayOfTheWeek":"day_of_the_week", "TotalSteps":"total_steps", "TotalDistance":"total_dist", "TrackerDistance":"track_dist", "LoggedActivitiesDistance":"logged_dist", "VeryActiveDistance":"very_active_dist", "ModeratelyActiveDistance":"moderate_active_dist", "LightActiveDistance":"light_active_dist", "SedentaryActiveDistance":"sedentary_active_dist", "VeryActiveMinutes":"very_active_mins", "FairlyActiveMinutes":"fairly_active_mins", "LightlyActiveMinutes":"lightly_active_mins", "SedentaryMinutes":"sedentary_mins", "TotalExerciseMinutes":"total_mins","TotalExerciseHours":"total_hours","Calories":"calories"}, inplace = True)

# print column names to confirm
print(df_activity.columns.values)
df_activity.head(5)

#### 3. Let's create new column TotalMins being the sum of VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes and SedentaryMinutes

In [None]:
# create new column "total_mins" containing sum of total minutes.
df_activity["total_mins"] = df_activity["very_active_mins"] + df_activity["fairly_active_mins"] + df_activity["lightly_active_mins"] + df_activity["sedentary_mins"]
df_activity["total_mins"].head(5)

#### 4. Create new column TotalHours by converting new column in #4 to number of hours.

In [None]:
# create new column *total_hours* by converting to hour and round float to two decimal places
df_activity["total_hours"] = round(df_activity["total_mins"] / 60)

# print 1st 5 rows to confirm
df_activity["total_hours"].head(5)

### 3.3.4 After data cleaning and manipulation, it is now ready to be analysed.

# STEP 4: ANALYZE

## 4.1 Statistical analysis

Pulling the statistics of df_activity for analysis:
* count - no. of rows
* mean (average)
* std (standard deviation)
* min and max
* percentiles 25%, 50%, 75%

In [None]:
# pull general statistics
df_activity.describe()

Interpreting statistical findings:

1. On average, users logged 7,638 steps or 5.5km. However, recommended by CDC, an adult female has to take at least 10,000 steps or 8km per day to benefit their health, weight loss and fitness improvement. [Source: Medical News Today article](https://www.medicalnewstoday.com/articles/how-many-steps-should-you-take-a-day)

2.  Sedentary users logged on average 991 minutes or 20 hours(85% of daytime).

3. Average calories burned is 2,303 calories equivalent to 0.6 pound. (If we have more users' infomation such as the age, weight, daily tasks, exercise, hormones and daily calorie intake, we could provide them some advices to raise daily calories burned. [Source: Health Line article](https://www.healthline.com/health/fitness-exercise/how-many-calories-do-i-burn-a-day#Burning-calories))

***



## 4.2 Users' Weekly Routine
Creating a new dataset to group steps by each one day.

In [None]:
#First, we will order the dataset by the specified vector. This is because in default R arranges by alphabet
day_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_activity['day_of_the_week'] = pd.Categorical(df_activity['day_of_the_week'], categories=day_list, ordered=True)
df_activity = df_activity.sort_values('day_of_the_week')

#Then we create the new dataset
day_steps_distance_sleep_calories = pd.pivot_table(df_activity,
                                                   values=['total_steps','total_dist','calories','sedentary_mins'],
                                                   index=['day_of_the_week'],
                                                   aggfunc=np.mean
                                                   )
day_steps_distance_sleep_calories = day_steps_distance_sleep_calories.rename(columns={'total_steps':'avg_steps', 'total_dist':'avg_distance', 'calories':'avg_calories', 'sedentary_mins':'avg_sed_mins'})
day_steps_distance_sleep_calories.head(10)

Interpreting weekly routine findings:

1. Users performance on Sunday is not better than on the midweek. Therefore, we suggest users work out every day, not on the weekend.

2. The performance on Monday, Tuesday and Saturday is better than average. So, more steps help relieve the blue-Monday-anxiety.

# STEP 5: SHARE - Data Visualisation and Findings

In this step, we are creating visualizations and offering our findings based on our analysis.

## 5.1 Frequency of usage across the week

Although days in the week are continous, we have manipulated and transformed to categorical variables, like Monday, Tuesday etc.. So, we decided to use bar chart instead of histogram.

In this bar chart, we are looking at the frequency of FitBit app usage in each days of the week. 

1. Users prefer to track their activity on the app from Tuesday to Friday. 

2. The frequency dropped on Friday until next Monday. 

In [None]:
# import matplotlib package
import matplotlib.pyplot as plt

# plotting bar chart
plt.style.use("seaborn-darkgrid")
plt.figure(figsize=(8,4)) # specify size of the chart

df_activity_gp = df_activity.groupby("day_of_the_week")
day_of_the_week_cnt = df_activity_gp["day_of_the_week"].count()
plt.bar(day_list, day_of_the_week_cnt, color = "lightskyblue", edgecolor = "black")

# adding annotations and visuals
plt.xlabel("Day of the week")
plt.ylabel("Frequency")
plt.title("No. of times users logged in app across the week")
plt.grid(True)
plt.show()

## 5.2 Calories burned

### 5.2.1 Every steps taken by users

In this scatter plot, we can find the correlation between specific variables. We discovered that:

1. It is a positive correlation. 

2. Outliers:
    - Zero steps with zero calories burned: it means user did not use this product, not the trade-off.
    - 1 observation of > 35,000 steps with < 3,000 calories burned.

In [None]:
# import matplotlib package
import matplotlib.pyplot as plt

# plotting scatter plot
plt.style.use("dark_background")
plt.figure(figsize=(8,6)) # specify size of the chart
plt.scatter(df_activity.total_steps, df_activity.calories, 
            alpha = 0.8, c = df_activity.calories, 
            cmap = "Spectral")

# add annotations and visuals
median_calories = 2303
median_steps = 7637

plt.colorbar(orientation = "vertical")
plt.axvline(median_steps, color = "Blue", label = "Median steps: 7637")
plt.axhline(median_calories, color = "Red", label = "Median calories burned: 2303")
plt.xlabel("Steps taken")
plt.ylabel("Calories burned")
plt.title("Calories burned for every step taken")
plt.grid(True)
plt.legend()
plt.show()

### 5.2.2 Every hour logged by users

The scatter plot is showing:

1. A weak positive correlation shows that the increase of hours logged does not guarantee to more calories being burned. The reanson probably is that the average sedentary hours (purple line) are about 16 to 17 hours more. 

2. Outliers:
   - The same zero value outliers
   - An unusual red dot at the 24 hours with zero calorie burned.

In [None]:
# import matplotlib package
import matplotlib.pyplot as plt

# plotting scatter plot
plt.style.use("dark_background")
plt.figure(figsize=(8,6)) # Specify size of the chart
plt.scatter(df_activity.total_hours, df_activity.calories, 
            alpha = 0.8, c = df_activity.calories, 
            cmap = "Spectral")

# adding annotations and visuals
median_calories = 2303
median_hours = 20
median_sedentary = 991 / 60

plt.colorbar(orientation = "vertical")
plt.axvline(median_hours, color = "Blue", label = "Median steps: 2303")
plt.axvline(median_sedentary, color = "Purple", label = "Median sedentary: 991/60")
plt.axhline(median_calories, color = "Red", label = "Median hours: 20")
plt.xlabel("Hours logged")
plt.ylabel("Calories burned")
plt.title("Calories burned for every hour logged")
plt.legend()
plt.grid(True)
plt.show()

### 5.2.3 Percentage of Activity in Minutes

There are some findings in this pie chart:

1. Sedentary minutes takes the biggest part at 81.3%. This indicates that users are using the FitBit app to record every daily activities such as daily commute, inactive movements (moving from one spot to another) etc.. 

2. Devices/APP is seldom being used to track fitness (ie. running); that is, the minor percentage of fairly active activity (1.1%) and very active activity (1.7%) are logged. This is a challenge for FitBit to APP design: if APP is a life recorder, FitBit can not fullfill the goal of increase users' fitness time; if APP is a fitness coach, FitBit probably help users increase fitness time.

In [None]:
# import packages
import matplotlib.pyplot as plt
import numpy as np

# calculating total of individual minutes column
very_active_mins = df_activity["very_active_mins"].sum()
fairly_active_mins = df_activity["fairly_active_mins"].sum()
lightly_active_mins = df_activity["lightly_active_mins"].sum()
sedentary_mins = df_activity["sedentary_mins"].sum()

# plotting pie chart
slices = [very_active_mins, fairly_active_mins, lightly_active_mins, sedentary_mins]
labels = ["Very active minutes", "Fairly active minutes", "Lightly active minutes", "Sedentary minutes"]
colours = ["lightcoral", "yellowgreen", "lightskyblue", "darkorange"]
explode = [0, 0, 0, 0.1]
plt.style.use("default")
plt.pie(slices, labels = labels, 
        colors = colours, wedgeprops = {"edgecolor": "black"}, 
        explode = explode, autopct = "%1.1f%%")
plt.title("Percentage of Activity in Minutes")
plt.tight_layout()
plt.show()

# STEP 6: ACT

In the final step, we will deliver our insights and provid recommendations based on our analysis. 
 
**1. What are the trends identified?**

* Majority activities logged on the FitBit app is sedentary activities(85% of daytime). That is, health habits are still not "habit".

* Users usually track their activities less on Sunday - perhaps because they still have the "habit" that fitness is a "job" and they need a day-off for fitness. 

**2. How could these trends apply to Bellabeat customers?**

* Just a smart device or recording APP is far from enough to help users increase fitness time. The users usually expect to buy a smart device to have a perfect health life(including me). Therefore, Bellabeat should not only provid correct and useful infomation about users' health, habit and fitness data, but also instruct them how to do excercises.

**3. How could these trends help influence Bellabeat marketing strategy?**

* Bellabeat marketing team should tell potential customers that they are fitness coach and expert. For example, if users wear the devices in the gym, the APP can tell them how to and how much they should take exercises; after working out, APP can recommend some restaurants to have healthy meals near to their home.

* On Sunday, Bellabeat app can show prompt notification to recommend some 10 minutes Youtube Yoga exercises.

* Bellabeat marketing strategy should focus on providing a smart service, not a smart device/APP.


***

