# Introduction <br>
### About the Project

This is my Google Data Analytics Certificate Capstone Project. In this case study, I carried out Exploratory Data Analysis on smart device usage data to discover trends, gain insights on user profiles and behaviours, and make recommendations to help guide marketing strategy of a tech-driven wellness company - Bellabeat. My analysis followed the 6 steps of data analysis: Ask, Prepare, Process, Analyse, Share and Act. Analysis tool used for this case study is Python.

### About the Company


Bellabeat is a high-tech company that manufactures health-focused smart products for women. The cofounder and Chief Creative Officer, Urška Sršen, believes that analysing smart device fitness data could help unlock new growth opportunites for the company.

# Ask - Define the Business Problem

### Business Tasks

Analyse smart device usage data to gain insights on how consumers are using the smart device, discover trends and insights to apply to Bellabeat products.

### Stakeholders

1. Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
2. Sando Mur: Mathematician and Bellabeat’s cofounder
3. Bellabeat marketing analytics team

### Bellabeat's Key Products
**Bellabeat app**: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

**Leaf**: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

**Time**: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

**Spring**: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

**Bellabeat membership**: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

### Questions to Answer

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

# Prepare - Select Dataset

### Dataset

Dataset Name: “Fitbit Fitness Tracker Data.” <br>
Source: Open-source dataset available on Kaggle. <br>
Link: [Fitbit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit)

This dataset contains Fitbit users' daily usage data including minute-level output for physical activity, heart rate, sleep monitoring, daily activities and steps collected from 30 consented Fitbit users. The dataset is generated by respondents from a distributed survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016.

### Credibility and Limitations of Dataset

1. Data on fitbit fitness tracker users is relevant to Bellabeat's business problem. 
3. There could be sampling bias in the dataset which makes it not representative of the entire population.
4. The data is lack of demographic profile information like age, occupation, location, and gender especially. Analysis based on data collected from unknown genders could lead to conclusions not applicable to Bellabeat products since they are only targeting female users. 
5. The data is collected in the year of 2016. Even though it is not more than 10 years ago, it is still quite outdated as users' daily activity and fitness habits could have changed especially after entering the COVID Era. For this reason, this analysis might not apply to markets where strict COVID regulations are continuously imposed on people's daily activity.
6. The data is not original as it is third party information.

### Data Selection

The data are stored in total 18 csv files, of which the following files are selected for analysis.

*dailyAcrivity_merged.csv* <br>
*sleepDay_merged.csv* <br>
*weightLogInfo_merged* <br>

# Process - Clean and Transform Data 

### Set Up Environment for Python and Import Data Files

Import numpy, pandas, matplotlib, datetime packages for data processing and visualisation.

In [None]:
import numpy as np 
import pandas as pd 
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.1f' % x)

In [None]:
# Load the interested data files and assign them to new variable names.
daily_activity = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep_day = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight_log_info = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

### Data Observation <br>
Take a look at the first 5 rows of these dataset to get familiar with the information contained like data attributes and data types. Check for missing values and duplicated values. <br>

**Overview**<br>

In [None]:
daily_activity.head()

In [None]:
sleep_day.head()

In [None]:
weight_log_info.head()

In [None]:
# Store these dataframes in a container for easier manipulation.
dfs = {'daily_activity': daily_activity, 'sleep_day': sleep_day, 'weight_log_info': weight_log_info}

In [None]:
for k,v in dfs.items(): 
    print(f'{k} has \033[1m{v.shape[0]}\033[0m rows and \033[1m{v.shape[1]}\033[0m columns')

**Check Data Types**

In [None]:
daily_activity.dtypes

In [None]:
sleep_day.dtypes

In [None]:
weight_log_info.dtypes

**Check null values**

In [None]:
daily_activity.isnull().sum()

In [None]:
sleep_day.isnull().sum()

In [None]:
weight_log_info.isnull().sum()

There are no missing values for all attributes except "Fat" in weight_log_info table.

**Check Number of Unique Participants**



In [None]:
for k,v in dfs.items():
    print(f'{k} data has \033[1m{v.Id.nunique()}\033[0m unique Participants.' )

**Check Number of Total Entries**

In [None]:
for k,v in dfs.items():
    print(f'{k} data has \033[1m{v.Id.count()}\033[0m total entries.' )

**Check Duplicates**

In [None]:
for k, v in dfs.items():
    print(f'{k} data has \033[1m{len(v[v.duplicated()])}\033[0m duplicated rows.' )

**Check Number of Entries by Each Unique User**

Check how many days users have been tracking data. 

In [None]:
daily_activity.groupby('Id')['ActivityDate'].count()

It is observed that User ID 4057192912 only recorded daily_activity data for 4 days.

In [None]:
sleep_day.groupby('Id')['SleepDay'].count()

For sleep_day data, about 30% of participants logged data for less than 10 days.

In [None]:
weight_log_info.groupby('Id')['Date'].count()

Only 2 participants frequently logged weight data.

### Data Cleaning and Data Manipulation

**Remove Redundant Columns**<br>

By taking a first glance at these dataframes, it is noticed that there is some redundancy in the information provided. 

In the "daily_activity" table, the values in attributes "TotalSteps" and "TrackerDistance" are identical for each user ID. Hence to make the table neater, one of the 2 columns is dropped. Attributes "LoggedActivitiesDistance" and "SedentaryActiveDistance" are giving straight 0 values and will not be used for analysis, so they are also dropped.

In the "weight_log_info" table, column "LogId" is not useful information and is hence dropped.

In [None]:
daily_activity.drop(['TrackerDistance', 'LoggedActivitiesDistance','SedentaryActiveDistance'], axis = 1, inplace = True)
weight_log_info.drop(['LogId'], axis = 1, inplace = True)

**Drop Duplicates**

In [None]:
sleep_day.drop_duplicates(keep = 'first', inplace = True)
sleep_day.duplicated().sum() # Check if the duplicates have been successfully removed.

**Deal with Missing Values**

In [None]:
weight_log_info.fillna(0, inplace = True) # Fill null values with 0.
print(weight_log_info.isnull().sum()) # Now no more null values.

**Column Manipulation - Rename, Create New Column** <br>
1. The time information in "SleepDay" column of "sleep_day" table is redundant as it is "12:00:00 AM" for all entries. Hence it is removed.
2. "ActivityDate" in "daily_activity" and "SleepDay" in "sleep_day" columns are renamed to be standardised.
3. A new column "TotalActiveMinutes" is created in "daily_activity" table.

In [None]:
# Remove "12:00:00 AM" in SleepDay column.
sleep_day['SleepDay'] = sleep_day['SleepDay'].apply(lambda x: x[:9])
sleep_day['SleepDay'].head() 

In [None]:
# Standardise "Date" column names.
daily_activity.rename(columns = {'ActivityDate': 'Date'}, inplace = True)
sleep_day.rename(columns = {'SleepDay': 'Date'}, inplace = True)

In [None]:
# Create a new column "TotalActiveMinutes" to daily_activity table.
daily_activity['TotalActiveMinutes'] = daily_activity['VeryActiveMinutes'] + daily_activity['FairlyActiveMinutes'] + daily_activity['LightlyActiveMinutes']
daily_activity['TotalActiveMinutes'].head()

**Data Transformation** <br>

1. Upon inspection of the data types of attributes in these dataframes, it is noticed that attribute "Date" in "daily_activity", "sleep_day" and "weight_log_info" is of object type, which shall be converted to datetime type.
2. All "Id" attributes in these dataframes are to be converted from int64 type to object type.


In [None]:
# Convert "Date" from object type to datetime type.
for v in dfs.values():
    v['Date'] = pd.to_datetime(v['Date'])

# Convert "Id" from Int64 type to object type.
for v in dfs.values():
    v['Id'] = v['Id'].astype(object)
# Check converted results
for k,v in dfs.items():
    print(f'Check {k} table \n {v.dtypes[:2]}\n')

# Analyse - Gain Insights from Data

### Merging Dataframes for Analysis

In [None]:
# Create a new column "WeekDay" to daily_activity, sleep_day and weight_log_info dataframes.
for v in dfs.values():
    v['WeekDay'] = v['Date'].dt.day_name()

In [None]:
#Retrieve current column names of daily_activity table.
daily_activity.columns.values

In [None]:
#Reorder the column index in daily_activity table.
new_cols = ['Id', 'Date', 'WeekDay', 'TotalSteps', 'TotalDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes',
       'SedentaryMinutes', 'Calories', 'TotalActiveMinutes']
daily_activity = daily_activity.reindex(columns = new_cols)
daily_activity.head()

In [None]:
# Merge daily_activity and sleep_day dataframes
merged_activity_sleep = pd.merge(daily_activity, sleep_day, on = ['Id', 'Date', 'WeekDay'])
merged_activity_sleep.head() # Take a look at the merged table

### Statistical Summary of Processed Data

In [None]:
daily_activity.describe()

In [None]:
sleep_day.describe()

In [None]:
weight_log_info.describe()

**Key Statistical Findings:**
1. The BMI statistics in weight_log_info table shows that the middle 50% of the records is between 24.0 and 25.6, which covers the upperbound of a "normal" BMI range (18.5 ~ 24.9) and extends slightly into the "overweight" BMI range (25.0 ~ 29.9). This could suggest that users whose BMI is around the "normal" upperbound tend to be more active and motivated in using the smart device.
2. 25% of the sleep records have total asleep minutes of 361min and less. That is about 6 hours and less.
3. Average total steps recorded is 7638 steps per day. Average calories burned per day is 2304 calories.

# Share - Create Visualisations and Communicate Findings from Analysis

### How Users Use the Smart Device
A stacked bar chart is plot to visualise the number of users that track daily_activity data, sleep_day data and weight_log_info data, and how frequent they log these data. 

In [None]:
# Consider tracking data more than 10 days during the two-month period as active.
num_active_user = {'daily_activity': sum(daily_activity.groupby('Id')['Date'].count()>10),
                   'sleep_day': sum(sleep_day.groupby('Id')['Date'].count()>10), 
                   'weight_log_info': sum(weight_log_info.groupby('Id')['Date'].count()>10)}

num_inactive_user = {'daily_activity': sum(daily_activity.groupby('Id')['Date'].count()<10), 
                     'sleep_day': sum(sleep_day.groupby('Id')['Date'].count()<10), 
                     'weight_log_info': sum(weight_log_info.groupby('Id')['Date'].count()<10)}

print(num_active_user)
print(num_inactive_user)

In [None]:
labels = ['daily_activity', 'sleep_day', 'weight_log_info']
active_users = [i for i in num_active_user.values()]
inactive_users = [i for i in num_inactive_user.values()]

width = 0.35       

fig, ax = plt.subplots(figsize = (8, 6))

ax.bar(labels, active_users, width, label='Active User')
ax.bar(labels, inactive_users, width, bottom= active_users,
       label='Inactive User')

ax.set_ylabel('No. of Unique Users', size = 12)
ax.set_title('No. of Users Tracking Wellness Data', size = 14)
ax.legend(fontsize = 12)

plt.show()

A pie chart is plot to show composition of users manually logging weight data.

In [None]:
is_manual = sum(weight_log_info['IsManualReport']==True)
is_not_manual = sum(weight_log_info['IsManualReport']==False)

slices = [is_manual, is_not_manual]
labels = 'Manual', 'Not Manual'

fig1, ax1 = plt.subplots(figsize = (6,6))
plt.pie(slices, labels=labels, autopct='%1.1f%%',
        shadow=False, startangle=90)
plt.title('Weight Data Logging Methods') 

plt.show()

This stacked bar chart titled "No. of Users Tracking Wellness Data" shows that participants use smart device much more to record daily activities than to record sleep  and weight data. This could be an indication of motivation. 

Users are lack of motivation to take sleep data possibly because they tend to take off the smart device before going to sleep. The company can consider improving the design of the smart device so that users feel more comfortable sleeping while wearing it. Another possible solution is to push notification to remind users to wear it before going to sleep.

As for weight data, the entries are even less. The pie chart titled "Weight Data Logging Methods" shows that 61.2% of weight data are manually input by users, which could suggest that the manual input method is discouraging users to track weight data as users tend to forget to record or consider it as too troublesome. The company can include a feature in the smart device that allows users to set schedules for taking weight data so that the device will send notification to remind users in a regular weekly or biweekly etc. basis.

In [None]:
week_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

def get_weekday_rec(data): 
    rec = {'Monday':0, 'Tuesday':0, 'Wednesday':0, 'Thursday':0, 'Friday':0, 'Saturday':0, 'Sunday':0}
    for i in rec:
        rec[i] = sum(data.WeekDay == i)
    return rec

weekday_activity = get_weekday_rec(daily_activity)
weekday_sleep = get_weekday_rec(sleep_day)
weekday_weight = get_weekday_rec(weight_log_info)

In [None]:
labels = week_order
activity_rec = [i for i in weekday_activity.values()]
sleep_rec = [i for i in weekday_sleep.values()]
weight_rec = [i for i in weekday_weight.values()]

width = 0.6      

fig, ax = plt.subplots(figsize = (9, 7))

ax.bar(labels, activity_rec, width, label='Activity Records', color = '#3b75af')
ax.bar(labels, sleep_rec, width, bottom= activity_rec,
       label='Sleep Records', color = '#E676B0')
import operator as op
ax.bar(labels, weight_rec, width, bottom= list(map(op.add,activity_rec,sleep_rec)),
       label='Weight Records', color = '#F7ab00')

ax.set_ylabel('No. of Records', size = 12)
ax.set_title('No. of Wellness Data Tracked over the Week', size = 14)
ax.legend(fontsize = 12)

plt.show()

From the stacked bar chart "No. of Wellness Data Tracked over the Week", it is observed that users are more motivated to track their activity using the smart device from Tuesday to Thursday. Usage is reduced near the weekends from Friday to Monday. Special notification can be sent on Fridays to encourage users not to lose momentum and to continue using the device over the weekends.

### Correlation between Time in Bed and Total Time Asleep

A scatter plot is used to investigate the relationships between "TotalTimeInBed" and "TotalMinutesAsleep" in "sleep_day" data.

In [None]:
sns.lmplot(x='TotalTimeInBed', y = 'TotalMinutesAsleep', data = sleep_day, height = 6, aspect = 1)

It is observed that there is a positive linear relationship between total time asleep and total time in bed. The smart device can remind users to go to bed on time so they can have more sleep.

In [None]:
plt.figure(figsize = (8,6))
sns.barplot(x = 'WeekDay', y = 'TotalTimeInBed', data = merged_activity_sleep, order = week_order, color = '#3b75af', ci = None)
plt.show()

It is observed that total time in bed on Sunday is relatively higher than the rest of week days, propably because Sunday is rest day.

### Correlation between Total Steps and Calories

In [None]:
sns.lmplot(x='TotalSteps', y = 'Calories', data = daily_activity, height = 6)

There is a positive linear relationship between total steps and calories burned. A useful feature that could encourage users to take more steps is to display a task progress bar on the device that shows how many more steps to take to burn an estimated calories target.

### Percentage of Sedentary Minutes and Active Minutes

In [None]:
sedentary_minutes = sum(daily_activity['SedentaryMinutes'])
active_minutes = sum(daily_activity['TotalActiveMinutes'])

slices = [sedentary_minutes, active_minutes]
labels = 'Total sedentary minutes', 'Total active minutes' 

fig1, ax1 = plt.subplots(figsize = (6,6))
plt.pie(slices, labels=labels, autopct='%1.1f%%',
        shadow=False, startangle=90)
plt.title('Percentage of Sedentary Minutes and Active Minutes') 

plt.show()

As shown in the pie chart, 81.3% of the tracked time users are sedentary, which is not good for health. It could also suggest that majority of participants might be having desk-bound jobs which require them to be seated for a long time.

### Correlation between Total Time Asleep and Sedentary Minutes

In [None]:
sns.lmplot(x='TotalMinutesAsleep', y = 'SedentaryMinutes', data = merged_activity_sleep, height = 6)

The scatter plot shows that total minutes asleep and sedentary minutes are inversely proportional. This could suggest that users who sleep less tend to be less active probably due to fatigue and low energy resulted from the lack of sleep.

# Act - Propose Business Recommendations from Analysis

Now we look back to our business questions and make recommendations based on analysis of the Fitbit users' data.

**Target Users**
1. Bellabeat products marketing can put special focus on users whose BMI is near the "normal" range upperbound or slightly "overweight" as this group of users is more motivated in using the smart device for weight data tracking and management.
2. Bellabeat marketing can highlight fitness management products to office ladies as this group of users tends to have long sedentary time due to the nature of their work and hence has the need to schedule fitness routines to maintain a healthy lifestyle. 
3. Bellabeat sleep management products can target female users working in a fast-paced and stressful environment as they have the need to monitor sleep quality to stay energetic during the day.

**Product Features**
1. Bellabeat products can guide users to set personalised fitness goals. The smart device dashboard can display a task progress bar on steps taken and calories burned to help users understand the gap towards a certain target.
2. Bellabeat products can design the notification system to encourage users to continue using the products and prevent loss of motivation. 
    - Push notification to remind users to stick to scheduled bedtime.
    - Remind users on Friday to continue wearing it on weekends.
    - Remind users if they have remained sedentary for a long time.
3. Improve the sleep monitoring device design to make it more comfortable wearing during sleep as Fitbit users' data have shown that users tend to take off the smart device before sleep.

