# <span style="color:#DD3E12 "> <center> Bellabeat: How Can a Wellness Technology Company Play It Smart? </center> </span>

<div style="width:100%;text-align: center;"> <img align=middle src="https://ecomblvd.com/wp-content/uploads/2019/03/bellabeat.png" alt="Heat beating" style="height:100px;margin-top:1rem;"> </div>





# <span style="color:#DD3E02"> 1. Summary </span> <a class="anchor" id="summary_1"></a>
This project serves as a final milestone to attain the [Google Data Analytics Professional Certificate](https://www.coursera.org/professional-certificates/google-data-analytics). It involves the case study on [Bellabeat](https://bellabeat.com), a tech wellness company that manufactures health-focused smart products for women. Bellabeat offer a range of smart devices that collects various health and lifestyle data to empower women with knowledge about their own health and habits. The smart devices work hand in hand with the Bellabeat app to provide users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. Thus, users will be able to better understand their current habits and make healthy decisions.

The objective of this study is to analyze consumers' usage data on non-Bellabeat smart devices and determine how it could unlock new growth opportunities for Bellabeat. The insights drawn will be used to develop high level recommendations for Bellabeat's marketing strategy.

In this project, the exploratory data analysis (EDA) approach will be used to analyze and investigate for trends, patterns, and relationships to derive insights from the dataset. This will be guided through the process of Ask, Prepare, Process, Analyze, Share, and Act using the Python programming language.
***

# <span style="color:#DD3E02">  2. Ask Phase </span>  <a class="anchor" id="ask_phase_2"></a>

#### 2.1 Business Task <a class="anchor" id="business_task_2_1"></a>
The aim of this project is to draw insights into how consumers use non-Bellabeat smart devices and develop high level recommendations for Bellabeat's marketing strategy with the following questions:

1. What are some trends in Fitbit smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

Stakeholders
* Urška Sršen - Bellabeat Cofounder and Chief Creative Officer 
* Sando Mur - Bellabeat Cofounder and key member of Bellabeat executive team 
* Bellabeat Marketing Analytics team 
***

# <span style="color:#DD3E02"> 3. Prepare Phase </span> <a class="anchor" id="prepare_phase_3"></a> 

#### 3.1 Dataset used: <a class="anchor" id="dataset_used_3_1"></a> 

The [Fitbit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit) from the Kaggle web repository will be used for this analysis. 

#### 3.2 Accessibility and privacy of data: <a class="anchor" id="accessibility_and_privacy_of_data_3_2"></a> 
The dataset is confirmed to be open-source and licensed under the CC0: Public Domain. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. The dataset can be copied, modified, distributed, and used for analysis, even for commercial purposes, all without asking permission.

#### 3.3 Information about our dataset:<a class="anchor" id="information_about_our_dataset_3_3"></a>  
The dataset is generated by respondents to a distributed survey via Amazon Mechanical Turk over 31 days between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.  

#### 3.4 Data Organization and verification: <a class="anchor" id="data_organization_and_verification_3_4"></a> 
The dataset consists of 18 CSV files in total with each containing various health and activity metrics tracked by Fitbit. Using the elimination approach to remove irrelevant dataframes from the analysis, a total of 13 dataframes were eliminated as they are either duplicates of a larger dataframe, too few of a sample size, or contain data that were not meaningful for the analysis. 

Hence, here are the 5 dataframes that will be used for this analysis:

| Table Name | Type | Description |
| --- | --- | --- |
| 1. dailyActivity_merged | Microsoft Excel CSV | Daily Activity over 31 days of 33 IDs. Tracking daily: Steps, Distance, Intensities, Calories |
| 2. hourlyCalories_merged | Microsoft Excel CSV | Hourly Calories burned over 31 days of 33 IDs |
| 3. hourlyIntensities_merged | Microsoft Excel CSV | Hourly total and average intensity over 31 days of 33 IDs |
| 4. hourlySteps_merged | Microsoft Excel CSV | Hourly Steps over 31 days of 33 IDs |
| 5. sleepDay_merged | Microsoft Excel CSV| Daily sleep logs, tracked by: Total count of sleeps a day, Total minutes, Total Time in Bed of 24 IDs |

#### 3.5 Data Integrity and limitations:<a class="anchor" id="data_credibility_and_integrity_3_5"></a> 
The dataset had the limitation of having too small of a sample size (30 users) that may not represent the entire population and may render conclusions drawn from the analysis to be invalid. Furthermore, demographical information such as age, gender, and ethnicity that is crucial to determine the strategy on Bellabeat's target market were not provided in the dataset.
***

# <span style="color:#DD3E02">  4. Process Phase </span> <a class="anchor" id="process_phase_4"></a> 
The data wrangling, analysis, and visualisation process will be carried out using the Python programming language.


### 4.1 Importing the required libraries <a class="anchor" id="installing_packages_and_opening_libraries_4_1"></a> 
Firstly, the following libraries below will be imported for our analysis.

In [1]:
import pandas as pd                         #Data Manipulation and Analysis
import numpy as np                          #Aggregate Functions
import seaborn as sns                       #Visualisation
import matplotlib.pyplot as plt             #Visualisation
import plotly.express as px                 #Interactive Visualisation

### 4.2 Importing and previewing the dataframes <a class="anchor" id="importing_datasets_4_2"></a>

In [None]:
# Loading the data into the pandas data frame.


# Displaying the top 5 rows of each dataset
print('\033[1m' + 'daily_activity' + '\033[0m') 
display(daily_activity.head(5))

print('\033[1m' + 'hourly_calories' + '\033[0m')
display(hourly_calories.head(5))

print('\033[1m' + 'hourly_intensities' + '\033[0m')
display(hourly_intensities.head(5))

print('\033[1m' + 'hourly_steps' + '\033[0m')
display(hourly_steps.head(5))

print('\033[1m' + 'sleep_day' + '\033[0m')
display(sleep_day.head(5))

### 4.3 Checking the data information <a class="anchor" id="cleaning_and_formatting_4_4"></a>
Now we will get an overview (number of entries, null values, column names) of the dataframes and check for any incorrect data types.

In [None]:
print('\033[1m' + 'daily_activity' + '\033[0m') #Bolded title 
daily_activity.info()

In [None]:
print('\033[1m' + 'hourly_calories' + '\033[0m') #Bolded title 
hourly_calories.info()

In [None]:
print('\033[1m' + 'hourly_intensities' + '\033[0m') #Bolded title 
hourly_intensities.info()

In [None]:
print('\033[1m' + 'hourly_steps' + '\033[0m') #Bolded title 
hourly_steps.info() 

In [None]:
print('\033[1m' + 'sleep_day' + '\033[0m') #Bolded title 
sleep_day.info()

Notice that the data types of the `ActivityDate`, `ActivityHour`, and `SleepDay` columns are in the object format. We will convert them to the date-time format later on (Section 4.4.3).

### 4.4 Data cleaning and Transformation 
We will begin the data cleaning and transformation process. This involves:

* Identifying and removing duplicates and nulls
* Formatting datatypes 
* Renaming columns
* Sorting

#### 4.4.1 Identifying and dropping duplicates




In [None]:
# Number of duplicates in each dataframe
duplicates_daily_activity = print("daily_activity=",daily_activity.duplicated().sum())

duplicates_hourly_calories = print("hourly_calories=",hourly_calories.duplicated().sum())

duplicates_hourly_intensities = print("hourly_intensities=",hourly_intensities.duplicated().sum())

duplicates_hourly_steps = print("hourly_steps=",hourly_steps.duplicated().sum())

duplicates_sleep_day= print("sleep_activity=",sleep_day.duplicated().sum())

Found 3 duplicates in the `sleep_activity` dataframe. 

In [None]:
# Extracting the duplicated rows in sleep_day dataframe
sleep_day.loc[sleep_day.duplicated(), :]

In [None]:
#Dropping the duplicates
sleep_day.drop_duplicates()

Note that we started out at 413 entries (Refer to Section 4.3) in this dataframe, now it is at 410 entries after removing the 3 duplicates. 

#### 4.4.2 Identifying and dropping nulls
Here we found no nulls within the dataframes, thus the removal of nulls is not needed.

In [None]:
# Total number of missing values
print("daily_activity =", daily_activity.isnull().sum().sum())
print("hourly_calories =", hourly_calories.isnull().sum().sum())
print("hourly_intensities =", hourly_intensities.isnull().sum().sum())
print("hourly_steps =", hourly_steps.isnull().sum().sum())
print("sleep_activity =", sleep_day.isnull().sum().sum())

#### 4.4.3 Renaming columns and formatting datatypes
As identified in Section 4.3, the timestamp columns of the respective dataframes are in the 'object' format. We would want to convert them into the 'date-time'format and display the dates in "yyyy-mm-dd". The `Date` and `Time` columns of the `sleep_day` dataframe will be split to merge with the `daily_activity` dataframe later.

In [None]:
# Convert to date-time format
daily_activity['ActivityDate'] = pd.to_datetime(daily_activity['ActivityDate'])\

hourly_calories['ActivityHour'] = pd.to_datetime(hourly_calories['ActivityHour'])

hourly_intensities['ActivityHour'] = pd.to_datetime(hourly_intensities['ActivityHour'])

hourly_steps['ActivityHour'] = pd.to_datetime(hourly_steps['ActivityHour'])

sleep_day['Date'] = pd.to_datetime(sleep_day['SleepDay'])
sleep_day['Time'] = pd.to_datetime(sleep_day['SleepDay']).dt.time

# Separating columns into Date and Time
sleep_day = sleep_day[['Id','Date','Time','TotalSleepRecords','TotalMinutesAsleep','TotalTimeInBed']]

In the `daily_activity` dataframe, we will rename the `ActivityDate` column and add a `DayOfWeek` Column to better structure and analyse the data.

In [None]:
# Rename ActivityDate column
daily_activity = daily_activity.rename(columns={'ActivityDate': 'Date'})

In [None]:
#Adding DayOfWeek Column 
daily_activity['DayOfWeek'] = pd.to_datetime(daily_activity['Date']).dt.day_name()

#Shifting DayOfWeek column to the second index of the dataframe
DayOfWeek = daily_activity['DayOfWeek']
daily_activity = daily_activity.drop(columns=['DayOfWeek'])
daily_activity.insert(loc=2, column='DayOfWeek', value=DayOfWeek)

Lets take a look at the transformed dataframes.

In [None]:
print('\033[1m' + 'daily_activity' + '\033[0m')
display(daily_activity.head(5))

print('\033[1m' + 'hourly_calories' + '\033[0m')
display(hourly_calories.head(5))

print('\033[1m' + 'hourly_intensities' + '\033[0m')
display(hourly_intensities.head(5))

print('\033[1m' + 'hourly_steps' + '\033[0m')
display(hourly_steps.head(5))

print('\033[1m' + 'sleep_day' + '\033[0m')
display(sleep_day.head(5))

### 4.5 Merging dataframes
As part of the transformation process, we will merge the `daily_activity` and `sleep_day` dataframes with the `Id` and `Date` column as the primary keys.

In [None]:
daily_activity_sleep = daily_activity.merge(sleep_day,on=['Id','Date'],how='left')

display(daily_activity_sleep)

We will also merge the `hourly_(Calories, Intensities, Steps)` dataframes using the `Id` and `ActivityHour` columns as primary keys to form a new dataframe.

In [None]:
#Merge hourly dataframes
hourly_metrics = hourly_calories.merge(hourly_intensities,on=['Id','ActivityHour'],how='left')\
.merge(hourly_steps,on=['Id','ActivityHour'],how='left')

#Rename columns
hourly_metrics = hourly_metrics.rename(columns={'ActivityHour': 'DateTime'})
hourly_metrics = hourly_metrics.rename(columns={'StepTotal': 'TotalSteps'})

display(hourly_metrics)

***

# <span style="color:#DD3E02"> 5. Analyze and Share Phase </span> 

### 5.1 Summary Statistics
This function provides an holistic overview of the dataframes to draw insights for analysis.

In [None]:
#Exclude Id column
cols = set(daily_activity_sleep.columns) - {'Id'}
summary_daily_activity = daily_activity_sleep[list(cols)]

summary_daily_activity.describe()

In [None]:
#Exclude Id column
cols = set(hourly_metrics.columns) - {'Id'}
summary_hourly_metrics = hourly_metrics[list(cols)]

summary_hourly_metrics.describe()

### 5.2 Distribution of the different activity levels
Now we will create a distribution of the different activity levels by minutes. They are categorize by:

* Lightly Active Minutes
* Fairly Active Minutes
* Fairly Active Minutes

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(23, 6))

plt.style.use("seaborn-colorblind")

fig.suptitle("Distribution of Activity Types", fontsize=20, fontweight="bold", y="1.03")

min_ylim, max_ylim = plt.ylim()

# Plot Histogram for Lightly Active Minutes
ax[0].hist(daily_activity_sleep["LightlyActiveMinutes"], histtype="bar", bins=10, edgecolor='k')
ax[0].set_xlabel("Lightly Active Minutes", fontsize=15)
ax[0].set_ylabel("# of Records", fontsize=15)
ax[0].axvline(daily_activity_sleep["LightlyActiveMinutes"].mean(), color='k', linestyle='dashed', linewidth=1)
ax[0].text(daily_activity_sleep["LightlyActiveMinutes"].mean()*1.1, max_ylim*186, 'Mean: {:.2f}'.format(daily_activity_sleep["LightlyActiveMinutes"].mean()))

# Plot Histogram for Fairly Active Minutes
ax[1].hist(daily_activity_sleep["FairlyActiveMinutes"], histtype="bar", color="r", bins=10, edgecolor='k')
ax[1].set_xlabel("Fairly Active Minutes", fontsize=15)
ax[1].set_ylabel("# of Records", fontsize=15)
ax[1].axvline(daily_activity_sleep["FairlyActiveMinutes"].mean(), color='k', linestyle='dashed', linewidth=1)
ax[1].text(daily_activity_sleep["FairlyActiveMinutes"].mean()*1.1, max_ylim*640, 'Mean: {:.2f}'.format(daily_activity_sleep["FairlyActiveMinutes"].mean()))


# Plot Histogram for Very Active Minutes
ax[2].hist(daily_activity_sleep["VeryActiveMinutes"], histtype="bar", color="g", bins=10, edgecolor='k')
ax[2].set_xlabel("Very Active Minutes", fontsize=15)
ax[2].set_ylabel("# of Records", fontsize=15)
ax[2].axvline(daily_activity_sleep["VeryActiveMinutes"].mean(), color='k', linestyle='dashed', linewidth=1)
ax[2].text(daily_activity_sleep["VeryActiveMinutes"].mean()*1.1, max_ylim*640, 'Mean: {:.2f}'.format(daily_activity_sleep["VeryActiveMinutes"].mean()))

plt.show()

It is observed from the histograms above that the records of 'Lightly Active Minutes' nearly follows a normal distribution curve where there are higher occurences around the mean region. Users are also seen spending most of their time in the Lightly Active category(Examples of activities include: Gardening, Cooking, Walking etc.) and lesser time in the Fairly Active and Very Active category (Example: high cardio activities like running). The findings are reasonable given that the average user could be non-atheletes that may be using the device for daily lifestyle acivities and to clock occasional mid-high intensity activities.

### 5.3 Time Spent (Mins) in each activity level

In [None]:
#Average of activity levels
average_active_minutes = daily_activity_sleep[['VeryActiveMinutes', 'FairlyActiveMinutes',
                                               'LightlyActiveMinutes', 'SedentaryMinutes']].mean()

#Convert into pandas dataframe
activity_level_minutes = pd.DataFrame(average_active_minutes) 
activity_level_minutes.reset_index(inplace=True)
activity_level_minutes = activity_level_minutes.rename(columns = {'index':'ActivityLevel', 0:'AverageMinutes'})

activity_level_minutes.head()

In [None]:
#Plotting the piechart for average time spent in each activity level
fig = px.pie(activity_level_minutes, values='AverageMinutes', names ='ActivityLevel',
             title = "Average total time spent in each activity level")

fig.update_traces(textposition='inside')

fig.show()

From the pie chart, users are seen spending 16.5 hours being sedentary, 3.2 hours of their day being lightly active, 13.6 minutes being fairly active, and 21 minutes being very active daily.

Although users spent 21 minutes on average daily in intense activities, a significant amount of their day is spent being sedentary. This presents a lifestyle concern that has to be address or health conditions could surface in the long run which beats the purpose of owning Bellabeat's health and lifestyle devices. 

### 5.4 Average calories burned by day of week

In [None]:
days_order = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
calories = daily_activity_sleep.groupby("DayOfWeek")[['Calories']].mean().reindex(days_order)
avg_calories_dow = pd.DataFrame(calories)
avg_calories_dow.reset_index(inplace=True)

display(avg_calories_dow)

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(8,6))
sns.set_context("notebook")
ax = sns.barplot(data=avg_calories_dow, x="DayOfWeek", y="Calories", ci=None, palette="RdBu")
plt.title("Average calories burned by day of week", fontsize=15, fontweight="bold")
plt.xlabel("")
plt.ylabel("Average Calories Burned", fontsize=15)
plt.xticks(rotation="45")

ax = plt.gca()

for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), 
            fontsize=12, color='black', ha='center', va='bottom')

plt.show()

Based on the bar chart plotted above, we can see that users burned a consistent amount of calories throughout the week with the lowest being on Thursday. However, the required amount of daily calories burned betwen men and women varies in which the lack of gender, age, and lifestyle demographics of the sample population does not provide a holistic picture of the data. Nevertheless, according to the [U.S. Department of Health and Human Services](https://health.gov/our-work/nutrition-physical-activity/dietary-guidelines/previous-dietary-guidelines/2015), the average adult women expends roughly 1,600 to 2,400 calories per day, and the average adult man uses 2,000 to 3,000 calories per day. 

Furthermore, the average sedentary person burns approximately 1800 calories a day. Thus, the **mean of 2307 calories** (Refer to section 5.1) daily is reasonably accurate as the average user spend most of their time being sedentary while a small subset of very active users could be skewing the mean of the data.

### 5.5 Average calories burned hourly

In [None]:
df = hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["Calories"].mean()
print(df)

In [None]:
fig = px.line(hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["Calories"].mean(),
              title="Average of total calories burned hourly", markers=True, y="Calories")

fig.update_layout(xaxis={'range':[0,23]}, xaxis_title="Time", yaxis_title="Average of total calories Burned")

"plotly express hovertemplate:", fig.data[0].hovertemplate
fig.update_traces(hovertemplate='Time of Day: %{x} <br> Average of Total Calories Burned: %{y}') 

fig.show()

According to [Sleep Foundation](https://www.sleepfoundation.org/how-sleep-works/how-your-body-uses-calories-while-you-sleep), it was found that we burn about 50 calories an hour while sleeping which is reflected on the graph above. There is an obvious trend where users begin to increase their calories burned gradually from 5am to mid-day. A slight drop in the amount of calories burned from 1-2pm was also observed. This is likely due to the occurence of postprandial somnolence (A.K.A Food Coma)  or the 'afternoon slump' that usually happens after lunch between 1-3pm leading to fewer calories  burned while being tired. 

The calories burn is observed to begin increasing again at 4pm while reaching its peak at 6pm indicating that users could likely be choosing to work out or commute after work/school hours. A significant decrease is also observed from 7-8pm and continues to gradually decrease which suggest that users are possibly winding down to be ready for their bedtime.

### 5.6 Total steps by day of week

In [None]:
steps = daily_activity_sleep.groupby("DayOfWeek")[['TotalSteps']].mean().reindex(days_order)
avg_steps_dow = pd.DataFrame(steps)
avg_steps_dow.reset_index(inplace=True)

display(avg_steps_dow)

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(8,6))
sns.set_context("notebook")
sns.boxplot(data=daily_activity_sleep, x="DayOfWeek", y="TotalSteps", palette="tab10", sym="", order=days_order)
plt.title("Total steps clocked by day of week", fontsize=15, fontweight="bold")
plt.xlabel("")
plt.ylabel("Total Steps Clocked", fontsize=15)
plt.xticks(rotation="45")

plt.show()

As observed from the boxplot, we can see that users clocked the highest amount of steps on Saturdays and the lowest on Sundays which could be likely a rest day for them. The median of steps took throughout the week varies but is rather consistent, hovering between the 6000-7000 range while the **mean is at 7652 steps**. This indicates that the dataset is fairly distibuted across the lowest to highest values.

Based on [MedicineNet](https://www.medicinenet.com/how_many_steps_a_day_is_considered_active/article.htm), here are the classification of activity levels based on the number of steps taken in a day:

* **Sedentary:** Less than 5,000 steps daily
* **Low active:** About 5,000 to 7,499 steps daily
* **Somewhat active:** About 7,500 to 9,999 steps daily
* **Active:** More than 10,000 steps daily
* **Highly active:** More than 12,500 steps daily

The data above signals that the average Bellabeat user is classified as somewhat active despite spending a significant amount of their time being sedentary. [MedicineNet](https://www.medicinenet.com/how_many_steps_a_day_is_considered_active/article.htm) also claimed that studies have shown improve blood sugar levels, lower blood pressure, improve symptoms of depression and anxiety for people who walk between 7,500 to 10,000 steps per day. Nevertheless, there is still work that can be done to encourage users in increasing their daily steps and aim for an active lifestyle.

### 5.7 Average steps taken hourly

In [None]:
df = hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["TotalSteps"].mean()
print(df)

In [None]:
fig = px.line(hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["TotalSteps"].mean(), 
              title="Average of total steps clocked hourly", markers=True, y="TotalSteps")

fig.update_layout(xaxis={'range':[0,23]}, xaxis_title="Time of Day", 
yaxis_title="Average of Total Steps Taken")

print("plotly express hovertemplate:", fig.data[0].hovertemplate)
fig.update_traces(hovertemplate='Time of Day: %{x} <br>Average of Total Steps: %{y}') 

fig.show()

This line chart has a closely identical pattern as compared to the average calories line chart (Section 5.5) as generated above. Users are seen generally starting their day from 5am onwards and reducing the number of steps taken after 7pm. Perhaps we should do a correlation analysis between the total calories burned and steps taken to identify for a relationship between both variables.

### 5.8 Correlation analysis of calories vs steps

In [None]:
px.defaults.template = "presentation"
px.defaults.color_continuous_scale = px.colors.qualitative.Antique
px.defaults.width = 800
px.defaults.height = 600

fig = px.scatter(x=daily_activity_sleep["TotalSteps"], y=daily_activity_sleep["Calories"],
                 title=" Correlation betwen Total Steps and Calories", 
                 
labels=dict(x="Total Steps",y="Calories"))

fig.update_layout(
    xaxis={
        'range':[0,32000]
          })

fig.show()

From the scatter plot, we can observe a positive linear relationship between both variables. This indicates that users burned more calories with higher steps taken. To further prove our analysis, we can write a linregress() code to find the R Value (Pearson's Correlation Coefficient) that determine the level of linear regression between both variables.

In [None]:
from scipy.stats import linregress
xs = daily_activity_sleep["TotalSteps"]
ys = daily_activity_sleep["Calories"]

res = linregress(xs,ys)
print(res)

As seen from the results, the linear regression have an r value of 0.6 indicating a strong linear relationship between both variables. 

linregress() is also a useful function that provides the regression slope value, intercept, p value and standard error of the analysis. For the importance of our analysis, the regression slope measures the steepness of the linear relationship shown by a best fit line. The steeper the line, the higher the effect on change the x variable has on the y variable. In this case, for every 1 step users take, they would expend an average of 0.08 calories. The r value of 0.6 should not be taken as a face value of a strong relationship between both variables as the r value only computes the strength of a linear relationship.

### 5.9 Activity level by distance

In [None]:
# Mean of active distance level
activity_level_dist = daily_activity_sleep[['SedentaryActiveDistance','LightActiveDistance',
                                            'ModeratelyActiveDistance','VeryActiveDistance']].mean()

# covert into pandas dataframe
active_distance = pd.DataFrame(activity_level_dist) 
active_distance.reset_index(inplace=True)
active_distance = active_distance .rename(columns = {'index':'ActiveDistanceLevel', 0:'AverageActiveDistance'})

active_distance.head()

In [None]:
activity_level_dist = pd.DataFrame(data = daily_activity_sleep, 
columns = ['SedentaryActiveDistance','LightActiveDistance','ModeratelyActiveDistance', 'VeryActiveDistance'])

sns.set_style("darkgrid")
plt.figure(figsize=(10,8))
sns.set_context("notebook")

ax = sns.barplot(x="variable", y="value", data=pd.melt(activity_level_dist), ci=None, palette="dark")
ax.set(xlabel="",ylabel="Average Distance")
plt.title("Average Distance of Activity Levels",fontsize=20)
plt.xticks(rotation=45)

ax = plt.gca()

for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%f' % float(p.get_height()), 
            fontsize=14, color='black', ha='center', va='bottom')
plt.show()

This barchart depicts the average distance users clocked in the respective activity levels:
* Sedentary Active Distance
* Lightly Active Distance
* Moderate Active Distance
* Very Active Distance

The highest distance of 3.35km is clocked in the lightly active level. This further reinforce our assumptions in Section 5.2 that users are likely wearing their watches for daily lifestyle activities (e.g walking, doing chores, gardening etc). The second highest distance clocked is at 1.5km in the very active level. Sedentary active clocked the lowest with a distance that is almost insignificant which makes sense as users are most likely inactive and not moving.

### 5.10 Average time of sleep activity

In [None]:
daily_activity_sleep['AwakeTimeInbed'] = daily_activity_sleep['TotalTimeInBed'] - daily_activity_sleep['TotalMinutesAsleep']

days_order = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

sleep = daily_activity_sleep.groupby("DayOfWeek")[['TotalTimeInBed', 
'TotalMinutesAsleep', 
'AwakeTimeInbed']].mean().reindex(days_order)

sleep_dow = pd.DataFrame(sleep)
sleep_dow.reset_index(inplace=True)

display(sleep_dow)

In [None]:
sleep_dow.plot(x="DayOfWeek", kind="bar", figsize=(12,6), ylabel="Average of Total Mins")
plt.title("Average Time of Sleep Activity", fontsize=20, fontweight="bold")
plt.xlabel("")
plt.ylabel("Average of Total Mins", fontsize=15)

plt.show()

It is calculated that users have a mean sleep schedule of **419.5 minutes(~7hrs)** that is consistent across the week and within the healthy range. The highest recorded mean time asleep was on Sundays (~ 7.5hrs) and the lowest was on Thursdays (~ 6.7 hrs). 

Comparing this chart with section 5.6 (Total steps by day of week), we understand that the lowest total steps on average was also recorded on Sunday. This reinforces our assumption that Sundays are likely a rest day for users. 

### 5.11 Proportion of users with adequate sleep

In [None]:
# Categorizing users based on their amount of sleep
def sleep_grp_if(TotalMinutesAsleep): 
    if (TotalMinutesAsleep > 420) :
        return 'Adequate Sleep'
    else:
        return 'Inadequate Sleep'
    
sleep_grp = sleep_day.loc[:,("Id", "Date", "TotalMinutesAsleep")]
sleep_grp['sleep_grp_if'] = sleep_grp['TotalMinutesAsleep'].apply(sleep_grp_if)
sleep_grp.head()

In [None]:
# Identifying the number of users for each sleep category
sleep_proportion = sleep_grp['sleep_grp_if'].value_counts()
print(sleep_proportion)

In [None]:
#Plotting the piechart 
fig = px.pie(sleep_grp, values=sleep_proportion, names=sleep_proportion, title = "Proportion of users by sleep adequacy")

fig.update_traces(textposition="inside", labels=["Adequate Sleep","Inadequate Sleep"])

fig.show()

The piechart generated shows a generally balanced proportion of users with adequate and inadequate sleep. However, I believe there could be intiatives to encourage more users to get at least 7 hours of sleep.

### 5.12 Distribution of users sleep hours

In [None]:
# Categorizing users based on sleep hours
def sleep_grp_hrs(TotalMinutesAsleep): 
    if (TotalMinutesAsleep <= 420) :
        return 'Less than 7hrs'
    elif (TotalMinutesAsleep <=540):
        return '7hrs to 9hrs'
    else:
        return 'More than 9hrs'

sleep_distribution = sleep_day.loc[:,("Id", "Date", "TotalMinutesAsleep")]
sleep_distribution['sleep_grp_hrs'] = sleep_distribution['TotalMinutesAsleep'].apply(sleep_grp_hrs)
sleep_distribution.head()

In [None]:
sleep_proportion = sleep_distribution['sleep_grp_hrs'].value_counts()
print(sleep_proportion)

In [None]:
X1 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == 'Less than 7hrs','TotalMinutesAsleep']
X2 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == '7hrs to 9hrs','TotalMinutesAsleep']
X3 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == 'More than 9hrs','TotalMinutesAsleep']

kwargs = dict(alpha=0.7, bins=20)

plt.figure(figsize=(14,8))
plt.hist(X1, **kwargs, color='r', label='Less than 7hrs', edgecolor='k')
plt.hist(X2, **kwargs, color='g', label='7hrs to 9hrs', edgecolor='k')
plt.hist(X3, **kwargs, color='b', label='More than 9hrs', edgecolor='k')
plt.title('Distribution of Users Sleep Hours', fontsize=20, fontweight="bold")
plt.xlabel('Sleep Time (Minutes)', fontsize=15)
plt.ylabel('Frequency', fontsize=15)

plt.legend()

plt.show()

Here, we breakdown the various sleep hours in a normal distribution curve, showing that majority of users get approximately **340 - 540 minutes (5.6hrs-9hrs)** of sleep.

### 5.13 Correlation matrix of daily activities

In [None]:
# Creating a dataframe containing correlation coefficients of variables in daily_activity_sleep
total_corr = daily_activity_sleep[["TotalSteps", "TotalDistance", "TrackerDistance", "LoggedActivitiesDistance","VeryActiveDistance", "ModeratelyActiveDistance", "LightActiveDistance", "SedentaryActiveDistance", "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "SedentaryMinutes", "TotalMinutesAsleep", "TotalTimeInBed", "Calories"]].corr()

# plotting the heatmap
fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(total_corr, annot=True, fmt = '.2f', cmap="viridis")
plt.title("Correlation Heatmap of daily_activity dataset", fontsize = 25)

plt.show()

Finally, we ran a correlation heatmap to provide us an overview on the correlation levels across the variables within the `daily_activity_sleep` dataframe. Some of the relevant variables pairs identified with strong correlation (R > 0.6) are:
* `TotalDistance` and `Calories`
* `VeryActiveDistance` and `VeryActiveMinutes`
* `LightlyActiveMinutes` and `LightActiveDistance`
* `FairlyActiveMinutes` and `ModeratelyActiveDistance`
* `VeryActiveMinutes` and `VeryActiveDistance`
* (`TotalSteps`, `TotalDistance`, `TrackerDistance`) and `VeryActiveDistance`

# <span style="color:#DD3E2"> 6. Act Phase </span> 
### 6.1 Key Insights

1. In terms of physical activities on a daily basis, Users spent the most time (~ 3.2hrs) and highest distance (3.35km) in the lightly active level.

2. Although users spent 21 minutes on average in the Very Active category, 81% of their day is spent being sedentary which highlights a concern.

3. The average user burns 2307 calories and clocks 7652 steps per day.

4. Users seem to burned a consistent amount of calories throughout the week with the highest burned (2365 calories) on Saturdays and lowest (2204 calories) on Thursdays.

5. The average user burn the highest calories between 5pm-7pm.

6. The highest number of steps clocked (8125 steps) are on Tuesdays and the lowest(6993 steps) are on Sundays.

7. The average user begins their day at 5am and clocked the highest number of steps between 5-7pm. They gradually reduce their activeness from 8pm onwards. 

8. There is a strong positive linear relationship between total steps clocked and total calories burned.

9. Users have a consistent sleep schedule with a mean sleep hours of 419.5 minutes (~ 7hrs) across the week. The highest recorded mean time asleep was on Sundays (~ 7.5hrs) and the lowest was on Thursdays (~ 6.7 hrs).

10. 44.3% of users have inadequate sleep hours(<7hours).

11. At least 5 relevant pairs of variables are found to have a strong correlation (r >0.6).

### 6.2 Recommendations
With the insights drawn from our analysis, the following recommendations are proposed to propel Bellabeat's marketing strategy.

### **Drawing in on market segments**

First and foremost, Bellabeat should look into further segmenting its products to meet the specific needs of each users through the use of demographical information. Although the company's products are mainly focused at women, the limitations of having a lack of demographical data reduces the ability to drive specific products in different market segments. Some of this market segments could consist of: 

1. Teenage Girls

2. Middle Age Ladies

3. Elderly Women

4. Westerners

5. Asians

Data such as age groups, ethnicity, region, etc. should be place into consideration as the lifestyle and habits varies according to these demograhics. Having this demographics would enable us to answer questions like: 
1. "What percentage of users are elderly women?"

2. "What is the proportion of westerners using Bellabeat's products?"

This will play a crucial role in identifying how the products are accepted in different segments which will ultimately influence how Bellebeat drive its marketing campaigns. 

### **Creating Fun**
Bellabeat could provide rewards and virtual achievements to users within the app when they reached certain milestones. As a user of Apple Watch, I personally felt that the virtual rewards feature has encourage me to clock in the extra active minutes to earn various awards within the app and I believe this concept could be applied to the Bellabeat app as well. 

Besides, Bellabeat could look into using the concept of gamification where users could redeem vouchers from merchant partners or Bellabeat line of products by hitting specific goals and rising through the reward tiers. This injects a 'fun' element in spurring users to take charge of their health and lifestyle habits which could reduce the the amount of sedentary time and increase the daily activity level of users.

### **Getting Social**
Bellabeat could incoporate a social share feature that allow users to share their daily activity and workouts in the bellabeat app with their family and friends. In addition, they could include a compete feature that enables users to challenge against one another to improve the number of steps taken and calories burned. This way users would not feel 'alone' in their fitness journey and Bellabeat could potentially attract new customers into the Bellabeat ecosystem with the influence of existing users.

### **The Friendly Reminder**
To prompt users in adopting a more active lifestyle, Bellabeat could allow users to configure the app and device settings that will serve as reminders and motivations to achieve their desired lifestyle goals. This could involve a vibration functionality on the device or push notifications via the app that will remind users to get active or practice a consistent bedtime routine. The push notifications could also include positive reinforcements to users by showing their progress achieved throughout the day or week with the data collected. 

Examples include: 
1. "Hey there! Another hour has past, get up and move your muscles!"
2. "Woah! Keep it going! You took 20% more steps than yesterday at this time!"
3. "Ring Ring, its almost bedtime. Don't forget to practice a consistent sleep schedule."

### **Going Zen**
Finally, Bellabeat could incorporate breathing or mindfulness functions in the app to help users wind down or relieve their anxiety and stress levels before bed time. This functions could also be interlinked with push notifications that would remind users to practice breathing and mindfulness activites before their scheduled bedtime. Helpful sleep health articles could also be supplemented in the app to provide tips and insights like breathing and relaxation techniques that would improve one's sleep quality. 
***

##### This case study was completed with some reference from previous projects and python exploratory data analysis articles. 

If you have found this interesting and insightful, do leave an upvote or a comment. 

I would also be open to constructive feedbacks in improving my code or analysis. Thank you! :)

In [None]:
Thank you for the attention :)