<a href="https://colab.research.google.com/github/kimkukhwa/Google-Data-Analytics-Capstone-Project/blob/main/google_data_analytics_capstone_bellabeat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Bellabeat](https://bellabeat.com/): How Can a Wellness Technology Company Play It Smart?

## INTRODUCTION:
**About the company**

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products.
Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around
the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with
knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly
positioned itself as a tech-driven wellness company for women.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available
through a growing number of online retailers in addition to their own e-commerce channel on their website. The company
has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital
marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and
consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google
Display Network to support campaigns around key marketing dates.

**Questions**

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart
devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions
will guide your analysis:
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

You will produce a report with the following deliverables:
1. A clear summary of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top high-level content recommendations based on your analysis

In [69]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## STEP 1: ASK

1.1 Business Task:

    Analyze FitBit Fitness Tracker Data to gain insights into user behaviour and trend that influence Bellabeat marketing strategy
    
1.2 Key Stakeholders:

    1. Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
    2. Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team
    3. Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat's marketing strategy.
    


## STEP 2: PREPARE

2.1 Description on Data Source:

    1. The data is publicly available on [Kaggle: FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit)
    2. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016
    3. Thirty eligible Fitbit users consented to the submission of personal tracker data
    4. Individual reports can be parsed by export session ID (column A) or timestamp (column B)

2.2 Data Credibility & Limitations:

    1. Reliable - This data could have sample selection bias becasue it doesn't refelct the overall population
    2. Original - This data is third party information
    3. Comprehensive - This data includes all important information needed to answer the business question
    4. Current - The data is 6 years old
    5. Cited - Unknown
    
    Overall, the data source is evaluated as bad data, but it is not relevant at the moment since this is for the capstone project. 

2.3 Data Selection:
    
    dailyActivity_merged.csv



## STEP 3: PROCESS


3.1 Import Packages

In [70]:
import numpy as np 
import pandas as pd 
import matplotlib as plt
import plotly.express as px
import datetime as dt 

3.2 Import Data

In [71]:
df=pd.read_csv('/dailyActivity_merged.csv')



3.3 EDA

In [72]:
# Check the first 5 rows of dataframe
df.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863


In [73]:
# Check the last 5 rows of dataframe
df.tail()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
935,8877689391,5/8/2016,10686,8.11,8.11,0.0,1.08,0.2,6.8,0.0,17,4,245,1174,2847
936,8877689391,5/9/2016,20226,18.25,18.25,0.0,11.1,0.8,6.24,0.05,73,19,217,1131,3710
937,8877689391,5/10/2016,10733,8.15,8.15,0.0,1.35,0.46,6.28,0.0,18,11,224,1187,2832
938,8877689391,5/11/2016,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.0,88,12,213,1127,3832
939,8877689391,5/12/2016,8064,6.12,6.12,0.0,1.82,0.04,4.25,0.0,23,1,137,770,1849


In [74]:
#  Overview of the dataset, including the index dtype and column dtypes, non-null values, and memory usage
print(df.info())
print('rows x columns:',df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories  

In [75]:
# Statistical summary for numerical columns
df.drop('Id',axis=1).describe()

Unnamed: 0,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
count,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0
mean,7637.910638,5.489702,5.475351,0.108171,1.502681,0.567543,3.340819,0.001606,21.164894,13.564894,192.812766,991.210638,2303.609574
std,5087.150742,3.924606,3.907276,0.619897,2.658941,0.88358,2.040655,0.007346,32.844803,19.987404,109.1747,301.267437,718.166862
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3789.75,2.62,2.62,0.0,0.0,0.0,1.945,0.0,0.0,0.0,127.0,729.75,1828.5
50%,7405.5,5.245,5.245,0.0,0.21,0.24,3.365,0.0,4.0,6.0,199.0,1057.5,2134.0
75%,10727.0,7.7125,7.71,0.0,2.0525,0.8,4.7825,0.0,32.0,19.0,264.0,1229.5,2793.25
max,36019.0,28.030001,28.030001,4.942142,21.92,6.48,10.71,0.11,210.0,143.0,518.0,1440.0,4900.0


In [76]:
# Check any na values
df.isna().sum()

Id                          0
ActivityDate                0
TotalSteps                  0
TotalDistance               0
TrackerDistance             0
LoggedActivitiesDistance    0
VeryActiveDistance          0
ModeratelyActiveDistance    0
LightActiveDistance         0
SedentaryActiveDistance     0
VeryActiveMinutes           0
FairlyActiveMinutes         0
LightlyActiveMinutes        0
SedentaryMinutes            0
Calories                    0
dtype: int64

In [77]:
# Unique entries over columns and rows in the object
df.nunique()

Id                           33
ActivityDate                 31
TotalSteps                  842
TotalDistance               615
TrackerDistance             613
LoggedActivitiesDistance     19
VeryActiveDistance          333
ModeratelyActiveDistance    211
LightActiveDistance         491
SedentaryActiveDistance       9
VeryActiveMinutes           122
FairlyActiveMinutes          81
LightlyActiveMinutes        335
SedentaryMinutes            549
Calories                    734
dtype: int64

In [78]:
# Check if every user tracked their activity everyday
df.groupby('ActivityDate')[['Id']].nunique()

Unnamed: 0_level_0,Id
ActivityDate,Unnamed: 1_level_1
4/12/2016,33
4/13/2016,33
4/14/2016,33
4/15/2016,33
4/16/2016,32
4/17/2016,32
4/18/2016,32
4/19/2016,32
4/20/2016,32
4/21/2016,32


In [79]:
# Check who (Id) missed daily tracking
df.groupby('Id')[['ActivityDate']].nunique()

Unnamed: 0_level_0,ActivityDate
Id,Unnamed: 1_level_1
1503960366,31
1624580081,31
1644430081,30
1844505072,31
1927972279,31
2022484408,31
2026352035,31
2320127002,31
2347167796,18
2873212765,31


In [80]:
# Check observation for Id=4057192912
df[df.Id==4057192912]

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
377,4057192912,4/12/2016,5394,4.03,4.03,0.0,0.0,0.0,3.94,0.0,0,0,164,1276,2286
378,4057192912,4/13/2016,5974,4.47,4.47,0.0,0.0,0.0,4.37,0.0,0,0,160,1280,2306
379,4057192912,4/14/2016,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1440,1776
380,4057192912,4/15/2016,3984,2.95,2.95,0.0,0.21,0.26,2.44,0.0,3,6,88,873,1527


In [81]:
# Check where records are 0
df.loc[(df==0).any(axis=1)]

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,5/8/2016,10686,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,5/9/2016,20226,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,5/10/2016,10733,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,5/11/2016,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,88,12,213,1127,3832


In [82]:
# Check correlation of all columns (Any missing values, non-numeric are automatically excluded)
df.drop('Id',axis=1).corr( )

Unnamed: 0,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
TotalSteps,1.0,0.985369,0.984822,0.181849,0.740115,0.507105,0.692208,0.070505,0.667079,0.498693,0.5696,-0.327484,0.591568
TotalDistance,0.985369,1.0,0.999505,0.188332,0.794582,0.470758,0.662002,0.082389,0.681297,0.462899,0.5163,-0.288094,0.644962
TrackerDistance,0.984822,0.999505,1.0,0.162585,0.794338,0.470277,0.661365,0.074591,0.680816,0.463154,0.514713,-0.289343,0.645313
LoggedActivitiesDistance,0.181849,0.188332,0.162585,1.0,0.150852,0.076527,0.138302,0.154996,0.234443,0.05386,0.102135,-0.046999,0.207595
VeryActiveDistance,0.740115,0.794582,0.794338,0.150852,1.0,0.192986,0.157669,0.046117,0.826681,0.21173,0.059845,-0.061754,0.491959
ModeratelyActiveDistance,0.507105,0.470758,0.470277,0.076527,0.192986,1.0,0.237847,0.005793,0.225464,0.946934,0.162092,-0.221436,0.21679
LightActiveDistance,0.692208,0.662002,0.661365,0.138302,0.157669,0.237847,1.0,0.099503,0.154966,0.220129,0.885697,-0.413552,0.466917
SedentaryActiveDistance,0.070505,0.082389,0.074591,0.154996,0.046117,0.005793,0.099503,1.0,0.008258,-0.022361,0.124185,0.035475,0.043652
VeryActiveMinutes,0.667079,0.681297,0.680816,0.234443,0.826681,0.225464,0.154966,0.008258,1.0,0.31242,0.051926,-0.164671,0.615838
FairlyActiveMinutes,0.498693,0.462899,0.463154,0.05386,0.21173,0.946934,0.220129,-0.022361,0.31242,1.0,0.14882,-0.237446,0.297623


In [83]:
# Check records by Unique Key (Id+Timestamp)
df.groupby(['Id','ActivityDate']).sum()


Unnamed: 0_level_0,Unnamed: 1_level_0,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
Id,ActivityDate,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1503960366,4/12/2016,13162,8.50,8.50,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1503960366,4/13/2016,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
1503960366,4/14/2016,10460,6.74,6.74,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
1503960366,4/15/2016,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
1503960366,4/16/2016,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8877689391,5/5/2016,14055,10.67,10.67,0.0,5.46,0.82,4.37,0.00,67,15,188,1170,3052
8877689391,5/6/2016,21727,19.34,19.34,0.0,12.79,0.29,6.16,0.00,96,17,232,1095,4015
8877689391,5/7/2016,12332,8.13,8.13,0.0,0.08,0.96,6.99,0.00,105,28,271,1036,4142
8877689391,5/8/2016,10686,8.11,8.11,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847


In [84]:
# Data sanity check: TotalDistance
total=df.filter(regex='Distance')
total['sum']=total.iloc[:,2:].sum(axis=1).round(1)
total



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,sum
0,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,8.5
1,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,7.0
2,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,6.8
3,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,6.2
4,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,8.2
...,...,...,...,...,...,...,...,...
935,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,8.1
936,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,18.2
937,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,8.1
938,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,19.5


### EDA Summary:
<font color='red'>
    
    1. 33 Unique Ids instead of 30 Unique Ids
    2. ActivityDate dtype needs to be converted to datetime dtype for accurate sorting/filtering purposes 
    3. No Null or Missing values, but it is observed that users did not have to track their activity daily and some records are "0" 
    4. Add Total Active Minutes

3.4 Data Transformation

In [85]:
# ActivityDate to datetime
df['ActivityDate']=pd.to_datetime(df['ActivityDate'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Id                        940 non-null    int64         
 1   ActivityDate              940 non-null    datetime64[ns]
 2   TotalSteps                940 non-null    int64         
 3   TotalDistance             940 non-null    float64       
 4   TrackerDistance           940 non-null    float64       
 5   LoggedActivitiesDistance  940 non-null    float64       
 6   VeryActiveDistance        940 non-null    float64       
 7   ModeratelyActiveDistance  940 non-null    float64       
 8   LightActiveDistance       940 non-null    float64       
 9   SedentaryActiveDistance   940 non-null    float64       
 10  VeryActiveMinutes         940 non-null    int64         
 11  FairlyActiveMinutes       940 non-null    int64         
 12  LightlyActiveMinutes  

In [86]:
# Add TotalActiveMinutes
min_col=df.filter(regex='Minutes')
df['TotalActiveMinutes']=min_col.iloc[:,:].sum(axis=1)
df.head()


Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,TotalActiveMinutes
0,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,1094
1,1503960366,2016-04-13,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797,1033
2,1503960366,2016-04-14,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776,1440
3,1503960366,2016-04-15,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745,998
4,1503960366,2016-04-16,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863,1040


## STEP 4: ANALYZE

4.1 User Activity Trend

In [87]:
df.describe()

Unnamed: 0,Id,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,TotalActiveMinutes
count,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0
mean,4855407000.0,7637.910638,5.489702,5.475351,0.108171,1.502681,0.567543,3.340819,0.001606,21.164894,13.564894,192.812766,991.210638,2303.609574,1218.753191
std,2424805000.0,5087.150742,3.924606,3.907276,0.619897,2.658941,0.88358,2.040655,0.007346,32.844803,19.987404,109.1747,301.267437,718.166862,265.931767
min,1503960000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
25%,2320127000.0,3789.75,2.62,2.62,0.0,0.0,0.0,1.945,0.0,0.0,0.0,127.0,729.75,1828.5,989.75
50%,4445115000.0,7405.5,5.245,5.245,0.0,0.21,0.24,3.365,0.0,4.0,6.0,199.0,1057.5,2134.0,1440.0
75%,6962181000.0,10727.0,7.7125,7.71,0.0,2.0525,0.8,4.7825,0.0,32.0,19.0,264.0,1229.5,2793.25,1440.0
max,8877689000.0,36019.0,28.030001,28.030001,4.942142,21.92,6.48,10.71,0.11,210.0,143.0,518.0,1440.0,4900.0,1440.0


In [88]:
df['ActivityDay']=df['ActivityDate'].dt.day_name()
df.groupby(['ActivityDay'])['Id'].count()


ActivityDay
Friday       126
Monday       120
Saturday     124
Sunday       121
Thursday     147
Tuesday      152
Wednesday    150
Name: Id, dtype: int64

In [89]:
df.groupby(['ActivityDay'])['TotalSteps','TotalDistance','TotalActiveMinutes','Calories'].mean().sort_values(['TotalSteps','TotalDistance','TotalActiveMinutes','Calories'])


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



Unnamed: 0_level_0,TotalSteps,TotalDistance,TotalActiveMinutes,Calories
ActivityDay,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sunday,6933.231405,5.02719,1198.743802,2263.0
Thursday,7405.836735,5.312245,1178.782313,2199.571429
Friday,7448.230159,5.309921,1236.674603,2331.785714
Wednesday,7559.373333,5.488333,1213.213333,2302.62
Monday,7780.866667,5.552917,1257.108333,2324.208333
Tuesday,8125.006579,5.832237,1241.993421,2356.013158
Saturday,8152.975806,5.854677,1208.548387,2354.967742


**Basic Statistical Summary**:
- On average, users performed 7638 steps, moved 5.5km distance, and burned 2304 calories per day. [Our average users are Somewhat active (7,500 to 9,999 steps per day)](https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/#:~:text=Low%20active%20is%205%2C000%20to,active%20is%20more%20than%2012%2C500).
- Users are active for 20 hours (1218.7) per day, on average,including from sedentary level to very active level.  
- User used the device most on Tuesday, but performed most steps and moved most distance on Saturday. 

4.2 User Logging Trend

In [90]:
df['LoggedDistince%']=df['LoggedActivitiesDistance']/df['TrackerDistance']
df[df['LoggedActivitiesDistance']>0][['TrackerDistance', 'LoggedActivitiesDistance','LoggedDistince%']].head()

Unnamed: 0,TrackerDistance,LoggedActivitiesDistance,LoggedDistince%
668,5.27,1.959596,0.37184
689,7.88,4.081692,0.517981
693,9.08,2.785175,0.306737
707,8.68,3.167822,0.364956
711,9.48,4.869783,0.51369


In [91]:
no_log=df[df['LoggedActivitiesDistance']==0]
no_log.shape

(908, 18)

In [92]:
logged=df[df['LoggedActivitiesDistance']>0]
logged.describe()

Unnamed: 0,Id,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,TotalActiveMinutes,LoggedDistince%
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,7681637000.0,12042.5,9.147188,8.725625,3.177521,3.57375,0.869687,4.690937,0.007188,71.1875,17.84375,227.78125,870.59375,3305.0,1187.40625,0.368726
std,709250100.0,3382.868224,2.377527,2.132471,1.253865,1.68858,0.522485,1.24671,0.025932,28.723866,9.615543,85.253601,179.824441,643.527827,215.342766,0.115856
min,6775889000.0,6064.0,4.81,4.81,1.959596,0.63,0.16,0.73,0.0,34.0,3.0,47.0,607.0,2105.0,902.0,0.1876
25%,7007744000.0,9035.25,7.165,7.165,2.092147,2.06,0.5825,3.9875,0.0,53.0,11.5,164.25,722.25,2825.5,1015.5,0.28673
50%,7693154000.0,12633.5,9.69,9.07,2.253081,4.29,0.705,4.5,0.0,63.0,16.0,204.5,812.0,3335.5,1060.5,0.368899
75%,8378563000.0,14177.5,10.6225,9.9425,4.86379,4.6775,1.0375,5.295,0.0,76.0,23.25,304.0,1028.25,3787.25,1440.0,0.476286
max,8378563000.0,20067.0,14.3,13.42,4.942142,6.9,2.12,7.95,0.11,137.0,42.0,382.0,1321.0,4236.0,1440.0,0.564105


In [93]:
logged['Id'].nunique()

4

In [94]:
df['LoggedYN'] = df['LoggedActivitiesDistance'].apply(lambda x: "Y" if x>0 else "N")
df[df['LoggedActivitiesDistance']>0].head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,TotalActiveMinutes,ActivityDay,LoggedDistince%,LoggedYN
668,6775888955,2016-04-26,7091,5.27,5.27,1.959596,3.48,0.87,0.73,0.0,42,30,47,1321,2584,1440,Tuesday,0.37184,Y
689,6962181067,2016-04-21,11835,9.71,7.88,4.081692,3.99,2.1,3.51,0.11,53,27,214,708,2179,1002,Thursday,0.517981,Y
693,6962181067,2016-04-25,13239,9.27,9.08,2.785175,3.02,1.68,4.46,0.1,35,31,282,637,2194,985,Monday,0.306737,Y
707,6962181067,2016-05-09,12342,8.72,8.68,3.167822,3.9,1.18,3.65,0.0,43,21,231,607,2105,902,Monday,0.364956,Y
711,7007744171,2016-04-12,14172,10.29,9.48,4.869783,4.5,0.38,5.41,0.0,53,8,355,1024,2937,1440,Tuesday,0.51369,Y


**Logging Activity Trend:**
- 908 entires are not logged, only 32 entries(4 users) are logged their Activities Distance. 
- When logged, Activities Distance is about 37% of their TrackerDistance on average.
- Users, who logged Activities Distance, burned 3305 calories, peformed 12,043 steps, and moved 9.15km per day on average. [Those users are Active~Highly Active ](https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/#:~:text=Low%20active%20is%205%2C000%20to,active%20is%20more%20than%2012%2C500)


## STEP 5: SHARE

5.1 Visualization


In [95]:
#import matplotlib & plotly
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

a. User Activity Tracking Trend -Histogram

In [96]:
fig = px.histogram(df, x='ActivityDay')
fig.show(renderer="colab")


In [97]:
fig = px.histogram(df, x='ActivityDate')
fig.update_layout(bargap=0.2)
fig.show(renderer="colab")

b. User Activity Tracking - Box plot

In [98]:
avg_steps_byday=df.groupby(['ActivityDay'])['TotalSteps','TotalDistance','Calories','TotalActiveMinutes'].mean()
avg_steps_byday=avg_steps_byday.reset_index()
avg_steps_byday


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



Unnamed: 0,ActivityDay,TotalSteps,TotalDistance,Calories,TotalActiveMinutes
0,Friday,7448.230159,5.309921,2331.785714,1236.674603
1,Monday,7780.866667,5.552917,2324.208333,1257.108333
2,Saturday,8152.975806,5.854677,2354.967742,1208.548387
3,Sunday,6933.231405,5.02719,2263.0,1198.743802
4,Thursday,7405.836735,5.312245,2199.571429,1178.782313
5,Tuesday,8125.006579,5.832237,2356.013158,1241.993421
6,Wednesday,7559.373333,5.488333,2302.62,1213.213333


In [106]:

fig = px.box(df, x="ActivityDay", y="TotalSteps",color="LoggedYN", title="Steps taken by Day")
fig.show(renderer="colab")


In [107]:
fig = px.box(df, x="ActivityDay", y="TotalDistance",color="LoggedYN", title="Distance moved by Day")
fig.show(renderer="colab")

In [108]:
fig = px.box(df, x="ActivityDay", y="Calories",color="LoggedYN", title="Calories burned by Day")
fig.show(renderer="colab")

c. User Activity Tracking - Pie Chart

In [109]:
#import Go
import plotly.graph_objects as go
# Calculation for pie chart
very_active_mins = df["VeryActiveMinutes"].mean().round(1)
fairly_active_mins = df["FairlyActiveMinutes"].mean().round(1)
lightly_active_mins = df["LightlyActiveMinutes"].mean().round(1)
sedentary_mins = df["SedentaryMinutes"].mean().round(1)
labels = ['very_active', 'fairly_active', 'lightly_active', 'sedentary']
values = [very_active_mins, fairly_active_mins, lightly_active_mins, sedentary_mins]

# plot pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
colors = ['lightgreen', 'mediumturquoise', 'darkorange', 'gold']
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show(renderer="colab")

d. User Activity Tracking - Heatmap & 

In [103]:
heatmap_df=df[["TotalSteps","TotalDistance","TotalActiveMinutes","Calories","LoggedYN"]]
heatmap_df

Unnamed: 0,TotalSteps,TotalDistance,TotalActiveMinutes,Calories,LoggedYN
0,13162,8.500000,1094,1985,N
1,10735,6.970000,1033,1797,N
2,10460,6.740000,1440,1776,N
3,9762,6.280000,998,1745,N
4,12669,8.160000,1040,1863,N
...,...,...,...,...,...
935,10686,8.110000,1440,2847,N
936,20226,18.250000,1440,3710,N
937,10733,8.150000,1440,2832,N
938,21420,19.559999,1440,3832,N


In [110]:
fig = px.scatter_matrix(heatmap_df, color="LoggedYN")
fig.show(renderer="colab")

In [111]:
fig = px.imshow(df.iloc[:,1:-2].corr())
fig.show(renderer="colab")

**Insight from Visualization**
- Usage of the smart device is higher during mid-week from Tuesday to Friday.
- During the mid-week, users take more steps, move more distances, and burn more calories. Very active users do not track their activities during the weekend.
- 81% of Total Active Minutes is classified as Sedentary level. The heatmap also observed that there is no correlation between the duration of smart device usage and a more active lifestyle. Moreover, only four users (out of 33) logged their Activity Distance; it can be assumed that the smart device is used in daily life for most people rather than being used particularly to track work-out/calories burned. 
- Scatter plot and heatmap confirm that the more steps users take and the distance they move, the more calories they burn.


## STEP 6: ACT

ANSWER THE BUSINESS QUESTIONS:

1. What are the trends identified?

- Usage of the smart device is higher during mid-week from Tuesday to Friday.
- The smart device is used in daily life rather than being used to track specific fitness activities.
- The average users are ‘somewhat active’ based on their average daily activity. 


2. How could these trends apply to Bellabeat customers?

- Bellabeat customers are more open to fitness ideas that they can practice during the week, during lunch hour, after work, or before going to work.
- Various Mid to low-level activities can be helpful for the customers to gradually/gently increase their daily activity levels and active times.
- Notification to log their activities can encourage their habits to track their fitness activities/routine.  


3. How could these trends help influence Bellabeat marketing strategy?

To target the majority of users:

- In terms of product options and design, Bellabeat customers prefer a wearable they can use daily.
- In terms of an app push notification, the customers will be interested in learning various low-mid level daily activities they can try during the weekdays.
- In the app, the customers want to see more low-level activity fitness tracking options such as meditation, breathing exercises and stretching.

To target the highly active users:

- In terms of product options, highly active customers will prefer a product/services that can track their progress, take on challenges, and compete with other users.
- In terms of app push notifications, the customers will be interested in learning how to improve their performance, specific techniques, and professional community in a local area.
- In the app, the customers want to see their quality of break (or sleep).