In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# [Bellabeat](https://bellabeat.com/): How Can a Wellness Technology Company Play It Smart?

## INTRODUCTION:
**About the company**

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products.
Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around
the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with
knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly
positioned itself as a tech-driven wellness company for women.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available
through a growing number of online retailers in addition to their own e-commerce channel on their website. The company
has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital
marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and
consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google
Display Network to support campaigns around key marketing dates.

**Questions**

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart
devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions
will guide your analysis:
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

You will produce a report with the following deliverables:
1. A clear summary of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top high-level content recommendations based on your analysis

## STEP 1: ASK

1.1 Business Task:

    Analyze FitBit Fitness Tracker Data to gain insights into user behaviour and trend that influence Bellabeat marketing strategy
    
1.2 Key Stakeholders:

    1. Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
    2. Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team
    3. Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat's marketing strategy.
    


## STEP 2: PREPARE

2.1 Description on Data Source:

    1. The data is publicly available on [Kaggle: FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit)
    2. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016
    3. Thirty eligible Fitbit users consented to the submission of personal tracker data
    4. Individual reports can be parsed by export session ID (column A) or timestamp (column B)

2.2 Data Credibility & Limitations:

    1. Reliable - This data could have sample selection bias becasue it doesn't refelct the overall population
    2. Original - This data is third party information
    3. Comprehensive - This data includes all important information needed to answer the business question
    4. Current - The data is 6 years old
    5. Cited - Unknown
    
    Overall, the data source is evaluated as bad data, but it is not relevant at the moment since this is for the capstone project. 

2.3 Data Selection:
    
    dailyActivity_merged.csv



## STEP 3: PROCESS


3.1 Import Packages

In [213]:
import numpy as np 
import pandas as pd 
import matplotlib as plt
import plotly.express as px
import datetime as dt 

3.2 Import Data

In [214]:
df=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')



3.3 EDA

In [215]:
# Check the first 5 rows of dataframe
df.head()

In [216]:
# Check the last 5 rows of dataframe
df.tail()

In [217]:
#  Overview of the dataset, including the index dtype and column dtypes, non-null values, and memory usage
print(df.info())
print('rows x columns:',df.shape)

In [218]:
# Statistical summary for numerical columns
df.drop('Id',axis=1).describe()

In [219]:
# Check any na values
df.isna().sum()

In [220]:
# Unique entries over columns and rows in the object
df.nunique()

In [221]:
# Check if every user tracked their activity everyday
df.groupby('ActivityDate')[['Id']].nunique()

In [222]:
# Check who (Id) missed daily tracking
df.groupby('Id')[['ActivityDate']].nunique()

In [223]:
# Check observation for Id=4057192912
df[df.Id==4057192912]

In [224]:
# Check where records are 0
df.loc[(df==0).any(axis=1)]

In [225]:
# Check correlation of all columns (Any missing values, non-numeric are automatically excluded)
df.drop('Id',axis=1).corr( )

In [226]:
# Check records by Unique Key (Id+Timestamp)
df.groupby(['Id','ActivityDate']).sum()


In [227]:
# Data sanity check: TotalDistance
total=df.filter(regex='Distance')
total['sum']=total.iloc[:,2:].sum(axis=1).round(1)
total

### EDA Summary:
<font color='red'>
    
    1. 33 Unique Ids instead of 30 Unique Ids
    2. ActivityDate dtype needs to be converted to datetime dtype for accurate sorting/filtering purposes 
    3. No Null or Missing values, but it is observed that users did not have to track their activity daily and some records are "0" 
    4. Add Total Active Minutes

3.4 Data Transformation

In [228]:
# ActivityDate to datetime
df['ActivityDate']=pd.to_datetime(df['ActivityDate'])
df.info()

In [229]:
# Add TotalActiveMinutes
min_col=df.filter(regex='Minutes')
df['TotalActiveMinutes']=min_col.iloc[:,:].sum(axis=1)
df.head()


## STEP 4: ANALYZE

4.1 User Activity Trend

In [230]:
df.describe()

In [231]:
df['ActivityDay']=df['ActivityDate'].dt.day_name()
df.groupby(['ActivityDay'])['Id'].count()


In [232]:
df.groupby(['ActivityDay'])['TotalSteps','TotalDistance','TotalActiveMinutes','Calories'].mean().sort_values(['TotalSteps','TotalDistance','TotalActiveMinutes','Calories'])

**Basic Statistical Summary**:
- On average, users performed 7638 steps, moved 5.5km distance, and burned 2304 calories per day. [Our average users are Somewhat active (7,500 to 9,999 steps per day)](https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/#:~:text=Low%20active%20is%205%2C000%20to,active%20is%20more%20than%2012%2C500).
- Users are active for 20 hours (1218.7) per day, on average,including from sedentary level to very active level.  
- User used the device most on Tuesday, but performed most steps and moved most distance on Saturday. 

4.2 User Logging Trend

In [233]:
df['LoggedDistince%']=df['LoggedActivitiesDistance']/df['TrackerDistance']
df[df['LoggedActivitiesDistance']>0][['TrackerDistance', 'LoggedActivitiesDistance','LoggedDistince%']].head()

In [234]:
no_log=df[df['LoggedActivitiesDistance']==0]
no_log.shape

In [235]:
logged['Id'].nunique()

In [236]:
logged=df[df['LoggedActivitiesDistance']>0]
logged.describe()

In [237]:
df['LoggedYN'] = df['LoggedActivitiesDistance'].apply(lambda x: "Y" if x>0 else "N")
df[df['LoggedActivitiesDistance']>0].head()

**Logging Activity Trend:**
- 908 entires are not logged, only 32 entries(4 users) are logged their Activities Distance. 
- When logged, Activities Distance is about 37% of their TrackerDistance on average.
- Users, who logged Activities Distance, burned 3305 calories, peformed 12,043 steps, and moved 9.15km per day on average. [Those users are Active~Highly Active ](https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/#:~:text=Low%20active%20is%205%2C000%20to,active%20is%20more%20than%2012%2C500)


## STEP 5: SHARE

5.1 Visualization


In [238]:
#import matplotlib & plotly
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

a. User Activity Tracking Trend -Histogram

In [239]:
fig = px.histogram(df, x='ActivityDay')
#fig.show()
py.iplot(fig)

In [240]:
fig = px.histogram(df, x='ActivityDate')
fig.update_layout(bargap=0.2)
fig.show()

b. User Activity Tracking - Box plot

In [241]:
avg_steps_byday=df.groupby(['ActivityDay'])['TotalSteps','TotalDistance','Calories','TotalActiveMinutes'].mean()
avg_steps_byday=avg_steps_byday.reset_index()
avg_steps_byday

In [242]:

fig = px.box(df, x="ActivityDay", y="TotalSteps",color="LoggedYN", title="Steps taken by Day")
fig.show()


In [243]:
fig = px.box(df, x="ActivityDay", y="TotalDistance",color="LoggedYN", title="Distance moved by Day")
fig.show()

In [244]:
fig = px.box(df, x="ActivityDay", y="Calories",color="LoggedYN", title="Calories burned by Day")
fig.show()

c. User Activity Tracking - Pie Chart

In [245]:
#import Go
import plotly.graph_objects as go
# Calculation for pie chart
very_active_mins = df["VeryActiveMinutes"].mean().round(1)
fairly_active_mins = df["FairlyActiveMinutes"].mean().round(1)
lightly_active_mins = df["LightlyActiveMinutes"].mean().round(1)
sedentary_mins = df["SedentaryMinutes"].mean().round(1)
labels = ['very_active', 'fairly_active', 'lightly_active', 'sedentary']
values = [very_active_mins, fairly_active_mins, lightly_active_mins, sedentary_mins]

# plot pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
colors = ['lightgreen', 'mediumturquoise', 'darkorange', 'gold']
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

d. User Activity Tracking - Heatmap & 

In [246]:
heatmap_df=df[["TotalSteps","TotalDistance","TotalActiveMinutes","Calories","LoggedYN"]]
heatmap_df

In [247]:
fig = px.scatter_matrix(heatmap_df, color="LoggedYN")
fig.show()

In [248]:
fig = px.imshow(df.iloc[:,1:-2].corr())
fig.show()

**Insight from Visualization**
- Usage of the smart device is higher during mid-week from Tuesday to Friday.
- During the mid-week, users take more steps, move more distances, and burn more calories. Very active users do not track their activities during the weekend.
- 81% of Total Active Minutes is classified as Sedentary level. The heatmap also observed that there is no correlation between the duration of smart device usage and a more active lifestyle. Moreover, only four users (out of 33) logged their Activity Distance; it can be assumed that the smart device is used in daily life for most people rather than being used particularly to track work-out/calories burned. 
- Scatter plot and heatmap confirm that the more steps users take and the distance they move, the more calories they burn.


## STEP 6: ACT

ANSWER THE BUSINESS QUESTIONS:

1. What are the trends identified?

- Usage of the smart device is higher during mid-week from Tuesday to Friday.
- The smart device is used in daily life rather than being used to track specific fitness activities.
- The average users are ‘somewhat active’ based on their average daily activity. 


2. How could these trends apply to Bellabeat customers?

- Bellabeat customers are more open to fitness ideas that they can practice during the week, during lunch hour, after work, or before going to work.
- Various Mid to low-level activities can be helpful for the customers to gradually/gently increase their daily activity levels and active times.
- Notification to log their activities can encourage their habits to track their fitness activities/routine.  


3. How could these trends help influence Bellabeat marketing strategy?

To target the majority of users:

- In terms of product options and design, Bellabeat customers prefer a wearable they can use daily.
- In terms of an app push notification, the customers will be interested in learning various low-mid level daily activities they can try during the weekdays.
- In the app, the customers want to see more low-level activity fitness tracking options such as meditation, breathing exercises and stretching.

To target the highly active users:

- In terms of product options, highly active customers will prefer a product/services that can track their progress, take on challenges, and compete with other users.
- In terms of app push notifications, the customers will be interested in learning how to improve their performance, specific techniques, and professional community in a local area.
- In the app, the customers want to see their quality of break (or sleep).