# Introduction

What strategies can a wellness technology company use to excel?

# Step 1: Ask

**Background**

Bellabeat is a technology company specializing in advanced health-focused products. They offer a range of smart devices designed to track activity, sleep, stress, and reproductive health, aiming to empower women with insights into their own health and habits.

This case study centers on analyzing the fitness data collected by Bellabeat's smart devices to uncover potential growth opportunities for the company. Our focus will be on one of Bellabeat's key products: the Bellabeat app.

The Bellabeat app provides users with comprehensive health data related to their activity, sleep, stress, menstrual cycle, and mindfulness practices. By connecting with Bellabeat's suite of smart wellness products, the app helps users gain a deeper understanding of their habits and make informed, health-conscious decisions.

**Key Stakeholders**

* Urška Sršen Bellabeat cofounder and Chief Creative Officer
* Sando Mur Bellabeat cofounder and key member of Bellabeat executive team
* Bellabeat Marketing Analytics team

**Bussiness Task**

The business task is to analyze user patterns in the use of Bellabeat's smart devices to uncover insights that can refine marketing strategies. In essence:

"How do users interact with our smart devices? Identify trends in the usage of both Bellabeat and non-Bellabeat smart devices to inform and enhance Bellabeat's marketing approach."

# Step 2: Prepare

Dataset used
The data source used for this case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through Mobius and generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 05.12.2016.

# Step 3: Process

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
plt.style.use('ggplot')

In [None]:
df=pd.read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')

**Data Exploration**

In [None]:
df.shape

We can see that we have 940 rows and 15 columnss in the dataset

In [None]:
df.columns

Kowing the names of the columns, let´s take a quick overview at the rows and the data itself, using the .head() pandas function

In [None]:
df.head(8)

In [None]:
df.info()

We can see that the Id columns is an integer, but it should be a string or object in this instance, why? because the Id is only an identifier, and our purpose is not to make mathematic operations with it, sums, multiplications, etc..

Also the ActivityDate columns is an object and should be a Date.

Other than that all the other columns seem to be the correct data type

In [None]:
# convert id to string
df['Id']= df['Id'].astype(str)
# convert Activity date to date format
df['ActivityDate'] = pd.to_datetime(df['ActivityDate'],format="%m/%d/%Y")

In [None]:
df.info()

In [None]:
df.head(10)

Checking column values:
After that we can get rid of columns that are not relevant for our analysis. First we note the 'TotalDistance' column, and the other columns related to distance tracking. We see at first glance that 'TotalDistance' and 'Tracker Distance' have similar values, but we are not sure. We also can assum that the 'TrackerDistance' or the 'TotalDistance' is the sum of the different "*ActiveDistance" columns, we may be wrong so we check first.

In [None]:
# We create a new column, adding up the "ActiveDistance" columns to see if it's equal to the 'TotalDistance' column, or the 'TrackerDistance' column
df['sum_distance'] = df['VeryActiveDistance'] + df['ModeratelyActiveDistance'] + df['LightActiveDistance'] + df['SedentaryActiveDistance']

# We also notice that 'LoggedActivitiesDistance' have 0.0 in value in most entries, but we filter to find where has more than 0
df.loc[(df['LoggedActivitiesDistance'] > 0),['TotalDistance','TrackerDistance','LoggedActivitiesDistance','sum_distance']]

In [None]:
df['TotalMinutes'] = df['VeryActiveMinutes'] + df['FairlyActiveMinutes'] + df['LightlyActiveMinutes'] + df['SedentaryMinutes']

In [None]:
#rename col for ease of use
df.columns = df.columns.str.lower()
df.rename(columns = {'trackerdistance':'tracker_distance','activitydate':'activity_date','totalsteps':'total_steps','totaldistance':'total_distance',
       'loggedactivitiesdistance':'logged_activities_distance', 'veryactivedistance':'very_active_distance',
       'moderatelyactivedistance':'moderately_active_distance', 'lightactivedistance':'light_active_distance',
       'sedentaryactivedistance':'sedentary_active_distance', 'veryactiveminutes':'very_active_minutes',
       'fairlyactiveminutes':'fairly_active_minutes','lightlyactiveminutes':'lightly_active_minutes',
       'sedentaryminutes':'sedentary_minutes'}
         ,inplace=True)

In [None]:
df.columns

In [None]:
day_of_week = df['activity_date'].dt.day_name()
df['day_of_week'] = day_of_week
df['n_day_of_week'] = df['activity_date'].dt.weekday # 0 represents monday, 6 represents sunday


In [None]:
#checking null values in dataset
df.isna().sum()

In [None]:
#Checking for duplicates in dataset
print('Total number of duplicated values are: ',df.duplicated().sum())

**Subsetting the data**
Now we can select only the columns we will use for our analysis. 


In [None]:
#subset data
df = df[['id', 'activity_date', 'total_steps', 'total_distance',
       #'tracker_distance', 'logged_activities_distance',
       #'very_active_distance', 'moderately_active_distance',
       #'light_active_distance', 'sedentary_active_distance',
       'very_active_minutes', 'fairly_active_minutes',
       'lightly_active_minutes', 'sedentary_minutes', 'calories',
       #'sum_distance','totalminutes', 
       'day_of_week', 'n_day_of_week'
        ]].copy()

In [None]:
df

**Analysis Phase**

In [None]:
# first group the data by the id
id_grp = df.groupby(['id'])

# Then I look for the average amount of steps, and sort the results in descending order
id_avg_step = id_grp['total_steps'].mean().sort_values(ascending=False)

# After that, I turn the results into a dataframe
id_avg_step = id_avg_step.to_frame()

# create a new column which tells in which category each user fits into, depending on the average amount of steps
conditions = [
    (id_avg_step <=6000),
    (id_avg_step > 6000) & (id_avg_step < 12000),
    (id_avg_step >= 12000)
] # These are the conditions

values = ['sedentary','active','very_active'] # And here are the name of the values

# create a column with the numpy function, np.select to asign each id a category
id_avg_step['activity_level'] = np.select(conditions,values)

# store the results in a variable to use it in the next step
id_activity_level = id_avg_step['activity_level']

# use a list comprehension to create the column in our original dataset.
# With this list comprehension I retrieve the categories where the index match the id column
df['activity_level'] = [id_activity_level[c] for c in df['id']]

# Step 4: Analyze

In [None]:
print('Number of unique values in id column:',df['id'].nunique())
print()
print('List of id values:',df['id'].unique())

In [None]:
print('How many times each id appear in the dataset?')
print(df['id'].value_counts())

Now let's check the date column, what is the minimum date, maximum date, the days between them, and number of unique dates



In [None]:
print('The min date is:',min(df['activity_date']))
print('The max date is:',max(df['activity_date']))
print('The number of unique dates are:',df['activity_date'].nunique())

In [None]:
# First we use the describe() function to see some statistics
df.describe()

# Step 5 - Share


Correlation between calories steps and calories
What is the correlation between the amount of steps done, and the amount of calories burnt?

In [None]:
ax =sns.scatterplot(x='total_steps', y='calories', data=df,hue='activity_level')

#handles, labels = ax.get_legend_handles_labels()
#plt.legend(handles, day_of_week, fontsize=7)
plt.title('Correlation Calories vs. Steps')

plt.show()

We can see in this scatterplot a somewhat positive correlation, the more steps done, the more calories burnt. Also we divided the dots by colors, using the activity_level category, so we can see which group is representing the data shown

# Average number of steps per day

In [None]:
day_of_week = ['Monday','Tuesday','Wednesday','Thursday', 'Friday','Saturday','Sunday']
fig, ax =plt.subplots(1,1,figsize=(9,6))

day_grp = df.groupby(['day_of_week'])
avg_daily_steps= day_grp['total_steps'].mean()
avg_steps = df['total_steps'].mean()

plt.bar(avg_daily_steps.index,avg_daily_steps,color='blue')

ax.set_xticks(range(len(day_of_week)))
ax.set_xticklabels(day_of_week)

ax.axhline(y=avg_daily_steps.mean(),color='red', label='Average daily steps')
ax.set_ylabel('Number of steps')
ax.set_xlabel('Day of the week')
ax.set_title('Avg Number of steps per day')

plt.legend()
plt.show()

The results show that Monday, Tuesday and Saturday are the days where the users were more physically active and above the average numbert of steps overall. Wednesday, Thursday, and Friday are below the average but the three fell into the same area. Sunday is the least active of all the weekdays.

With this information we can interpret that users tend to be more physically active during the firsts days of the week and during saturdays, giving us a hint of the activities they may do.

# Percentage of activity in minutes


In [None]:
very_active_mins = df['very_active_minutes'].sum() 
fairly_active_mins = df['fairly_active_minutes'].sum()
lightly_active_mins = df['lightly_active_minutes'].sum()
sedentary_mins = df['sedentary_minutes'].sum()

# Data for the pie chart
slices = [very_active_mins, fairly_active_mins, lightly_active_mins, sedentary_mins]
labels = ['Very Active Minutes', 'Fairly Active Minutes', 'Lightly Active Minutes', 'Sedentary Minutes']
explode = [0, 0, 0, 0.1]

# Custom colors
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']

# Creating the pie chart
plt.pie(slices, labels=labels, explode=explode, autopct='%1.1f%%', textprops=dict(size=9), shadow=True, colors=colors)

# Setting the title and layout
plt.title('Percentage of Activity in Minutes', fontsize=18)
plt.tight_layout()

# Displaying the chart
plt.show()

This pie chart shows that the users are in a sedentary state of activity most of the time, a sixth of the time doing light activity and only 2% of the time being active doing proper excercise.

# Correlation Between activity level minutes and calories

In [None]:
n_day_of_week = [0,1,2,3,4,5,6]

fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(11,15),dpi=70)

sns.scatterplot(data=df,x='calories',y='sedentary_minutes',hue='activity_level',ax=axes[0,0],legend=False)

sns.scatterplot(data=df,x='calories',y='lightly_active_minutes',hue='activity_level',ax=axes[0,1],legend=False)

sns.scatterplot(data=df,x='calories',y='fairly_active_minutes',hue='activity_level',ax=axes[1,0],legend=False)

sns.scatterplot(data=df,x='calories',y='very_active_minutes',hue='activity_level',ax=axes[1,1])


plt.legend(title='Activity level',title_fontsize=20,bbox_to_anchor=(1.8,2.2),fontsize=18,frameon=True,scatterpoints=1)
fig.suptitle('Correlation Between activity level minutes and calories',x=0.5,y=0.92,fontsize=24)
plt.show()

# Step 6.- Act


After analyzing FitBit Fitness Tracker Data, we have found some insights that would help influence Bellabeat marketing strategy

**A versatile wellness tool**

Bellabeat should communicate that their products are designed for more than just sports or exercise-related activities. Data indicates that many users wear their tracking devices more frequently on weekends compared to weekdays, suggesting that they may perceive the product as suitable only for sports or occasional activities like park walks.

Bellabeat can emphasize that their products are intended to accompany users throughout all daily activities, including work, and help track information that enhances overall fitness and health. By highlighting this versatility, Bellabeat can appeal to women from diverse backgrounds and encourage broader use of their products for comprehensive health management.


**Rewards and reminds**

Bellabeat can enhance user engagement by integrating various features into the Bellabeat app or other products. These features could include rewards and incentives, as well as reminders to help users achieve specific goals. For example, users could be encouraged to reach milestones such as 7,500 steps per day, a certain amount of calorie burn for weight loss, or an 8-hour sleep pattern.

Potential rewards might include leaderboards showcasing top users who consistently meet their daily step goals, virtual medals, or tangible prizes like discounts and special offers. To support goal achievement, Bellabeat could send notifications when users fall short of their targets and offer personalized recommendations to help them improve their sleep or meet their fitness objectives.