# Bellabeat Case Study (Google Data Analytics Capstone Project)


**About a company**

Bellabeat is a high-tech manufacturer of health-focused smart products for women. 
The company mission is empower women with knowledge about their own health and habits. 
Bellabeat has grown rapidly and positioned itself as a data-driven wellness company. 
Their product portfolio includes devices such as Leaf, Time, and Spring, which collect activity, sleep, stress, and reproductive health data.

**Questions for the analysis**
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy

**Purpose of the project** 

To understand user behavior and develop product strategy recommendations by analyzing Bellabeat user data.

To derive meaningful insights from the data obtained


**Loading packages**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

**Importing datasets**

In [None]:
df_activity_original = pd.read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
df_heartrate_original = pd.read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
df_sleep_original = pd.read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
df_weight_original = pd.read_csv('/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')

In [None]:
df_activity = df_activity_original
df_heartrate = df_heartrate_original
df_sleep = df_sleep_original
df_weight = df_weight_original

## Examining Datasets

### activity dataset

In [None]:
df_activity.head(10)

In [None]:
df_activity.info()

In [None]:
unique_id_count = df_activity['Id'].nunique()
print(unique_id_count)

### weight dataset

In [None]:
df_weight.head(10)

In [None]:
df_weight.info()

### hearthrate dataset

In [None]:
df_heartrate.head(10)

In [None]:
df_heartrate.info()

### sleep dataset

In [None]:
df_sleep.head(10)

In [None]:
df_sleep.info()

## Cleaning Datasets

**Data Cleaning Steps:**

1. Id values were converted to string format for consistency across datasets.

2. Date columns were standardized to a common datetime format in all datasets.

3. Unused or irrelevant columns were removed.

4. Outliers were checked using box plots for each numerical column.

5. Maximum and minimum values were manually reviewed and adjusted if necessary.

6. Missing or duplicated records were identified and handled appropriately.


### activity dataset

In [None]:
df_activity['ActivityDate'] = pd.to_datetime(df_activity['ActivityDate'], format='%m/%d/%Y')
df_activity = df_activity.rename(columns={'ActivityDate': 'Date'})


In [None]:
df_activity['Id'] = df_activity['Id'].astype(str)

In [None]:
duplicate_count = df_activity.duplicated().sum()
print(f"Total number of duplicate rows: {duplicate_count}")

In [None]:
df_activity = df_activity.drop(['TrackerDistance', 'LoggedActivitiesDistance','VeryActiveDistance', 'ModeratelyActiveDistance','LightActiveDistance','SedentaryActiveDistance'], axis=1)

In [None]:
df_activity['Total_Active_Minutes'] = df_activity['FairlyActiveMinutes'] + df_activity['VeryActiveMinutes'] + df_activity['LightlyActiveMinutes']


In [None]:
columns_to_check = ['TotalSteps', 'Calories', 'SedentaryMinutes', 'Total_Active_Minutes']

plt.figure(figsize=(15, 5))
for i, column in enumerate(columns_to_check):
    plt.subplot(1, 4, i + 1)
    sns.boxplot(y=df_activity[column])
    plt.title(f'{column} Box Plot')
    plt.ylabel(column)

plt.tight_layout()
plt.show()

### weight dataset

In [None]:
df_weight['Date'] = pd.to_datetime(df_weight['Date'])
df_weight['Date'] = df_weight['Date'].dt.date
df_weight['Date'] = pd.to_datetime(df_activity['Date'], format='%m/%d/%Y')

print(df_weight['Date'].head())

In [None]:
df_weight['Id'] = df_weight['Id'].astype(str)
df_weight['LogId'] = df_weight['LogId'].astype(str)

In [None]:
duplicate_count = df_weight.duplicated().sum()
print(f"Total number of duplicate rows: {duplicate_count}")

In [None]:
df_weight = df_weight.drop(['Fat','WeightPounds', 'LogId','IsManualReport'], axis=1)

In [None]:
columns_to_check = ['WeightKg', 'BMI']

plt.figure(figsize=(15, 5))
for i, column in enumerate(columns_to_check):
    plt.subplot(1, 4, i + 1)
    sns.boxplot(y=df_weight[column])
    plt.title(f'{column} Box Plot')
    plt.ylabel(column)

plt.tight_layout()
plt.show()

In [None]:
weight_check = df_weight.sort_values(by='WeightKg', ascending=False)[['WeightKg', 'BMI']].head(10)
print("weight check:")
print(weight_check)

### hearthrate

In [None]:
df_heartrate['Time'] = pd.to_datetime(df_heartrate['Time'])

df_heartrate['Time'] = df_heartrate['Time'].dt.date

df_heartrate = df_heartrate.rename(columns={'Time': 'Date'})

df_heartrate['Date'].head()

In [None]:
df_heartrate['Date'] = pd.to_datetime(df_heartrate['Date'])
print(df_heartrate['Date'].head())

In [None]:
df_heartrate['Id'] = df_heartrate['Id'].astype(str)

Average heart rate was calculated from the heartrate dataset using mean values per user per day.

In [None]:
df_heartrate_avg = df_heartrate.groupby(['Id', 'Date'])['Value'].mean().reset_index()


In [None]:
df_heartrate_avg = df_heartrate_avg.rename(columns={'DailyAvgHeartRate': 'AvgHeartRate'})

In [None]:
df_heartrate = df_heartrate_avg

In [None]:
df_heartrate = df_heartrate.rename(columns={'Value': 'AvgHeartRate'})

In [None]:
df_heartrate.head()

In [None]:
duplicate_count = df_heartrate.duplicated().sum()
print(f"Total number of duplicate rows: {duplicate_count}")

In [None]:
min_DailyAvgHeartRate_check = df_heartrate['AvgHeartRate'].min()
print("Min Heart Rate Check:")
print(min_DailyAvgHeartRate_check)

In [None]:
max_DailyAvgHeartRate_check = df_heartrate['AvgHeartRate'].max()
print("Max Heart Rate Check:")
print(max_DailyAvgHeartRate_check)

In [None]:
heart_to_check = ['AvgHeartRate']

plt.figure(figsize=(15, 5))
for i, column in enumerate(heart_to_check):
    plt.subplot(1, 4, i + 1)
    sns.boxplot(y=df_heartrate[column])
    plt.title(f'{column} Box Plot')
    plt.ylabel(column)

plt.tight_layout()
plt.show()

### sleep

In [None]:
df_sleep.info()

In [None]:
df_sleep['Id'] = df_sleep['Id'].astype(str)

In [None]:
df_sleep['SleepDay'] = pd.to_datetime(df_sleep['SleepDay'])

df_sleep['SleepDay'] = df_sleep['SleepDay'].dt.date

df_sleep = df_sleep.rename(columns={'SleepDay': 'Date'})

df_sleep['Date'] = pd.to_datetime(df_sleep['Date'])
print(df_sleep['Date'].head())

In [None]:
duplicate_count = df_sleep.duplicated().sum()
print(f"Total number of duplicate rows: {duplicate_count}")

In [None]:
duplicate_rows = df_sleep[df_sleep.duplicated(keep=False)]
print("Repeated Rows (Total 3 sets):")
print(duplicate_rows)

In [None]:
df_sleep = df_sleep.drop_duplicates()

# Check
duplicate_count_after_drop = df_sleep.duplicated().sum()
print(f"\nNumber of duplicate rows after deletion: {duplicate_count_after_drop}")


In [None]:
min_sleep_check = df_sleep['TotalMinutesAsleep'].min()
print("Min Sleep Check:")
print(min_sleep_check)

In [None]:
max_sleep_check = df_sleep['TotalMinutesAsleep'].max()
print("Max Sleep Check:")
print(max_sleep_check)

In [None]:
sleep_to_check = ['TotalMinutesAsleep']

plt.figure(figsize=(15, 5))
for i, column in enumerate(sleep_to_check):
    plt.subplot(1, 4, i + 1)
    sns.boxplot(y=df_sleep[column])
    plt.title(f'{column} Box Plot')
    plt.ylabel(column)

plt.tight_layout()
plt.show()

### sleep dataset

### Merging Datasets

In [None]:
df_merged_weight_activity = pd.merge(
    df_activity,
    df_weight,
    on=['Id', 'Date'],
    how='inner'
)

print("\nSummary of merged weight and activity dataset:")
df_merged_weight_activity.info()

In [None]:
df_merged_weight_activity.head()

In [None]:
df_merged_sleep_heart = pd.merge(
    df_heartrate,
    df_sleep,
    on=['Id', 'Date'],
    how='inner'
)

print("\nSummary of merged sleep and heartrate dataset:")
df_merged_sleep_heart.info()

In [None]:
df_merged_sleep_heart.head()

In [None]:
df_merged_sleep_heart_activity = pd.merge(
    df_activity,
    df_merged_sleep_heart,
    on=['Id', 'Date'],
    how='inner'
)

print("\nSummary of merged sleep and heartrate dataset:")
df_merged_sleep_heart.info()

In [None]:
df_merged_sleep_heart.head()

## Analyze

In [None]:
initial_date = df_activity['Date'].min()
final_date = df_activity['Date'].max()
print(f"Dataset initial date: {initial_date}")
print(f"Dataset final date: {final_date}")


### Correlation Matrix: Weight, Total Active Minutes, and Calories

In [None]:
columns_for_correlation = [
    'WeightKg',
    'Total_Active_Minutes',
    'Calories',
]

correlation_matrix = df_merged_weight_activity[columns_for_correlation].corr()

plt.figure(figsize=(8, 6))

sns.heatmap(
    correlation_matrix,
    annot=True,
    cmap='coolwarm',
    fmt=".2f",
    linewidths=.5,
    cbar_kws={'label': 'Correlation Coefficient (r)'}
)

plt.title('WeightKg, Activity, Calorie Correlation Matrix', fontsize=14)

plt.tight_layout()
plt.show()

The heat map shows the relationships between users' weight, total active minutes, and calories burned.



*   A moderately positive correlation between weight and calories (r=0.65) was observed; students with higher weights tend to burn more calories.

*   There is also a positive correlation between active minutes and calories (r=0.53), indicating that physical activity increases energy expenditure.

*   No significant relationship was found between weight and activity (r≈–0.10).


### Step Count Distribution

In [None]:
avg_steps = df_activity['TotalSteps'].mean()
avg_calories = df_activity['Calories'].mean()
avg_active_minutes = (df_activity['VeryActiveMinutes'] +
                      df_activity['FairlyActiveMinutes'] +
                      df_activity['LightlyActiveMinutes']).mean()

print(f"Average number of steps: {avg_steps:.0f}")
print(f"Average calorie burn: {avg_calories:.0f}")
print(f"Average active minutes: {avg_active_minutes:.0f}")

In [None]:
bins = [0, 5000, 10000, 15000, 20000, df_activity['TotalSteps'].max()]
labels = ['<5K', '5K–10K', '10K–15K', '15K–20K', '>20K']
df_activity['step_range'] = pd.cut(df_activity['TotalSteps'], bins=bins, labels=labels)
step_counts = df_activity['step_range'].value_counts(normalize=True) * 100

plt.figure(figsize=(6,6))
plt.pie(step_counts, labels=step_counts.index, autopct='%1.1f%%', startangle=90, pctdistance=0.8, colors=sns.color_palette("pastel"))
plt.title('Distribution of Users Step Counts (%)')

plt.tight_layout()
plt.show()

The distribution of users by daily step count is shown.

Generally, most users are moderately active.


* 38.7% of participants were moderately active, taking 5,000–10,000 steps per day
* 27.3% between 10,000–15,000 steps
* 26.2% under 5,000 steps
* 5.6% between 15,000–20,000 steps
* 2.2% over 20,000 steps.





### Sleep Quality Levels

In [None]:
df_sleep['SleepEfficiency'] = (df_sleep['TotalMinutesAsleep'] / df_sleep['TotalTimeInBed']) * 100
df_sleep['SleepEfficiency'].describe()



In [None]:
bins = [0, 70, 85, 100]
labels = ['Low', 'Medium', 'High']
df_sleep['SleepQuality'] = pd.cut(df_sleep['SleepEfficiency'], bins=bins, labels=labels)

ax = sns.countplot(x='SleepQuality', data=df_sleep, palette='pastel')

plt.title('Sleep Quality Distribution')
plt.xlabel('Sleep Quality')
plt.ylabel('Number of Users')

total = len(df_sleep)
for p in ax.patches:
    count = p.get_height()
    percentage = 100 * count / total
    ax.annotate(f'{percentage:.1f}%',
                (p.get_x() + p.get_width() / 2., count),
                ha='center', va='bottom', fontsize=10)

plt.show()

The distribution of users by sleep efficiency is shown.
Generally, the majority of Bellabeat users have good sleep patterns.

*   92% of users report high sleep quality
*   6.3% of users report low sleep quality
*   1.7% of users report medium sleep quality


Sleep efficiency was calculated using the following formula:


Sleep efficiency = (Total sleep minutes / Total time in bed) × 100


### Estimated Stress Levels 

Research shows that; high heart rate,low sleep efficiency, underactivity or overactivity is linked to stress.



In [None]:
df_merged_sleep_heart_activity['SleepEfficiency'] = (df_merged_sleep_heart_activity['TotalMinutesAsleep'] / df_merged_sleep_heart_activity['TotalTimeInBed']) * 100

df_merged_sleep_heart_activity['StressIndex'] = (
    (df_merged_sleep_heart_activity['AvgHeartRate'] / df_merged_sleep_heart_activity['AvgHeartRate'].max()) * 0.5
    + ((100 - df_merged_sleep_heart_activity['SleepEfficiency']) / 100) * 0.3
    + ((df_merged_sleep_heart_activity['VeryActiveMinutes'].max() - df_merged_sleep_heart_activity['VeryActiveMinutes']) /
       df_merged_sleep_heart_activity['VeryActiveMinutes'].max()) * 0.2
)

In [None]:
bins = [0, 0.3, 0.6, 1]
labels = ['Low', 'Medium', 'High']
df_merged_sleep_heart_activity['StressLevel'] = pd.cut(df_merged_sleep_heart_activity['StressIndex'], bins=bins, labels=labels)

In [None]:
plot = sns.countplot(x='StressLevel', data=df_merged_sleep_heart_activity, palette='pastel')
plt.title('Estimated Stress Level Distribution')
plt.xlabel('Stress Level')
plt.ylabel('Number of Users')

total = len(df_merged_sleep_heart_activity)
for p in plot.patches:
    count = p.get_height()
    percentage = 100 * count / total
    plot.annotate(f'{percentage:.1f}%',
                (p.get_x() + p.get_width() / 2., count),
                ha='center', va='bottom', fontsize=10)

plt.show()

The results show that most users are at a moderate stress level.
With a small percentage showing a tendency toward high stress.



This highlights the importance of features like breathing exercises and mindfulness reminders in the Bellabeat app.

### Weekly Active Day Trends

In [None]:
df_activity['Week'] = df_activity['Date'].dt.isocalendar().week
df_activity['Weekday'] = df_activity['Date'].dt.day_name()



The threshold for being an active day is set at 5000 steps.


In [None]:
df_activity['ActiveDay'] = df_activity['TotalSteps'] > 5000


In [None]:
user_weekly = df_activity.groupby(['Id', 'Week'])['ActiveDay'].sum().reset_index()
user_weekly.rename(columns={'ActiveDay': 'ActiveDaysPerWeek'}, inplace=True)

In [None]:
distribution = user_weekly['ActiveDaysPerWeek'].value_counts(normalize=True).sort_index() * 100
print(distribution)

In [None]:
avg_activity = df_activity.groupby('Weekday')['TotalSteps'].mean().reset_index()
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
avg_activity['Weekday'] = pd.Categorical(avg_activity['Weekday'], categories=order, ordered=True)
avg_activity = avg_activity.sort_values('Weekday')

fig, axes = plt.subplots(1, 2, figsize=(12,5))

sns.barplot(x=distribution.index, y=distribution.values, palette='pastel', ax=axes[0])
axes[0].set_title('Activity Frequency Per Week')
axes[0].set_xlabel('Number of Weekly Active Days')
axes[0].set_ylabel('User Rate (%)')
for i, val in enumerate(distribution.values):
  axes[0].text(i, val + 1, f'{val:.1f}%', ha='center', fontsize=9)

sns.barplot(x='Weekday', y='TotalSteps', data=avg_activity, palette='pastel', ax=axes[1])
axes[1].set_title('Average Steps by Day of the Week')
axes[1].set_xlabel('Day of Week')
axes[1].set_ylabel('Average Number of Steps')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Analysis of weekly activity patterns shows that users are not uniformly active throughout the week.


* Activity frequency: About 51.7% of users are active five or more days per week, while 11.1% remain inactive during an entire week. This indicates that most Bellabeat users maintain a regular activity routine, though a portion shows low engagement.


* Day-level activity: The highest average step counts occur on Saturday (8,153) and Tuesday (8,125), suggesting users are slightly more active early in the week and during weekends. Sunday (6,933) has the lowest activity level, showing a clear weekend slowdown trend.


These findings suggest that Bellabeat could introduce weekend activity challenges or reminder notifications to increase user engagement on less active days.

## Recommendations

1. Encourage weekend engagement
Introduce weekend step challenges or personalized push notifications to increase user activity on low-activity days.

2. Promote mindfulness features
Integrate short stress-reduction or breathing exercises in the Bellabeat app for users with elevated stress signals.

3. Personalized sleep insights
Offer reminders for bedtime routines and highlight sleep coaching to support users with lower sleep quality.

4. Activity goal tracking
Provide adaptive daily goals or badges to maintain motivation for  users.

These findings can guide Bellabeat in improving user engagement, promoting healthy habits, and refining targeted marketing or product strategies.