# Fitbit Bella_B Dataset – Exploratory Data Analysis (EDA)

## 1. Introduction

This notebook explores the Fitbit **bella_b** dataset collected from consumer wearables. We focus on three raw CSV exports produced by Fitabase: daily activity summaries, sleep sessions, and second-level heart rate samples. The goal of this project is to build foundations for **health risk prediction** across sleep quality, cardiovascular strain, stress, and an overall daily health risk level. The analyses here mirror the broader ETL and ML workflow: raw Fitbit CSVs are profiled, key features are engineered, labels are applied with project utilities, and a lightweight model demonstrates how downstream risk prediction works alongside the production pipeline and Streamlit app.

In [None]:
# 2. Imports & Global Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score

%matplotlib inline
sns.set_theme(style="whitegrid")

In [None]:
# 3. Load Raw Fitbit Data
raw_dir = "data/raw/bella_b/Fitabase Data 4.12.16-5.12.16"
activity_path = f"{raw_dir}/dailyActivity_merged.csv"
sleep_path = f"{raw_dir}/sleepDay_merged.csv"
hr_path = f"{raw_dir}/heartrate_seconds_merged.csv"

# Load datasets
activity_df = pd.read_csv(activity_path)
sleep_df = pd.read_csv(sleep_path)
heartrate_df = pd.read_csv(hr_path)

activity_df.head()

In [None]:
# Inspect activity dataset
activity_info = activity_df.info()
activity_describe = activity_df.describe(include='all')
activity_info, activity_describe

In [None]:
# Inspect sleep dataset
sleep_df.head()

In [None]:
sleep_info = sleep_df.info()
sleep_describe = sleep_df.describe(include='all')
sleep_info, sleep_describe

In [None]:
# Inspect heart rate dataset
heartrate_df.head()

In [None]:
hr_info = heartrate_df.info()
hr_describe = heartrate_df.describe(include='all')
hr_info, hr_describe

### What each dataset represents
- **dailyActivity_merged.csv**: Daily aggregates per user with steps, distance, activity intensity minutes, sedentary time, and calories.
- **sleepDay_merged.csv**: Nightly sleep sessions with minutes asleep, time in bed, and record counts.
- **heartrate_seconds_merged.csv**: Second-level heart rate time series with user IDs and timestamps.

In [None]:
# 4. Visualizations on Raw Data
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(activity_df["TotalSteps"], bins=10, color="#4C72B0")
axes[0].set_title("Daily Steps Distribution")
axes[0].set_xlabel("Steps")
axes[0].set_ylabel("Count")

axes[1].hist(sleep_df["TotalMinutesAsleep"], bins=10, color="#55A868")
axes[1].set_title("Sleep Minutes Distribution")
axes[1].set_xlabel("Total Minutes Asleep")
axes[1].set_ylabel("Count")
plt.tight_layout()
plt.show()


In [None]:
# Heart rate time series example
heartrate_df['Time'] = pd.to_datetime(heartrate_df['Time'])
sample_user = heartrate_df['Id'].iloc[0]
sample_date = heartrate_df['Time'].dt.date.iloc[0]

hr_sample = heartrate_df[(heartrate_df['Id'] == sample_user) & (heartrate_df['Time'].dt.date == sample_date)].copy()
hr_sample = hr_sample.sort_values('Time')

plt.figure(figsize=(10, 4))
plt.plot(hr_sample['Time'], hr_sample['Value'], marker='o', linestyle='-')
plt.title(f"Heart Rate Trend for User {sample_user} on {sample_date}")
plt.xlabel("Timestamp")
plt.ylabel("Heart Rate (bpm)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


The selected user's heart rate pattern shows daytime peaks (around midday) with lower values during morning and evening, illustrating typical circadian variation and activity-driven heart rate changes.

In [None]:
# 5. Feature Engineering Demonstration
activity_df = activity_df.copy()
sleep_df = sleep_df.copy()
activity_df['active_minutes'] = activity_df['VeryActiveMinutes'] + activity_df['FairlyActiveMinutes']
sleep_df['sleep_efficiency'] = sleep_df['TotalMinutesAsleep'] / sleep_df['TotalTimeInBed']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].boxplot(activity_df['active_minutes'].dropna())
axes[0].set_title('Active Minutes (Very + Fairly Active)')
axes[0].set_ylabel('Minutes')

axes[1].hist(sleep_df['sleep_efficiency'].dropna(), bins=10, color='#C44E52')
axes[1].set_title('Sleep Efficiency')
axes[1].set_xlabel('Efficiency (ratio)')
axes[1].set_ylabel('Count')
plt.tight_layout()
plt.show()


Higher active_minutes reflect more vigorous time across the day, while sleep_efficiency highlights how much time in bed results in actual sleep; both are useful precursors for risk profiling.

In [None]:
# 6. Load Processed Daily Metrics Dataset
processed_path = "data/processed/daily_metrics.csv"
metrics_df = pd.read_csv(processed_path, parse_dates=["date"])
metrics_df.head()

In [None]:
metrics_info = metrics_df.info()
metrics_describe = metrics_df.describe()
metrics_info, metrics_describe

In [None]:
# 7. Correlation Heatmap
numeric_cols = metrics_df.select_dtypes(include=[np.number]).columns
corr = metrics_df[numeric_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="Blues", square=True)
plt.title("Correlation Heatmap of Daily Metrics")
plt.show()


In [None]:
# 8. Apply Risk Labeling (using project code)
from src.ml.risk_labeling import add_risk_labels

labeled_df = add_risk_labels(metrics_df)

label_columns = [
    "sleep_quality_risk",
    "cardiovascular_strain_risk",
    "stress_risk",
    "health_risk_level",
]

for col in label_columns:
    display(labeled_df[col].value_counts(dropna=False))


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for ax, col in zip(axes, label_columns):
    labeled_df[col].value_counts().reindex(['low','moderate','high']).plot(kind='bar', ax=ax, color='#4C72B0')
    ax.set_title(f"{col} distribution")
    ax.set_ylabel('Count')
    ax.set_xlabel('Risk level')
plt.tight_layout()
plt.show()


Risk thresholds follow project conventions: balanced sleep duration (7–9 hours) and efficiency ≥85% map to **low** sleep risk; very short/long sleep or efficiency <75% yields **high**. Cardiovascular strain escalates with high average or peak heart rates combined with limited active time, while stress risk rises with elevated resting rates, low sleep, and long sedentary periods. The combined **health_risk_level** summarizes the trio, and in this sample most days fall into moderate ranges with occasional high or low days.

In [None]:
# 9. Mini In-Notebook Model Demo
feature_cols = [
    "total_steps",
    "total_distance",
    "very_active_minutes",
    "fairly_active_minutes",
    "lightly_active_minutes",
    "sedentary_minutes",
    "calories",
    "total_minutes_asleep",
    "sleep_efficiency",
    "avg_hr",
    "max_hr",
    "min_hr",
    "active_minutes",
]

model_df = labeled_df.dropna(subset=feature_cols + ["health_risk_level"]).copy()

X = model_df[feature_cols]
y = model_df["health_risk_level"]

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded if len(np.unique(y_encoded)) > 1 else None
)

clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="macro", zero_division=0)

print(f"Accuracy: {accuracy:.2f}")
print(f"Macro F1: {f1:.2f}")


In [None]:
importances = clf.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 6))
plt.barh(range(len(indices)), importances[indices], color="#DD8452")
plt.yticks(range(len(indices)), [feature_cols[i] for i in indices])
plt.title("Feature Importances (Random Forest)")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()


The miniature model highlights which daily metrics drive the health_risk_level label—features tied to activity volume, sleep efficiency, and heart rate dynamics typically rank highest, underscoring their importance when interpreting daily wellness and informing the production-grade pipeline and Streamlit app visualizations.

## 10. Final Summary
- Fitbit activity, sleep, and heart-rate exports reveal daily behavior patterns such as step volume, time in various intensity zones, sleep duration, and circadian heart rate trends.
- Derived features like **active_minutes** and **sleep_efficiency** sharpen interpretation by combining key behaviors into single, risk-aware signals.
- Project risk labels translate these signals into actionable categories (sleep quality, cardiovascular strain, stress, overall health risk) and the example model shows how predictors map to outcomes.
- These analyses connect directly to the broader pipeline: the ETL job builds `daily_metrics.csv`, labeling enriches it for modeling, and the Streamlit app surfaces insights for users and stakeholders.