
# Bellabeat Smart Device Market Analysis — Notebook

This notebook mirrors the project steps from the PDF and repo scripts. It focuses on **runnable, text-based EDA** (no charts) and shows how to generate the merged daily dataset and sanity-check it.


In [1]:

import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 120)
print('Pandas version:', pd.__version__)


Pandas version: 2.3.2


In [2]:

# Paths (run this cell from notebooks/)
REPO_ROOT = Path('..').resolve()
DATA_DIR = (REPO_ROOT / 'data').resolve()
print('Repo root:', REPO_ROOT)
print('Data dir :', DATA_DIR)


Repo root: /Users/seunghyunhong/simonhong/Projects/bellabeat-analysis
Data dir : /Users/seunghyunhong/simonhong/Projects/bellabeat-analysis/data



## Generate cleaned daily dataset

This calls the Python ETL script to create `data/daily_merged.csv` with derived metrics: `TotalActiveMinutes` and `SleepEfficiency`.


In [3]:

import sys, subprocess, shlex
script = REPO_ROOT / 'scripts' / 'analysis.py'
cmd = f"python {script} --data_dir {DATA_DIR} --out_dir {DATA_DIR}"
print('Running:', cmd)
completed = subprocess.run(shlex.split(cmd), capture_output=True, text=True)
print(completed.stdout)
print(completed.stderr)


Running: python /Users/seunghyunhong/simonhong/Projects/bellabeat-analysis/scripts/analysis.py --data_dir /Users/seunghyunhong/simonhong/Projects/bellabeat-analysis/data --out_dir /Users/seunghyunhong/simonhong/Projects/bellabeat-analysis/data

Traceback (most recent call last):
  File [35m"/Users/seunghyunhong/simonhong/Projects/bellabeat-analysis/scripts/analysis.py"[0m, line [35m85[0m, in [35m<module>[0m
    [31mmain[0m[1;31m(args.data_dir, args.out_dir)[0m
    [31m~~~~[0m[1;31m^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[0m
  File [35m"/Users/seunghyunhong/simonhong/Projects/bellabeat-analysis/scripts/analysis.py"[0m, line [35m24[0m, in [35mmain[0m
    raise FileNotFoundError('No daily activity CSVs found in data/. Expected e.g. dailyActivity_merged.csv')
[1;35mFileNotFoundError[0m: [35mNo daily activity CSVs found in data/. Expected e.g. dailyActivity_merged.csv[0m




## Load and sanity-check `daily_merged.csv`


In [4]:

merged_path = DATA_DIR / 'daily_merged.csv'
df = pd.read_csv(merged_path) if merged_path.exists() else None
if df is None:
    raise FileNotFoundError('daily_merged.csv was not found. Ensure Kaggle CSVs are in data/ and re-run the ETL cell.')
print(df.shape)
df.head()


FileNotFoundError: daily_merged.csv was not found. Ensure Kaggle CSVs are in data/ and re-run the ETL cell.


## Text-based EDA (no charts)


In [None]:

# Null counts
nulls = df.isna().sum().sort_values(ascending=False)
print('Top nulls:')
print(nulls.head(10))

# Describe key columns
cols = [c for c in ['TotalSteps','Calories','VeryActiveMinutes','FairlyActiveMinutes','LightlyActiveMinutes','SedentaryMinutes','TotalMinutesAsleep','TotalTimeInBed','TotalActiveMinutes','SleepEfficiency'] if c in df.columns]
df[cols].describe(percentiles=[.25,.5,.75])



## Segmentations
Activity level buckets (by steps) and sleep adequacy buckets mirroring the SQL snippets.


In [None]:

def bucket_activity(steps):
    if pd.isna(steps): return 'Unknown'
    if steps >= 12500: return 'High Active'
    if 5000 <= steps <= 12499: return 'Moderate Active'
    return 'Low Active'

def bucket_sleep(mins):
    if pd.isna(mins): return 'Unknown'
    return 'Adequate Sleep' if mins >= 420 else 'Inadequate Sleep'

seg = df.copy()
seg['activity_level'] = seg['TotalSteps'].apply(bucket_activity)
if 'TotalMinutesAsleep' in seg.columns:
    seg['sleep_pattern'] = seg['TotalMinutesAsleep'].apply(bucket_sleep)

print('Activity levels:')
print(seg['activity_level'].value_counts(dropna=False))

if 'sleep_pattern' in seg.columns:
    print('\nSleep patterns:')
    print(seg['sleep_pattern'].value_counts(dropna=False))

# Cross-tab (activity vs sleep)
if 'sleep_pattern' in seg.columns:
    print('\nActivity vs Sleep cross-tab:')
    print(pd.crosstab(seg['activity_level'], seg['sleep_pattern']))
