In [9]:
"""
Data Sanity Check — CAPTURE-24 Prepared Data

PURPOSE
This script verifies that the output of the official CAPTURE-24 benchmark
preprocessing pipeline was loaded correctly before any model training or
evaluation is performed.

It is intended as a lightweight sanity check to ensure that:
- all required files are present,
- array dimensions are consistent,
- participant identifiers align with windowed data,
- the Walmsley (2020) 4-class activity taxonomy is respected.

EXPECTED FILES (relative to repo root)
- capture24/prepared_data/X.npy
    Windowed accelerometer data of shape (N, L, C) or (N, C, L)
- capture24/prepared_data/Y_Walmsley2020.npy
    Window-level activity labels (Walmsley 4-class taxonomy)
- capture24/prepared_data/P.npy
    Participant identifier for each window

CHECKS PERFORMED
- Verifies existence of all required files
- Verifies consistency of sample dimension N across X, Y, and P
- Reports number of unique participants
- Reports unique activity labels present in the dataset

USAGE
This script is typically run from the `notebooks/` directory:

    python sanity_check_data.py

The relative path assumes the repository structure:
    ann2_capture24/
      ├─ capture24/
      │   └─ prepared_data/
      └─ notebooks/

AI DISCLAIMER
This script was written with AI-assisted code completion and review.
It is used solely for auxiliary data validation and does not affect
model training, evaluation metrics, or conclusions of the study.
"""
import numpy as np
import os

# Relative path (notebooks/ → repo root → capture24/prepared_data)
DATA_DIR = os.path.join("..", "capture24", "prepared_data")

# existence checks 
assert os.path.isdir(DATA_DIR), f"DATA_DIR does not exist: {os.path.abspath(DATA_DIR)}"

for fname in ["X.npy", "Y_Walmsley2020.npy", "P.npy"]:
    fpath = os.path.join(DATA_DIR, fname)
    assert os.path.isfile(fpath), f"Missing file: {fpath}"

# load prepared benchmark outputs 
X = np.load(os.path.join(DATA_DIR, "X.npy"), mmap_mode="r")
Y = np.load(os.path.join(DATA_DIR, "Y_Walmsley2020.npy"))
P = np.load(os.path.join(DATA_DIR, "P.npy"))

print("X shape:", X.shape)
print("Y shape:", Y.shape)
print("P shape:", P.shape)

print("Unique participants:", len(np.unique(P)))
print("Unique labels:", np.unique(Y))


X shape: (934762, 1000, 3)
Y shape: (934762,)
P shape: (934762,)
Unique participants: 151
Unique labels: ['light' 'moderate-vigorous' 'sedentary' 'sleep']
