# Data Preprocessing Verification for SisFall Dataset

This notebook verifies that the SisFall data preprocessing pipeline adheres to the `DATA_SPLITTING_GUIDE.md` requirements, especially concerning data splitting and impact-zone-based labeling.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Define paths (adjust if necessary)
PROCESSED_DIR = Path('../data/processed/sisfall/')
RAW_DIR = Path('../data/raw/sisfall/')
SAMPLING_RATE = 200 # Hz, as per configs/data/sisfall.yaml

print(f"Processed data directory: {PROCESSED_DIR.resolve()}")
print(f"Raw data directory: {RAW_DIR.resolve()}")

Processed data directory: /Users/minhphan/src/Fall-TSAD/data/processed/sisfall
Raw data directory: /Users/minhphan/src/Fall-TSAD/data/raw/sisfall


## 1. Verify Split Files and Metadata

In [13]:
# Load metadata and split files
metadata_df = pd.read_csv(PROCESSED_DIR / 'metadata.csv')
train_files_df = pd.read_csv(PROCESSED_DIR / 'splits/train.csv')
val_files_df = pd.read_csv(PROCESSED_DIR / 'splits/val.csv')
test_files_df = pd.read_csv(PROCESSED_DIR / 'splits/test.csv')

print("Metadata Head:")
print(metadata_df.head())
print("\nTrain Files Head:")
print(train_files_df.head())
print("\nValidation Files Head:")
print(val_files_df.head())
print("\nTest Files Head:")
print(test_files_df.head())

# print unique subjects in each split
print(f"\nUnique subjects in Train: {train_files_df['subject'].nunique()}")
print(f"Unique subjects in Validation: {val_files_df['subject'].nunique()}")
print(f"Unique subjects in Test: {test_files_df['subject'].nunique()}")

# print number of classes of groups in each split
print(f"\nClasses in Train: {train_files_df['group'].unique()}")
print(f"Classes in Validation: {val_files_df['group'].unique()}")
print(f"Classes in Test: {test_files_df['group'].unique()}")

Metadata Head:
           filename code subject  group trial  is_fall  \
0  D01_SA01_R01.txt  D01    SA01  young   R01        0   
1  D02_SA01_R01.txt  D02    SA01  young   R01        0   
2  D03_SA01_R01.txt  D03    SA01  young   R01        0   
3  D04_SA01_R01.txt  D04    SA01  young   R01        0   
4  D05_SA01_R01.txt  D05    SA01  young   R01        0   

                                                path  
0  /Users/minhphan/src/Fall-TSAD/data/raw/sisfall...  
1  /Users/minhphan/src/Fall-TSAD/data/raw/sisfall...  
2  /Users/minhphan/src/Fall-TSAD/data/raw/sisfall...  
3  /Users/minhphan/src/Fall-TSAD/data/raw/sisfall...  
4  /Users/minhphan/src/Fall-TSAD/data/raw/sisfall...  

Train Files Head:
           filename code subject  group trial  is_fall  \
0  D08_SA10_R02.txt  D08    SA10  young   R02        0   
1  D15_SA02_R02.txt  D15    SA02  young   R02        0   
2  D15_SA07_R04.txt  D15    SA07  young   R04        0   
3  D06_SA21_R05.txt  D06    SA21  young   R05        0 

### Automated Checks for Split Composition

In [14]:
# Check 1: Train set contains only young ADL
assert all(train_files_df['group'] == 'young'), "Train set contains non-young participants."
assert all(train_files_df['is_fall'] == 0), "Train set contains fall events."
print("Train set composition: OK (young ADL only)")

# Check 2: Validation set contains young ADL and some young FALLs
assert all(val_files_df['group'] == 'young'), "Validation set contains non-young participants."
assert any(val_files_df['is_fall'] == 1), "Validation set does not contain fall events."
assert any(val_files_df['is_fall'] == 0), "Validation set does not contain ADL events."
print("Validation set composition: OK (young ADL + some young FALLs)")

# Check 3: Test set contains young + elderly ADL + FALLs
assert any(test_files_df['group'] == 'young'), "Test set does not contain young participants."
assert any(test_files_df['group'] == 'elderly'), "Test set does not contain elderly participants."
assert any(test_files_df['is_fall'] == 1), "Test set does not contain fall events."
assert any(test_files_df['is_fall'] == 0), "Test set does not contain ADL events."
print("Test set composition: OK (young + elderly ADL + FALLs)")

# Check 4: No file overlap between splits
train_paths = set(train_files_df['path'])
val_paths = set(val_files_df['path'])
test_paths = set(test_files_df['path'])

assert len(train_paths.intersection(val_paths)) == 0, "Overlap between train and validation sets."
assert len(train_paths.intersection(test_paths)) == 0, "Overlap between train and test sets."
assert len(val_paths.intersection(test_paths)) == 0, "Overlap between validation and test sets."
print("File overlap: OK (no overlap between splits)")

Train set composition: OK (young ADL only)
Validation set composition: OK (young ADL + some young FALLs)
Test set composition: OK (young + elderly ADL + FALLs)
File overlap: OK (no overlap between splits)


## 2. Verify Window Labels for Fall Events (Impact Zone)

In [17]:
# Load segmented data and labels for a fall event from the validation set
# Find a fall event in the validation set
fall_val_entry = val_files_df[val_files_df['is_fall'] == 1].iloc[0]
fall_file_path = fall_val_entry['path']
fall_subject = fall_val_entry['subject']
fall_code = fall_val_entry['code']
fall_trial = fall_val_entry['trial']

print(f"Verifying labeling for fall event: {fall_file_path}")

# Load raw signal (assuming it's in the raw_dir and has a .txt extension)
raw_signal_path = RAW_DIR / f"{fall_subject}/{fall_code}_{fall_subject}_{fall_trial}.txt"
raw_data = np.loadtxt(raw_signal_path, delimiter=',')

# Load corresponding segmented labels (this requires finding the index of this file in the val_data.npy)
# This is a bit tricky as we don't have a direct mapping from file to index in the .npy array.
# For simplicity, we'll re-run the segmentation logic for this specific file to get its labels.
# In a real scenario, you might save a mapping or load the full val_labels.npy and find the relevant section.

# For now, let's assume we can load the entire val_data.npy and val_labels.npy
val_data_segmented = np.load(PROCESSED_DIR / 'val_data.npy')
val_labels_segmented = np.load(PROCESSED_DIR / 'val_labels.npy')

# To properly verify, we need to re-segment this specific file and compare.
# This requires the original `segment_data` and `segment_dataset` logic.
# For a quick check, we can just plot the labels from `val_labels_segmented` if we know the range.

# Let's simplify: we'll just plot the raw data and visually inspect where labels '1' should be.
# To get the actual labels for this specific file, we would need to re-run the segmentation
# or have a more sophisticated way to extract them from the concatenated arrays.

# For a more robust check, we'd need to integrate the `segment_dataset` function here
# or modify `serialize` to save individual file labels before concatenation.

# For now, let's just plot the raw data and the magnitude to identify the impact zone.
magnitude = np.sqrt(np.sum(raw_data**2, axis=1))
impact_idx = np.argmax(magnitude)

plt.figure(figsize=(15, 6))
plt.plot(raw_data[:, 0], label='Acc X')
plt.plot(raw_data[:, 1], label='Acc Y')
plt.plot(raw_data[:, 2], label='Acc Z')
plt.plot(magnitude, label='Magnitude', linestyle='--', color='black')
plt.axvline(x=impact_idx, color='red', linestyle=':', label='Impact Peak')
plt.axvspan(impact_idx - 0.5 * SAMPLING_RATE, impact_idx + 1.0 * SAMPLING_RATE, color='red', alpha=0.2, label='Impact Zone (0.5s pre, 1.0s post)')
plt.title(f'Raw Accelerometer Data and Impact Zone for Fall Event: {fall_code}')
plt.xlabel('Sample Index')
plt.ylabel('Acceleration (g)')
plt.legend()
plt.grid(True)
plt.show()

print("Visual inspection required: Check if the impact zone aligns with the expected fall event.")
print("To fully verify window labels, a more complex setup is needed to map segmented labels back to original files.")

Verifying labeling for fall event: /Users/minhphan/src/Fall-TSAD/data/raw/sisfall/SA22/F05_SA22_R02.txt


ValueError: could not convert string ' 158;' to float64 at row 0, column 9.

### Automated Check for Train Labels (All 0s)

In [None]:
train_labels = np.load(PROCESSED_DIR / 'train_labels.npy')
assert np.all(train_labels == 0), "Train labels contain non-zero values (fall events)."
print("Train labels verification: OK (all 0s)")