# SisFall Data Preprocessing Pipeline

This notebook demonstrates the complete data preprocessing pipeline for the SisFall dataset, including:

1.  Metadata extraction from filenames.
2.  Subject-wise data splitting into training, validation, and test sets.
3.  Loading raw sensor data and converting it to physical units (g and deg/s).
4.  Normalization (Standardization) using training set statistics.
5.  Segmentation of time-series data into overlapping windows.
6.  Serialization of the processed data, scaler, and metadata for efficient future use.

In [1]:
# Ensure necessary imports and set up paths
import sys
from pathlib import Path
import numpy as np
import pandas as pd

# Add the src directory to the Python path to import custom modules
if str(Path('../src').resolve()) not in sys.path:
    sys.path.insert(0, str(Path('../src').resolve()))

from data.sisfall_paths import RAW_DATA_DIR, PROCESSED_DATA_DIR
from data.sisfall_metadata import build_metadata
from data.sisfall_split import split_data_by_subject
from data.sisfall_loader import load_sisfall_file
from data.sisfall_normalize import StandardScaler
from data.sisfall_segment import segment_dataset
from data.sisfall_serialize import save_processed_data, load_processed_data

# Create processed data directory if it doesn't exist
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Raw data directory: {RAW_DATA_DIR}")
print(f"Processed data will be saved to: {PROCESSED_DATA_DIR}")

Raw data directory: /Users/minhphan/src/Fall-TSAD/data/SisFall/raw
Processed data will be saved to: /Users/minhphan/src/Fall-TSAD/data/SisFall/processed


## Configuration Parameters

In [2]:
WINDOW_SIZE = 600  # 3 seconds at 200Hz
OVERLAP = 300      # 50% overlap
RANDOM_STATE = 42  # For reproducibility of subject split
TRAIN_RATIO = 0.7
VAL_RATIO = 0.15
# Test ratio will be 1 - TRAIN_RATIO - VAL_RATIO

## Step 1: Metadata Extraction and Subject-wise Splitting

First, we extract metadata from all raw data filenames and then split this metadata into training, validation, and test sets based on unique subjects. This ensures that data from any single subject appears in only one split.

In [3]:
print("\n--- Step 1: Building Metadata and Splitting by Subject ---")
metadata_df = build_metadata(RAW_DATA_DIR)
print(f"Total files found: {len(metadata_df)}")
print(f"Unique subjects: {metadata_df['subject'].nunique()}")

train_meta, val_meta, test_meta = split_data_by_subject(
    metadata_df, 
    train_ratio=TRAIN_RATIO, 
    val_ratio=VAL_RATIO, 
    random_state=RANDOM_STATE
)

print(f"Train subjects: {train_meta['subject'].nunique()} (files: {len(train_meta)})")
print(f"Validation subjects: {val_meta['subject'].nunique()} (files: {len(val_meta)})")
print(f"Test subjects: {test_meta['subject'].nunique()} (files: {len(test_meta)})")

print("Example of training metadata:")
display(train_meta.head())


--- Step 1: Building Metadata and Splitting by Subject ---
Total files found: 4505
Unique subjects: 38
Train subjects: 26 (files: 3042)
Validation subjects: 5 (files: 390)
Test subjects: 7 (files: 1073)
Example of training metadata:


Unnamed: 0,filename,code,subject,group,trial,is_fall,path
0,D01_SA01_R01.txt,D01,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
1,D02_SA01_R01.txt,D02,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
2,D03_SA01_R01.txt,D03,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
3,D04_SA01_R01.txt,D04,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
4,D05_SA01_R01.txt,D05,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...


## Step 2: Load Data for Each Split with Unit Conversion

Now, we iterate through the metadata for each split, load the corresponding raw data files, and apply the unit conversion (bits to g/deg/s) as defined in `sisfall_loader.py`.

In [None]:
print("\n--- Step 2: Loading Raw Data with Unit Conversion ---")

def load_data_for_split(meta_df: pd.DataFrame) -> list[np.ndarray]:
    data_list = []
    for idx, row in meta_df.iterrows():
        file_path = Path(row['path'])
        try:
            data = load_sisfall_file(file_path)
            data_list.append(data)
        except ValueError as e:
            print(f"Skipping {file_path} due to error: {e}")
    return data_list

train_raw_data = load_data_for_split(train_meta)
val_raw_data = load_data_for_split(val_meta)
test_raw_data = load_data_for_split(test_meta)

print(f"Loaded {len(train_raw_data)} training data files.")
print(f"Loaded {len(val_raw_data)} validation data files.")
print(f"Loaded {len(test_raw_data)} test data files.")

if train_raw_data:
    print(f"Example train data shape (first file): {train_raw_data[0].shape}")
    print(f"Example train data (first 5 rows of first file):\n{np.ndarray(train_raw_data[0][:5])}")


--- Step 2: Loading Raw Data with Unit Conversion ---
Loaded 3042 training data files.
Loaded 390 validation data files.
Loaded 1073 test data files.
Example train data shape (first file): (19999, 6)
Example train data (first 5 rows of first file):
[[ 6.6406250e-02 -6.9921875e-01 -3.8671875e-01 -1.0986328e+00
  -3.0761719e+01 -2.1484375e+01]
 [ 5.8593750e-02 -6.7968750e-01 -3.5156250e-01 -3.2348633e+00
  -3.4667969e+01 -1.8676758e+01]
 [ 3.9062500e-03 -6.8750000e-01 -3.1640625e-01 -5.1269531e+00
  -3.7414551e+01 -1.6540527e+01]
 [-3.9062500e-02 -7.0312500e-01 -3.0078125e-01 -6.3476562e+00
  -3.9489746e+01 -1.3854980e+01]
 [-8.2031250e-02 -7.4609375e-01 -2.4609375e-01 -7.8125000e+00
  -4.1198730e+01 -1.1657715e+01]]


## Step 3: Normalization (Standardization)

We fit the `StandardScaler` *only* on the concatenated training data to calculate its mean and standard deviation. These same parameters are then used to transform the training, validation, and test sets to prevent data leakage.

In [6]:
print("\n--- Step 3: Normalizing Data ---")

# Concatenate all training data files to fit the scaler
full_train_data_for_scaler = np.concatenate(train_raw_data, axis=0)

scaler = StandardScaler()
scaler.fit(full_train_data_for_scaler)

print(f"Scaler fitted. Mean: {scaler.mean}, Std: {scaler.std}")

# Apply the transformation to each individual data file in all splits
train_norm_data = [scaler.transform(d) for d in train_raw_data]
val_norm_data = [scaler.transform(d) for d in val_raw_data]
test_norm_data = [scaler.transform(d) for d in test_raw_data]

if train_norm_data:
    print(f"Example normalized train data (first 5 rows of first file):\n{train_norm_data[0][:5]}")


--- Step 3: Normalizing Data ---
Scaler fitted. Mean: [-0.01273254 -0.7045118  -0.09500277 -0.6085865   1.9256511  -0.28996933], Std: [ 0.40913334  0.59208745  0.48417    39.229504   30.267822   25.110033  ]
Example normalized train data (first 5 rows of first file):
[[ 0.19343032  0.00893968 -0.60250735 -0.01249178 -1.0799379  -0.84406126]
 [ 0.17433508  0.04192678 -0.52989596 -0.06694647 -1.2089942  -0.7322487 ]
 [ 0.04066838  0.02873194 -0.4572846  -0.11517777 -1.2997367  -0.64717394]
 [-0.06435545  0.00234226 -0.4250129  -0.14629474 -1.3682979  -0.54022276]
 [-0.16937926 -0.07022937 -0.31206185 -0.1836351  -1.42476    -0.45271727]]


## Step 4: Segmentation

The normalized data from each split is then segmented into fixed-size, overlapping windows. Each segment inherits the `is_fall` label from its original trial.

In [7]:
print("\n--- Step 4: Segmenting Data ---")

# Extract labels corresponding to the order of data files in each split
train_labels_list = train_meta['is_fall'].tolist()
val_labels_list = val_meta['is_fall'].tolist()
test_labels_list = test_meta['is_fall'].tolist()

train_X, train_y = segment_dataset(train_norm_data, train_labels_list, WINDOW_SIZE, OVERLAP)
val_X, val_y = segment_dataset(val_norm_data, val_labels_list, WINDOW_SIZE, OVERLAP)
test_X, test_y = segment_dataset(test_norm_data, test_labels_list, WINDOW_SIZE, OVERLAP)

print(f"Segmented Training Data Shape: {train_X.shape}, Labels Shape: {train_y.shape}")
print(f"Segmented Validation Data Shape: {val_X.shape}, Labels Shape: {val_y.shape}")
print(f"Segmented Test Data Shape: {test_X.shape}, Labels Shape: {test_y.shape}")

print("Example of a segmented window (first window, first 5 rows):\n", train_X[0, :5, :])
print("Example of a segmented label (first label):", train_y[0])


--- Step 4: Segmenting Data ---
Segmented Training Data Shape: (32106, 600, 6), Labels Shape: (32106,)
Segmented Validation Data Shape: (4646, 600, 6), Labels Shape: (4646,)
Segmented Test Data Shape: (10931, 600, 6), Labels Shape: (10931,)
Example of a segmented window (first window, first 5 rows):
 [[ 0.19343032  0.00893968 -0.60250735 -0.01249178 -1.0799379  -0.84406126]
 [ 0.17433508  0.04192678 -0.52989596 -0.06694647 -1.2089942  -0.7322487 ]
 [ 0.04066838  0.02873194 -0.4572846  -0.11517777 -1.2997367  -0.64717394]
 [-0.06435545  0.00234226 -0.4250129  -0.14629474 -1.3682979  -0.54022276]
 [-0.16937926 -0.07022937 -0.31206185 -0.1836351  -1.42476    -0.45271727]]
Example of a segmented label (first label): 0


## Step 5: Serialization

Finally, the fully processed data (segmented and normalized), their labels, the fitted `StandardScaler` object, and the complete metadata DataFrame are saved to disk. This allows for quick loading in subsequent model training or evaluation scripts without re-running the entire preprocessing pipeline.

In [8]:
print("\n--- Step 5: Saving Processed Data ---")

save_processed_data(
    output_dir=PROCESSED_DATA_DIR,
    train_data=train_X, train_labels=train_y,
    val_data=val_X, val_labels=val_y,
    test_data=test_X, test_labels=test_y,
    scaler=scaler,
    metadata_df=metadata_df
)

print("Data preprocessing pipeline completed and data saved.")


--- Step 5: Saving Processed Data ---
Processed data, scaler, and metadata saved to /Users/minhphan/src/Fall-TSAD/data/SisFall/processed
Data preprocessing pipeline completed and data saved.


## Optional: Loading Processed Data

You can load the saved data using the `load_processed_data` function.

In [9]:
print("\n--- Optional: Loading Processed Data ---")
loaded_data = load_processed_data(PROCESSED_DATA_DIR)

print("Loaded training data shape:", loaded_data['train_data'].shape)
print("Loaded scaler mean:", loaded_data['scaler'].mean)
display(loaded_data['metadata_df'].head())


--- Optional: Loading Processed Data ---
Loaded training data shape: (32106, 600, 6)
Loaded scaler mean: [-0.01273254 -0.7045118  -0.09500277 -0.6085865   1.9256511  -0.28996933]


Unnamed: 0,filename,code,subject,group,trial,is_fall,path
0,D01_SA01_R01.txt,D01,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
1,D02_SA01_R01.txt,D02,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
2,D03_SA01_R01.txt,D03,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
3,D04_SA01_R01.txt,D04,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
4,D05_SA01_R01.txt,D05,SA01,Adult,R01,0,/Users/minhphan/src/Fall-TSAD/data/SisFall/raw...
