# PD Gait + Voice — Manifest Builder (Jupyter Notebook)


This notebook builds clean **manifests** for your Parkinson's Disease project:

- Parses **gait** (FOG-style) folders and metadata to produce a manifest (PD-only).
- Auto-detects your **voice** dataset layout and builds a manifest (expects PD/Healthy).
- Creates **subject-wise splits** to avoid leakage.
- (Optional) Creates **label-aligned pairs** for multimodal experiments.



## Prerequisites
- Python 3.9+ (Anaconda recommended)
- Packages: `pandas`, `scikit-learn` (installed below if missing)

## Dataset layout (as you shared)
```
C:\Users\muham\_Projects\PD\data\
├─ gait\
│  ├─ train\{defog, tdcsfog, notype}\*.csv
│  └─ test\{defog, tdcsfog}\*.csv
└─ voice\  (varies: may contain pd/ healthy/ folders, or a CSV manifest, or flat wavs)
```


## Configure paths

In [1]:

from pathlib import Path

# >>>> EDIT THIS if your folder differs <<<<
ROOT = Path(r"C:\Users\muham\_Projects\PD New\data")

GAIT = ROOT / "gait"
VOICE = ROOT / "voice"
OUT = ROOT.parent / "manifests"
OUT.mkdir(parents=True, exist_ok=True)

print("ROOT:", ROOT)
print("GAIT exists:", GAIT.exists())
print("VOICE exists:", VOICE.exists())
print("Manifests ->", OUT)


ROOT: C:\Users\muham\_Projects\PD New\data
GAIT exists: True
VOICE exists: True
Manifests -> C:\Users\muham\_Projects\PD New\manifests


## Install/verify dependencies

In [2]:

import sys, subprocess, importlib

def ensure(pkg, import_name=None):
    try:
        importlib.import_module(import_name or pkg)
        print(f"{import_name or pkg} OK")
    except ImportError:
        print(f"Installing {pkg}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])
        importlib.import_module(import_name or pkg)
        print(f"{import_name or pkg} installed")

ensure("pandas")
# Package is named 'scikit-learn' on pip, but imported as 'sklearn'
ensure("scikit-learn", import_name="sklearn")


pandas OK
sklearn OK


## Helper functions — build manifests

In [3]:

import os, re
import pandas as pd

def _read_csv_maybe(p: Path):
    return pd.read_csv(p) if p.exists() else None

def build_gait_manifest():
    """
    Build a gait manifest by mapping each CSV (recording) to a subject using
    defog_metadata.csv / tdcsfog_metadata.csv if available.
    All gait samples are PD patients (label=1).
    """
    def normalize_map(df, id_candidates=("id","recording_id","file_id","series_id"), subj_candidates=("subject","subject_id")):
        id_col = next((c for c in df.columns if c.lower() in id_candidates), None)
        sub_col = next((c for c in df.columns if c.lower() in subj_candidates), None)
        if id_col is None:
            raise ValueError("Could not find an ID column in gait metadata")
        if sub_col is None:
            raise ValueError("Could not find a Subject column in gait metadata")
        df = df.rename(columns={id_col:"Id", sub_col:"Subject"})
        return df[["Id","Subject"]]

    m_rows = []
    d_defog = _read_csv_maybe(GAIT/"defog_metadata.csv")
    if d_defog is not None:
        df = normalize_map(d_defog)
        df["source"] = "defog"
        m_rows.append(df)

    d_tdcs = _read_csv_maybe(GAIT/"tdcsfog_metadata.csv")
    if d_tdcs is not None:
        df = normalize_map(d_tdcs)
        df["source"] = "tdcsfog"
        m_rows.append(df)

    id2sub = pd.concat(m_rows, ignore_index=True) if m_rows else pd.DataFrame(columns=["Id","Subject","source"])

    entries = []
    for split in ["train","test"]:
        for src in ["defog","tdcsfog","notype"]:
            d = GAIT/split/src
            if not d.exists():
                continue
            for f in d.glob("*.csv"):
                rid = f.stem
                # default source from folder; override if mapping says otherwise
                source = src
                subject = None
                if not id2sub.empty:
                    row = id2sub[id2sub["Id"]==rid]
                    if not row.empty:
                        subject = str(row.iloc[0]["Subject"])
                        source  = str(row.iloc[0]["source"])
                entries.append({
                    "dataset":"gait",
                    "path": str(f),
                    "recording_id": rid,
                    "subject_id": subject,
                    "source": source,
                    "split": split,
                    "label": 1,  # PD patients
                })

    gf = pd.DataFrame(entries)
    outp = OUT/"gait_manifest.csv"
    gf.to_csv(outp, index=False)
    print(f"[gait] rows={len(gf)}  subjects(with id)={gf['subject_id'].notna().sum()}  -> {outp}")
    return gf

def build_voice_manifest():
    """
    Accepts any of these layouts:
      A) voice/pd/*.wav, voice/healthy/*.wav (also 'hc'/'control')
      B) voice/*.wav where filenames contain tokens: pd/parkinson/hc/healthy/control
      C) voice/*.csv manifest with columns like path,label[,subject_id]
    """
    entries = []

    # C) CSV manifest (pick the largest csv in VOICE)
    csvs = list(VOICE.glob("*.csv"))
    if csvs:
        csvs.sort(key=lambda p: p.stat().st_size, reverse=True)
        df = pd.read_csv(csvs[0])
        lc = {c.lower(): c for c in df.columns}
        path_col = lc.get("path") or lc.get("filepath") or lc.get("file") or lc.get("wav")
        label_col = lc.get("label") or lc.get("class") or lc.get("y")
        sid_col = lc.get("subject_id") or lc.get("subject") or lc.get("id")
        if path_col is None:
            raise ValueError(f"Voice CSV {csvs[0].name} missing a path column (path/file/wav)")
        if label_col is None:
            # Try to infer from filenames
            tmp = df[path_col].astype(str).str.lower()
            guess = tmp.map(lambda s: 1 if ("pd" in s or "parkinson" in s) else (0 if ("hc" in s or "healthy" in s or "control" in s) else None))
            if guess.isna().any():
                raise ValueError("Voice CSV missing label column and filenames do not encode class")
            df["label"] = guess.astype(int)
            label_col = "label"
        df_out = pd.DataFrame({
            "dataset":"voice",
            "path": df[path_col].apply(lambda p: str((VOICE/str(p)).resolve()) if not os.path.isabs(str(p)) else str(p)),
            "subject_id": df[sid_col] if sid_col else None,
            "label": df[label_col].astype(int)
        })
        entries.append(df_out)

    # A) pd/ healthy/ hc/ control/ subfolders
    for name, y in [("pd",1), ("parkinson",1), ("healthy",0), ("hc",0), ("control",0)]:
        d = VOICE/name
        if d.exists():
            rows = [{
                "dataset":"voice",
                "path": str(f.resolve()),
                "subject_id": None,
                "label": y
            } for f in d.rglob("*.wav")]
            if rows:
                entries.append(pd.DataFrame(rows))

    # B) flat wavs with class token in filename
    if not entries:
        wavs = list(VOICE.glob("*.wav"))
        if wavs:
            pat = re.compile(r"(pd|parkinson|hc|healthy|control)", re.I)
            rows = []
            for f in wavs:
                m = pat.search(f.name)
                if m:
                    token = m.group(1).lower()
                    y = 1 if token in ("pd","parkinson") else 0
                    rows.append({
                        "dataset":"voice",
                        "path": str(f.resolve()),
                        "subject_id": None,
                        "label": y
                    })
            if rows:
                entries.append(pd.DataFrame(rows))

    if not entries:
        raise FileNotFoundError("Could not auto-build voice manifest. Place a CSV in /voice or use pd/ and healthy/ folders.")

    vf = pd.concat(entries, ignore_index=True)
    outp = OUT/"voice_manifest.csv"
    vf.to_csv(outp, index=False)
    print(f"[voice] rows={len(vf)}  PD={(vf['label']==1).sum()}  HC={(vf['label']==0).sum()}  -> {outp}")
    return vf


## Subject-wise split & optional label-aligned pairs

In [4]:

from sklearn.model_selection import GroupShuffleSplit
import pandas as pd

def subjectwise_split(df, group_col, test_size=0.2, val_size=0.2, seed=7):
    df = df.copy()
    if group_col not in df or df[group_col].isna().all():
        # Fall back to file-level grouping if subjects are unknown
        df["__group"] = df.index.astype(str)
        group_col = "__group"
    gss = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=seed)
    test_idx = next(gss.split(df, groups=df[group_col]))[1]
    df["split2"] = "train"
    df.loc[df.index.isin(test_idx), "split2"] = "test"
    # Now split remaining into train/val
    rest = df[df["split2"]=="train"]
    gss2 = GroupShuffleSplit(n_splits=1, test_size=val_size, random_state=seed)
    val_idx = rest.index[next(gss2.split(rest, groups=rest[group_col]))[1]]
    df.loc[val_idx, "split2"] = "val"
    return df

def make_label_aligned_pairs(gait_df, voice_df, max_pairs_per_gait=1, seed=13):
    """
    Pair each gait sample (PD=1) with a random voice sample of the same label.
    Healthy has no gait; you can keep voice-only rows for HC when training fusion models.
    """
    import numpy as np
    rng = np.random.default_rng(seed)
    rows = []
    for _, g in gait_df.iterrows():
        y = int(g["label"])
        cand = voice_df[voice_df["label"]==y]
        if cand.empty:
            continue
        pick = cand.sample(n=min(max_pairs_per_gait, len(cand)), random_state=rng.integers(0, 2**32-1))
        for _, v in pick.iterrows():
            rows.append({
                "gait_path": g["path"],
                "voice_path": v["path"],
                "label": y,
                "gait_subject": g.get("subject_id"),
                "voice_subject": v.get("subject_id"),
                "source": g.get("source")
            })
    pair_df = pd.DataFrame(rows)
    pair_df.to_csv(OUT/"pairs_label_aligned.csv", index=False)
    print(f"[pairs] rows={len(pair_df)} -> {OUT/'pairs_label_aligned.csv'}")
    return pair_df


In [7]:
# HOTFIX: replace build_voice_manifest to support feature-only CSVs (e.g., pd_speech_features.csv)

def build_voice_manifest():
    import os, re
    import pandas as pd

    entries = []

    # --- Look for a CSV in VOICE ---
    csvs = list(VOICE.glob("*.csv"))
    if csvs:
        csvs.sort(key=lambda p: p.stat().st_size, reverse=True)
        src_csv = csvs[0]  # pick the largest CSV (likely pd_speech_features.csv)
        df = pd.read_csv(src_csv)
        lc = {c.lower(): c for c in df.columns}

        path_col  = lc.get("path") or lc.get("filepath") or lc.get("file") or lc.get("wav")
        label_col = lc.get("label") or lc.get("class") or lc.get("y") or lc.get("status")
        sid_col   = lc.get("subject_id") or lc.get("subject") or lc.get("id") or lc.get("name") or lc.get("filename")

        if path_col is None:
            # Feature-only CSV -> create synthetic paths and save features table
            if label_col is None:
                raise ValueError(
                    f"Voice CSV {src_csv.name} has no file path and no label column. "
                    f"Expected one of: label / class / y / status"
                )

            def to_y(v):
                s = str(v).strip().lower()
                if s in ("1","pd","parkinson","true","yes","patient","parkinson's disease"):
                    return 1
                if s in ("0","hc","healthy","control","false","no"):
                    return 0
                try:
                    return 1 if float(s) > 0 else 0
                except Exception:
                    return None

            y = df[label_col].map(to_y)
            if y.isna().any():
                bad = df[label_col][y.isna()].unique()[:5]
                raise ValueError(f"Could not coerce some labels to 0/1 (examples: {bad}).")

            if sid_col:
                keys = df[sid_col].astype(str)
            else:
                keys = pd.Series([f"row_{i}" for i in range(len(df))])

            # Minimal manifest with synthetic key-paths
            vf = pd.DataFrame({
                "dataset": "voice",
                "path": keys.map(lambda s: f"features://{s}"),
                "subject_id": df[sid_col] if sid_col else None,
                "label": y.astype(int)
            })
            entries.append(vf)

            # Save full features for training later
            features_out = OUT / "voice_features.csv"
            df.to_csv(features_out, index=False)
            print(f"[voice:features] saved full feature table -> {features_out}")
        else:
            # Path-based manifest
            if label_col is None:
                tmp = df[path_col].astype(str).str.lower()
                guess = tmp.map(lambda s: 1 if ("pd" in s or "parkinson" in s)
                                else (0 if ("hc" in s or "healthy" in s or "control" in s) else None))
                if guess.isna().any():
                    raise ValueError(f"{src_csv.name} missing label column and filenames do not encode class")
                df["label"] = guess.astype(int)
                label_col = "label"

            vf = pd.DataFrame({
                "dataset":"voice",
                "path": df[path_col].apply(lambda p: str((VOICE/str(p)).resolve()) if not os.path.isabs(str(p)) else str(p)),
                "subject_id": df[sid_col] if sid_col else None,
                "label": df[label_col].astype(int)
            })
            entries.append(vf)

    # Folders voice/pd and voice/healthy (optional)
    for name, yval in [("pd",1), ("parkinson",1), ("healthy",0), ("hc",0), ("control",0)]:
        d = VOICE/name
        if d.exists():
            rows = [{
                "dataset":"voice",
                "path": str(f.resolve()),
                "subject_id": None,
                "label": yval
            } for f in d.rglob("*.wav")]
            if rows:
                entries.append(pd.DataFrame(rows))

    # Flat wavs with class token (optional)
    if not entries:
        wavs = list(VOICE.glob("*.wav"))
        if wavs:
            pat = re.compile(r"(pd|parkinson|hc|healthy|control)", re.I)
            rows = []
            for f in wavs:
                m = pat.search(f.name)
                if m:
                    token = m.group(1).lower()
                    yval = 1 if token in ("pd","parkinson") else 0
                    rows.append({"dataset":"voice","path":str(f.resolve()),"subject_id":None,"label":yval})
            if rows:
                entries.append(pd.DataFrame(rows))

    if not entries:
        raise FileNotFoundError(
            "Could not build voice manifest. Provide either:\n"
            " - voice/pd and voice/healthy folders of wavs, OR\n"
            " - a CSV with columns path,label[,subject_id], OR\n"
            " - a feature CSV (e.g., pd_speech_features.csv) with label/status and optional subject/name."
        )

    vf = pd.concat(entries, ignore_index=True)
    outp = OUT / "voice_manifest.csv"
    vf.to_csv(outp, index=False)
    print(f"[voice] rows={len(vf)}  PD={(vf['label']==1).sum()}  HC={(vf['label']==0).sum()}  -> {outp}")
    return vf


## Run the pipeline

In [8]:

# Build manifests
gait_manifest = build_gait_manifest()
voice_manifest = build_voice_manifest()

# Summaries
print("\n[gait] by split,source")
print(gait_manifest.groupby(["split","source"]).size())

print("\n[voice] class counts")
print(voice_manifest["label"].value_counts())

# Subject-wise splits
gait_splits = subjectwise_split(gait_manifest, "subject_id")
voice_splits = subjectwise_split(voice_manifest, "subject_id")

gait_splits.to_csv(OUT/"gait_manifest_splits.csv", index=False)
voice_splits.to_csv(OUT/"voice_manifest_splits.csv", index=False)

print(f"\nSaved: {OUT/'gait_manifest.csv'}")
print(f"       {OUT/'voice_manifest.csv'}")
print(f"       {OUT/'gait_manifest_splits.csv'}")
print(f"       {OUT/'voice_manifest_splits.csv'}")

# Optional: label-aligned pairs (mostly useful for PD multimodal experiments)
pairs = make_label_aligned_pairs(gait_manifest, voice_manifest)


[gait] rows=972  subjects(with id)=972  -> C:\Users\muham\_Projects\PD New\manifests\gait_manifest.csv
[voice:features] saved full feature table -> C:\Users\muham\_Projects\PD New\manifests\voice_features.csv
[voice] rows=756  PD=564  HC=192  -> C:\Users\muham\_Projects\PD New\manifests\voice_manifest.csv

[gait] by split,source
split  source 
test   defog        1
       tdcsfog      1
train  defog      137
       tdcsfog    833
dtype: int64

[voice] class counts
label
1    564
0    192
Name: count, dtype: int64

Saved: C:\Users\muham\_Projects\PD New\manifests\gait_manifest.csv
       C:\Users\muham\_Projects\PD New\manifests\voice_manifest.csv
       C:\Users\muham\_Projects\PD New\manifests\gait_manifest_splits.csv
       C:\Users\muham\_Projects\PD New\manifests\voice_manifest_splits.csv
[pairs] rows=972 -> C:\Users\muham\_Projects\PD New\manifests\pairs_label_aligned.csv


## Sanity checks (preview heads)

In [9]:

import pandas as pd
from pathlib import Path

def preview(path, n=5):
    try:
        df = pd.read_csv(path)
        print(f"\n{path.name}  rows={len(df)}")
        display(df.head(n))
    except Exception as e:
        print(f"Could not preview {path}: {e}")

preview(OUT/"gait_manifest.csv")
preview(OUT/"voice_manifest.csv")
preview(OUT/"gait_manifest_splits.csv")
preview(OUT/"voice_manifest_splits.csv")
preview(OUT/"pairs_label_aligned.csv")



gait_manifest.csv  rows=972


Unnamed: 0,dataset,path,recording_id,subject_id,source,split,label
0,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,02ea782681,ae2d35,defog,train,1
1,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,06414383cf,8c1f5e,defog,train,1
2,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,092b4c1819,2874c5,defog,train,1
3,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,0c55be4384,1fb9cd,defog,train,1
4,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,0d7ab3a9f9,8c1f5e,defog,train,1



voice_manifest.csv  rows=756


Unnamed: 0,dataset,path,subject_id,label
0,voice,features://0,0,1
1,voice,features://0,0,1
2,voice,features://0,0,1
3,voice,features://1,1,1
4,voice,features://1,1,1



gait_manifest_splits.csv  rows=972


Unnamed: 0,dataset,path,recording_id,subject_id,source,split,label,split2
0,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,02ea782681,ae2d35,defog,train,1,test
1,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,06414383cf,8c1f5e,defog,train,1,train
2,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,092b4c1819,2874c5,defog,train,1,test
3,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,0c55be4384,1fb9cd,defog,train,1,val
4,gait,C:\Users\muham\_Projects\PD New\data\gait\trai...,0d7ab3a9f9,8c1f5e,defog,train,1,train



voice_manifest_splits.csv  rows=756


Unnamed: 0,dataset,path,subject_id,label,split2
0,voice,features://0,0,1,train
1,voice,features://0,0,1,train
2,voice,features://0,0,1,train
3,voice,features://1,1,1,train
4,voice,features://1,1,1,train



pairs_label_aligned.csv  rows=972


Unnamed: 0,gait_path,voice_path,label,gait_subject,voice_subject,source
0,C:\Users\muham\_Projects\PD New\data\gait\trai...,features://177,1,ae2d35,177,defog
1,C:\Users\muham\_Projects\PD New\data\gait\trai...,features://4,1,8c1f5e,4,defog
2,C:\Users\muham\_Projects\PD New\data\gait\trai...,features://220,1,2874c5,220,defog
3,C:\Users\muham\_Projects\PD New\data\gait\trai...,features://218,1,1fb9cd,218,defog
4,C:\Users\muham\_Projects\PD New\data\gait\trai...,features://202,1,8c1f5e,202,defog



## Next steps
- Train **VoiceNet** (PD vs Healthy) on `voice_manifest_splits.csv`.
- Train **GaitNet** (FOG/severity or other available labels) on `gait_manifest_splits.csv`.
- Fuse calibrated probabilities later with a simple logistic meta-learner on your validation set.
- If any metadata column names differ, tweak the heuristics in `build_gait_manifest()` or provide a voice CSV with explicit `path,label[,subject_id]`.
