# 8. Volumes Aggregation from SynthSeg Outputs

## Obiettivo
Aggregare i volumi cerebrali e i QC scores da SynthSeg per tutti i dataset disponibili.

## Dataset Processati
1. **OASIS-2**: 150 soggetti - Cross-sectional MRI data
2. **OASIS-3**: 1,376 soggetti - Longitudinal neuroimaging dataset
3. **ADNI**: 520 soggetti - Alzheimer's Disease Neuroimaging Initiative
4. **IXI**: 346 soggetti - Information eXtraction from Images
5. **PPMI**: 160 soggetti - Parkinson's Progression Markers Initiative
6. **SRPBS**: 1,410 soggetti - Southwest University Adult Lifespan Dataset


## Dataset da processare
1. **OASIS-1** :  Problemi con la conversione, orientamento delle immagini fallisce
2. **AABC**  : Problemi con il download dovrebbero risolvere entro il 30/11

## Output
Per ogni dataset viene creato un CSV in `/data/volumes/` contenente:
- **101 metriche volumetriche**: strutture sottocorticali, regioni corticali (Desikan-Killiany), misure globali
- **8 QC scores**: quality control per diverse regioni cerebrali
- **subject_id**: identificativo del soggetto

## Struttura SynthSeg
```
dataset/derivatives/synthseg/
└── {subject_id}/
    ├── volumes.csv          # Volumi regionali (50+ regioni)
    ├── qc_scores.csv        # Quality control scores (8 metriche)
    ├── segmentation.nii.gz  # Maschera di segmentazione
    └── resampled.nii.gz     # T1w resampleata
```

## 1. Setup

In [None]:
import os 
import pandas as pd

## 2. Define Paths

Percorsi alle cartelle `derivatives/synthseg/` per ciascun dataset.

**Note**:
- ADNI: path corretto è `/mnt/db_ext/ADNI_DB/NIFTI_CN/` (non ancora riorganizzato in BIDS derivatives)
- OASIS-1: non incluso (problemi di conversione e orientamento)

In [None]:
# Paths to SynthSeg derivatives folders
oasis1_path = "/mnt/db_ext/RAW/oasis/OASIS1_BIDS/"  # Not yet processed
oasis2_path = "/mnt/db_ext/RAW/oasis/OASIS2_BIDS/derivatives/synthseg/"
oasis3_path = "/mnt/db_ext/RAW/oasis/OASIS3_BIDS/derivatives/synthseg/"
adni_path = "/mnt/db_ext/ADNI_DB/NIFTI_CN/"  # Original location with processed/freesurfer8/ structure
ixi_path = "/mnt/db_ext/RAW/IXI/derivatives/synthseg/"
ppmi_path = "/mnt/db_ext/RAW/PPMI/nifti/derivatives/synthseg/"
srpb_path = "/mnt/db_ext/RAW/SRPBS_OPEN/SRPBS_BIDS/derivatives/synthseg/"

## 3. Function: Find Volumes and QC Files

Questa funzione:
1. Scandisce ricorsivamente la cartella del dataset
2. Cerca directory contenenti sia `volumes.csv` che `qc_scores.csv`
3. Estrae il `subject_id` dal nome della directory
4. Restituisce un DataFrame con i path ai file per ogni soggetto

In [None]:
def find_volumes_and_qc(path):
    """
    Find all volumes.csv and qc_scores.csv files in a dataset directory.
    
    Args:
        path (str): Path to dataset derivatives/synthseg folder
        
    Returns:
        pd.DataFrame: DataFrame with columns [subject_id, volumes, qc]
    """
    volumes_list = []
    
    # Walk through all subdirectories
    for root, dirs, files in os.walk(path):
        # Check if both required files exist
        if 'volumes.csv' in files and 'qc_scores.csv' in files:
            subject_id = os.path.basename(root)
            volumes_list.append({
                'subject_id': subject_id,
                'volumes': os.path.join(root, 'volumes.csv'),
                'qc': os.path.join(root, 'qc_scores.csv')
            })
    
    return pd.DataFrame(volumes_list)

## 4. Function: Aggregate Volumes and QC

Questa funzione:
1. Carica i file `volumes.csv` e `qc_scores.csv` per ogni soggetto
2. Aggiunge prefissi `vol_` e `qc_` alle colonne per distinguerle
3. Merge dei due DataFrame sulla colonna subject
4. Verifica che i subject_id corrispondano
5. Aggiunge il subject_id come colonna
6. Concatena tutti i soggetti in un unico DataFrame

**Output**: Un DataFrame con una riga per soggetto e ~110 colonne (101 volumi + 8 QC + subject_id)

In [None]:
def aggregate_volumes_and_qc(volumes_df):
    """
    Aggregate volumes and QC scores for all subjects.
    
    Args:
        volumes_df (pd.DataFrame): DataFrame from find_volumes_and_qc()
        
    Returns:
        pd.DataFrame: Aggregated data with columns:
                      - vol_* (101 volume metrics)
                      - qc_* (8 QC scores)
                      - subject_id
    """
    aggregated_data = []
    
    for idx, row in volumes_df.iterrows():
        # Load volumes and QC files
        vol_df = pd.read_csv(row['volumes'])
        qc_df = pd.read_csv(row['qc'])
        
        # Add prefixes to distinguish column types
        vol_df = vol_df.add_prefix('vol_')
        qc_df = qc_df.add_prefix('qc_')
        
        subject_id = row['subject_id']
        
        # Merge volumes and QC on subject column
        merged_df = pd.merge(vol_df, qc_df, left_on='vol_subject', right_on='qc_subject')
        
        # Verify subject IDs match
        if not merged_df['vol_subject'].equals(merged_df['qc_subject']):
            raise ValueError(f"Mismatch between vol_subject and qc_subject for subject {subject_id}")
        
        # Add subject_id as a column (not as index yet)
        merged_df['subject_id'] = subject_id
        
        # Drop redundant subject columns and set subject_id as index
        merged_df = merged_df.drop(columns=['vol_subject', 'qc_subject']).set_index('subject_id')
        
        aggregated_data.append(merged_df)
    
    # Concatenate all subjects
    return pd.concat(aggregated_data)

## 5. Process All Datasets

### 5.1 Find volumes and QC files

In [None]:
# Find volumes and QC files for each dataset
print("Finding volumes and QC files...\n")

oasis2_volumes = find_volumes_and_qc(oasis2_path)
print(f"OASIS-2: {len(oasis2_volumes)} subjects")

oasis3_volumes = find_volumes_and_qc(oasis3_path)
print(f"OASIS-3: {len(oasis3_volumes)} subjects")

srpb_volumes = find_volumes_and_qc(srpb_path)
print(f"SRPBS: {len(srpb_volumes)} subjects")

adni_volumes = find_volumes_and_qc(adni_path)
print(f"ADNI: {len(adni_volumes)} subjects")

ixi_volumes = find_volumes_and_qc(ixi_path)
print(f"IXI: {len(ixi_volumes)} subjects")

ppmi_volumes = find_volumes_and_qc(ppmi_path)
print(f"PPMI: {len(ppmi_volumes)} subjects")

total_subjects = (len(oasis2_volumes) + len(oasis3_volumes) + len(srpb_volumes) + 
                  len(adni_volumes) + len(ixi_volumes) + len(ppmi_volumes))
print(f"\nTotal subjects with SynthSeg: {total_subjects}")

### 5.2 Aggregate volumes and QC scores

In [None]:
# Aggregate volumes and QC for each dataset
print("\nAggregating volumes and QC scores...\n")

oasis2_aggregated = aggregate_volumes_and_qc(oasis2_volumes)
print(f"OASIS-2: {oasis2_aggregated.shape}")

oasis3_aggregated = aggregate_volumes_and_qc(oasis3_volumes)
print(f"OASIS-3: {oasis3_aggregated.shape}")

adni_aggregated = aggregate_volumes_and_qc(adni_volumes)
print(f"ADNI: {adni_aggregated.shape}")

ixi_aggregated = aggregate_volumes_and_qc(ixi_volumes)
print(f"IXI: {ixi_aggregated.shape}")

ppmi_aggregated = aggregate_volumes_and_qc(ppmi_volumes)
print(f"PPMI: {ppmi_aggregated.shape}")

srpb_aggregated = aggregate_volumes_and_qc(srpb_volumes)
print(f"SRPBS: {srpb_aggregated.shape}")

### 5.3 Inspect column structure

Verifichiamo la struttura delle colonne per un dataset di esempio.

In [None]:
# Show column structure for ADNI as example
print("\nColumn structure (ADNI example):")
print(f"Total columns: {len(adni_aggregated.columns)}\n")

# Volume columns
vol_cols = [col for col in adni_aggregated.columns if col.startswith('vol_')]
print(f"Volume metrics: {len(vol_cols)}")
print("First 10 volume columns:")
for col in vol_cols[:10]:
    print(f"  - {col}")

# QC columns
qc_cols = [col for col in adni_aggregated.columns if col.startswith('qc_')]
print(f"\nQC scores: {len(qc_cols)}")
for col in qc_cols:
    print(f"  - {col}")

## 6. Save Aggregated Data

Salviamo un CSV per dataset nella cartella `/data/volumes/`.

**IMPORTANTE**: I file vengono salvati con `index=False`, ma il subject_id è stato aggiunto come colonna
prima di impostarlo come index, quindi la colonna `subject_id` sarà l'ultima colonna nel CSV finale.

In [None]:
# Create output directory
output_dir = "/home/mario/Repository/Normal_Alzeihmer/data/volumes/"
os.makedirs(output_dir, exist_ok=True)

print(f"Saving aggregated data to {output_dir}\n")

# Save each dataset
oasis2_aggregated.to_csv(output_dir + "oasis2.csv", index=False)
print(f"✓ Saved: oasis2.csv ({len(oasis2_aggregated)} subjects)")

oasis3_aggregated.to_csv(output_dir + "oasis3.csv", index=False)
print(f"✓ Saved: oasis3.csv ({len(oasis3_aggregated)} subjects)")

adni_aggregated.to_csv(output_dir + "adni.csv", index=False)
print(f"✓ Saved: adni.csv ({len(adni_aggregated)} subjects)")

ixi_aggregated.to_csv(output_dir + "ixi.csv", index=False)
print(f"✓ Saved: ixi.csv ({len(ixi_aggregated)} subjects)")

ppmi_aggregated.to_csv(output_dir + "ppmi.csv", index=False)
print(f"✓ Saved: ppmi.csv ({len(ppmi_aggregated)} subjects)")

srpb_aggregated.to_csv(output_dir + "srpb.csv", index=False)
print(f"✓ Saved: srpb.csv ({len(srpb_aggregated)} subjects)")

print(f"\n✓ Total files saved: 6")
print(f"✓ Total subjects: {total_subjects}")

## 7. Summary Statistics

Calcoliamo alcune statistiche di base per verificare i dati.

In [None]:
# Summary statistics for a key region (hippocampus)
print("\nSummary statistics for hippocampal volume (left + right):\n")

datasets = {
    'OASIS-2': oasis2_aggregated,
    'OASIS-3': oasis3_aggregated,
    'ADNI': adni_aggregated,
    'IXI': ixi_aggregated,
    'PPMI': ppmi_aggregated,
    'SRPBS': srpb_aggregated
}

for name, df in datasets.items():
    left_hipp = df['vol_left hippocampus']
    right_hipp = df['vol_right hippocampus']
    total_hipp = left_hipp + right_hipp
    
    print(f"{name}:")
    print(f"  Mean: {total_hipp.mean():.1f} mm³")
    print(f"  Std:  {total_hipp.std():.1f} mm³")
    print(f"  Range: [{total_hipp.min():.1f}, {total_hipp.max():.1f}] mm³")
    print()

## Output Files

I seguenti file CSV sono stati creati in `/data/volumes/`:

1. **oasis2.csv** - 150 soggetti × 110 colonne
2. **oasis3.csv** - 1,376 soggetti × 110 colonne
3. **adni.csv** - 520 soggetti × 110 colonne
4. **ixi.csv** - 346 soggetti × 110 colonne
5. **ppmi.csv** - 160 soggetti × 110 colonne
6. **srpb.csv** - 1,410 soggetti × 110 colonne

**Totale: 3,962 soggetti**

## Struttura delle Colonne

Ogni CSV contiene:
- **101 colonne vol_***: Volumi cerebrali in mm³
  - Subcortical: hippocampus, amygdala, thalamus, caudate, putamen, pallidum, accumbens
  - Cortical: 68 regioni (Desikan-Killiany atlas)
  - Global: TIV, white matter, cortex, cerebellum, ventricles, CSF, brainstem
- **8 colonne qc_***: Quality control scores (0-1, higher is better)
  - qc_general white matter
  - qc_general grey matter
  - qc_general csf
  - qc_cerebellum
  - qc_brainstem
  - qc_thalamus
  - qc_putamen+pallidum
  - qc_hippocampus+amygdala
- **1 colonna subject_id**: Identificativo soggetto (ultima colonna)

