# 02 EEG Preprocess Raw Data

## Overview
This notebook performs the **initial loading and preprocessing** of raw EEG data files, preparing them for quality control and analysis.

**Purpose:**
- Load raw EEG CSV files containing voltage measurements from 32 channels
- Synchronize EEG timestamps with experimental event timing
- Clean and standardize the data format for downstream processing

**What it does:**
1. Loads the session mapping (from Notebook 01) to match EEG files with experiment files
2. For each matched session:
   - Reads raw EEG CSV data with time and voltage values for all channels
   - Normalizes timestamps to start from zero
   - Handles timing synchronization between EEG recording and experiment start
   - Resamples data to ensure uniform 500 Hz sampling rate (0.002s intervals)
   - Removes duplicate timestamps
   - Interpolates missing channel values
3. Saves cleaned data as pickle files for fast loading in subsequent notebooks

**Output:**
- `session_00-EEG-raw.pkl`, `session_01-EEG-raw.pkl`, etc.
- Each file contains time-aligned EEG data ready for quality control

**Next step:** Run Notebook 03 to identify bad channels using RANSAC.

**Code Attribution:**
- Original EEG preprocessing code adapted from: Chiossi, F., Mayer, S., & Ou, C. (2024). MobileHCI 2024 Papers - Submission 7226.
- OSF Repository: https://osf.io/fncj4/overview (Created: Sep 11, 2023)
- License: GNU General Public License (GPL) 3.0
- Code has been modified for this study's session-based structure and experimental design.

## 1. Import Libraries

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import TwoSlopeNorm
import pandas as pd
import seaborn as sns
import mne

from mne_icalabel import label_components

from mne.io import concatenate_raws, read_raw_edf
from mne.time_frequency import tfr_multitaper
from mne.stats import permutation_cluster_1samp_test as pcluster_test
import datetime
import pyprep
from autoreject import get_rejection_threshold

from tqdm.notebook import trange, tqdm
import pickle

from multiprocessing import Pool

from collections import Counter
import time

import pytz
tz = pytz.timezone("Europe/Berlin")

# Helper function for loading combined EEG segments
def load_eeg_data_with_combined_segments(eeg_file_spec, data_dir='./Data'):
    """
    Load EEG data, automatically combining multiple segments if needed.
    
    Args:
        eeg_file_spec (str): Either a single filename or multiple filenames joined by ' + '
                            (e.g., 'EEG_data_1.csv' or 'EEG_data_1.csv + EEG_data_2.csv')
        data_dir (str): Directory containing EEG files
    
    Returns:
        pd.DataFrame: Combined EEG data
    """
    eeg_files = [f.strip() for f in eeg_file_spec.split('+')]
    
    if len(eeg_files) == 1:
        # Single file - load normally
        df = pd.read_csv(f'{data_dir}/{eeg_files[0]}')
        print(f"    ✓ Single EEG file: {eeg_files[0]}")
        return df
    else:
        # Multiple files - concatenate them
        print(f"    ⚠️  Combining {len(eeg_files)} EEG segments:")
        dfs = []
        for i, eeg_file in enumerate(eeg_files, 1):
            df_segment = pd.read_csv(f'{data_dir}/{eeg_file}')
            print(f"      [{i}/{len(eeg_files)}] {eeg_file}: {len(df_segment)} rows")
            dfs.append(df_segment)
        
        # Combine all segments
        df_combined = pd.concat(dfs, ignore_index=True)
        print(f"    ✓ Combined: {len(df_combined)} total rows")
        
        return df_combined

print("✓ Helper functions loaded")

✓ Helper functions loaded


## 2. Show Channel Names

In [2]:
# All 64 EEG channels (excluding Time, TimeLsl, and accelerometer data)
chan_names = ['Fp1', 'Fz', 'F3', 'F7', 'F9', 'FC5', 'FC1', 'C3', 'T7', 'CP5', 'CP1', 'Pz', 'P3', 'P7', 'P9', 'O1', 'Oz', 'O2', 'P10', 'P8', 'P4', 'CP2', 'CP6', 'T8', 'C4', 'Cz', 'FC2', 'FC6', 'F10', 'F8', 'F4', 'Fp2', 'AF7', 'AF3', 'AFz', 'F1', 'F5', 'FT7', 'FC3', 'C1', 'C5', 'TP7', 'CP3', 'P1', 'P5', 'PO7', 'PO3', 'Iz', 'POz', 'PO4', 'PO8', 'P6', 'P2', 'CPz', 'CP4', 'TP8', 'C6', 'C2', 'FC4', 'FT8', 'F6', 'F2', 'AF4', 'AF8']

# Load session mapping
df_sessions = pd.read_csv('./session_mapping.csv')
df_matched = df_sessions[df_sessions['eeg_file'] != 'NO MATCH'].copy()

print(f"Found sessions: {len(df_matched)}")
print(f"Channel names: {len(chan_names)}")
print("\nSessions to process:")
print(df_matched[['experiment_file', 'eeg_file', 'time_offset_min']])


Found sessions: 13
Channel names: 64

Sessions to process:
                                      experiment_file  \
0   01_human-llm-alignment_2025-11-17_11h36.44.912...   
1   01_human-llm-alignment_2025-11-20_13h16.37.791...   
4   01_human-llm-alignment_2025-11-24_14h13.05.529...   
5   01_human-llm-alignment_2025-11-27_09h44.35.888...   
6   01_human-llm-alignment_2025-11-27_10h18.29.349...   
7   01_human-llm-alignment_2025-11-27_10h49.01.727...   
9   01_human-llm-alignment_2025-11-27_12h50.01.880...   
10  01_human-llm-alignment_2025-12-01_09h17.38.489...   
11  01_human-llm-alignment_2025-12-01_10h17.52.885...   
12  01_human-llm-alignment_2025-12-01_12h45.09.945...   
13  01_human-llm-alignment_2025-12-01_14h10.59.014...   
14  01_human-llm-alignment_2025-12-01_16h21.26.160...   
15  01_human-llm-alignment_2025-12-02_13h26.02.964...   

                                             eeg_file  time_offset_min  
0   EEG_data_1763373596.csv + EEG_data_1763373596t...        38.78003

## 3. Load Raw EEG Data

In [3]:
def load_eeg(session_id, eeg_file, freq=0.002):
    """Load EEG data for a session. Handles both single and combined EEG files."""
    # Use helper function to load single or combined EEG segments
    dfEEG = load_eeg_data_with_combined_segments(eeg_file)
    dfEEG = dfEEG.rename(columns={"Block":"Condition"})

    dfEEG["SessionID"] = session_id  # Use Session-ID instead of PId
    dfEEG["TimeRaw"] = dfEEG.Time
    
    dfEEG.TimeLsl = dfEEG.TimeLsl - dfEEG.TimeLsl.min()
    dfEEG.TimeRaw = dfEEG.TimeRaw.min() + dfEEG.TimeLsl
    dfEEG.drop("TimeLsl", axis=1, inplace=True)
    
    dfEEG.index = list(range(len(dfEEG)))

    offset = dfEEG.TimeRaw.iloc[0] % freq
    dfEEG.TimeRaw = dfEEG.TimeRaw - offset

    dfEEG.TimeRaw = (dfEEG.TimeRaw / freq).round() * freq

    dfEEG = dfEEG.drop_duplicates("TimeRaw")

    dfEEG.Time = dfEEG.TimeRaw.apply(lambda x: datetime.datetime.fromtimestamp(x, tz))
    
    dfEEG = dfEEG.set_index("Time").asfreq(freq="0.002s").reset_index()
    dfEEG[chan_names] = dfEEG[chan_names].interpolate(method='linear')

    dfEEG["TimeUnix"] = dfEEG.Time.apply(lambda x: x.timestamp() * 1000)
    
    print(f"  EEG data loaded: {len(dfEEG)} rows")
    
    return dfEEG


## 4. Process and Export Sessions

In [None]:
def process_session(session_id, session_row):
    """Process a session."""
    eeg_file = session_row['eeg_file']
    exp_file = session_row['experiment_file']
    
    print(f"Session {session_id}: {exp_file} <-> {eeg_file}")
    
    dfEEG = load_eeg(session_id, eeg_file)
    dfEEG.to_pickle(f"./preprocessed/session_{session_id:02d}-EEG-raw.pkl")
    print(f"  Saved: session_{session_id:02d}-EEG-raw.pkl")

# Process all matched sessions
print(f"Processing {len(df_matched)} sessions...")
for i, (idx, session) in enumerate(df_matched.iterrows(), 1):
    try:
        print(f"\n[{i}/{len(df_matched)}] ", end="")
        process_session(i-1, session)  # Use i-1 as session_id (0-indexed)
        print(f"  ✓ Session {i-1} successful!")
    except Exception as e:
        print(f"  ✗ Error at session {i-1}: {e}")


Processing 13 sessions...

[1/13] Session 0: 01_human-llm-alignment_2025-11-17_11h36.44.912.csv <-> EEG_data_1763373596.csv + EEG_data_1763373596tb.csv
    ⚠️  Combining 2 EEG segments:
      [1/2] EEG_data_1763373596.csv: 1903451 rows
      [1/2] EEG_data_1763373596.csv: 1903451 rows
      [2/2] EEG_data_1763373596tb.csv: 1903451 rows
      [2/2] EEG_data_1763373596tb.csv: 1903451 rows
    ✓ Combined: 3806902 total rows
    ✓ Combined: 3806902 total rows
  EEG data loaded: 1904168 rows
  EEG data loaded: 1904168 rows
  Saved: session_00-EEG-raw.pkl
  ✓ Session 0 successful!

[2/13] Session 1: 01_human-llm-alignment_2025-11-20_13h16.37.791.csv <-> EEG_data_1763640940.csv
  Saved: session_00-EEG-raw.pkl
  ✓ Session 0 successful!

[2/13] Session 1: 01_human-llm-alignment_2025-11-20_13h16.37.791.csv <-> EEG_data_1763640940.csv
    ✓ Single EEG file: EEG_data_1763640940.csv
    ✓ Single EEG file: EEG_data_1763640940.csv
  EEG data loaded: 1053511 rows
  EEG data loaded: 1053511 rows
  Save