# 04 EEG Preprocess ICA

## Overview

This notebook performs the final preprocessing step on EEG data using Independent Component Analysis (ICA) to identify and remove artifacts. The workflow includes:

1. **Filtering**: Applies bandpass filtering (1-30 Hz) to isolate relevant frequency components
2. **ICA Decomposition**: Decomposes the EEG signal into independent components using the Infomax algorithm
3. **Artifact Classification**: Uses ICLabel to automatically classify ICA components (brain, muscle, eye, heart, line noise, channel noise, other)
4. **Artifact Removal**: Excludes non-brain and non-other components and reconstructs clean EEG data
5. **Visualization**: Plots the preprocessed data for quality inspection

**Input**: `session_XX-EEG-raw.pkl` files (from notebook 02, after RANSAC bad channel removal from notebook 03)

**Output**: `session_XX-EEG-preprocessed.pkl` files containing artifact-removed EEG data ready for ERP analysis

**Prerequisites**: 
- Session mapping CSV must exist
- Raw EEG pickle files must be generated (notebook 02)
- Bad channels identified and removed from channel list (notebook 03)

**Code Attribution:**
- Original EEG preprocessing code adapted from: Chiossi, F., Mayer, S., & Ou, C. (2024). MobileHCI 2024 Papers - Submission 7226.
- OSF Repository: https://osf.io/fncj4/overview (Created: Sep 11, 2023)
- License: GNU General Public License (GPL) 3.0
- Code has been modified for this study's session-based structure and experimental design.


In [1]:
# Install onnxruntime for ICLabel
import sys
!{sys.executable} -m pip install onnxruntime




[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## 1. Import Libraries

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import TwoSlopeNorm
import pandas as pd
import seaborn as sns
import mne

import os

# Fix timezone issue on Windows - MUST be before any pickle loading
os.environ['TZ'] = 'UTC'
import time
if hasattr(time, 'tzset'):
    time.tzset()

# Monkey-patch pandas timezone handling for Windows
import pandas._libs.tslibs.timezones as tz_module
_original_get_timezone = tz_module.get_timezone

def _patched_get_timezone(zone):
    """Handle missing timezones on Windows."""
    if zone == 'Europe/Berlin':
        from datetime import timezone, timedelta
        return timezone(timedelta(hours=1))
    return _original_get_timezone(zone)

tz_module.get_timezone = _patched_get_timezone

from mne_icalabel import label_components

from mne.io import concatenate_raws, read_raw_edf
from mne.time_frequency import tfr_multitaper
from mne.stats import permutation_cluster_1samp_test as pcluster_test
import datetime
import pyprep
from autoreject import get_rejection_threshold

from tqdm.notebook import trange, tqdm
import pickle

from multiprocessing import Pool

from collections import Counter

## 2. Load Sessions and Select Channels

In [3]:
# Load session mapping
df_sessions = pd.read_csv('./session_mapping.csv')
df_matched = df_sessions[df_sessions['eeg_file'] != 'NO MATCH'].copy()

# Get available sessions by scanning the preprocessed directory
import re
from pathlib import Path

preprocessed_dir = Path('./preprocessed')
available_sessions = []

for pkl_file in sorted(preprocessed_dir.glob('session_*-EEG-raw.pkl')):
    # Extract session number from filename (e.g., session_00-EEG-raw.pkl -> 0)
    match = re.search(r'session_(\d+)-EEG-raw\.pkl', pkl_file.name)
    if match:
        session_id = int(match.group(1))
        available_sessions.append(session_id)

session_ids = sorted(available_sessions)
print(f"Found {len(session_ids)} preprocessed raw files")

# All 64 EEG channels (will be filtered per-session based on RANSAC bad channels)
all_channels = ['Fp1', 'Fz', 'F3', 'F7', 'F9', 'FC5', 'FC1', 'C3', 'T7', 'CP5', 'CP1', 'Pz', 'P3', 'P7', 'P9', 'O1', 'Oz', 'O2', 'P10', 'P8', 'P4', 'CP2', 'CP6', 'T8', 'C4', 'Cz', 'FC2', 'FC6', 'F10', 'F8', 'F4', 'Fp2', 'AF7', 'AF3', 'AFz', 'F1', 'F5', 'FT7', 'FC3', 'C1', 'C5', 'TP7', 'CP3', 'P1', 'P5', 'PO7', 'PO3', 'Iz', 'POz', 'PO4', 'PO8', 'P6', 'P2', 'CPz', 'CP4', 'TP8', 'C6', 'C2', 'FC4', 'FT8', 'F6', 'F2', 'AF4', 'AF8']

print(f"Total channels available: {len(all_channels)}")
print(f"Channels will be filtered per-session based on RANSAC bad channels")
print(f"\nSessions to process: {len(session_ids)}")
print(f"Session IDs: {session_ids}")


Found 21 preprocessed raw files
Total channels available: 64
Channels will be filtered per-session based on RANSAC bad channels

Sessions to process: 21
Session IDs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]


## 3. High-pass Filter, Low-pass Filter, and ICA

In [4]:
def eeg_preprocessing(session_id, all_channels, l_freq=1, h_freq=None, n_jobs=10, verbose=False):
    """Preprocess EEG data with filtering and ICA."""
    
    dfEEG = pd.read_pickle(f"./preprocessed/session_{session_id:02d}-EEG-raw.pkl")
    
    # Load session-specific bad channels
    bad_channels_path = preprocessed_dir / f"session_{session_id:02d}-bad-channels.pkl"
    bad_channels = []
    if bad_channels_path.exists():
        bad_channels = pd.read_pickle(bad_channels_path)
        if isinstance(bad_channels, list):
            bad_channels = bad_channels
        else:
            bad_channels = []
    
    # Remove bad channels from this session
    chan_names = [ch for ch in all_channels if ch not in bad_channels]
    
    if bad_channels:
        print(f"  Bad channels for session {session_id}: {bad_channels}")
        print(f"  Using {len(chan_names)} channels (removed {len(bad_channels)})")
    
    info = mne.create_info(ch_names=chan_names, sfreq=500, ch_types='eeg', verbose=verbose)
    info.set_montage('standard_1020')
    info['subject_info'] = {"id":session_id}
    info['subject_info'] = {"his_id":str(session_id)}
    
    raw = mne.io.RawArray(dfEEG[chan_names].values.T/1000000, info, verbose=verbose)
    
    # Set average reference before filtering
    raw = raw.set_eeg_reference('average')
    
    # Apply 1-100 Hz filter for ICA (ICLabel requirement)
    # Use method='iir' for much faster filtering (10-20x speedup)
    print(f"  [1/5] Filtering 1-100 Hz for ICA...")
    raw.filter(l_freq=1, h_freq=100, method='iir', verbose=verbose, n_jobs=n_jobs)
    
    # Fit ICA on 1-100 Hz filtered data
    print(f"  [2/5] Running ICA (this takes 5-7 minutes)...")
    ica = mne.preprocessing.ICA(n_components=15, max_iter='auto', method='infomax', 
                                fit_params=dict(extended=True), verbose=verbose, random_state=42)
    ica.fit(raw)
    
    # Classify ICA components using ICLabel
    print(f"  [3/5] Classifying ICA components with ICLabel...")
    ic_labels = label_components(raw, ica, method="iclabel")
    labels = ic_labels["labels"]
    exclude_idx = [idx for idx, label in enumerate(labels) if label not in ["brain", "other"]]
    print(f"  → Excluding components: {exclude_idx} ({labels})")
    
    # Apply ICA artifact removal
    print(f"  [4/5] Removing artifacts...")
    ica.apply(raw, exclude=exclude_idx)
    
    # Now apply the final 1-20 Hz filter for ERP analysis
    print(f"  [5/5] Final filtering 1-20 Hz + 50 Hz notch...")
    raw.filter(l_freq=l_freq, h_freq=h_freq, method='iir', verbose=verbose, n_jobs=n_jobs)
    
    # Apply notch filter to remove 50 Hz line noise
    raw.notch_filter(freqs=50, method='iir', verbose=verbose, n_jobs=n_jobs)

    pickle.dump(raw, open(f"./preprocessed/session_{session_id:02d}-EEG-preprocessed.pkl", 'wb'))
    
    print(f"  ✓ Session {session_id} saved")
    return True


In [6]:
%%time

for i, session_id in enumerate(session_ids):
    # Check if already preprocessed
    output_file = f"./preprocessed/session_{session_id:02d}-EEG-preprocessed.pkl"
    if os.path.exists(output_file):
        #print(f"\n[{i+1}/{len(session_ids)}] Skipping Session {session_id} (already preprocessed)")
        continue
    
    print(f"\n[{i+1}/{len(session_ids)}] Processing Session {session_id}...")
    result = eeg_preprocessing(session_id, all_channels, l_freq=1, h_freq=20, n_jobs=10)  # Changed from 30 Hz to 20 Hz
    print(f"Session {session_id} completed: {result}")



[18/21] Processing Session 17...
EEG channel type selected for re-referencing
Applying average reference.
Applying a custom ('EEG',) reference.
EEG channel type selected for re-referencing
Applying average reference.
Applying a custom ('EEG',) reference.
  [1/5] Filtering 1-100 Hz for ICA...
  [1/5] Filtering 1-100 Hz for ICA...
  [2/5] Running ICA (this takes 5-7 minutes)...
Fitting ICA to data using 64 channels (please be patient, this may take a while)
  [2/5] Running ICA (this takes 5-7 minutes)...
Fitting ICA to data using 64 channels (please be patient, this may take a while)
Selecting by number: 15 components
Computing Extended Infomax ICA
Selecting by number: 15 components
Computing Extended Infomax ICA
Fitting ICA took 207.1s.
Fitting ICA took 207.1s.
  [3/5] Classifying ICA components with ICLabel...
  [3/5] Classifying ICA components with ICLabel...
  → Excluding components: [1, 2, 3, 4, 5, 7, 8, 10, 12] (['other', 'eye blink', 'eye blink', 'line noise', 'channel noise', 'c

## 4. Plotting Preprocessed EEG Data

In [7]:
os.makedirs('./figures', exist_ok=True)

for session_id in session_ids:
    try:
        # Load preprocessed raw object
        with open(f"./preprocessed/session_{session_id:02d}-EEG-preprocessed.pkl", 'rb') as file:
            raw = pickle.load(file)
        
        print(f"Plotting Session {session_id}")
        print(f"  Shape: {raw.get_data().shape}")
        print(f"  Duration: {raw.times[-1]:.1f} seconds")
        
        # Create figure with smaller time window to avoid memory issues
        fig = raw.plot(show=False, verbose=False, start=0, duration=10.0, n_channels=32)
        
        # Save figure
        plt.savefig(f"./figures/raw-filtered-session_{session_id:02d}.png", dpi=150, bbox_inches='tight')
        plt.close(fig)
        print(f"  ✓ Saved to figures/raw-filtered-session_{session_id:02d}.png")
        
    except Exception as e:
        print(f"  ✗ Error plotting session {session_id}: {str(e)[:100]}")

Plotting Session 0
  Shape: (64, 1904168)
  Duration: 3808.3 seconds
  Shape: (64, 1904168)
  Duration: 3808.3 seconds
  ✓ Saved to figures/raw-filtered-session_00.png
  ✓ Saved to figures/raw-filtered-session_00.png
Plotting Session 1
Plotting Session 1
  Shape: (64, 1053511)
  Duration: 2107.0 seconds
  Shape: (64, 1053511)
  Duration: 2107.0 seconds
  ✓ Saved to figures/raw-filtered-session_01.png
  ✓ Saved to figures/raw-filtered-session_01.png
Plotting Session 2
Plotting Session 2
  Shape: (64, 1053511)
  Duration: 2107.0 seconds
  Shape: (64, 1053511)
  Duration: 2107.0 seconds
  ✓ Saved to figures/raw-filtered-session_02.png
  ✓ Saved to figures/raw-filtered-session_02.png
Plotting Session 3
Plotting Session 3
  Shape: (64, 655281)
  Duration: 1310.6 seconds
  Shape: (64, 655281)
  Duration: 1310.6 seconds
  ✓ Saved to figures/raw-filtered-session_03.png
  ✓ Saved to figures/raw-filtered-session_03.png
Plotting Session 4
Plotting Session 4
  Shape: (64, 1757375)
  Duration: 3514