# 01 Map Sessions

## Overview

This notebook matches experiment CSV files with their corresponding EEG recording files based on Unix timestamps. The workflow includes:

1. **File Discovery**: Scans the Data directory for experiment CSV files and EEG CSV files
2. **Timestamp Extraction**: Efficiently extracts the first and last timestamps from each file
3. **Time Matching**: Compares timestamps to find the best EEG recording match for each experiment session
4. **Validation**: Calculates time offset and coverage to verify match quality
5. **Mapping Export**: Creates a session mapping CSV linking experiment files to EEG recordings

**Input**: 
- Experiment CSV files: `01_human-llm-alignment_YYYY-MM-DD_HHhMM.SS.mmm.csv`
- EEG CSV files: `EEG_data_YYYY-MM-DD_HHhMM.SS.mmm.csv`

**Output**: `session_mapping.csv` with columns:
- `experiment_file`: Name of the experiment CSV file
- `eeg_file`: Name of the matched EEG CSV file (or 'NO MATCH')
- `time_offset_min`: Time difference between experiment and EEG start in minutes
- `coverage`: Percentage of experiment duration covered by EEG recording

**Note**: This notebook uses an optimized file reading method (seeking to the last line) for 10-100x faster timestamp extraction compared to loading entire files.

## 1. Import Libaries


In [23]:
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
import glob

## 2. Find all Files

In [24]:
data_dir = Path('./Data')

# Find all experiment CSV files (with complete data)
exp_files = sorted([f for f in data_dir.glob('01_human-llm-alignment_*.csv')])

# Find all EEG files
eeg_csv_files = sorted(data_dir.glob('EEG_data_*.csv'))
eyetracking_edf_files = sorted(data_dir.glob('*.EDF'))

print(f"Found:")
print(f"  Experiment CSVs: {len(exp_files)}")
print(f"  EEG CSV Files: {len(eeg_csv_files)}")
print(f"  Eye-Tracking EDF Files: {len(eyetracking_edf_files)}")

Found:
  Experiment CSVs: 88
  EEG CSV Files: 7
  Eye-Tracking EDF Files: 4


## 3. Extract Timestamps from Experiment Files

In [29]:
def get_experiment_timestamps(csv_file):
    """Extract start and end time from experiment CSV."""
    try:
        df = pd.read_csv(csv_file, engine='python', on_bad_lines='skip')
        
        # Get el_recording.started_Unix timestamp
        if 'el_recording.started_Unix' not in df.columns:
            return None
            
        start_unix = df['el_recording.started_Unix'].dropna()
        if len(start_unix) == 0:
            return None
            
        start = start_unix.iloc[0]
        
        # Determine number of trials from ScenarioLoop (more accurate than counting rows)
        n_trials = 0
        if 'ScenarioLoop.thisN' in df.columns:
            max_trial_n = df['ScenarioLoop.thisN'].dropna()
            if len(max_trial_n) > 0:
                # thisN is 0-indexed, so max + 1 = total trials
                n_trials = int(max_trial_n.max()) + 1
        
        # Fall back to counting AI_Response events if ScenarioLoop not available
        if n_trials == 0 and 'AI_Response.started' in df.columns:
            ai_response_times = df['AI_Response.started'].dropna()
            n_trials = len(ai_response_times)
        
        # Skip files with very few trials (likely test runs)
        if n_trials < 10:
            return None
        
        # Calculate duration and end time
        if 'AI_Response.started' in df.columns:
            ai_response_times = df['AI_Response.started'].dropna()
            if len(ai_response_times) > 0:
                duration = ai_response_times.max()
            else:
                duration = 3000  # default 50 min estimate
        else:
            duration = 3000  # default estimate
            
        end = start + duration + 300  # +5 minutes buffer
        
        return {
            'file': csv_file.name,
            'start_unix': start,
            'end_unix': end,
            'duration': duration,
            'n_trials': n_trials,
            'date': datetime.fromtimestamp(start).strftime('%Y-%m-%d %H:%M:%S')
        }
        
    except Exception as e:
        print(f"Error at {csv_file.name}: {e}")
        return None

# Collect info for all experiment files
exp_info = []
for f in exp_files:
    info = get_experiment_timestamps(f)
    if info:
        exp_info.append(info)

df_exp = pd.DataFrame(exp_info)
print(f"\nComplete experiments: {len(df_exp)}")
df_exp.head(10)


Complete experiments: 5


Unnamed: 0,file,start_unix,end_unix,duration,n_trials,date
0,01_human-llm-alignment_2025-11-17_11h36.44.912...,1763376000.0,1763378000.0,2158.411714,50,2025-11-17 11:38:44
1,01_human-llm-alignment_2025-11-20_13h16.37.791...,1763641000.0,1763644000.0,2247.667509,39,2025-11-20 13:17:08
2,01_human-llm-alignment_2025-11-20_15h02.07.504...,1763647000.0,1763650000.0,2642.800302,49,2025-11-20 15:04:29
3,01_human-llm-alignment_2025-11-20_16h25.12.833...,1763652000.0,1763655000.0,2036.464338,46,2025-11-20 16:25:37
4,01_human-llm-alignment_2025-11-24_14h13.05.529...,1763990000.0,1763994000.0,3471.866908,49,2025-11-24 14:15:13


## 4. Extract EEG Timestamps

In [30]:
def get_eeg_csv_timestamps(csv_file):
    """Extract start and end time from EEG CSV."""
    try:
        # Use tail method for last line (much faster)
        with open(csv_file, 'r') as f:
            # First line (Header)
            header = f.readline().strip().split(',')
            # Second line (first data line)
            first_line = f.readline().strip().split(',')
            
            # Last line with tail-like method
            f.seek(0, 2)  # Go to end of file
            file_size = f.tell()
            
            # Read last ~2000 bytes (should contain multiple lines)
            offset = min(2000, file_size)
            f.seek(file_size - offset)
            lines = f.readlines()
            last_line = lines[-1].strip().split(',')
        
        # Find Time column index
        time_idx = header.index('Time')
        
        start_time = float(first_line[time_idx])
        end_time = float(last_line[time_idx])
        
        return {
            'file': csv_file.name,
            'start_unix': start_time,
            'end_unix': end_time,
            'duration': end_time - start_time,
            'date': datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S')
        }
    except Exception as e:
        print(f"Error at {csv_file.name}: {e}")
        return None

# Collect EEG info
eeg_info = []
for f in eeg_csv_files:
    info = get_eeg_csv_timestamps(f)
    if info:
        eeg_info.append(info)

df_eeg = pd.DataFrame(eeg_info)
print(f"\nEEG CSV files: {len(df_eeg)}")
df_eeg


EEG CSV files: 7


Unnamed: 0,file,start_unix,end_unix,duration,date
0,EEG_data_1763373596.csv,1763374000.0,1763377000.0,3808.189084,2025-11-17 10:59:57
1,EEG_data_1763373596tb.csv,1763374000.0,1763377000.0,3808.189084,2025-11-17 10:59:57
2,EEG_data_1763640940.csv,1763641000.0,1763643000.0,2106.895494,2025-11-20 13:15:41
3,EEG_data_1763647289.csv,1763647000.0,1763648000.0,443.182222,2025-11-20 15:01:30
4,EEG_data_1763652280.csv,1763652000.0,1763652000.0,110.268764,2025-11-20 16:24:41
5,EEG_data_1763989917.csv,1763990000.0,1763990000.0,293.910738,2025-11-24 14:11:58
6,EEG_data_1763990890.csv,1763991000.0,1763993000.0,2062.123921,2025-11-24 14:28:10


## 5. Match Experiment ↔ EEG Based on Timestamps

In [31]:
def find_matching_eeg(exp_row, df_eeg, min_coverage_percent=15):
    """Find matching EEG file for an experiment.
    
    Args:
        exp_row: Experiment info
        df_eeg: DataFrame with EEG info
        min_coverage_percent: Minimum coverage percentage to consider a match (default: 15%)
    """
    exp_start = exp_row['start_unix']
    exp_end = exp_row['end_unix']
    exp_duration = exp_end - exp_start
    
    # Find EEG files whose time range overlaps with the experiment
    matches = []
    for idx, eeg_row in df_eeg.iterrows():
        eeg_start = eeg_row['start_unix']
        eeg_end = eeg_row['end_unix']
        
        # Calculate overlap
        overlap_start = max(exp_start, eeg_start)
        overlap_end = min(exp_end, eeg_end)
        overlap_duration = max(0, overlap_end - overlap_start)
        
        coverage_percent = (overlap_duration / exp_duration) * 100
        
        # Require minimum coverage
        if coverage_percent >= min_coverage_percent:
            # Calculate time offset (negative = EEG starts after experiment)
            time_offset = exp_start - eeg_start
            
            matches.append({
                'eeg_file': eeg_row['file'],
                'offset_seconds': time_offset,
                'offset_minutes': time_offset / 60,
                'coverage_percent': coverage_percent,
                'coverage': f'{coverage_percent:.1f}%',
                'coverage_seconds': overlap_duration,
                'overlap_minutes': overlap_duration / 60,
                'is_complete': eeg_end >= exp_end
            })
    
    if matches:
        # Choose match with best coverage (longest overlap)
        return max(matches, key=lambda x: x['coverage_seconds'])
    return None

# Match all experiments
session_map = []
for idx, exp in df_exp.iterrows():
    match = find_matching_eeg(exp, df_eeg)
    session_map.append({
        'experiment_file': exp['file'],
        'experiment_date': exp['date'],
        'n_trials': exp['n_trials'],
        'exp_duration_min': exp['duration'] / 60,
        'eeg_file': match['eeg_file'] if match else 'NO MATCH',
        'time_offset_min': match['offset_minutes'] if match else None,
        'coverage': match['coverage'] if match else None,
        'overlap_min': match['overlap_minutes'] if match else None,
        'is_complete': match['is_complete'] if match else False
    })

df_sessions = pd.DataFrame(session_map)
print(f"\nSession Mapping (≥15% coverage):")
print(f"  Matched: {df_sessions['eeg_file'].ne('NO MATCH').sum()}")
print(f"  Unmatched: {df_sessions['eeg_file'].eq('NO MATCH').sum()}")
print(f"  Complete coverage: {df_sessions['is_complete'].sum()}")
df_sessions


Session Mapping (≥15% coverage):
  Matched: 3
  Unmatched: 2
  Complete coverage: 0


Unnamed: 0,experiment_file,experiment_date,n_trials,exp_duration_min,eeg_file,time_offset_min,coverage,overlap_min,is_complete
0,01_human-llm-alignment_2025-11-17_11h36.44.912...,2025-11-17 11:38:44,50,35.973529,EEG_data_1763373596.csv,38.780035,60.3%,24.689783,False
1,01_human-llm-alignment_2025-11-20_13h16.37.791...,2025-11-20 13:17:08,39,37.461125,EEG_data_1763640940.csv,1.459425,79.3%,33.6555,False
2,01_human-llm-alignment_2025-11-20_15h02.07.504...,2025-11-20 15:04:29,49,44.046672,NO MATCH,,,,False
3,01_human-llm-alignment_2025-11-20_16h25.12.833...,2025-11-20 16:25:37,46,33.941072,NO MATCH,,,,False
4,01_human-llm-alignment_2025-11-24_14h13.05.529...,2025-11-24 14:15:13,49,57.864448,EEG_data_1763990890.csv,-12.952562,54.7%,34.368732,False


## 6. Save Session Mapping

In [32]:
# Save as CSV
df_sessions.to_csv('./session_mapping.csv', index=False)
print("Session mapping saved: ./session_mapping.csv")

# Show only matched sessions
df_matched = df_sessions[df_sessions['eeg_file'] != 'NO MATCH'].copy()
print(f"\n{len(df_matched)} complete sessions for analysis:")
df_matched

Session mapping saved: ./session_mapping.csv

3 complete sessions for analysis:


Unnamed: 0,experiment_file,experiment_date,n_trials,exp_duration_min,eeg_file,time_offset_min,coverage,overlap_min,is_complete
0,01_human-llm-alignment_2025-11-17_11h36.44.912...,2025-11-17 11:38:44,50,35.973529,EEG_data_1763373596.csv,38.780035,60.3%,24.689783,False
1,01_human-llm-alignment_2025-11-20_13h16.37.791...,2025-11-20 13:17:08,39,37.461125,EEG_data_1763640940.csv,1.459425,79.3%,33.6555,False
4,01_human-llm-alignment_2025-11-24_14h13.05.529...,2025-11-24 14:15:13,49,57.864448,EEG_data_1763990890.csv,-12.952562,54.7%,34.368732,False


## 7. Summary

Next steps:
1. **Preprocessing**: All matched EEG files through preprocessing pipeline (01-04)
2. **ERP Analysis**: Calculate ERPs for each session (05_ERP_Analysis)
3. **Grand Average**: Combine all sessions for group ERPs
4. **Statistics**: Condition comparisons across all sessions