# 01 · Data Preprocessing

This notebook loads raw experiment logs, applies the **pre-registered exclusion rules**, cleans/derives variables, and exports a tidy dataset (`clean_data.csv`) for downstream analysis.

**Inputs (expected):**
- `../data/raw/participants.csv` (demographics, baseline trust)
- `../data/raw/trials.csv` (one row per trial; timestamps, choices)
- `../data/raw/post_task.csv` (post-task trust, NASA-TLX)

**Outputs:**
- `../data/derived/clean_data.csv`
- `../data/derived/exclusions.csv` (list of excluded participants and reasons)


In [None]:
# Imports
import pandas as pd
import numpy as np
from pathlib import Path

DATA_RAW = Path('../data/raw')
DATA_DERIVED = Path('../data/derived')
DATA_DERIVED.mkdir(parents=True, exist_ok=True)


In [None]:
# Load raw data (adjust paths/filenames if needed)
participants = pd.read_csv(DATA_RAW / 'participants.csv')
trials = pd.read_csv(DATA_RAW / 'trials.csv')
post_task = pd.read_csv(DATA_RAW / 'post_task.csv')
print(participants.shape, trials.shape, post_task.shape)
display(participants.head(3))
display(trials.head(3))
display(post_task.head(3))

## Apply preregistered exclusions

- **Attention check failure** → exclude participant
- **Completion ≤ 50% of trials** → exclude participant

Document excluded IDs and the reason in `exclusions.csv`.

In [None]:
# Flag attention check failures (example column names; adjust to your schema)
attn_fail = participants.query('attention_check_passed == 0')['participant_id']

# Compute completion rate per participant
trial_counts = trials.groupby('participant_id')['trial_id'].nunique()
total_trials = trials['trial_id'].nunique()
completion_rate = trial_counts / total_trials
low_completion = completion_rate[completion_rate <= 0.5].index

exclude_ids = pd.Index(attn_fail).union(low_completion)
exclusions = pd.DataFrame({'participant_id': exclude_ids,
                           'reason': ['attn_or_low_completion']*len(exclude_ids)})
exclusions.to_csv(DATA_DERIVED / 'exclusions.csv', index=False)
print(f'Excluded N={len(exclude_ids)} participants')

## Merge datasets
Merge `participants`, `trials`, and `post_task` into a single long-format table.
Ensure that **condition** is correctly typed as category, and compute any derived fields (e.g., decision time in seconds).

In [None]:
# Filter to included participants
incl_participants = participants[~participants['participant_id'].isin(exclude_ids)].copy()
incl_trials = trials[trials['participant_id'].isin(incl_participants['participant_id'])].copy()
incl_post = post_task[post_task['participant_id'].isin(incl_participants['participant_id'])].copy()

# Example: derive decision_time_s from timestamps (assumes ms columns; adjust as needed)
if {'ts_presented_ms','ts_decision_ms'}.issubset(incl_trials.columns):
    incl_trials['decision_time_s'] = (incl_trials['ts_decision_ms'] - incl_trials['ts_presented_ms']) / 1000.0

# Example: map numeric condition to labels if required
condition_map = {0: 'A_Standard', 1: 'B_Explainable'}
if incl_trials['condition'].dtype != 'O':
    incl_trials['condition_label'] = incl_trials['condition'].map(condition_map)
else:
    incl_trials['condition_label'] = incl_trials['condition']

# Aggregate per-participant outcomes (acceptance rate, mean decision time)
agg = (incl_trials
       .assign(accept=lambda d: (d['choice'].str.lower()=='accept').astype(int))
       .groupby(['participant_id','condition_label'], as_index=False)
       .agg(acceptance_rate=('accept','mean'),
            mean_decision_time_s=('decision_time_s','mean')))

# Merge baseline/post trust
merged = (agg
          .merge(incl_participants[['participant_id','trust_baseline']], on='participant_id', how='left')
          .merge(incl_post[['participant_id','trust_post','nasa_tlx']], on='participant_id', how='left'))

merged['acceptance_rate'] = (merged['acceptance_rate'] * 100).round(1)
merged['condition_label'] = merged['condition_label'].astype('category')
display(merged.head(10))

## Save outputs
Export the tidy dataset for analysis and tables/figures.

In [None]:
outpath = DATA_DERIVED / 'clean_data.csv'
merged.to_csv(outpath, index=False)
print(f'Saved: {outpath.resolve()}')

> **Note:** If you cannot share real logs, commit a small *synthetic* `clean_data.csv` that follows the same schema to enable downstream notebooks to run.