Lang Zhang1 · JinYi Yoon2 · Matthew Corbett3 · Abhijit Sarkar1 · Bo Ji1
1Virginia Tech 2Inha University 3Army Cyber Institute at West Point
Official implementation of the IJCAI 2026 paper: "EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding"
EyeCue fuses egocentric video and gaze sequences through cross-attention to detect cognitive distraction in drivers.
EyeCue is a multimodal framework for detecting cognitive distraction in drivers using egocentric (first-person) video and synchronized gaze data. Unlike prior work that treats video and gaze independently, EyeCue introduces a gaze-guided patch selection mechanism: for each frame, the gaze point is mapped to a specific spatial patch in the video token space, and a cross-attention fusion module allows gaze signals to actively query visual context. This produces a semantically grounded representation that captures what the driver is looking at, not just where.
The model is trained end-to-end for binary classification — distinguishing cognitively distracted from attentive driving states.
- Gaze-guided patch selection maps per-frame gaze coordinates directly to video patch tokens in the TimeSformer feature space
- Cross-attention semantic fusion lets gaze tokens query video tokens, propagating spatial attention across modalities
- Lightweight gaze encoder with a learnable CLS token encodes raw (x, y) gaze sequences into rich representations
- End-to-end training on paired egocentric video + gaze data, no hand-crafted features required
- Egocentric perspective leverages first-person viewpoint for naturalistic distraction assessment
Requirements:
- Python ≥ 3.9
- CUDA 11.8+ (GPU strongly recommended)
- PyTorch 2.6.0
git clone https://github.com/langzhang2000/EyeCue.git
cd EyeCue
# Install PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--index-url https://download.pytorch.org/whl/cu118
# Install remaining dependencies
pip install -r requirements.txtThe pretrained TimeSformer backbone (facebook/timesformer-base-finetuned-k600) will be downloaded automatically from Hugging Face on first run.
EyeCue is trained and evaluated on CogDrive, our egocentric driving dataset with synchronized gaze recordings.
Download: CogDrive Dataset (Google Drive)
The dataset contains three folders:
all_gaze_coordinate— gaze point data for all recordingsall_video_heatmap— videos with gaze heatmap overlayall_video_raw_resize— resized raw videos
python new_train.py \
--train_list /path/to/train.txt \
--val_list /path/to/val.txt \
--batch_size 4 \
--epochs 15 \
--lr 1e-5 \
--clip_len 8 \
--save_dir checkpointsKey arguments:
| Argument | Default | Description |
|---|---|---|
--train_list |
required | Path to training list file |
--val_list |
required | Path to validation list file |
--batch_size |
4 |
Batch size |
--epochs |
15 |
Number of training epochs |
--lr |
1e-5 |
Learning rate |
--clip_len |
8 |
Number of frames sampled per clip |
--num_workers |
1 |
DataLoader worker threads |
--device |
auto | cuda or cpu |
--save_dir |
checkpoints |
Directory to save model weights |
Checkpoints are saved as:
best_model.pth— full model at best validation accuracyeye_video_encoder.pth— video encoder weights only
To run multiple dataset splits in sequence:
bash new_test.shEyeCue consists of four modules:
Egocentric Video ──► VideoEncoder (TimeSformer) ──► video_tokens [B, T×196, D]
│
Gaze-guided patch selection
│
Gaze Sequence ───► GazeEncoder (Transformer) ──► gaze_tokens [B, T, D]
│
Semantic (Cross-Attention)
│
ClassificationHead ──► {0, 1}
| Module | Architecture | Output |
|---|---|---|
| VideoEncoder | TimeSformer (facebook/timesformer-base-finetuned-k600), 16-frame support via temporal embedding interpolation | cls_token [B,1,D], video_tokens [B,L-1,D] |
| GazeEncoder | Linear projection + multi-layer self-attention with learnable CLS token | cls_out [B,1,D], gaze_tokens [B,T,D] |
| Semantic | Stacked CrossAttentionBlocks — gaze tokens as query, selected video patches as key/value | fused features [B,1,D] |
| ClassificationHead | Flatten → Linear → ReLU → Dropout → Linear | logits [B,2] |
If you find this work useful, please cite:
@inproceedings{zhang2026eyecue,
title = {EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding},
author = {Zhang, Lang and Yoon, JinYi and Corbett, Matthew and Sarkar, Abhijit and Ji, Bo},
booktitle = {Proceedings of the Thirty-Fifth International Joint Conference on
Artificial Intelligence (IJCAI)},
year = {2026}
}