Skip to content

langzhang2000/EyeCue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

EyeCue
Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

IJCAI 2026

Lang Zhang1 · JinYi Yoon2 · Matthew Corbett3 · Abhijit Sarkar1 · Bo Ji1

1Virginia Tech   2Inha University   3Army Cyber Institute at West Point

License: MIT

Official implementation of the IJCAI 2026 paper: "EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding"

EyeCue Framework
EyeCue fuses egocentric video and gaze sequences through cross-attention to detect cognitive distraction in drivers.

Overview

EyeCue is a multimodal framework for detecting cognitive distraction in drivers using egocentric (first-person) video and synchronized gaze data. Unlike prior work that treats video and gaze independently, EyeCue introduces a gaze-guided patch selection mechanism: for each frame, the gaze point is mapped to a specific spatial patch in the video token space, and a cross-attention fusion module allows gaze signals to actively query visual context. This produces a semantically grounded representation that captures what the driver is looking at, not just where.

The model is trained end-to-end for binary classification — distinguishing cognitively distracted from attentive driving states.

Highlights

  • Gaze-guided patch selection maps per-frame gaze coordinates directly to video patch tokens in the TimeSformer feature space
  • Cross-attention semantic fusion lets gaze tokens query video tokens, propagating spatial attention across modalities
  • Lightweight gaze encoder with a learnable CLS token encodes raw (x, y) gaze sequences into rich representations
  • End-to-end training on paired egocentric video + gaze data, no hand-crafted features required
  • Egocentric perspective leverages first-person viewpoint for naturalistic distraction assessment

Installation

Requirements:

  • Python ≥ 3.9
  • CUDA 11.8+ (GPU strongly recommended)
  • PyTorch 2.6.0
git clone https://github.com/langzhang2000/EyeCue.git
cd EyeCue

# Install PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
    --index-url https://download.pytorch.org/whl/cu118

# Install remaining dependencies
pip install -r requirements.txt

The pretrained TimeSformer backbone (facebook/timesformer-base-finetuned-k600) will be downloaded automatically from Hugging Face on first run.

Dataset

EyeCue is trained and evaluated on CogDrive, our egocentric driving dataset with synchronized gaze recordings.

Download: CogDrive Dataset (Google Drive)

The dataset contains three folders:

  • all_gaze_coordinate — gaze point data for all recordings
  • all_video_heatmap — videos with gaze heatmap overlay
  • all_video_raw_resize — resized raw videos

Getting Started

Training

python new_train.py \
    --train_list /path/to/train.txt \
    --val_list   /path/to/val.txt \
    --batch_size 4 \
    --epochs     15 \
    --lr         1e-5 \
    --clip_len   8 \
    --save_dir   checkpoints

Key arguments:

Argument Default Description
--train_list required Path to training list file
--val_list required Path to validation list file
--batch_size 4 Batch size
--epochs 15 Number of training epochs
--lr 1e-5 Learning rate
--clip_len 8 Number of frames sampled per clip
--num_workers 1 DataLoader worker threads
--device auto cuda or cpu
--save_dir checkpoints Directory to save model weights

Checkpoints are saved as:

  • best_model.pth — full model at best validation accuracy
  • eye_video_encoder.pth — video encoder weights only

Batch Training

To run multiple dataset splits in sequence:

bash new_test.sh

Model Architecture

EyeCue consists of four modules:

Egocentric Video ──► VideoEncoder (TimeSformer)  ──► video_tokens [B, T×196, D]
                                                          │
                                              Gaze-guided patch selection
                                                          │
Gaze Sequence ───► GazeEncoder (Transformer)  ──► gaze_tokens [B, T, D]
                                                          │
                                              Semantic (Cross-Attention)
                                                          │
                                              ClassificationHead ──► {0, 1}
Module Architecture Output
VideoEncoder TimeSformer (facebook/timesformer-base-finetuned-k600), 16-frame support via temporal embedding interpolation cls_token [B,1,D], video_tokens [B,L-1,D]
GazeEncoder Linear projection + multi-layer self-attention with learnable CLS token cls_out [B,1,D], gaze_tokens [B,T,D]
Semantic Stacked CrossAttentionBlocks — gaze tokens as query, selected video patches as key/value fused features [B,1,D]
ClassificationHead Flatten → Linear → ReLU → Dropout → Linear logits [B,2]

Citation

If you find this work useful, please cite:

@inproceedings{zhang2026eyecue,
  title     = {EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding},
  author    = {Zhang, Lang and Yoon, JinYi and Corbett, Matthew and Sarkar, Abhijit and Ji, Bo},
  booktitle = {Proceedings of the Thirty-Fifth International Joint Conference on
               Artificial Intelligence (IJCAI)},
  year      = {2026}
}

About

EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding (IJCAI 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors