EyeCue
Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

IJCAI 2026

Lang Zhang¹ · JinYi Yoon² · Matthew Corbett³ · Abhijit Sarkar¹ · Bo Ji¹

¹Virginia Tech ²Inha University ³Army Cyber Institute at West Point

Official implementation of the IJCAI 2026 paper: "EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding"

EyeCue fuses egocentric video and gaze sequences through cross-attention to detect cognitive distraction in drivers.

Overview

EyeCue is a multimodal framework for detecting cognitive distraction in drivers using egocentric (first-person) video and synchronized gaze data. Unlike prior work that treats video and gaze independently, EyeCue introduces a gaze-guided patch selection mechanism: for each frame, the gaze point is mapped to a specific spatial patch in the video token space, and a cross-attention fusion module allows gaze signals to actively query visual context. This produces a semantically grounded representation that captures what the driver is looking at, not just where.

The model is trained end-to-end for binary classification — distinguishing cognitively distracted from attentive driving states.

Highlights

Gaze-guided patch selection maps per-frame gaze coordinates directly to video patch tokens in the TimeSformer feature space
Cross-attention semantic fusion lets gaze tokens query video tokens, propagating spatial attention across modalities
Lightweight gaze encoder with a learnable CLS token encodes raw (x, y) gaze sequences into rich representations
End-to-end training on paired egocentric video + gaze data, no hand-crafted features required
Egocentric perspective leverages first-person viewpoint for naturalistic distraction assessment

Installation

Requirements:

Python ≥ 3.9
CUDA 11.8+ (GPU strongly recommended)
PyTorch 2.6.0

git clone https://github.com/langzhang2000/EyeCue.git
cd EyeCue

# Install PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
    --index-url https://download.pytorch.org/whl/cu118

# Install remaining dependencies
pip install -r requirements.txt

The pretrained TimeSformer backbone (facebook/timesformer-base-finetuned-k600) will be downloaded automatically from Hugging Face on first run.

Dataset

EyeCue is trained and evaluated on CogDrive, our egocentric driving dataset with synchronized gaze recordings.

Download: CogDrive Dataset (Google Drive)

The dataset contains three folders:

all_gaze_coordinate — gaze point data for all recordings
all_video_heatmap — videos with gaze heatmap overlay
all_video_raw_resize — resized raw videos

Getting Started

Training

python new_train.py \
    --train_list /path/to/train.txt \
    --val_list   /path/to/val.txt \
    --batch_size 4 \
    --epochs     15 \
    --lr         1e-5 \
    --clip_len   8 \
    --save_dir   checkpoints

Key arguments:

Argument	Default	Description
`--train_list`	required	Path to training list file
`--val_list`	required	Path to validation list file
`--batch_size`	`4`	Batch size
`--epochs`	`15`	Number of training epochs
`--lr`	`1e-5`	Learning rate
`--clip_len`	`8`	Number of frames sampled per clip
`--num_workers`	`1`	DataLoader worker threads
`--device`	auto	`cuda` or `cpu`
`--save_dir`	`checkpoints`	Directory to save model weights

Checkpoints are saved as:

best_model.pth — full model at best validation accuracy
eye_video_encoder.pth — video encoder weights only

Batch Training

To run multiple dataset splits in sequence:

bash new_test.sh

Model Architecture

EyeCue consists of four modules:

Egocentric Video ──► VideoEncoder (TimeSformer)  ──► video_tokens [B, T×196, D]
                                                          │
                                              Gaze-guided patch selection
                                                          │
Gaze Sequence ───► GazeEncoder (Transformer)  ──► gaze_tokens [B, T, D]
                                                          │
                                              Semantic (Cross-Attention)
                                                          │
                                              ClassificationHead ──► {0, 1}

Module	Architecture	Output
VideoEncoder	TimeSformer (facebook/timesformer-base-finetuned-k600), 16-frame support via temporal embedding interpolation	`cls_token` [B,1,D], `video_tokens` [B,L-1,D]
GazeEncoder	Linear projection + multi-layer self-attention with learnable CLS token	`cls_out` [B,1,D], `gaze_tokens` [B,T,D]
Semantic	Stacked CrossAttentionBlocks — gaze tokens as query, selected video patches as key/value	fused features [B,1,D]
ClassificationHead	Flatten → Linear → ReLU → Dropout → Linear	logits [B,2]

Citation

If you find this work useful, please cite:

@inproceedings{zhang2026eyecue,
  title     = {EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding},
  author    = {Zhang, Lang and Yoon, JinYi and Corbett, Matthew and Sarkar, Abhijit and Ji, Bo},
  booktitle = {Proceedings of the Thirty-Fifth International Joint Conference on
               Artificial Intelligence (IJCAI)},
  year      = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
framework.png		framework.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EyeCue
Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

IJCAI 2026

Overview

Highlights

Installation

Dataset

Getting Started

Training

Batch Training

Model Architecture

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

EyeCueDriver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

IJCAI 2026

Overview

Highlights

Installation

Dataset

Getting Started

Training

Batch Training

Model Architecture

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

EyeCue
Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

Packages