We propose FixATE, a framework that aligns a frozen VLM's visual attention with each user's characteristic gaze pattern through interpretability-based probing and personalized soft prompt tuning, enabling more faithful user simulation in visual recommendation scenarios.
Existing LLM-based user simulators perceive recommendations through text or structured metadata, missing the visual attention signals that drive real user behavior. FixATE bridges this gap by:
- Probing the VLM's internal visual attention via interpretability operators (Attention Rollout, GLIMPSE, AttnLRP) to obtain slot-level relevance distributions comparable with human fixation.
- Learning personalized soft prompts through a factorized basis decomposition, steering the model's attention toward each user's characteristic fixation pattern.
- PyTorch >= 2.1
- Transformers >= 4.40
- Two supported VLM backbones:
Download RecGaze and AdSERP into datasets/ (e.g. RecGaze/, Adserp/).
RecGaze
python preprocessing/recgaze/dataset_preprocess_swipes.py
python preprocessing/recgaze/generate_interface_iamge.py 1-35Raw inputs for the second step: datasets/raw/RecGaze/ (summary_feedback.csv, item_features.csv, poster_cache/). Outputs: datasets/RecGaze/init_interface_user_gaze(swipes).csv and datasets/RecGaze/interface_iamge/*.png.
AdSERP (optional)
python preprocessing/adserp/build_samples.py --mode both --n 5
python preprocessing/adserp/build_click_aoi_dataset.py --split allRun from the repo root. Hyperparameters are in config/common_config.py and operator-specific files (config/attnlrp_config.py, glimpse_config.py, rollout_config.py, attnlrp_config_adserp.py). Put VLM weights under llm_models/ (paths in common_config.py).
RecGaze
python fixate/fixate_training/train_fixate_attnlrp.py # AttnLRP
python fixate/fixate_training/train_fixate_glimpse.py # GLIMPSE
python fixate/fixate_training/train_fixate_rollout.py # Attention RolloutAdSERP
python fixate/fixate_training/train_fixate_attnlrp_adserp.pyTraining scripts write per-run metrics to JSON under outputs/ and checkpoints/ (paths depend on config/). Below matches what compute_sample_metrics and the trainers aggregate (sample-level metrics are micro-averaged over the evaluation set, prefixed with micro_ in logs).
How well the normalized model slot-attention vector a matches the normalized human gaze (dwell) vector g on the same slots. choice is the ground-truth clicked slot index.
| Metric | Meaning | Better |
|---|---|---|
KL divergence (kl_div / micro_kl_div) |
KL(g ∥ a): how much human gaze g differs from model mass a | Lower |
JS divergence (js_div / micro_js_div) |
Squared Jensen–Shannon distance between g and a | Lower |
Cosine similarity (cosine_sim / micro_cosine_sim) |
Cosine similarity between vectors g and a | Higher |
CSH@k (CSH@1, CSH@3, CSH@5) |
Whether choice is in the top-k slots when ranked by model attention a | Higher |
TGO@k (TGO@1, TGO@3, TGO@5) |
Overlap between top-k by a and top-k by g (implementation normalizes by k for k>1) | Higher |
Metrics that depend on the ground-truth choice (clicked slot) and/or the model’s discrete prediction, not only distributional alignment between g and a.
| Metric | Meaning | Better |
|---|---|---|
| Log-Loss | Negative log of the softmax probability, over candidate answer tokens, assigned to the true slot (not mass from the saliency / attention map in RecGaze eval) | Lower |
| AUC | One-vs-rest style rank score: other slots vs. choice under that same logit-based slot distribution | Higher |
| Answer Accuracy | Fraction of samples where the model’s generated choice (letter or index) matches the label | Higher |
High-level layout:
├── config/ # Training hyperparameters & paths
├── datasets/ # RecGaze / AdSERP data
├── fixate/ # Core library + training scripts (fixate_training/)
├── preprocessing/ # Dataset-specific preprocessing
├── llm_models/ # Local VLM checkpoints (optional path)
├── outputs/ # Metrics / logs
└── requirements.txt
- RecGaze for the eye-tracking dataset in carousel-based recommendation
- AdSERP for the eye-tracking dataset in sponsored search
- Qwen3-VL-4B-Instruct and InternVL3_5-4B-Instruct for VLM backbones
- GLIMPSE, AttnLRP, and Attention Rollout for interpretability operators