Read this in other languages: English, δΈζ.
SyEval-VL is a comprehensive evaluation framework for assessing sycophantic behavior in Vision-Language Models (VLMs). The project supports evaluation on multiple datasets including Visual Genome, Art Style, and Unsafe Driving.
π Paper Information
This is the official implementation of the ACM MM2025 accepted paper "Evaluating and Mitigating Sycophancy in Large Vision-Language Models".
- Create Conda Virtual Environment
# Create new conda environment
conda create -n syeval-vl python=3.10
conda activate syeval-vl
- Install Dependencies
# Install basic dependencies
pip install -r requirements.txt
OpenCLIP weights will be automatically downloaded on first run, or you can manually download them:
# OpenCLIP weights are automatically managed through open_clip library
python -c "import open_clip; open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')"
Download GroundingDINO model weights:
# Create weights directory
mkdir -p weights
# Download GroundingDINO weights
wget -O weights/groundingdino_swint_ogc.pth \
https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
# Clone GroundingDINO configuration files
git clone https://github.com/IDEA-Research/GroundingDINO.git
Create the following directory structure:
data/
βββ visual_genome/
β βββ processed_objects.json # Image and object annotations
β βββ VG_100K/ # Image files
β βββ 1.jpg
β βββ 2.jpg
β βββ ...
β βββ VG_100K_2/ # Image files
β βββ 1.jpg
β βββ 2.jpg
β βββ ...
βββ art_style/
β βββ annotations.json # Art style annotations
β βββ target_set/ # Target image set
β βββ style1.jpg
β βββ style2.jpg
β βββ ...
βββ unsafe_driving/
βββ annotations.json # Unsafe driving behavior annotations
βββ query_set/ # Query image set
βββ 1.png
βββ 2.png
βββ ...
βββ target_set/ # Target image set
βββ 1.jpg
βββ 2.jpg
βββ ...
- Visual Genome Dataset
# Download Visual Genome dataset
# Reference: https://visualgenome.org/
# Download images to data/visual_genome/VG_100K/ and data/visual_genome/VG_100K_2/
# Process annotation files into processed_objects.json format
- Art Style Dataset
# Prepare art style dataset
# Place images in data/art_style/target_set/
# Create annotations.json containing file_name and style labels
- Unsafe Driving Dataset
# Download unsafe driving dataset
# Dataset download link: https://www.kaggle.com/datasets/robinreni/revitsone-5class/data
# Extract and organize images to data/unsafe_driving/target_set/
# Create annotations.json containing image field and behavior labels
Use syeval.py
to generate evaluation questions:
# Visual Genome dataset
python syeval.py \
--dataset_name visual_genome \
--dataset_path ./data/ \
--image_num 500 \
--sample_num 2 \
--save_path ./output \
--seed 42
# Art Style dataset
python syeval.py \
--dataset_name art_style \
--dataset_path ./data/ \
--image_num 500 \
--sample_num 1 \
--save_path ./output
# Unsafe Driving dataset
python syeval.py \
--dataset_name unsafe_driving \
--dataset_path ./data/ \
--image_num 500 \
--sample_num 1 \
--save_path ./output
Parameter | Type | Default | Description |
---|---|---|---|
--dataset_path |
str | ./data/ |
Dataset file storage path |
--image_num |
int | 500 | Number of images to sample for dataset generation |
--sample_num |
int | 1 | Number of questions to generate per image |
--dataset_name |
str | visual_genome |
Dataset name (visual_genome/art_style/unsafe_driving) |
--save_path |
str | ./output |
Output JSON file save directory |
--seed |
int | 42 | Random seed for reproducibility |
Use main.py
for model evaluation:
# Basic evaluation
python main.py \
--dataset_name visual_genome \
--model-base qwen-vl-max \
--dataset_path ./data/ \
--json_path ./output \
--result_path ./results \
--sample_num 300
# Evaluation with data augmentation
python main.py \
--dataset_name art_style \
--model-base qwen-vl-max \
--augmention \
--temperature 0.0 \
--max-new-tokens 300 \
--device cuda:0
Parameter | Type | Default | Description |
---|---|---|---|
--dataset_path |
str | ./data/ |
Dataset file storage path |
--dataset_name |
str | visual_genome |
Dataset name for evaluation |
--json_path |
str | ./output |
Question JSON file path |
--result_path |
str | ./results |
Evaluation results save path |
--augmention |
bool | False | Whether to use data augmentation |
--sample_num |
int | 300 | Number of samples per image |
Parameter | Type | Default | Description |
---|---|---|---|
--model-base |
str | qwen-vl-max |
Model for evaluation |
--eval_model_api_key |
str | - | API key for the evaluated model |
--device |
str | cuda:3 |
Computing device |
--temperature |
float | 0.0 | Sampling temperature for generation |
--max-new-tokens |
int | 300 | Maximum number of tokens to generate |
Parameter | Type | Default | Description |
---|---|---|---|
--box_threshold |
float | 0.45 | Bounding box threshold |
--text_threshold |
float | 0.45 | Text threshold |
--weight_path |
str | ./weights/groundingdino_swint_ogc.pth |
Grounding model weight path |
--grounding_py |
str | ./GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py |
Grounding model config file path |
Parameter | Type | Default | Description |
---|---|---|---|
--api_key |
str | Your_API_Key_Here |
Model service API key |
Use evaluation.py
to evaluate results:
# Evaluate Visual Genome results
python evaluation.py \
--dataset_name visual_genome \
--mllm GPT-4o \
--result_path ./results \
--output_path ./evaluation \
--sample_num 300
# Evaluate Art Style results
python evaluation.py \
--dataset_name art_style \
--model-base qwen-max \
--augmention \
--temperature 0.0
Parameter | Type | Default | Description |
---|---|---|---|
--api_key |
str | - | Evaluation model API key |
--mllm |
str | GPT-4o |
Evaluated MLLM model name |
--sample_num |
int | 300 | Number of evaluation samples |
--dataset_name |
str | visual_genome |
Evaluation dataset name |
--result_path |
str | ./results |
Raw evaluation results path |
--output_path |
str | ./evaluation |
Processed evaluation results save path |
--model-base |
str | qwen-max |
Base model for answer extraction |
--temperature |
float | 0.0 | Temperature for answer extraction |
--max-new-tokens |
int | 300 | Maximum tokens for answer extraction |
--augmention |
bool | False | Whether to evaluate augmented results |
SyEval-VL/
βββ main.py # Main evaluation script
βββ syeval.py # Question generation script
βββ evaluation.py # Results evaluation script
βββ utils.py # Utility functions
βββ syeval_utils.py # Evaluation utility functions
βββ dataset.py # Dataset processing
βββ prompt.py # Prompt definitions
βββ requirements.txt # Dependency list
βββ weights/ # Model weights directory
βββ output/ # Generated question files
βββ results/ # Evaluation results
βββ evaluation/ # Processed evaluation results
βββ README.md # Project documentation (English)
βββ README_ZH.md # Project documentation (Chinese)
βββ README_EN.md # Project documentation (English - deprecated)
- API Keys: Ensure correct API keys are set for model access
- Data Paths: Adjust dataset paths according to your actual situation
- GPU Memory: Some models may require substantial GPU memory
- Weight Paths: Ensure GroundingDINO weight paths are correctly set
Please refer to the project's license file.
Welcome to submit Issues and Pull Requests to improve this project.
If you use this project, please cite our paper:
@inproceedings{syeval-vl-2025,
title={Evaluating and Mitigating Sycophancy in Large Vision-Language Models},
author={Jiayi Gao and Huaiwen Zhang},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
year={2025},
publisher={ACM}
}
Abstract: This paper proposes the SyEval-VL framework for systematically evaluating sycophantic behavior in large vision-language models and presents corresponding mitigation strategies.