SyEval-VL: Sycophantic Evaluation for Vision-Language Models

Read this in other languages: English, 中文.

SyEval-VL is a comprehensive evaluation framework for assessing sycophantic behavior in Vision-Language Models (VLMs). The project supports evaluation on multiple datasets including Visual Genome, Art Style, and Unsafe Driving.

📄 Paper Information
This is the official implementation of the ACM MM2025 accepted paper "Evaluating and Mitigating Sycophancy in Large Vision-Language Models".

🚀 Quick Start

Environment Setup

Create Conda Virtual Environment

# Create new conda environment
conda create -n syeval-vl python=3.10
conda activate syeval-vl

Install Dependencies

# Install basic dependencies
pip install -r requirements.txt

Weight Downloads

1. OpenCLIP Weights

OpenCLIP weights will be automatically downloaded on first run, or you can manually download them:

# OpenCLIP weights are automatically managed through open_clip library
python -c "import open_clip; open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')"

2. GroundingDINO Weights

Download GroundingDINO model weights:

# Create weights directory
mkdir -p weights

# Download GroundingDINO weights
wget -O weights/groundingdino_swint_ogc.pth \
  https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

# Clone GroundingDINO configuration files
git clone https://github.com/IDEA-Research/GroundingDINO.git

Dataset Preparation

Create the following directory structure:

data/
├── visual_genome/
│   ├── processed_objects.json    # Image and object annotations
│   └── VG_100K/                  # Image files
│       ├── 1.jpg
│       ├── 2.jpg
│       └── ...
│   └── VG_100K_2/                # Image files
│       ├── 1.jpg
│       ├── 2.jpg
│       └── ...
├── art_style/
│   ├── annotations.json          # Art style annotations
│   └── target_set/               # Target image set
│       ├── style1.jpg
│       ├── style2.jpg
│       └── ...
└── unsafe_driving/
    ├── annotations.json          # Unsafe driving behavior annotations
    └── query_set/                # Query image set
        ├── 1.png
        ├── 2.png
        └── ...
    └── target_set/               # Target image set
        ├── 1.jpg
        ├── 2.jpg
        └── ...

Dataset Download and Preparation

Visual Genome Dataset

# Download Visual Genome dataset
# Reference: https://visualgenome.org/
# Download images to data/visual_genome/VG_100K/ and data/visual_genome/VG_100K_2/
# Process annotation files into processed_objects.json format

Art Style Dataset

# Prepare art style dataset
# Place images in data/art_style/target_set/
# Create annotations.json containing file_name and style labels

Unsafe Driving Dataset

# Download unsafe driving dataset
# Dataset download link: https://www.kaggle.com/datasets/robinreni/revitsone-5class/data
# Extract and organize images to data/unsafe_driving/target_set/
# Create annotations.json containing image field and behavior labels

📊 Usage

Step 1: Generate Evaluation Questions

Use syeval.py to generate evaluation questions:

# Visual Genome dataset
python syeval.py \
    --dataset_name visual_genome \
    --dataset_path ./data/ \
    --image_num 500 \
    --sample_num 2 \
    --save_path ./output \
    --seed 42

# Art Style dataset
python syeval.py \
    --dataset_name art_style \
    --dataset_path ./data/ \
    --image_num 500 \
    --sample_num 1 \
    --save_path ./output

# Unsafe Driving dataset
python syeval.py \
    --dataset_name unsafe_driving \
    --dataset_path ./data/ \
    --image_num 500 \
    --sample_num 1 \
    --save_path ./output

syeval.py Parameter Description

Parameter	Type	Default	Description
`--dataset_path`	str	`./data/`	Dataset file storage path
`--image_num`	int	500	Number of images to sample for dataset generation
`--sample_num`	int	1	Number of questions to generate per image
`--dataset_name`	str	`visual_genome`	Dataset name (visual_genome/art_style/unsafe_driving)
`--save_path`	str	`./output`	Output JSON file save directory
`--seed`	int	42	Random seed for reproducibility

Step 2: Execute Model Evaluation

Use main.py for model evaluation:

# Basic evaluation
python main.py \
    --dataset_name visual_genome \
    --model-base qwen-vl-max \
    --dataset_path ./data/ \
    --json_path ./output \
    --result_path ./results \
    --sample_num 300

# Evaluation with data augmentation
python main.py \
    --dataset_name art_style \
    --model-base qwen-vl-max \
    --augmention \
    --temperature 0.0 \
    --max-new-tokens 300 \
    --device cuda:0

main.py Parameter Description

Dataset Configuration

Parameter	Type	Default	Description
`--dataset_path`	str	`./data/`	Dataset file storage path
`--dataset_name`	str	`visual_genome`	Dataset name for evaluation
`--json_path`	str	`./output`	Question JSON file path
`--result_path`	str	`./results`	Evaluation results save path
`--augmention`	bool	False	Whether to use data augmentation
`--sample_num`	int	300	Number of samples per image

Model Configuration

Parameter	Type	Default	Description
`--model-base`	str	`qwen-vl-max`	Model for evaluation
`--eval_model_api_key`	str	-	API key for the evaluated model
`--device`	str	`cuda:3`	Computing device
`--temperature`	float	0.0	Sampling temperature for generation
`--max-new-tokens`	int	300	Maximum number of tokens to generate

GroundingDINO Configuration

Parameter	Type	Default	Description
`--box_threshold`	float	0.45	Bounding box threshold
`--text_threshold`	float	0.45	Text threshold
`--weight_path`	str	`./weights/groundingdino_swint_ogc.pth`	Grounding model weight path
`--grounding_py`	str	`./GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py`	Grounding model config file path

API Configuration

Parameter	Type	Default	Description
`--api_key`	str	`Your_API_Key_Here`	Model service API key

Step 3: Evaluate Results

Use evaluation.py to evaluate results:

# Evaluate Visual Genome results
python evaluation.py \
    --dataset_name visual_genome \
    --mllm GPT-4o \
    --result_path ./results \
    --output_path ./evaluation \
    --sample_num 300

# Evaluate Art Style results
python evaluation.py \
    --dataset_name art_style \
    --model-base qwen-max \
    --augmention \
    --temperature 0.0

evaluation.py Parameter Description

Parameter	Type	Default	Description
`--api_key`	str	-	Evaluation model API key
`--mllm`	str	`GPT-4o`	Evaluated MLLM model name
`--sample_num`	int	300	Number of evaluation samples
`--dataset_name`	str	`visual_genome`	Evaluation dataset name
`--result_path`	str	`./results`	Raw evaluation results path
`--output_path`	str	`./evaluation`	Processed evaluation results save path
`--model-base`	str	`qwen-max`	Base model for answer extraction
`--temperature`	float	0.0	Temperature for answer extraction
`--max-new-tokens`	int	300	Maximum tokens for answer extraction
`--augmention`	bool	False	Whether to evaluate augmented results

🔧 Project Structure

SyEval-VL/
├── main.py                    # Main evaluation script
├── syeval.py            # Question generation script
├── evaluation.py             # Results evaluation script
├── utils.py                  # Utility functions
├── syeval_utils.py          # Evaluation utility functions
├── dataset.py               # Dataset processing
├── prompt.py                # Prompt definitions
├── requirements.txt         # Dependency list
├── weights/                 # Model weights directory
├── output/                  # Generated question files
├── results/                 # Evaluation results
├── evaluation/              # Processed evaluation results
├── README.md               # Project documentation (English)
├── README_ZH.md            # Project documentation (Chinese)
└── README_EN.md            # Project documentation (English - deprecated)

🚨 Important Notes

API Keys: Ensure correct API keys are set for model access
Data Paths: Adjust dataset paths according to your actual situation
GPU Memory: Some models may require substantial GPU memory
Weight Paths: Ensure GroundingDINO weight paths are correctly set

📄 License

Please refer to the project's license file.

🤝 Contributing

Welcome to submit Issues and Pull Requests to improve this project.

📚 Citation

If you use this project, please cite our paper:

@inproceedings{syeval-vl-2025,
  title={Evaluating and Mitigating Sycophancy in Large Vision-Language Models},
  author={Jiayi Gao and Huaiwen Zhang},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025},
  publisher={ACM}
}

Abstract: This paper proposes the SyEval-VL framework for systematically evaluating sycophantic behavior in large vision-language models and presents corresponding mitigation strategies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SyEval-VL: Sycophantic Evaluation for Vision-Language Models

🚀 Quick Start

Environment Setup

Weight Downloads

1. OpenCLIP Weights

2. GroundingDINO Weights

Dataset Preparation

Dataset Download and Preparation

📊 Usage

Step 1: Generate Evaluation Questions

syeval.py Parameter Description

Step 2: Execute Model Evaluation

main.py Parameter Description

Dataset Configuration

Model Configuration

GroundingDINO Configuration

API Configuration

Step 3: Evaluate Results

evaluation.py Parameter Description

🔧 Project Structure

🚨 Important Notes

📄 License

🤝 Contributing

📚 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
precomputed		precomputed
README.md		README.md
README_EN.md		README_EN.md
README_ZH.md		README_ZH.md
__init__.py		__init__.py
dataset.py		dataset.py
evaluation.py		evaluation.py
main.py		main.py
prompt.py		prompt.py
requirements.txt		requirements.txt
syeval.py		syeval.py
syeval_main.py		syeval_main.py
syeval_utils.py		syeval_utils.py
utils.py		utils.py

immc-lab/SyEval-VL

Folders and files

Latest commit

History

Repository files navigation

SyEval-VL: Sycophantic Evaluation for Vision-Language Models

🚀 Quick Start

Environment Setup

Weight Downloads

1. OpenCLIP Weights

2. GroundingDINO Weights

Dataset Preparation

Dataset Download and Preparation

📊 Usage

Step 1: Generate Evaluation Questions

syeval.py Parameter Description

Step 2: Execute Model Evaluation

main.py Parameter Description

Dataset Configuration

Model Configuration

GroundingDINO Configuration

API Configuration

Step 3: Evaluate Results

evaluation.py Parameter Description

🔧 Project Structure

🚨 Important Notes

📄 License

🤝 Contributing

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages