Skip to content

immc-lab/SyEval-VL

Repository files navigation

SyEval-VL: Sycophantic Evaluation for Vision-Language Models

Read this in other languages: English, δΈ­ζ–‡.

SyEval-VL is a comprehensive evaluation framework for assessing sycophantic behavior in Vision-Language Models (VLMs). The project supports evaluation on multiple datasets including Visual Genome, Art Style, and Unsafe Driving.

πŸ“„ Paper Information
This is the official implementation of the ACM MM2025 accepted paper "Evaluating and Mitigating Sycophancy in Large Vision-Language Models".


πŸš€ Quick Start

Environment Setup

  1. Create Conda Virtual Environment
# Create new conda environment
conda create -n syeval-vl python=3.10
conda activate syeval-vl
  1. Install Dependencies
# Install basic dependencies
pip install -r requirements.txt

Weight Downloads

1. OpenCLIP Weights

OpenCLIP weights will be automatically downloaded on first run, or you can manually download them:

# OpenCLIP weights are automatically managed through open_clip library
python -c "import open_clip; open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')"

2. GroundingDINO Weights

Download GroundingDINO model weights:

# Create weights directory
mkdir -p weights

# Download GroundingDINO weights
wget -O weights/groundingdino_swint_ogc.pth \
  https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

# Clone GroundingDINO configuration files
git clone https://github.com/IDEA-Research/GroundingDINO.git

Dataset Preparation

Create the following directory structure:

data/
β”œβ”€β”€ visual_genome/
β”‚   β”œβ”€β”€ processed_objects.json    # Image and object annotations
β”‚   └── VG_100K/                  # Image files
β”‚       β”œβ”€β”€ 1.jpg
β”‚       β”œβ”€β”€ 2.jpg
β”‚       └── ...
β”‚   └── VG_100K_2/                # Image files
β”‚       β”œβ”€β”€ 1.jpg
β”‚       β”œβ”€β”€ 2.jpg
β”‚       └── ...
β”œβ”€β”€ art_style/
β”‚   β”œβ”€β”€ annotations.json          # Art style annotations
β”‚   └── target_set/               # Target image set
β”‚       β”œβ”€β”€ style1.jpg
β”‚       β”œβ”€β”€ style2.jpg
β”‚       └── ...
└── unsafe_driving/
    β”œβ”€β”€ annotations.json          # Unsafe driving behavior annotations
    └── query_set/                # Query image set
        β”œβ”€β”€ 1.png
        β”œβ”€β”€ 2.png
        └── ...
    └── target_set/               # Target image set
        β”œβ”€β”€ 1.jpg
        β”œβ”€β”€ 2.jpg
        └── ...

Dataset Download and Preparation

  1. Visual Genome Dataset
# Download Visual Genome dataset
# Reference: https://visualgenome.org/
# Download images to data/visual_genome/VG_100K/ and data/visual_genome/VG_100K_2/
# Process annotation files into processed_objects.json format
  1. Art Style Dataset
# Prepare art style dataset
# Place images in data/art_style/target_set/
# Create annotations.json containing file_name and style labels
  1. Unsafe Driving Dataset
# Download unsafe driving dataset
# Dataset download link: https://www.kaggle.com/datasets/robinreni/revitsone-5class/data
# Extract and organize images to data/unsafe_driving/target_set/
# Create annotations.json containing image field and behavior labels

πŸ“Š Usage

Step 1: Generate Evaluation Questions

Use syeval.py to generate evaluation questions:

# Visual Genome dataset
python syeval.py \
    --dataset_name visual_genome \
    --dataset_path ./data/ \
    --image_num 500 \
    --sample_num 2 \
    --save_path ./output \
    --seed 42

# Art Style dataset
python syeval.py \
    --dataset_name art_style \
    --dataset_path ./data/ \
    --image_num 500 \
    --sample_num 1 \
    --save_path ./output

# Unsafe Driving dataset
python syeval.py \
    --dataset_name unsafe_driving \
    --dataset_path ./data/ \
    --image_num 500 \
    --sample_num 1 \
    --save_path ./output

syeval.py Parameter Description

Parameter Type Default Description
--dataset_path str ./data/ Dataset file storage path
--image_num int 500 Number of images to sample for dataset generation
--sample_num int 1 Number of questions to generate per image
--dataset_name str visual_genome Dataset name (visual_genome/art_style/unsafe_driving)
--save_path str ./output Output JSON file save directory
--seed int 42 Random seed for reproducibility

Step 2: Execute Model Evaluation

Use main.py for model evaluation:

# Basic evaluation
python main.py \
    --dataset_name visual_genome \
    --model-base qwen-vl-max \
    --dataset_path ./data/ \
    --json_path ./output \
    --result_path ./results \
    --sample_num 300

# Evaluation with data augmentation
python main.py \
    --dataset_name art_style \
    --model-base qwen-vl-max \
    --augmention \
    --temperature 0.0 \
    --max-new-tokens 300 \
    --device cuda:0

main.py Parameter Description

Dataset Configuration
Parameter Type Default Description
--dataset_path str ./data/ Dataset file storage path
--dataset_name str visual_genome Dataset name for evaluation
--json_path str ./output Question JSON file path
--result_path str ./results Evaluation results save path
--augmention bool False Whether to use data augmentation
--sample_num int 300 Number of samples per image
Model Configuration
Parameter Type Default Description
--model-base str qwen-vl-max Model for evaluation
--eval_model_api_key str - API key for the evaluated model
--device str cuda:3 Computing device
--temperature float 0.0 Sampling temperature for generation
--max-new-tokens int 300 Maximum number of tokens to generate
GroundingDINO Configuration
Parameter Type Default Description
--box_threshold float 0.45 Bounding box threshold
--text_threshold float 0.45 Text threshold
--weight_path str ./weights/groundingdino_swint_ogc.pth Grounding model weight path
--grounding_py str ./GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py Grounding model config file path
API Configuration
Parameter Type Default Description
--api_key str Your_API_Key_Here Model service API key

Step 3: Evaluate Results

Use evaluation.py to evaluate results:

# Evaluate Visual Genome results
python evaluation.py \
    --dataset_name visual_genome \
    --mllm GPT-4o \
    --result_path ./results \
    --output_path ./evaluation \
    --sample_num 300

# Evaluate Art Style results
python evaluation.py \
    --dataset_name art_style \
    --model-base qwen-max \
    --augmention \
    --temperature 0.0

evaluation.py Parameter Description

Parameter Type Default Description
--api_key str - Evaluation model API key
--mllm str GPT-4o Evaluated MLLM model name
--sample_num int 300 Number of evaluation samples
--dataset_name str visual_genome Evaluation dataset name
--result_path str ./results Raw evaluation results path
--output_path str ./evaluation Processed evaluation results save path
--model-base str qwen-max Base model for answer extraction
--temperature float 0.0 Temperature for answer extraction
--max-new-tokens int 300 Maximum tokens for answer extraction
--augmention bool False Whether to evaluate augmented results

πŸ”§ Project Structure

SyEval-VL/
β”œβ”€β”€ main.py                    # Main evaluation script
β”œβ”€β”€ syeval.py            # Question generation script
β”œβ”€β”€ evaluation.py             # Results evaluation script
β”œβ”€β”€ utils.py                  # Utility functions
β”œβ”€β”€ syeval_utils.py          # Evaluation utility functions
β”œβ”€β”€ dataset.py               # Dataset processing
β”œβ”€β”€ prompt.py                # Prompt definitions
β”œβ”€β”€ requirements.txt         # Dependency list
β”œβ”€β”€ weights/                 # Model weights directory
β”œβ”€β”€ output/                  # Generated question files
β”œβ”€β”€ results/                 # Evaluation results
β”œβ”€β”€ evaluation/              # Processed evaluation results
β”œβ”€β”€ README.md               # Project documentation (English)
β”œβ”€β”€ README_ZH.md            # Project documentation (Chinese)
└── README_EN.md            # Project documentation (English - deprecated)

🚨 Important Notes

  1. API Keys: Ensure correct API keys are set for model access
  2. Data Paths: Adjust dataset paths according to your actual situation
  3. GPU Memory: Some models may require substantial GPU memory
  4. Weight Paths: Ensure GroundingDINO weight paths are correctly set

πŸ“„ License

Please refer to the project's license file.

🀝 Contributing

Welcome to submit Issues and Pull Requests to improve this project.

πŸ“š Citation

If you use this project, please cite our paper:

@inproceedings{syeval-vl-2025,
  title={Evaluating and Mitigating Sycophancy in Large Vision-Language Models},
  author={Jiayi Gao and Huaiwen Zhang},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025},
  publisher={ACM}
}

Abstract: This paper proposes the SyEval-VL framework for systematically evaluating sycophantic behavior in large vision-language models and presents corresponding mitigation strategies.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages