PhD: A Prompted Visual Hallucination Evaluation Dataset

Jiazhen Liu^1,2, Yuhan Fu^1,2, Ruobing Xie², Runquan Xie²,

Xingwu Sun², Fengzong Lian², Zhanhui Kang¹ and Xirong Li¹

¹Key Lab of DEKE, Renmin University of China ²Machine Learning Platform Department, Tencent

Note: Due to certain policy restrictions, the version of the paper on arXiv is not the final version, whereas the data in this repository is the latest version. When using PhD dataset, it's advisable to refer to the instructions provided within. If you have any questions or concerns, feel free to raise an issue for discussion.

Introduction

PhD is a dataset for Visual Hallucination Evaluation (VHE). Depending on what to ask (objects, attributes, ...) and how the questions are asked (neural or misleading), we structure PhD along two dimensions, i.e. task and mode. Our PhD dataset aims to explore the intrinsic causes of visual hallucination. The dataset construction centers around task-specific hallucinatory elements (hitems). The overall evaluation data consists of over 30,000 samples.

Task

In particular, we consider five content-based tasks, including object recognition, attribute recognition, sentiment understanding, positional reasoning, and counting.

Mode

We provide two modes: neutral and misleading. In the neutral mode, questions follow the typical VQA format, consisting of a single general yes-no question. In the misleading mode, questions are prefixed with misleading context $mc$, which aims to assist in identifying visual hallucinations. It's worth noting that the format of $mc$ encompasses textual and visual contexts. If not specified, the default is textual.

Showcases

Ten tracks: Showcases of the two modes and five subtasks for VHE, collectively forming ten evaluation tracks. $h$, $gd$, $q$, and $a$ symbolize hitem, grounding truth item, question, and answer, respectively. Additionally, $mc$ signifies the misleading context, intentionally contrasting with the image content, designed to enable more stringent assessments in the misleading mode.

Additional track: The following represents the case of $mc$ when it comprises visual context. The task involves object recognition.

Image Download

Please download the images from the following links: Google Drive or Baidu Drive. Then, place them into the images/ directory.

Data Organization

The format of `data.jsonl`

# Each line is one evaluation sample and can be read as a dict. in JSON format. 
# It includes the following keys:

"""
· image_path: indicate the path to the test image.
· hitem_gt: specify the item around which the question is constructed.
· question:
· answer:
· task: one of the 5 tasks
· mode: neutral or misleading
"""

# For example
{"image_path": "images/authentic/COCO_val2014_000000125069.jpg", "hitem_gt": "dresser", "answer": "No", 
"question": ..., "task": "object_recognition", "mode": "misleading_textual_indirect"}

To distinguish tracks, it's recommended to use the format task + mode for differentiation.

track = task + '_' + mode.split('_')[0] + mode.split('_')[1]
textual_category = mode.split('_')[2] if len(mode.split('_')) == 3 else None

Evaluation Results

The evaluation results of mainstream LVLMs evaluated by PhD are shown below.

Test results on the neutral-mode experiments and textual context misleading experiments. All results are presented using the average PhD score (see paper). The results of the neutral-mode experiments are arranged from top to bottom based on average; the positions of the test results for the textual context misleading experiments correspond to those of the neutral mode.

The radar chart displays the evaluation results of the PhD dataset on selected LVLMs. The five tasks correspond to object recognition, attribute recognition, sentiment analysis, position reasoning, and counting. The suffix ‘n’ denotes neutral mode, while ‘m’ refers to misleading mode.
Test results for three categories of textual misleading contexts. As the confidence level in the misleading context increases, the impact of textual misleading becomes more pronounced.
Test results on visual context misleading experiments. "Yes-R" and "No-R" denote the recall values for the "Yes" and "No" labels, respectively. LVLMs are arranged in ascending order according to the PhD score.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@misc{liu2024phd,
      title={PhD: A Prompted Visual Hallucination Evaluation Dataset}, 
      author={Jiazhen Liu and Yuhan Fu and Ruobing Xie and Runquan Xie and Xingwu Sun and Fengzong Lian and Zhanhui Kang and Xirong Li},
      year={2024},
      eprint={2403.11116},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
figs		figs
images		images
.gitignore		.gitignore
README.md		README.md
data.jsonl		data.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figs

figs

images

images

.gitignore

.gitignore

README.md

README.md

data.jsonl

data.jsonl

Repository files navigation

PhD: A Prompted Visual Hallucination Evaluation Dataset

Introduction

Task

Mode

Showcases

Image Download

Data Organization

Directory

The format of `data.jsonl`

Evaluation Results

Citation

About

Releases

Packages

Contributors 2

jiazhen-code/IntrinsicHallu

Folders and files

Latest commit

History

Repository files navigation

PhD: A Prompted Visual Hallucination Evaluation Dataset

Introduction

Task

Mode

Showcases

Image Download

Data Organization

Directory

The format of data.jsonl

Evaluation Results

Citation

About

Resources

Stars

Watchers

Forks

The format of `data.jsonl`