Dependencies
pip install pycocoevalcap tqdm spacy shortuuid openpyxl -i https://pypi.tuna.tsinghua.edu.cn/simple
Prior to conducting evaluations, obtain the Vicuna-7B model and the InfMLLM-7B model from Hugging Face. Once downloaded, these should be placed in the pretrained_models
directory.
To access comprehensive guidance on preparing evaluation datasets such as okvqa, vqav2, and others, it is advised to consult the Qwen-VL repository.
Once prepared, the directory should have the following structure.
|-- rootdir
|-- pretrained_models
|-- lmsys/vicuna-7b-v1.5/
|-- infmllm/InfMLLM-7B
|-- datasets
|-- okvqa
|-- vqav2
|-- TextVQA
|-- gqa
|-- ocr-vqa
|-- refcoco
|-- refcoco+
|-- refcocog
To evaluate VQA benchmarks, execute the scripts provided by evaluate/infmllm/evaluate_vqa.sh
. The evaluated performance is expected to be as follows with InfMLLM-7B:
okvqa: 61.23
textvqa: 67.90
gqa: 63.06
ocr-vqa: 73.51
vqav2-testdev: 81.96
The vqav2-testdev
needs to be submitted to eval.ai for evaluation through their online platform.
To evaluate visual grounding benchmarks, execute the scripts provided by evaluate/infmllm/evaluate_grounding.sh
. The evaluated performance is expected to be as follows with InfMLLM-7B:
refcoco_testA: 94.59
refcoco_testB: 89.24
refcoco+_testA: 92.33
refcoco+_testB: 81.61
refcocog_test: 89.78
Prior to conducting evaluations, obtain the InfMLLM-7B-Chat model from Hugging Face.
To access comprehensive guidance on preparing evaluation datasets such as MME, MMBench, and others, it is advised to consult the LLaVA repository.
Once prepared, the directory should have the following structure.
|-- rootdir
|-- pretrained_models
|-- infmllm/InfMLLM-7B-Chat
|-- datasets
|-- MME_Benchmark
|-- mmbench
|-- SEED-Bench
|-- POPE
|-- mm-vet
|-- ScienceQA
|-- TextVQA
|-- gqa
|-- VQAv2
You can find all the scripts for evaluation in the evaluate/infmllm_chat/
directory. For example, use the evaluate/infmllm_chat/seed.sh
script to carry out the evaluation on the SEED benchmark.
The evaluated performance is expected to be as follows with InfMLLM-7B-Chat:
MME: 1498.87
MMBench:
MMBench-CN:
SEED: 61.70
POPE-f1: 86.56
MM-Vet: 32.9
ScienceQA-Image: 68.07
TextVQA: 63.91
GQA: 64.97
vqav2-testdev: 82.25