Awesome Large Multimodal Models

This repository is a collection of useful things about Large Multimodal Models (LMMs).

Stars, suggestions, and contributions are all welcome.

Papers

Survey / Understanding / Analysis

Papers on the survey, understanding, and analysis of LMMs.

submit_date	research_perspective	paper	github_code
2023-11-23	evaluate LMMs by GPT-4V	MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
2023-11-14	review on instruction tuning	Vision-Language Instruction Tuning: A Review and Analysis arXiv 2023
2023-10-25	evaluation of GPT-4V	An Early Evaluation of GPT-4V(ision) arXiv 2023	-
2023-10-25	evaluation of GPT-4V	An Early Evaluation of GPT-4V(ision) arXiv 2023	-
2023-10-25	OCR of GPT-4V	Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation arXiv 2023
2023-10-17	visual grounding of GPT-4V	Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V arXiv 2023
2023-10-13	visual encoder of LMMs	From CLIP to DINO: Visual Encoders Shout in Multi-Modal Large Language Models arXiv 2023
2023-09-29	evaluation of GPT-4V	The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) arXiv 2023	-
2023-09-18	scaling of instruction tuning	An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models arXiv 2023	-
2023-06-23	survey on LMMs	A Survey on Multimodal Large Language Models arXiv 2023

Foundation Models

Papers on training LMMs from scratch.

submit_date	model_name	paper	github_code
2023-11-10	Florence-2	Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks arXiv 2023	-
2023-11-07	mPLUG-Owl2	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration arXiv 2023
2023-10-13	PaLI-3	PaLI-3 Vision Language Models: Smaller, Faster, Stronger arXiv 2023	-
2023-09-25	GPT-4V	GPT-4V(ision) system card OpenAI 2023	-
2023-08-24	Qwen-VL	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond arXiv 2023
2023-08-02	OpenFlamingo	OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models arXiv 2023
2023-05-29	PaLI-X	PaLI-X: On Scaling up a Multilingual Vision and Language Model arXiv 2023	-
2023-04-27	mPLUG-Owl	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality arXiv 2023
2023-03-15	GPT-4	GPT-4 Technical Report arXiv 2023	-
2022-09-14	PaLI	PaLI: A Jointly-Scaled Multilingual Language-Image Model ICLR 2023	-
2022-04-29	Flamingo	Flamingo: A Visual Language Model for Few-Shot Learning NeurIPS 2022	-
2022-01-28	BLIP	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ICML 2022
2021-11-22	Florence	Florence: A New Foundation Model for Computer Vision arXiv 2021	-

Instruction Tuning

Papers on improving LMMs through instruction tuning.

submit_date	model_name	paper
2023-11-21	ShareGPT4V	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions arXiv 2023
2023-11-20	LION	LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge arXiv 2023
2023-11-15	MMCA	MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning arXiv 2023
2023-10-14	MiniGPT-v2	MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning arXiv 2023
2023-10-13	COMM	From CLIP to DINO: Visual Encoders Shout in Multi-Modal Large Language Models arXiv 2023
2023-10-5	LLaVA-1.5	Improved Baselines with Visual Instruction Tuning arXiv 2023
2023-09-29	DeepSpeed-VisualChat	DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention arXiv 2023
2023-04-20	MiniGPT-4	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models arXiv 2023
2023-04-17	LLaVA	Visual Instruction Tuning NeurIPS 2023

Reinforcement Learning with Human Feedback (RLHF)

Papers on LMM alignment through RLHF.

submit_date	model_name	paper	github_code
2023-09-25	LLaVA-RLHF	Aligning Large Multimodal Models with Factually Augmented RLHF arXiv 2023

Capability Expansion

Papers on expanding capabilities of LMMs, such as segmentation, detection, generation, and etc.

submit_date	model_name	paper	capability	github_code
2023-11-09	LLaVa-Plus	LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents arXiv 2023
2023-11-01	LLaVA-Interactive	LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing arXiv 2023

Hallucination Correction

Papers on correcting hallucination of LMMs.

submit_date	model_name	paper	github_code
2023-10-24	Woodpecker	Woodpecker: Hallucination Correction for Multimodal Large Language Models arXiv 2023

Datasets

Datasets of Instruction Tuning

submit_date	dataset	paper	number	keywords
2023-11-21	ShareGPT4V	arXiv 2023	1.2M	highly descriptive captions, GPT-4V
2023-11-15	MMC-Instruction	arXiv 2023	600K	chart
2023-04-20	cc_sbu_align	arXiv 2023	5M	high-quality, well-aligned
2023-04-17	LLaVA-Instruct-150K	NeurIPS 2023	158K	conversation, description, reasoning

Datasets of Reinforcement Learning with Human Feedback (RLHF)

submit_date	dataset	paper	number	keywords
2023-09-25	LLaVA-SFT-122K LLaVA-Human-Preference-10K	arXiv 2023	122K 10K	high-quality

Datasets of Evaluation

submit_date	dataset	paper	number	keywords
2023-11-23	MLLM-Bench	arXiv 2023	419	GPT-4V
2023-11-15	MMC-Benchmark	arXiv 2023	600K	chart
2023-09-25	MMHal-Bench	arXiv 2023	96	hallucination
2023-06-23	MME	arXiv 2023	-	perception and cognition
2023-04-27	OwlEval	arXiv 2023	82	multi-turn, diverse capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Awesome Large Multimodal Models

Contents

Papers

Survey / Understanding / Analysis

Foundation Models

Instruction Tuning

Reinforcement Learning with Human Feedback (RLHF)

Capability Expansion

Hallucination Correction

Datasets

Datasets of Instruction Tuning

Datasets of Reinforcement Learning with Human Feedback (RLHF)

Datasets of Evaluation

About

Releases

Packages

junkunyuan/Awesome-Large-Multimodal-Models

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Awesome Large Multimodal Models

Contents

Papers

Survey / Understanding / Analysis

Foundation Models

Instruction Tuning

Reinforcement Learning with Human Feedback (RLHF)

Capability Expansion

Hallucination Correction

Datasets

Datasets of Instruction Tuning

Datasets of Reinforcement Learning with Human Feedback (RLHF)

Datasets of Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages