Skip to content

junkunyuan/Awesome-Large-Multimodal-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 

Repository files navigation

Awesome Large Multimodal Models

Awesome

This repository is a collection of useful things about Large Multimodal Models (LMMs).

Stars, suggestions, and contributions are all welcome.

Contents

Papers

Survey / Understanding / Analysis

Papers on the survey, understanding, and analysis of LMMs.

submit_date research_perspective paper github_code
2023-11-23 evaluate LMMs by GPT-4V MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V Star
2023-11-14 review on instruction tuning Vision-Language Instruction Tuning: A Review and Analysis    arXiv 2023 Star
2023-10-25 evaluation of GPT-4V An Early Evaluation of GPT-4V(ision)    arXiv 2023 -
2023-10-25 evaluation of GPT-4V An Early Evaluation of GPT-4V(ision)    arXiv 2023 -
2023-10-25 OCR of GPT-4V Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation    arXiv 2023 Star
2023-10-17 visual grounding of GPT-4V Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V    arXiv 2023 Star
2023-10-13 visual encoder of LMMs From CLIP to DINO: Visual Encoders Shout in Multi-Modal Large Language Models    arXiv 2023 Star
2023-09-29 evaluation of GPT-4V The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)    arXiv 2023 -
2023-09-18 scaling of instruction tuning An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models    arXiv 2023 -
2023-06-23 survey on LMMs A Survey on Multimodal Large Language Models    arXiv 2023 Star

Foundation Models

Papers on training LMMs from scratch.

submit_date model_name paper github_code
2023-11-10 Florence-2 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks    arXiv 2023 -
2023-11-07 mPLUG-Owl2 mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration    arXiv 2023 Star
2023-10-13 PaLI-3 PaLI-3 Vision Language Models: Smaller, Faster, Stronger    arXiv 2023 -
2023-09-25 GPT-4V GPT-4V(ision) system card    OpenAI 2023 -
2023-08-24 Qwen-VL Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond    arXiv 2023 Star
2023-08-02 OpenFlamingo OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models    arXiv 2023 Star
2023-05-29 PaLI-X PaLI-X: On Scaling up a Multilingual Vision and Language Model    arXiv 2023 -
2023-04-27 mPLUG-Owl mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality    arXiv 2023 Star
2023-03-15 GPT-4 GPT-4 Technical Report    arXiv 2023 -
2022-09-14 PaLI PaLI: A Jointly-Scaled Multilingual Language-Image Model    ICLR 2023 -
2022-04-29 Flamingo Flamingo: A Visual Language Model for Few-Shot Learning    NeurIPS 2022 -
2022-01-28 BLIP BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation    ICML 2022 Star
2021-11-22 Florence Florence: A New Foundation Model for Computer Vision    arXiv 2021 -

Instruction Tuning

Papers on improving LMMs through instruction tuning.

submit_date model_name paper github_code
2023-11-21 ShareGPT4V ShareGPT4V: Improving Large Multi-Modal Models with Better Captions    arXiv 2023 Star
2023-11-20 LION LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge    arXiv 2023 Star
2023-11-15 MMCA MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning    arXiv 2023 Star
2023-10-14 MiniGPT-v2 MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning    arXiv 2023 Star
2023-10-13 COMM From CLIP to DINO: Visual Encoders Shout in Multi-Modal Large Language Models    arXiv 2023 Star
2023-10-5 LLaVA-1.5 Improved Baselines with Visual Instruction Tuning    arXiv 2023 Star
2023-09-29 DeepSpeed-VisualChat DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention    arXiv 2023 Star
2023-04-20 MiniGPT-4 MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models    arXiv 2023 Star
2023-04-17 LLaVA Visual Instruction Tuning    NeurIPS 2023 Star

Reinforcement Learning with Human Feedback (RLHF)

Papers on LMM alignment through RLHF.

submit_date model_name paper github_code
2023-09-25 LLaVA-RLHF Aligning Large Multimodal Models with Factually Augmented RLHF    arXiv 2023 Star

Capability Expansion

Papers on expanding capabilities of LMMs, such as segmentation, detection, generation, and etc.

submit_date model_name paper capability github_code
2023-11-09 LLaVa-Plus LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents    arXiv 2023 Star
2023-11-01 LLaVA-Interactive LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing    arXiv 2023 Star

Hallucination Correction

Papers on correcting hallucination of LMMs.

submit_date model_name paper github_code
2023-10-24 Woodpecker Woodpecker: Hallucination Correction for Multimodal Large Language Models    arXiv 2023 Star

Datasets

Datasets of Instruction Tuning

submit_date dataset paper number keywords
2023-11-21 ShareGPT4V arXiv 2023 1.2M highly descriptive captions, GPT-4V
2023-11-15 MMC-Instruction arXiv 2023 600K chart
2023-04-20 cc_sbu_align arXiv 2023 5M high-quality, well-aligned
2023-04-17 LLaVA-Instruct-150K NeurIPS 2023 158K conversation, description, reasoning

Datasets of Reinforcement Learning with Human Feedback (RLHF)

submit_date dataset paper number keywords
2023-09-25 LLaVA-SFT-122K
LLaVA-Human-Preference-10K
arXiv 2023 122K
10K
high-quality

Datasets of Evaluation

submit_date dataset paper number keywords
2023-11-23 MLLM-Bench arXiv 2023 419 GPT-4V
2023-11-15 MMC-Benchmark arXiv 2023 600K chart
2023-09-25 MMHal-Bench arXiv 2023 96 hallucination
2023-06-23 MME arXiv 2023 - perception and cognition
2023-04-27 OwlEval arXiv 2023 82 multi-turn, diverse capabilities

About

Useful things about large multimodal models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published