This repository is a collection of useful things about Large Multimodal Models (LMMs).
Stars, suggestions, and contributions are all welcome.
Papers on the survey, understanding, and analysis of LMMs.
Papers on training LMMs from scratch.
Papers on improving LMMs through instruction tuning.
submit_date | model_name | paper | github_code |
---|---|---|---|
2023-11-21 | ShareGPT4V | ShareGPT4V: Improving Large Multi-Modal Models with Better Captions arXiv 2023 | |
2023-11-20 | LION | LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge arXiv 2023 | |
2023-11-15 | MMCA | MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning arXiv 2023 | |
2023-10-14 | MiniGPT-v2 | MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning arXiv 2023 | |
2023-10-13 | COMM | From CLIP to DINO: Visual Encoders Shout in Multi-Modal Large Language Models arXiv 2023 | |
2023-10-5 | LLaVA-1.5 | Improved Baselines with Visual Instruction Tuning arXiv 2023 | |
2023-09-29 | DeepSpeed-VisualChat | DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention arXiv 2023 | |
2023-04-20 | MiniGPT-4 | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models arXiv 2023 | |
2023-04-17 | LLaVA | Visual Instruction Tuning NeurIPS 2023 |
Papers on LMM alignment through RLHF.
submit_date | model_name | paper | github_code |
---|---|---|---|
2023-09-25 | LLaVA-RLHF | Aligning Large Multimodal Models with Factually Augmented RLHF arXiv 2023 |
Papers on expanding capabilities of LMMs, such as segmentation, detection, generation, and etc.
submit_date | model_name | paper | capability | github_code |
---|---|---|---|---|
2023-11-09 | LLaVa-Plus | LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents arXiv 2023 | ||
2023-11-01 | LLaVA-Interactive | LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing arXiv 2023 |
Papers on correcting hallucination of LMMs.
submit_date | model_name | paper | github_code |
---|---|---|---|
2023-10-24 | Woodpecker | Woodpecker: Hallucination Correction for Multimodal Large Language Models arXiv 2023 |
submit_date | dataset | paper | number | keywords |
---|---|---|---|---|
2023-11-21 | ShareGPT4V | arXiv 2023 | 1.2M | highly descriptive captions, GPT-4V |
2023-11-15 | MMC-Instruction | arXiv 2023 | 600K | chart |
2023-04-20 | cc_sbu_align | arXiv 2023 | 5M | high-quality, well-aligned |
2023-04-17 | LLaVA-Instruct-150K | NeurIPS 2023 | 158K | conversation, description, reasoning |
submit_date | dataset | paper | number | keywords |
---|---|---|---|---|
2023-09-25 | LLaVA-SFT-122K LLaVA-Human-Preference-10K |
arXiv 2023 | 122K 10K |
high-quality |
submit_date | dataset | paper | number | keywords |
---|---|---|---|---|
2023-11-23 | MLLM-Bench | arXiv 2023 | 419 | GPT-4V |
2023-11-15 | MMC-Benchmark | arXiv 2023 | 600K | chart |
2023-09-25 | MMHal-Bench | arXiv 2023 | 96 | hallucination |
2023-06-23 | MME | arXiv 2023 | - | perception and cognition |
2023-04-27 | OwlEval | arXiv 2023 | 82 | multi-turn, diverse capabilities |