Curates the latest text-vision large-model (LLM for OCR) ecosystem information to simplify capability comparison, track open/closed-source roadmaps, and accelerate experimentation.
- Curates state-of-the-art OCR-centric multimodal LLMs, from academic labs to commercial providers.
- Provides one-click access to official weights, evaluation reports, and papers for each model.
- Documents the in-house benchmark snapshot so results are reproducible and comparable.
- Ideal for research, engineering, and product teams that need to locate OCR LLM solutions quickly.
README.md– the living catalog you are reading; updated as new OCR LLMs are released.img/benchmark.png– benchmark snapshot referenced below (replace with new figures when re-running evaluations).
| Models | Project / Weights | Paper / Notes | Affiliation |
|---|---|---|---|
| Qwen3-VL | https://github.com/QwenLM/Qwen3-VL | - (technical report comming soon) | Tongyi Team / Alibaba |
| Gemini 2.5 Pro | https://gemini.google.com/app | - (service docs) | Google DeepMind |
| GPT-4o | https://chatgpt.com | - (OpenAI system card) | OpenAI |
Descriptions: All listed models focus on end-to-end OCR, document understanding, or multimodal visual reasoning. Some (e.g., GPT-4o, Gemini 2.5 Pro) are closed-source APIs but remain key baselines.
This catalog aggregates public information from the respective research groups and vendors. Please cite the original papers or product pages when using the models in academic or commercial projects.
