NOTE: A curated list of awesome Image captioning strudies, aimed at annotating and reporting CT / MRI scans
In the era since the announcement of the self-attention mechanism, the Transformers architecture has become a game-chaning architecture in the field of machine-translation. These findings and advances were quicly mitigated onto other fields in natural language processing (NLP) is when we end up with Language Models [encoder-based] and [generative] we aware of so far. This repository represent a collection of the systems and findids that may help you to advance yourself in development of Multimodal Large Language Models (MLLMs), that support the following modalities:
- 🖼️ Image (Photo, Scans, even Footages / Video frames gathered into sigle image)
- 📝 Text (Caption / Report / Question)
- Ferret-V2 (11 April, 2024) [report]
- OmniFusion (09 April, 2024) [report] [code]
- MM1 (22 March, 2024) report
- CLIP
- DINO-v2