Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

snehilsanyal · 2023-11-01T15:53:19Z

Hey fellow CV Course Contributors and Reviewers 🤗

This issue discusses an initial draft for the chapter Fusion of Text and Vision which is part of Unit 4: Multimodal Models. We feel that since this is an introductory section, we will have less content related to code and more stress will be on content and setting up the stage for later sections in the unit. We would like this unit to be short and crisp, at most 3 sections, nothing more than that unless some other additions are required like spaces/demos.

Thought Process: Previous unit is Unit 3 on Vision Transformers, Next unit is Unit 5 on Generative Models. So, content in this unit will use Unit 3's transformer models (and not traditional approaches to the tasks, so we will refrain from adding too much historical aspects) and also will form a precursor for later sections as well as Unit 5 Generative Models.

1. Introduction

Why Multimodality?
Real-world data is multimodal (it is always a combination of different modalities)
Short example of the human sensory feedback system (humans make decisions based on different sensory inputs and feedback)
Multimodal in what sense? Data? Models? Fusion Technique? Are spectrograms an example of multimodal data representation? (input is multimodal, output is multimodal, both input and output are of different modalities, this part is foundation for multimodal tasks and models)
Why data is multimodal in many real-life scenarios, how real-life content is multimodal data and is essential for search (example from Google and Bing)
Some cool applications and examples of multimodality (Robotics: Vision Language Action models like RT2, RTX, Palm-E)

2. Multimodal Tasks and Models

A brief overview of different tasks and models (more emphasis on those tasks, which will be taken up in the course in different sections like #29 and #28

Mention briefly about the tasks and models (task, input and output, models with links or spaces). Can include other examples like text to speech, and speech to text in tasks and add a one-liner on it referring to the HF-Audio Course "For more information on this refer to HF-Audio Course". After this focus on Vision + Text/Audio.

Tasks and Models (each task, it's input output, and around 3-4 model names to go with it):

Document Visual Question Answering (text + vision), Models: LayoutLM, Nougat, Donut.
Image to Text, Visual Question Answering Models: Deplot, Pix2Struct, VILT, TrOCR, BLIP
Text to Image (synthesis and generation) SD, IF etc
Image and Video Captioning
Text to Video Models: CLIP-VTT etc

We can also create an infographic that divides the models into different categories like text + vision, text + vision + audio, more than 3 modalities etc, like a chart or hierarchy.

Mention everything related to tasks on vision + X (audio, text) and focus on Vision Language Models (text + vision) in the next section.

3. Vision Language Models

Introduction to Vision Language Models (brief, mechanism)
Cool Applications and examples (Multimodal Chatbots like GILL, LLava, Video ChatGPT, some cool application being developed in Multimodal Models - CLIP and relatives #29 )
Emphasize on tasks that involve CLIP and relatives Multimodal Models - CLIP and relatives #29
A brief ending of the introduction section which sets the stage for next sections like CLIP and relatives and fine-tuning.

References:

Please feel free to share your views on the outline 🤗 🚀 🔥

merveenoyan · 2023-11-03T21:58:19Z

Hello @snehilsanyal 👋 Overall I think it's very cool. Please note that we also have this issue on Multimodal Models so it would be nice if you could explain how this outline fits in with this :)
note: apparently I missed the issue mentions above!

johko · 2023-11-04T15:37:34Z

Hey @snehilsanyal ,

thanks for the detailed outline and all the thoughts you put into it.

I really like your intro, giving an intuition about what multimodal data is and why it is important 👍

Regarding the tasks I have a few additions you can consider:

Visual Grounding/Open Vocabulary Object Detection (like OWL-ViT)
Image-Text Retrieval
Referring Expression Comprehension (a rather special one)

As the field is constantly moving (at high pace) there are always new tasks and names for them, so feel free to include whatever you see fit.

For the models part it would be great to have focus on some models that are included in the transformers library, but I also totally understand that you don't want to skip on things like LLava and ChatGPT-4V. Again, do whatever feels like it makes most sense to you an people would like to read/learn about 🙂

johko · 2023-11-04T15:48:13Z

And one paper that I can recommend for a very detailed overview (~100 pages) is this one: https://arxiv.org/pdf/2210.09263.pdf

ATaylorAerospace · 2023-11-05T09:14:44Z

@snehilsanyal One addition to this chapter that might be very useful are text and vision use cases. Examples could be...

Real Estate Analysis: Analyzing property images and descriptions for categorization
Ecommerce Product Recommendation: Recommending products based on image and text reviews
Healthcare Diagnosis: Interpreting medical images and patient history for diagnosis
Social Media Monitoring: Analyzing social media posts and images for sentiment analysis

snehilsanyal · 2023-11-06T07:43:22Z

Hey @merveenoyan Thanks for your comments 🤗
Yes sure, thanks for pointing it out. We have mentioned the outline in detail on how it is related to #29 , so whatever is being done in this issue, we will be creating the content accordingly so that the content has a good flow and sync.

snehilsanyal · 2023-11-06T09:30:59Z

Hey @johko Thank you for your comments 🤗 , really glad that you liked our outline.
I followed the #29 issue very closely and also encountered your commentary on the issue, much of those commentaries were summarized and incorporated into this outline so that everything is in sync.
We will look into recent tasks as suggested by you and include them in the content 🤗

Regarding models we plan to include all types of models as it is educational, but we will stick to those that have ready implementations available with the help of transformers library, for example already available (or developed by us) spaces, demos, or examples. So yes, it will be a mix, where people can read and learn about multimodality in general but since the course is about CV and by HF, we would include models that are already present in the HF ecosystem.

And thanks for the suggested paper :D We will go through it and check if something is interesting and will add it to the content, lol might need to divide pages amongst the group 😆

charchit7 · 2023-11-06T17:48:32Z

Thank you for your comments 🤗 @johko, @merveenoyan :) . We'll update accordingly.

ratan · 2023-12-13T16:15:13Z

Very nice detailed outline and flow captured here.
We may also include speech-text scenario like whisper models.

johko added the Chapter Content Discuss and track the content of a chapter label Nov 4, 2023

This was referenced Dec 7, 2023

Add pre-introduction for Unit 4 Multimodal Models #99

Merged

Introduction: Fusion of Text and Vision SuryaKrishna02/computer-vision-course#2

Closed

SuryaKrishna02 mentioned this issue Dec 15, 2023

MultiModal Tasks and Models Part - 1 SuryaKrishna02/computer-vision-course#4

Closed

9 tasks

charchit7 mentioned this issue Dec 17, 2023

Introduction to Vision Language Models SuryaKrishna02/computer-vision-course#5

Closed

7 tasks

snehilsanyal mentioned this issue Dec 17, 2023

Unit 4, Introduction: Fusion of Text and Vision #126

Merged

charchit7 mentioned this issue Dec 18, 2023

Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147

Merged

SuryaKrishna02 mentioned this issue Dec 18, 2023

Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151

Merged

This was referenced Dec 21, 2023

Demo for Introduction Chapter SuryaKrishna02/computer-vision-course#7

Open

Add supplementary resources SuryaKrishna02/computer-vision-course#8

Closed

snehilsanyal mentioned this issue Jan 7, 2024

Supplementary resources Unit 4: Multimodal Models #173

Merged

johko closed this as completed Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

snehilsanyal commented Nov 1, 2023

merveenoyan commented Nov 3, 2023 •

edited

Loading

johko commented Nov 4, 2023

johko commented Nov 4, 2023

ATaylorAerospace commented Nov 5, 2023 •

edited

Loading

snehilsanyal commented Nov 6, 2023

snehilsanyal commented Nov 6, 2023 •

edited

Loading

charchit7 commented Nov 6, 2023 •

edited

Loading

ratan commented Dec 13, 2023

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

Comments

snehilsanyal commented Nov 1, 2023

merveenoyan commented Nov 3, 2023 • edited Loading

johko commented Nov 4, 2023

johko commented Nov 4, 2023

ATaylorAerospace commented Nov 5, 2023 • edited Loading

snehilsanyal commented Nov 6, 2023

snehilsanyal commented Nov 6, 2023 • edited Loading

charchit7 commented Nov 6, 2023 • edited Loading

ratan commented Dec 13, 2023

merveenoyan commented Nov 3, 2023 •

edited

Loading

ATaylorAerospace commented Nov 5, 2023 •

edited

Loading

snehilsanyal commented Nov 6, 2023 •

edited

Loading

charchit7 commented Nov 6, 2023 •

edited

Loading