Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

Closed
snehilsanyal opened this issue Nov 1, 2023 · 8 comments
Closed

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

snehilsanyal opened this issue Nov 1, 2023 · 8 comments
Labels
Chapter Content Discuss and track the content of a chapter

Comments

@snehilsanyal
Copy link
Contributor

Hey fellow CV Course Contributors and Reviewers 馃

This issue discusses an initial draft for the chapter Fusion of Text and Vision which is part of Unit 4: Multimodal Models. We feel that since this is an introductory section, we will have less content related to code and more stress will be on content and setting up the stage for later sections in the unit. We would like this unit to be short and crisp, at most 3 sections, nothing more than that unless some other additions are required like spaces/demos.

Thought Process: Previous unit is Unit 3 on Vision Transformers, Next unit is Unit 5 on Generative Models. So, content in this unit will use Unit 3's transformer models (and not traditional approaches to the tasks, so we will refrain from adding too much historical aspects) and also will form a precursor for later sections as well as Unit 5 Generative Models.

1. Introduction

  • Why Multimodality?
  • Real-world data is multimodal (it is always a combination of different modalities)
  • Short example of the human sensory feedback system (humans make decisions based on different sensory inputs and feedback)
  • Multimodal in what sense? Data? Models? Fusion Technique? Are spectrograms an example of multimodal data representation? (input is multimodal, output is multimodal, both input and output are of different modalities, this part is foundation for multimodal tasks and models)
  • Why data is multimodal in many real-life scenarios, how real-life content is multimodal data and is essential for search (example from Google and Bing)
  • Some cool applications and examples of multimodality (Robotics: Vision Language Action models like RT2, RTX, Palm-E)

2. Multimodal Tasks and Models

A brief overview of different tasks and models (more emphasis on those tasks, which will be taken up in the course in different sections like #29 and #28

Mention briefly about the tasks and models (task, input and output, models with links or spaces). Can include other examples like text to speech, and speech to text in tasks and add a one-liner on it referring to the HF-Audio Course "For more information on this refer to HF-Audio Course". After this focus on Vision + Text/Audio.

Tasks and Models (each task, it's input output, and around 3-4 model names to go with it):

  • Document Visual Question Answering (text + vision), Models: LayoutLM, Nougat, Donut.
  • Image to Text, Visual Question Answering Models: Deplot, Pix2Struct, VILT, TrOCR, BLIP
  • Text to Image (synthesis and generation) SD, IF etc
  • Image and Video Captioning
  • Text to Video Models: CLIP-VTT etc

We can also create an infographic that divides the models into different categories like text + vision, text + vision + audio, more than 3 modalities etc, like a chart or hierarchy.

Mention everything related to tasks on vision + X (audio, text) and focus on Vision Language Models (text + vision) in the next section.

3. Vision Language Models

  • Introduction to Vision Language Models (brief, mechanism)
  • Cool Applications and examples (Multimodal Chatbots like GILL, LLava, Video ChatGPT, some cool application being developed in Multimodal Models - CLIP and relatives 聽#29 )
  • Emphasize on tasks that involve CLIP and relatives Multimodal Models - CLIP and relatives 聽#29
  • A brief ending of the introduction section which sets the stage for next sections like CLIP and relatives and fine-tuning.

References:

  1. Awesome Self-Supervised Multimodal Learning
  2. HF Tasks
  3. Multi Modal Machine Learning Course, CMU
  4. Meta's ImageBind
  5. Multimodal Machine Learning: A Survey and Taxonomy
  6. Recent blog by Chip Huyen

Please feel free to share your views on the outline 馃 馃殌 馃敟

@merveenoyan
Copy link
Collaborator

merveenoyan commented Nov 3, 2023

Hello @snehilsanyal 馃憢 Overall I think it's very cool. Please note that we also have this issue on Multimodal Models so it would be nice if you could explain how this outline fits in with this :)
note: apparently I missed the issue mentions above!

@johko johko added the Chapter Content Discuss and track the content of a chapter label Nov 4, 2023
@johko
Copy link
Owner

johko commented Nov 4, 2023

Hey @snehilsanyal ,

thanks for the detailed outline and all the thoughts you put into it.

I really like your intro, giving an intuition about what multimodal data is and why it is important 馃憤

Regarding the tasks I have a few additions you can consider:

  • Visual Grounding/Open Vocabulary Object Detection (like OWL-ViT)
  • Image-Text Retrieval
  • Referring Expression Comprehension (a rather special one)

As the field is constantly moving (at high pace) there are always new tasks and names for them, so feel free to include whatever you see fit.

For the models part it would be great to have focus on some models that are included in the transformers library, but I also totally understand that you don't want to skip on things like LLava and ChatGPT-4V. Again, do whatever feels like it makes most sense to you an people would like to read/learn about 馃檪

@johko
Copy link
Owner

johko commented Nov 4, 2023

And one paper that I can recommend for a very detailed overview (~100 pages) is this one: https://arxiv.org/pdf/2210.09263.pdf

@ATaylorAerospace
Copy link
Collaborator

ATaylorAerospace commented Nov 5, 2023

@snehilsanyal One addition to this chapter that might be very useful are text and vision use cases. Examples could be...

  • Real Estate Analysis: Analyzing property images and descriptions for categorization
  • Ecommerce Product Recommendation: Recommending products based on image and text reviews
  • Healthcare Diagnosis: Interpreting medical images and patient history for diagnosis
  • Social Media Monitoring: Analyzing social media posts and images for sentiment analysis

@snehilsanyal
Copy link
Contributor Author

Hey @merveenoyan Thanks for your comments 馃
Yes sure, thanks for pointing it out. We have mentioned the outline in detail on how it is related to #29 , so whatever is being done in this issue, we will be creating the content accordingly so that the content has a good flow and sync.

@snehilsanyal
Copy link
Contributor Author

snehilsanyal commented Nov 6, 2023

Hey @johko Thank you for your comments 馃 , really glad that you liked our outline.
I followed the #29 issue very closely and also encountered your commentary on the issue, much of those commentaries were summarized and incorporated into this outline so that everything is in sync.
We will look into recent tasks as suggested by you and include them in the content 馃

Regarding models we plan to include all types of models as it is educational, but we will stick to those that have ready implementations available with the help of transformers library, for example already available (or developed by us) spaces, demos, or examples. So yes, it will be a mix, where people can read and learn about multimodality in general but since the course is about CV and by HF, we would include models that are already present in the HF ecosystem.

And thanks for the suggested paper :D We will go through it and check if something is interesting and will add it to the content, lol might need to divide pages amongst the group 馃槅

@charchit7
Copy link
Collaborator

charchit7 commented Nov 6, 2023

Thank you for your comments 馃 @johko, @merveenoyan :) . We'll update accordingly.

@ratan
Copy link
Collaborator

ratan commented Dec 13, 2023

Very nice detailed outline and flow captured here.
We may also include speech-text scenario like whisper models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chapter Content Discuss and track the content of a chapter
Projects
None yet
Development

No branches or pull requests

6 participants