-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151
Conversation
Added VLM Introduction, pushing to main for reviews!
@@ -0,0 +1,36 @@ | |||
# MultiModal Tasks and Models Part - 1 | |||
|
|||
In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice intro 😄
|
||
- **Text-to-Image generation:** Imagine a magical paintbrush that interprets your words and brings them to life! Text-to-image generation is like that, it transform your written descriptions into unique images. It's a blend of language understanding and image creation, where your text unlocks a visual world from photorealistic landscapes to dreamlike abstractions, all born from the power of your words. | ||
|
||
## Visual Question Anwering (VQA) and Visual Reasoning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to read the upcoming sections 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nicely connected to real world. :)
|
||
In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications. | ||
|
||
## Examples of Tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my previous experience, try to avoid heavy words like delve
and keep the words as simple as possible for easy reading.
@SuryaKrishna02 amazing read, nice introduction and well written 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to add models? if so, you can make this a draft PR. I left some wording suggestions only
@merveenoyan Thanks for your comments. I have made those changes and completed the rest of the section. Looking forward to your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done 👏
just a really small grammar thing from my side
Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com>
@SuryaKrishna02 can you fix merge conflicts and we can merge? |
@merveenoyan Fixed the merge conflicts. |
Hey everyone!
This PR adds the Part 1 of the Second Section on Fusion of Text and Vision for Unit 4: Multimodal Models. introducing the Multimodal Tasks and Models involving Image and Text.
Related to Issue: #54
Best,
Fusion of Text and Vision Team.