Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151

Merged
merged 15 commits into from
Dec 28, 2023

Conversation

SuryaKrishna02
Copy link
Contributor

@SuryaKrishna02 SuryaKrishna02 commented Dec 18, 2023

Hey everyone!

This PR adds the Part 1 of the Second Section on Fusion of Text and Vision for Unit 4: Multimodal Models. introducing the Multimodal Tasks and Models involving Image and Text.
Related to Issue: #54

Best,
Fusion of Text and Vision Team.

@@ -0,0 +1,36 @@
# MultiModal Tasks and Models Part - 1

In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications.
Copy link
Contributor

@snehilsanyal snehilsanyal Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice intro 😄


- **Text-to-Image generation:** Imagine a magical paintbrush that interprets your words and brings them to life! Text-to-image generation is like that, it transform your written descriptions into unique images. It's a blend of language understanding and image creation, where your text unlocks a visual world from photorealistic landscapes to dreamlike abstractions, all born from the power of your words.

## Visual Question Anwering (VQA) and Visual Reasoning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to read the upcoming sections 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicely connected to real world. :)


In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications.

## Examples of Tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my previous experience, try to avoid heavy words like delve and keep the words as simple as possible for easy reading.

@snehilsanyal
Copy link
Contributor

snehilsanyal commented Dec 19, 2023

@SuryaKrishna02 amazing read, nice introduction and well written 🤗
Waiting for the next sections.

Copy link
Collaborator

@ratan ratan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to add models? if so, you can make this a draft PR. I left some wording suggestions only

@SuryaKrishna02 SuryaKrishna02 marked this pull request as draft December 21, 2023 04:25
@SuryaKrishna02 SuryaKrishna02 marked this pull request as ready for review December 21, 2023 20:50
@SuryaKrishna02
Copy link
Contributor Author

@merveenoyan Thanks for your comments. I have made those changes and completed the rest of the section. Looking forward to your review.

Copy link
Collaborator

@ratan ratan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Owner

@johko johko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done 👏
just a really small grammar thing from my side

Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com>
@merveenoyan
Copy link
Collaborator

@SuryaKrishna02 can you fix merge conflicts and we can merge?

@SuryaKrishna02
Copy link
Contributor Author

@merveenoyan Fixed the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants