Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151

SuryaKrishna02 · 2023-12-18T18:01:33Z

Hey everyone!

This PR adds the Part 1 of the Second Section on Fusion of Text and Vision for Unit 4: Multimodal Models. introducing the Multimodal Tasks and Models involving Image and Text.
Related to Issue: #54

Best,
Fusion of Text and Vision Team.

Added VLM Introduction, pushing to main for reviews!

snehilsanyal · 2023-12-19T04:13:38Z

chapters/en/Unit 4 - Mulitmodal Models/tasks-models-part1.mdx

@@ -0,0 +1,36 @@
+# MultiModal Tasks and Models Part - 1
+
+In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications.


Really nice intro 😄

snehilsanyal · 2023-12-19T04:19:48Z

chapters/en/Unit 4 - Mulitmodal Models/tasks-models-part1.mdx

+
+- **Text-to-Image generation:** Imagine a magical paintbrush that interprets your words and brings them to life! Text-to-image generation is like that, it transform your written descriptions into unique images. It's a blend of language understanding and image creation, where your text unlocks a visual world from photorealistic landscapes to dreamlike abstractions, all born from the power of your words.
+
+## Visual Question Anwering (VQA) and Visual Reasoning


Excited to read the upcoming sections 😄

nicely connected to real world. :)

snehilsanyal · 2023-12-19T04:21:54Z

chapters/en/Unit 4 - Mulitmodal Models/tasks-models-part1.mdx

+
+In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications.
+
+## Examples of Tasks


From my previous experience, try to avoid heavy words like delve and keep the words as simple as possible for easy reading.

snehilsanyal · 2023-12-19T04:23:01Z

@SuryaKrishna02 amazing read, nice introduction and well written 🤗
Waiting for the next sections.

ratan

Looks good.

merveenoyan

Are you going to add models? if so, you can make this a draft PR. I left some wording suggestions only

chapters/en/Unit 4 - Mulitmodal Models/tasks-models-part1.mdx

SuryaKrishna02 · 2023-12-21T20:51:53Z

@merveenoyan Thanks for your comments. I have made those changes and completed the rest of the section. Looking forward to your review.

ratan

LGTM

merveenoyan

Thank you!

johko

Well done 👏
just a really small grammar thing from my side

chapters/en/Unit 4 - Mulitmodal Models/tasks-models-part1.mdx

Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com>

merveenoyan · 2023-12-27T08:11:54Z

@SuryaKrishna02 can you fix merge conflicts and we can merge?

…ion-course into tasks-models-part1

SuryaKrishna02 · 2023-12-28T11:37:09Z

@merveenoyan Fixed the merge conflicts.

suryaguthikonda and others added 6 commits December 17, 2023 20:51

Created the MDX file

9c2838f

added vlm introduction

22ccddf

Merge pull request #6 from charchit7/charchit_vlm

b2d9298

Added VLM Introduction, pushing to main for reviews!

Added structure and intro to tasks-models-part1

b95a0be

Isolating the PR

7c2cec6

spell correction

8645465

SuryaKrishna02 requested review from merveenoyan and MKhalusova as code owners December 18, 2023 18:01

snehilsanyal reviewed Dec 19, 2023

View reviewed changes

ratan reviewed Dec 19, 2023

View reviewed changes

merveenoyan reviewed Dec 20, 2023

View reviewed changes

Added review changes and finalized structure

954cd68

SuryaKrishna02 marked this pull request as draft December 21, 2023 04:25

added models to the tasks

74ec90b

SuryaKrishna02 marked this pull request as ready for review December 21, 2023 20:50

ratan approved these changes Dec 22, 2023

View reviewed changes

added model for image-text retrieval

ba21e6c

merveenoyan approved these changes Dec 24, 2023

View reviewed changes

snehilsanyal added 2 commits December 25, 2023 20:02

Merge branch 'johko:main' into main

f8cec86

Merge branch 'johko:main' into main

01cb6b7

johko approved these changes Dec 25, 2023

View reviewed changes

chapters/en/Unit 4 - Mulitmodal Models/tasks-models-part1.mdx Outdated Show resolved Hide resolved

Fixed grammar

dcdb5af

Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com>

SuryaKrishna02 added 3 commits December 27, 2023 17:40

Merge branch 'johko:main' into main

36b6ceb

Merge branch 'main' of https://github.com/SuryaKrishna02/computer-vis…

70191ad

…ion-course into tasks-models-part1

isolating the PR

6f5d78f

merveenoyan merged commit 15c245b into johko:main Dec 28, 2023

snehilsanyal mentioned this pull request Jan 7, 2024

MultiModal Tasks and Models Part - 1 SuryaKrishna02/computer-vision-course#4

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151

Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151

SuryaKrishna02 commented Dec 18, 2023 •

edited

Loading

snehilsanyal Dec 19, 2023 •

edited

Loading

snehilsanyal Dec 19, 2023

charchit7 Dec 19, 2023

snehilsanyal Dec 19, 2023

snehilsanyal commented Dec 19, 2023 •

edited

Loading

ratan left a comment

merveenoyan left a comment

SuryaKrishna02 commented Dec 21, 2023

ratan left a comment

merveenoyan left a comment

johko left a comment

merveenoyan commented Dec 27, 2023

SuryaKrishna02 commented Dec 28, 2023

		@@ -0,0 +1,36 @@
		# MultiModal Tasks and Models Part - 1

		In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications.


		- Text-to-Image generation: Imagine a magical paintbrush that interprets your words and brings them to life! Text-to-image generation is like that, it transform your written descriptions into unique images. It's a blend of language understanding and image creation, where your text unlocks a visual world from photorealistic landscapes to dreamlike abstractions, all born from the power of your words.

		## Visual Question Anwering (VQA) and Visual Reasoning


		In this section, we will briefly look at the different multimodal tasks involving Image and Text modalities, and their corresponding models. Before diving in, let's have small recap on what is meant by "multimodal" which was covered in previous sections. The human world is a symphony of diverse sensory inputs. We perceive and understand through sight, sound, touch, and more. This multimodality is what separates our rich understanding from the limitations of traditional, unimodal AI models. Multimodal models, drawing inspiration from human cognition, aim to bridge this gap by integrating information from multiple sources, like text, images, audio, and even sensor data. This fusion of modalities leads to a more comprehensive and nuanced understanding of the world, unlocking a vast range of tasks and applications.

		## Examples of Tasks

Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151

Unit 4: Fusion Text and Vision - Tasks and Models for Image and Text. #151

Conversation

SuryaKrishna02 commented Dec 18, 2023 • edited Loading

snehilsanyal Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

snehilsanyal Dec 19, 2023

Choose a reason for hiding this comment

charchit7 Dec 19, 2023

Choose a reason for hiding this comment

snehilsanyal Dec 19, 2023

Choose a reason for hiding this comment

snehilsanyal commented Dec 19, 2023 • edited Loading

ratan left a comment

Choose a reason for hiding this comment

merveenoyan left a comment

Choose a reason for hiding this comment

SuryaKrishna02 commented Dec 21, 2023

ratan left a comment

Choose a reason for hiding this comment

merveenoyan left a comment

Choose a reason for hiding this comment

johko left a comment

Choose a reason for hiding this comment

merveenoyan commented Dec 27, 2023

SuryaKrishna02 commented Dec 28, 2023

SuryaKrishna02 commented Dec 18, 2023 •

edited

Loading

snehilsanyal Dec 19, 2023 •

edited

Loading

snehilsanyal commented Dec 19, 2023 •

edited

Loading