Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147

charchit7 · 2023-12-18T12:07:26Z

Hey everyone 🤗

This PR adds the Introduction to VLM on Fusion of Text and Vision for Unit 4: Multimodal Models.
Related to Issue: #54

Please have a look!
@MKhalusova @merveenoyan

charchit7 · 2023-12-18T12:09:36Z

++ I accidently added the files changes from @snehilsanyal. Removed that and kept just mine.

charchit7 · 2023-12-18T12:38:08Z

@snehilsanyal, @SuryaKrishna02

merveenoyan

I've given mostly format-related recommendations, thank you!

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

merveenoyan · 2023-12-18T16:21:56Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+
+One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above**
+It designed to let us think more like : the results of models looks really amazing and it's way better than previous models but does it understand compositional relationships
+in the same way humans would understand it rather than just generalizating to the data. For eg. earlier version of Stable Diffusion was


This sentence is a bit way too long, can you shorten?

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

charchit7 · 2023-12-18T17:56:33Z

I've given mostly format-related recommendations, thank you!

Thank you, @merveenoyan :) I'll address them. Regarding the content, do you think there's anything more I could add? I had a great time learning about VLMs. I've kept the content brief here.

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

charchit7 · 2023-12-21T21:18:09Z

Hey @merveenoyan, thank you so much for the suggested fixes. I have addressed them all. Please let me know if anything else is required.

ratan

LGTM

merveenoyan

Thank you! left formatting nits

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

charchit7 · 2023-12-24T22:24:32Z

@merveenoyan Thanks, will adress them. Lot of line issues happening from my end.

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

edited the new-lines issues present in the content

charchit7 · 2023-12-28T06:49:25Z

Hey @merveenoyan updated the content, and fixed the new lines issues suggested by you. Please have a look. Thanks!

merveenoyan

Looks good to me!

merveenoyan · 2023-12-28T11:47:23Z

@charchit7 if you can solve merge conflicts we can merge.

charchit7 · 2023-12-28T14:28:46Z

Hey @merveenoyan fixed the merge conflict. Please check.

johko

Hey, sorry for my late review on this, I somehow went past it.

Great content. I left some suggestions which are mostly of grammatical nature.

johko · 2023-12-30T19:38:53Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+
+## Our World is Multimodal 
+Humans explore the world through diverse senses: sight, sound, touch, and scent. A complete grasp of our surroundings emerges by harmonizing insights from these varied modalities.
+We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed.Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements.


Suggested change

We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed.Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements.

We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed. Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements.

johko · 2023-12-30T19:51:24Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+
+
+## Introduction 
+Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen shift from tradional ML/DL to new learning paradigm called pre-training, fine-tuning and prediction which has shown great benefit due since in tradional way we may need to collect huge amount of data, etc. 


Suggested change

Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen shift from tradional ML/DL to new learning paradigm called pre-training, fine-tuning and prediction which has shown great benefit due since in tradional way we may need to collect huge amount of data, etc.

Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen a shift from traditional ML/DL training from scratch to a new learning paradigm including pre-training, fine-tuning and prediction, which has shown great benefit since in the traditional way we may need to collect huge amount of data, etc.

johko · 2023-12-30T19:53:22Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+## Mechanism
+To enable the functionality of Vision Language Models (VLMs), a meaningful combination of both text and images is essential for joint learning. How can we do that? One simple/common way is given image-text pairs:
+- Extract image and text features using text and image encoders. For images it can be **CNN** or **transformer** based architectures.
+- Learns the vision-language correlation with certain pre-training objectives.


Suggested change

- Learns the vision-language correlation with certain pre-training objectives.

- Learn the vision-language correlation with certain pre-training objectives.

johko · 2023-12-30T19:56:11Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+VLM pre-training aims to pre-train a VLM to learn image-text correlation, targeting effective zero-shot predictions on visual recognition tasks which can be segmentation, classification, etc. 
+
+## Strategies
+We can categorize [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.


Not sure if I get the sentence right, but I think you can either remove "categorize" or "group"

Suggested change

We can categorize [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.

We can [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.

johko · 2023-12-30T19:57:39Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+    - In this method we fuse visual information into language models by treating images as normal text tokens and train the model on a sequence of joint representations of both text and images. Precisely, images are divided into multiple smaller patches and each patch is treated as one "token" in the input sequence. e.g. [VisualBERT](https://arxiv.org/abs/1908.03557), [SimVLM](https://arxiv.org/abs/2108.10904).
+
+- Learning good image embeddings that can work as a prefix for a frozen, pre-trained language model.
+    - In this method we don't change the language model parameters when adapting to handle visual signal. Instead we learn such an embedding space for images that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).


Suggested change

- In this method we don't change the language model parameters when adapting to handle visual signal. Instead we learn such an embedding space for images that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).

- In this method we don't change the language model parameters when adapting to handle visual signals. Instead we learn an embedding space for images, such that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).

johko · 2023-12-30T20:10:17Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+
+
+## Downstream Tasks and Evaluation
+VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance. 


I think some line breaks would make it easier to read

Suggested change

VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance.

VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally.

Generally the setup used for evaluating VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning.

In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure its representation. How do we evaluate these models? We can check how they perform on datasets, e.g. given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/).

Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training doesn't work and models had a huge gap with human performance.

johko · 2023-12-30T20:12:06Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+
+![Winogrand Idea](https://huggingface.co/datasets/hf-vision/course-assets/resolve/99ac107ade7fb89aae792f3655341528e64e1fbb/winogrand_paper.png) 
+
+One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier version of Stable Diffusion was not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage! 


Suggested change

One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier version of Stable Diffusion was not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage!

One more such dataset called **Winoground** was designed to figure out how good CLIP actually is. **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier versions of Stable Diffusion and other text-to-image models, were not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage!

johko · 2023-12-30T20:16:45Z

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

+
+
+## What's Next?
+The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast!


Suggested change

The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast!

The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once. This is one possible scenario for the future - modality-agnostic foundation models that can read and generate many modalities! But maybe we also see other alternatives developing, one thing we can say for sure is . there is an interesting future ahead.

To capture more on these recent advances feel free follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself.

charchit7 · 2023-12-31T12:36:42Z

Hey, sorry for my late review on this, I somehow went past it.

Great content. I left some suggestions which are mostly of grammatical nature.

No problem at all @johko
Hope your children are doing good now.

Fixed the changes.

johko

Thanks, LGTM now 🙂

johko · 2024-01-02T10:53:49Z

Merging 🚀

charchit7 · 2024-01-02T14:51:07Z

Thanks @johko :)

minor changes

17458ce

charchit7 requested review from merveenoyan and MKhalusova as code owners December 18, 2023 12:07

charchit7 closed this Dec 18, 2023

charchit7 reopened this Dec 18, 2023

accidently added intro

d9ecde3

charchit7 closed this Dec 18, 2023

changes

762ff9a

charchit7 reopened this Dec 18, 2023

merveenoyan reviewed Dec 18, 2023

View reviewed changes

snehilsanyal reviewed Dec 19, 2023

View reviewed changes

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved

snehilsanyal reviewed Dec 19, 2023

View reviewed changes

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved

ratan reviewed Dec 19, 2023

View reviewed changes

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved

charchit7 and others added 14 commits December 22, 2023 01:44

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

32f62de

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

d9f15cc

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

24cb520

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

0149a65

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

dd082c6

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

017004e

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

0813e65

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

d61d66e

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

8d5923c

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

c0b855d

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

43674f8

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

e21a881

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

fba27e0

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

5a9e498

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

charchit7 requested a review from merveenoyan December 21, 2023 21:18

charchit7 self-assigned this Dec 21, 2023

ratan approved these changes Dec 22, 2023

View reviewed changes

merveenoyan reviewed Dec 24, 2023

View reviewed changes

charchit7 and others added 6 commits December 28, 2023 11:59

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

4e5f8b9

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

f1bcdc8

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

487f493

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

b04f9aa

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx

9283954

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Update vlm-intro.mdx

6c1f81f

edited the new-lines issues present in the content

charchit7 requested a review from merveenoyan December 28, 2023 06:48

Update vlm-intro.mdx

ff160ea

merveenoyan approved these changes Dec 28, 2023

View reviewed changes

charchit7 added 2 commits December 28, 2023 19:20

Merge branch 'main' into charchit7-charchit_vlm

417868f

Merge branch 'main' into charchit_vlm

d755fd2

charchit7 requested a review from merveenoyan December 28, 2023 14:29

johko reviewed Dec 30, 2023

View reviewed changes

fixed changes suggested

597c4b8

charchit7 requested a review from johko December 31, 2023 12:56

johko approved these changes Jan 2, 2024

View reviewed changes

johko merged commit fd084b7 into johko:main Jan 2, 2024

snehilsanyal mentioned this pull request Jan 7, 2024

Introduction to Vision Language Models SuryaKrishna02/computer-vision-course#5

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147

Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147

charchit7 commented Dec 18, 2023

charchit7 commented Dec 18, 2023 •

edited

Loading

charchit7 commented Dec 18, 2023

merveenoyan left a comment

merveenoyan Dec 18, 2023

charchit7 Dec 21, 2023

charchit7 commented Dec 18, 2023 •

edited

Loading

charchit7 commented Dec 21, 2023

ratan left a comment

merveenoyan left a comment

charchit7 commented Dec 24, 2023

charchit7 commented Dec 28, 2023

merveenoyan left a comment

merveenoyan commented Dec 28, 2023

charchit7 commented Dec 28, 2023

johko left a comment

johko Dec 30, 2023

johko Dec 30, 2023

johko Dec 30, 2023

johko Dec 30, 2023

johko Dec 30, 2023

johko Dec 30, 2023

johko Dec 30, 2023

johko Dec 30, 2023

charchit7 commented Dec 31, 2023 •

edited

Loading

johko left a comment

johko commented Jan 2, 2024

charchit7 commented Jan 2, 2024



		## Introduction
		Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen shift from tradional ML/DL to new learning paradigm called pre-training, fine-tuning and prediction which has shown great benefit due since in tradional way we may need to collect huge amount of data, etc.

	- Learns the vision-language correlation with certain pre-training objectives.
	- Learn the vision-language correlation with certain pre-training objectives.

	We can categorize [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.
	We can [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.

	- In this method we don't change the language model parameters when adapting to handle visual signal. Instead we learn such an embedding space for images that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).
	- In this method we don't change the language model parameters when adapting to handle visual signals. Instead we learn an embedding space for images, such that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).



		## Downstream Tasks and Evaluation
		VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance.

-VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance.
+VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally.
+Generally the setup used for evaluating VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning.
+In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure its representation. How do we evaluate these models? We can check how they perform on datasets, e.g. given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/).
+Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training doesn't work and models had a huge gap with human performance.


		![Winogrand Idea](https://huggingface.co/datasets/hf-vision/course-assets/resolve/99ac107ade7fb89aae792f3655341528e64e1fbb/winogrand_paper.png)

		One more such dataset called Winoground was designed to figure out, okay, so how godd is CLIP actually? Figure Above This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier version of Stable Diffusion was not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage!



		## What's Next?
		The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast!

-The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast!
+The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once. This is one possible scenario for the future - modality-agnostic foundation models that can read and generate many modalities! But maybe we also see other alternatives developing, one thing we can say for sure is . there is an interesting future ahead.
+To capture more on these recent advances feel free follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself.

Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147

Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147

Conversation

charchit7 commented Dec 18, 2023

charchit7 commented Dec 18, 2023 • edited Loading

charchit7 commented Dec 18, 2023

merveenoyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charchit7 commented Dec 18, 2023 • edited Loading

charchit7 commented Dec 21, 2023

ratan left a comment

Choose a reason for hiding this comment

merveenoyan left a comment

Choose a reason for hiding this comment

charchit7 commented Dec 24, 2023

charchit7 commented Dec 28, 2023

merveenoyan left a comment

Choose a reason for hiding this comment

merveenoyan commented Dec 28, 2023

charchit7 commented Dec 28, 2023

johko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charchit7 commented Dec 31, 2023 • edited Loading

johko left a comment

Choose a reason for hiding this comment

johko commented Jan 2, 2024

charchit7 commented Jan 2, 2024

charchit7 commented Dec 18, 2023 •

edited

Loading

charchit7 commented Dec 18, 2023 •

edited

Loading

charchit7 commented Dec 31, 2023 •

edited

Loading