Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147

Merged
merged 43 commits into from
Jan 2, 2024

Conversation

charchit7
Copy link
Collaborator

Hey everyone 🤗

This PR adds the Introduction to VLM on Fusion of Text and Vision for Unit 4: Multimodal Models.
Related to Issue: #54

Please have a look!
@MKhalusova @merveenoyan

@charchit7
Copy link
Collaborator Author

charchit7 commented Dec 18, 2023

++ I accidently added the files changes from @snehilsanyal. Removed that and kept just mine.

@charchit7 charchit7 reopened this Dec 18, 2023
@charchit7 charchit7 closed this Dec 18, 2023
@charchit7 charchit7 reopened this Dec 18, 2023
@charchit7
Copy link
Collaborator Author

@snehilsanyal, @SuryaKrishna02

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given mostly format-related recommendations, thank you!

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved

One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above**
It designed to let us think more like : the results of models looks really amazing and it's way better than previous models but does it understand compositional relationships
in the same way humans would understand it rather than just generalizating to the data. For eg. earlier version of Stable Diffusion was
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit way too long, can you shorten?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating.

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
@charchit7
Copy link
Collaborator Author

charchit7 commented Dec 18, 2023

I've given mostly format-related recommendations, thank you!

Thank you, @merveenoyan :) I'll address them. Regarding the content, do you think there's anything more I could add? I had a great time learning about VLMs. I've kept the content brief here.

charchit7 and others added 14 commits December 22, 2023 01:44
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
@charchit7
Copy link
Collaborator Author

Hey @merveenoyan, thank you so much for the suggested fixes. I have addressed them all. Please let me know if anything else is required.

@charchit7 charchit7 self-assigned this Dec 21, 2023
Copy link
Collaborator

@ratan ratan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! left formatting nits

chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
chapters/en/Unit 4 - Mulitmodal Models/vlm-intro.mdx Outdated Show resolved Hide resolved
@charchit7
Copy link
Collaborator Author

@merveenoyan Thanks, will adress them. Lot of line issues happening from my end.

charchit7 and others added 6 commits December 28, 2023 11:59
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
edited the new-lines issues present in the content
@charchit7
Copy link
Collaborator Author

Hey @merveenoyan updated the content, and fixed the new lines issues suggested by you. Please have a look. Thanks!

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@merveenoyan
Copy link
Collaborator

@charchit7 if you can solve merge conflicts we can merge.

@charchit7
Copy link
Collaborator Author

Hey @merveenoyan fixed the merge conflict. Please check.

Copy link
Owner

@johko johko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for my late review on this, I somehow went past it.

Great content. I left some suggestions which are mostly of grammatical nature.


## Our World is Multimodal
Humans explore the world through diverse senses: sight, sound, touch, and scent. A complete grasp of our surroundings emerges by harmonizing insights from these varied modalities.
We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed.Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed.Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements.
We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed. Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements.



## Introduction
Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen shift from tradional ML/DL to new learning paradigm called pre-training, fine-tuning and prediction which has shown great benefit due since in tradional way we may need to collect huge amount of data, etc.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen shift from tradional ML/DL to new learning paradigm called pre-training, fine-tuning and prediction which has shown great benefit due since in tradional way we may need to collect huge amount of data, etc.
Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen a shift from traditional ML/DL training from scratch to a new learning paradigm including pre-training, fine-tuning and prediction, which has shown great benefit since in the traditional way we may need to collect huge amount of data, etc.

## Mechanism
To enable the functionality of Vision Language Models (VLMs), a meaningful combination of both text and images is essential for joint learning. How can we do that? One simple/common way is given image-text pairs:
- Extract image and text features using text and image encoders. For images it can be **CNN** or **transformer** based architectures.
- Learns the vision-language correlation with certain pre-training objectives.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Learns the vision-language correlation with certain pre-training objectives.
- Learn the vision-language correlation with certain pre-training objectives.

VLM pre-training aims to pre-train a VLM to learn image-text correlation, targeting effective zero-shot predictions on visual recognition tasks which can be segmentation, classification, etc.

## Strategies
We can categorize [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I get the sentence right, but I think you can either remove "categorize" or "group"

Suggested change
We can categorize [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.
We can [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning.

- In this method we fuse visual information into language models by treating images as normal text tokens and train the model on a sequence of joint representations of both text and images. Precisely, images are divided into multiple smaller patches and each patch is treated as one "token" in the input sequence. e.g. [VisualBERT](https://arxiv.org/abs/1908.03557), [SimVLM](https://arxiv.org/abs/2108.10904).

- Learning good image embeddings that can work as a prefix for a frozen, pre-trained language model.
- In this method we don't change the language model parameters when adapting to handle visual signal. Instead we learn such an embedding space for images that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- In this method we don't change the language model parameters when adapting to handle visual signal. Instead we learn such an embedding space for images that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).
- In this method we don't change the language model parameters when adapting to handle visual signals. Instead we learn an embedding space for images, such that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734).



## Downstream Tasks and Evaluation
VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some line breaks would make it easier to read

Suggested change
VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance.
VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally.
Generally the setup used for evaluating VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning.
In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure its representation. How do we evaluate these models? We can check how they perform on datasets, e.g. given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/).
Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training doesn't work and models had a huge gap with human performance.


![Winogrand Idea](https://huggingface.co/datasets/hf-vision/course-assets/resolve/99ac107ade7fb89aae792f3655341528e64e1fbb/winogrand_paper.png)

One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier version of Stable Diffusion was not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier version of Stable Diffusion was not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage!
One more such dataset called **Winoground** was designed to figure out how good CLIP actually is. **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier versions of Stable Diffusion and other text-to-image models, were not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage!



## What's Next?
The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast!
The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once. This is one possible scenario for the future - modality-agnostic foundation models that can read and generate many modalities! But maybe we also see other alternatives developing, one thing we can say for sure is . there is an interesting future ahead.
To capture more on these recent advances feel free follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself.

@charchit7
Copy link
Collaborator Author

charchit7 commented Dec 31, 2023

Hey, sorry for my late review on this, I somehow went past it.

Great content. I left some suggestions which are mostly of grammatical nature.

No problem at all @johko
Hope your children are doing good now.

Fixed the changes.

@charchit7 charchit7 requested a review from johko December 31, 2023 12:56
Copy link
Owner

@johko johko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM now 🙂

@johko
Copy link
Owner

johko commented Jan 2, 2024

Merging 🚀

@johko johko merged commit fd084b7 into johko:main Jan 2, 2024
@charchit7
Copy link
Collaborator Author

Thanks @johko :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants