-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Introduction to Visual Language Models(VLM) for Unit 4. Multimodal Models. #147
Conversation
++ I accidently added the files changes from @snehilsanyal. Removed that and kept just mine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given mostly format-related recommendations, thank you!
|
||
One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above** | ||
It designed to let us think more like : the results of models looks really amazing and it's way better than previous models but does it understand compositional relationships | ||
in the same way humans would understand it rather than just generalizating to the data. For eg. earlier version of Stable Diffusion was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is a bit way too long, can you shorten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating.
Thank you, @merveenoyan :) I'll address them. Regarding the content, do you think there's anything more I could add? I had a great time learning about VLMs. I've kept the content brief here. |
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Hey @merveenoyan, thank you so much for the suggested fixes. I have addressed them all. Please let me know if anything else is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! left formatting nits
@merveenoyan Thanks, will adress them. Lot of line issues happening from my end. |
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
edited the new-lines issues present in the content
Hey @merveenoyan updated the content, and fixed the new lines issues suggested by you. Please have a look. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@charchit7 if you can solve merge conflicts we can merge. |
Hey @merveenoyan fixed the merge conflict. Please check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry for my late review on this, I somehow went past it.
Great content. I left some suggestions which are mostly of grammatical nature.
|
||
## Our World is Multimodal | ||
Humans explore the world through diverse senses: sight, sound, touch, and scent. A complete grasp of our surroundings emerges by harmonizing insights from these varied modalities. | ||
We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed.Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed.Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements. | |
We think of modality, initially introduced in mathematics as distinct peaks, in a poetic way as: "With each modality, a unique part to play, Together they form our understanding array. A symphony of senses, a harmonious blend, In perception's dance, our world transcends." In pursuit of making a AI capable to understand the world, the field of machine learning seeks to develop models capable of processing and integrating data across multiple modalities. However, several challenges, including representation and alignment, must be addressed. Representation explores techniques to effectively summarize multimodal data, capturing the intricate connections among individual modality elements. Alignment focuses on identifying connections and interactions across all elements. |
|
||
|
||
## Introduction | ||
Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen shift from tradional ML/DL to new learning paradigm called pre-training, fine-tuning and prediction which has shown great benefit due since in tradional way we may need to collect huge amount of data, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen shift from tradional ML/DL to new learning paradigm called pre-training, fine-tuning and prediction which has shown great benefit due since in tradional way we may need to collect huge amount of data, etc. | |
Processing images to generate text, such as image captioning and visual question-answering, has been studied for many years which includes autonomous driving, remote sensing, etc. We also have seen a shift from traditional ML/DL training from scratch to a new learning paradigm including pre-training, fine-tuning and prediction, which has shown great benefit since in the traditional way we may need to collect huge amount of data, etc. |
## Mechanism | ||
To enable the functionality of Vision Language Models (VLMs), a meaningful combination of both text and images is essential for joint learning. How can we do that? One simple/common way is given image-text pairs: | ||
- Extract image and text features using text and image encoders. For images it can be **CNN** or **transformer** based architectures. | ||
- Learns the vision-language correlation with certain pre-training objectives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Learns the vision-language correlation with certain pre-training objectives. | |
- Learn the vision-language correlation with certain pre-training objectives. |
VLM pre-training aims to pre-train a VLM to learn image-text correlation, targeting effective zero-shot predictions on visual recognition tasks which can be segmentation, classification, etc. | ||
|
||
## Strategies | ||
We can categorize [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I get the sentence right, but I think you can either remove "categorize" or "group"
We can categorize [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning. | |
We can [group](https://lilianweng.github.io/posts/2022-06-09-vlm/#no-training) VLMs based on how we leverage the two modes of learning. |
- In this method we fuse visual information into language models by treating images as normal text tokens and train the model on a sequence of joint representations of both text and images. Precisely, images are divided into multiple smaller patches and each patch is treated as one "token" in the input sequence. e.g. [VisualBERT](https://arxiv.org/abs/1908.03557), [SimVLM](https://arxiv.org/abs/2108.10904). | ||
|
||
- Learning good image embeddings that can work as a prefix for a frozen, pre-trained language model. | ||
- In this method we don't change the language model parameters when adapting to handle visual signal. Instead we learn such an embedding space for images that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In this method we don't change the language model parameters when adapting to handle visual signal. Instead we learn such an embedding space for images that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734). | |
- In this method we don't change the language model parameters when adapting to handle visual signals. Instead we learn an embedding space for images, such that it is compatible with the language model’s. e.g. [Frozen](https://arxiv.org/abs/2106.13884), [ClipCap](https://arxiv.org/abs/2111.09734). |
|
||
|
||
## Downstream Tasks and Evaluation | ||
VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some line breaks would make it easier to read
VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. Generally the setup used for evaluation VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure it's representation.How do we evaluate these models? We can check how they perform on these datasets given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training dosen't work and models had huge gap with human performance. | |
VLMs are getting good at many downstream tasks, including image classification, object detection, semantic segmentation, image-text retrieval, and action recognition while surpassing models trained traditionally. | |
Generally the setup used for evaluating VLMs is zero-shot prediction and linear probing. Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning. | |
In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure its representation. How do we evaluate these models? We can check how they perform on datasets, e.g. given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data. For this, the most common dataset used is [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/). | |
Standard datasets like MSCOCO might be straightforward for a model to learn due to their distribution, which may not adequately demonstrate a model's capacity to generalize across more challenging or diverse datasets. In response, datasets like [Hateful Memes](https://arxiv.org/abs/2005.04790) are created to address this problem by understanding the models capability to an extreme by adding difficult examples ("benign confounders") to the dataset to make it hard which showed that multimodal pre-training doesn't work and models had a huge gap with human performance. |
|
||
![Winogrand Idea](https://huggingface.co/datasets/hf-vision/course-assets/resolve/99ac107ade7fb89aae792f3655341528e64e1fbb/winogrand_paper.png) | ||
|
||
One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier version of Stable Diffusion was not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more such dataset called **Winoground** was designed to figure out, okay, so how godd is CLIP actually? **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier version of Stable Diffusion was not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage! | |
One more such dataset called **Winoground** was designed to figure out how good CLIP actually is. **Figure Above** This dataset challenges us to consider if models, despite their impressive results, truly grasp compositional relationships like humans or if they're generalizing data. For example, earlier versions of Stable Diffusion and other text-to-image models, were not able to clearly count fingers. So, there's still lot of amazing work to be done to get the VLM's to the next stage! |
|
||
|
||
## What's Next? | ||
The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once, etc. So, in future there will be modality-agnostic foundation models that can read and generate many modalities! Interesting future ahead. To capture more on these recent advances please follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add these recent advances super fast! | |
The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once. This is one possible scenario for the future - modality-agnostic foundation models that can read and generate many modalities! But maybe we also see other alternatives developing, one thing we can say for sure is . there is an interesting future ahead. | |
To capture more on these recent advances feel free follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself. |
No problem at all @johko Fixed the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM now 🙂
Merging 🚀 |
Thanks @johko :) |
Hey everyone 🤗
This PR adds the Introduction to VLM on Fusion of Text and Vision for Unit 4: Multimodal Models.
Related to Issue: #54
Please have a look!
@MKhalusova @merveenoyan