Multimodal Models - CLIP and relatives #29

pedrogengo · 2023-10-10T14:46:51Z

Hello!

Inspired by #19 #28, me and my fellow collaborators have also outlined a course curriculum for our section but we would like to have some inputs and feedback from the HF team before we finalize it and start working on it. This is our chosen structure so far.

Introduction

Motivation for multimodality
History of multimodal models
Self supervised learning enabling multi-modality

CLIP

Intro to (ELI5)
Theory behind clip (contrastive loss, embeddings, etc)
Variations of CLIP backbones
How tokenisation and embeddings work in clip
Applications of clip:
- Search and retrieve
- Zero shot classification
- Clip guidance (Using clip in other models to guide generation, DALLE, SD etc)
Fine-tuning clip (Open-clip, and other variants?)

Losses/ self supervised learning

Contrastive
Non contrastive
Triplet
One or two other ones

Relatives

Image-bind
BLIP
OWL-VIT
Flamingo (IDEFICS)
LLaVa

Practical applications & challenges

Applications
- Search image engine based on textual prompts
- Downstream tasks on embeddings eg classification, clustering etc
- Visual question answering systems
Challenges
- Data bias/ out of distribution data
- Hard to get enough data -> leads to using noisy internet data

References:

@mattmdjaga @froestiago

merveenoyan · 2023-10-11T12:36:44Z

Hello 👋
I think I'll comment on every chapter if it's ok.

The introduction seems very fine.

Introduction

Motivation for multimodality
History of multimodal models
Self supervised learning enabling multi-modality
CLIP

It's very CLIP focused so it would be nice to be less specific IMO. I think thanks to CLIP we have many multimodal models this day but maybe keep it brief? Not sure, we can decide on the writing process as well.

Intro to (ELI5)
Theory behind clip (contrastive loss, embeddings, etc)
Variations of CLIP backbones
How tokenisation and embeddings work in clip
Applications of clip:
Search and retrieve
Zero shot classification
Clip guidance (Using clip in other models to guide generation, DALLE, SD etc)
Fine-tuning clip (Open-clip, and other variants?)

This section is nice.

Losses/ self supervised learning

Contrastive
Non contrastive
Triplet
One or two other ones
Relatives

This section is nice, maybe make sure it doesn't overlap with the section where we talk about existing architectures or foundation models.

Image-bind
BLIP
OWL-VIT
Flamingo (IDEFICS)
LLaVa
Practical applications & challenges

Maybe keep this brief and explain more at Computer Vision in the Wild section, WDYT? Also pinging @johko

Applications
Search image engine based on textual prompts
Downstream tasks on embeddings eg classification, clustering etc
Visual question answering systems
Challenges
Data bias/ out of distribution data
Hard to get enough data -> leads to using noisy internet data

johko · 2023-10-11T20:48:34Z

Hey,

thanks for the great outline @pedrogengo . Here are my thoughts:

Introduction
I think you can keep the introduction shorter, as we have a chapter "Connecting Text and Vision", which (I suppose) will talk about most things you mentioned.
Maybe your introduction can focus on model history (which you already planned as one point), covering a bit what happened before CLIP.
Of course if you want to make sure, feel free to reach out to someone from the other group to see what they plan on covering.

CLIP
They are all totally valid points to cover, but as Merve also said, try to not get too carried away with it.

Losses/ self supervised learning
Really nice idea of covering that here, love it ❤️

Relatives
The related models seem a bit one-sided to me, BLIP, IDEFICS and LLaVa basically cover the same task.
Maybe you can also focus on models that are available in transformers (which would rule out Image-bind and LLaVa).
Some alternative suggestions from my side:

Donut or Nougat (Document Analysis)
GroupViT or OneFormer(Segmentation)
ALIGN (as a CLIP alternative)

but those are just some suggestions, feel free to have a look at the transformers docs in the multimodal section:
https://huggingface.co/docs/transformers

Applications
Looks good overall. Keep in mind that we do have a dedicated Zero Shot Computer Vision section, so you don't necessarily need to cover these kind of applications, plus you might already cover some cases in the section about models above.

Challenges
Looking good 👍

Hope that helps you :)

ahmadmustafaanis · 2023-10-16T19:55:28Z

Thought:

Relatives
Image-bind
BLIP
OWL-VIT
Flamingo (IDEFICS)
LLaVa

Maybe we can divide it into better sections like and add models here.

Foundational Models
VQA
Image Captioning
Video Captioning
Diffusion Models (we already have a chapter for this)

johko · 2023-10-17T09:50:33Z

Maybe we can divide it into better sections like and add models here.

Foundational Models

VQA

Image Captioning

Video Captioning

Diffusion Models (we already have a chapter for this)

In general a good idea, but the main problem I see with this is that many new models focus on the "foundation" part, so actually most are able to perform many tasks at once by now.
i don't know of many models focusing only on things like VQA and Image Captioning.

I think the most important part here is to cover models that are good representatives for different common architectures or training strategies, so people taking the course get an overview of what is out there.

incorrect merging reverted

alperenunlu added the Chapter Content Discuss and track the content of a chapter label Oct 10, 2023

snehilsanyal mentioned this issue Nov 1, 2023

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

Closed

johko closed this as completed Apr 21, 2024

merveenoyan pushed a commit that referenced this issue Apr 30, 2024

Merge pull request #29 from sezan92/googlenetv2

e2c703e

incorrect merging reverted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal Models - CLIP and relatives #29

Multimodal Models - CLIP and relatives #29

pedrogengo commented Oct 10, 2023

merveenoyan commented Oct 11, 2023

johko commented Oct 11, 2023

ahmadmustafaanis commented Oct 16, 2023 •

edited

Loading

johko commented Oct 17, 2023

Multimodal Models - CLIP and relatives #29

Multimodal Models - CLIP and relatives #29

Comments

pedrogengo commented Oct 10, 2023

merveenoyan commented Oct 11, 2023

johko commented Oct 11, 2023

ahmadmustafaanis commented Oct 16, 2023 • edited Loading

johko commented Oct 17, 2023

ahmadmustafaanis commented Oct 16, 2023 •

edited

Loading