Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal Models - CLIP and relatives #29

Closed
pedrogengo opened this issue Oct 10, 2023 · 4 comments
Closed

Multimodal Models - CLIP and relatives #29

pedrogengo opened this issue Oct 10, 2023 · 4 comments
Labels
Chapter Content Discuss and track the content of a chapter

Comments

@pedrogengo
Copy link
Contributor

Hello!

Inspired by #19 #28, me and my fellow collaborators have also outlined a course curriculum for our section but we would like to have some inputs and feedback from the HF team before we finalize it and start working on it. This is our chosen structure so far.

Introduction

  • Motivation for multimodality
  • History of multimodal models
  • Self supervised learning enabling multi-modality

CLIP

  • Intro to (ELI5)
  • Theory behind clip (contrastive loss, embeddings, etc)
  • Variations of CLIP backbones
  • How tokenisation and embeddings work in clip
  • Applications of clip:
    • Search and retrieve
    • Zero shot classification
    • Clip guidance (Using clip in other models to guide generation, DALLE, SD etc)
  • Fine-tuning clip (Open-clip, and other variants?)

Losses/ self supervised learning

  • Contrastive
  • Non contrastive
  • Triplet
  • One or two other ones

Relatives

  • Image-bind
  • BLIP
  • OWL-VIT
  • Flamingo (IDEFICS)
  • LLaVa

Practical applications & challenges

  • Applications
    • Search image engine based on textual prompts
    • Downstream tasks on embeddings eg classification, clustering etc
    • Visual question answering systems
  • Challenges
    • Data bias/ out of distribution data
    • Hard to get enough data -> leads to using noisy internet data

References:

@mattmdjaga @froestiago

@alperenunlu alperenunlu added the Chapter Content Discuss and track the content of a chapter label Oct 10, 2023
@merveenoyan
Copy link
Collaborator

Hello 👋
I think I'll comment on every chapter if it's ok.

The introduction seems very fine.

Introduction

Motivation for multimodality
History of multimodal models
Self supervised learning enabling multi-modality
CLIP

It's very CLIP focused so it would be nice to be less specific IMO. I think thanks to CLIP we have many multimodal models this day but maybe keep it brief? Not sure, we can decide on the writing process as well.

Intro to (ELI5)
Theory behind clip (contrastive loss, embeddings, etc)
Variations of CLIP backbones
How tokenisation and embeddings work in clip
Applications of clip:
Search and retrieve
Zero shot classification
Clip guidance (Using clip in other models to guide generation, DALLE, SD etc)
Fine-tuning clip (Open-clip, and other variants?)

This section is nice.

Losses/ self supervised learning

Contrastive
Non contrastive
Triplet
One or two other ones
Relatives

This section is nice, maybe make sure it doesn't overlap with the section where we talk about existing architectures or foundation models.

Image-bind
BLIP
OWL-VIT
Flamingo (IDEFICS)
LLaVa
Practical applications & challenges

Maybe keep this brief and explain more at Computer Vision in the Wild section, WDYT? Also pinging @johko

Applications
Search image engine based on textual prompts
Downstream tasks on embeddings eg classification, clustering etc
Visual question answering systems
Challenges
Data bias/ out of distribution data
Hard to get enough data -> leads to using noisy internet data

@johko
Copy link
Owner

johko commented Oct 11, 2023

Hey,

thanks for the great outline @pedrogengo . Here are my thoughts:

Introduction
I think you can keep the introduction shorter, as we have a chapter "Connecting Text and Vision", which (I suppose) will talk about most things you mentioned.
Maybe your introduction can focus on model history (which you already planned as one point), covering a bit what happened before CLIP.
Of course if you want to make sure, feel free to reach out to someone from the other group to see what they plan on covering.

CLIP
They are all totally valid points to cover, but as Merve also said, try to not get too carried away with it.

Losses/ self supervised learning
Really nice idea of covering that here, love it ❤️

Relatives
The related models seem a bit one-sided to me, BLIP, IDEFICS and LLaVa basically cover the same task.
Maybe you can also focus on models that are available in transformers (which would rule out Image-bind and LLaVa).
Some alternative suggestions from my side:

  • Donut or Nougat (Document Analysis)
  • GroupViT or OneFormer(Segmentation)
  • ALIGN (as a CLIP alternative)

but those are just some suggestions, feel free to have a look at the transformers docs in the multimodal section:
https://huggingface.co/docs/transformers

Applications
Looks good overall. Keep in mind that we do have a dedicated Zero Shot Computer Vision section, so you don't necessarily need to cover these kind of applications, plus you might already cover some cases in the section about models above.

Challenges
Looking good 👍

Hope that helps you :)

@ahmadmustafaanis
Copy link

ahmadmustafaanis commented Oct 16, 2023

Thought:

Relatives
Image-bind
BLIP
OWL-VIT
Flamingo (IDEFICS)
LLaVa

Maybe we can divide it into better sections like and add models here.

  1. Foundational Models
  2. VQA
  3. Image Captioning
  4. Video Captioning
  5. Diffusion Models (we already have a chapter for this)

@johko
Copy link
Owner

johko commented Oct 17, 2023

Maybe we can divide it into better sections like and add models here.

  1. Foundational Models
  2. VQA
  3. Image Captioning
  4. Video Captioning
  5. Diffusion Models (we already have a chapter for this)

In general a good idea, but the main problem I see with this is that many new models focus on the "foundation" part, so actually most are able to perform many tasks at once by now.
i don't know of many models focusing only on things like VQA and Image Captioning.

I think the most important part here is to cover models that are good representatives for different common architectures or training strategies, so people taking the course get an overview of what is out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chapter Content Discuss and track the content of a chapter
Projects
None yet
Development

No branches or pull requests

5 participants