Challenges:
- **Representation**: effectively summarize multimodal data, capturing the intricate connections among individual modality elements
- **Alignment**: identifying connections and interactions accross all elements.

Difficulties:
- One modality may dominate others.
- Additional modalities can introduce noise.
- Full coverage over all modalities is not guaranteed.
- Different modalities can have complicated relationships.



# Mechanism

To enable the functionality of Vision Language Models (VLMs), a meaningful combination of both text and images is essential for joint learning. How can we do that? One simple/common way is given image-text pairs:

- Extract image and text features using text and image encoders. For images it can be CNN or transformer based architectures.
  - Learn the vision-language correlation with certain pre-training objectives.
  - The pre-training objective can be divided into three groups:
    - **contrastive** objectives train VLMs to learn discriminative representations by pulling paired samples close and pushing others faraway in the embedding space.
    - **generative** objectives to make VLMs learn semantic features by training networks to generate image/text data.
    - **alignment** objectives align the image-text pair via global image-text matching or local region-word matching on embedding space.
- With the learned vision-language correlation, VLMs can be evaluated on unseen data in a zero-shot manner by matching the embeddings of any given images and texts.


![image.png](attachment:image.png)

Existing research predominantly focuses on enhancing VLMs from three key perspectives:

- collecting large-scale informative image-text data.
- designing effective models for learning from big data.
- designing new pre-training methods/objective for learning effective vision-language correlation.

VLM pre-training aims to pre-train a VLM to learn image-text correlation, targeting effective zero-shot predictions on visual recognition tasks which can be segmentation, classification, etc.



# Strategies
We can group VLMs based on how we leverage the two modes of learning.

- Translating images into embedding features that can be jointly trained with token embeddings.
- Learning good image embeddings that can work as a prefix for a frozen, pre-trained language model.
- Using a specially designed cross-attention mechanism to fuse visual information into layers of the language model.
- Combine vision and language models without any training.

# Evaluation

Generally the setup used for evaluating VLMs is **zero-shot prediction** and **linear probing**
- Zero-shot prediction is the most common way to evaluate the VLMs, where we directly apply pre-trained VLMs to downstream tasks without any task-specific fine-tuning.
- In linear probing, we freeze the pre-trained VLM and train a linear classifier to classify the VLM-encoded embeddings to measure its representation. How do we evaluate these models? We can check how they perform on datasets, e.g. given an image and a question, the task is to answer the question correctly! We can also check how these models reason answer questions about the visual data.
