# LXMERT: Learning Cross-Modality Encoder Representations from Transformers
### By Mulla Meharaj

### Overview
LXMERT, which stands for **Learning Cross-Modality Encoder Representations from Transformers**, is pretty cool in how it brings together **visual content** from images and **natural language** from text. I really like this deep learning framework, especially for stuff like **Visual Question Answering (VQA)**. It does a great job of mixing visual and textual info, making everything work together smoothly!

### Motivation
Traditional models process vision and language separately, which limits their ability to handle tasks requiring joint understanding—like answering “What is the person doing in the image?” These tasks demand integrated reasoning across both modalities.

LXMERT addresses this by learning joint image-text representations through a Transformer-based architecture. It combines a language encoder, an object-relationship encoder, and a cross-modality encoder to align visual and linguistic features.

Pre-trained on over 9 million image-sentence pairs using tasks like masked language modeling, object prediction, and image question answering, LXMERT captures both intra-modality and cross-modality relationships. This results in state-of-the-art performance on benchmarks such as VQA, GQA, and NLVR2, showcasing its strength in vision-language reasoning.


###  Model Architecture
####  Encoders in LXMERT:
- **Language Encoder**: Based on a Transformer architecture, it processes tokenized text inputs using WordPiece embeddings and positional encodings. It captures contextual word representations using multiple layers of self-attention and feed-forward sub-layers, following the BERT-style design.
- **Object-Relationship Encoder**:  Instead of raw pixel data, images are represented as a sequence of object features extracted via a pre-trained Faster R-CNN detector. Each object is embedded with both its visual (RoI) features and spatial (bounding box) coordinates, allowing the encoder to model inter-object relationships using self-attention layers.
- **Cross-Modality Encoder**: The core of LXMERT, this encoder fuses visual and textual information through bi-directional cross-attention—from language to vision and vice versa—followed by self-attention within each modality. This enables rich, aligned representations that capture interactions between specific words and image regions.

Image -> Object Features |
Text -> Token Features |
Both fed into parallel encoders

![Model Diagram](model.png)


### Input Embeddings
#### Text Input (Word-Level)
- WordPiece Tokenizer: As described, LXMERT uses the WordPiece tokenizer, which splits input text into a series of tokens. This is particularly beneficial for handling rare words by breaking them down into more common subword units.

#### Embedding Process
- Token Embeddings: After tokenization, each token is transformed into a continuous representation (embedding).
- Positional Embeddings: Positional information is crucial because the model needs to understand the order of words in a sentence. Each token position in the sentence is embedded as a position vector.
- Combination: The token embeddings and positional embeddings are summed to produce the final word-level sentence embeddings. This combination allows the model to interpret both the identity and order of the words.


#### Image Input (Object-Level)
- Feature Extraction with Faster R-CNN: LXMERT uses a pre-trained Faster R-CNN model to detect objects within the image. For each detected object, the model extracts a Region-of-Interest (RoI) feature vector, which serves as a compact representation of the object's appearance.

#### Embedding Process


- RoI Features: These features are direct outputs from the Faster R-CNN and represent the visual characteristics of the objects detected in the image.
- Position Embeddings: In addition to visual features, spatial information from bounding box coordinates (like x, y positions of the bounding box, width, and height) is converted into position embeddings.
- Combination: The RoI features and position embeddings are combined to form the final object-level embeddings. This results in embeddings that encapsulate not just what the object looks like, but where it is located within the image.


### Attention Layers
- **Self-Attention**: Models internal relationships (within text or image).
- **Cross-Attention**: Models inter-modal relationships (between image and text).

Attention: Query → Key, Value
Cross-attention: Text ↔ Image

### Pretraining Tasks
LXMERT is pre-trained on **5 core tasks** to build robust cross-modal understanding:

1. **Masked Language Modeling** (like BERT but cross-modal)
2. **Masked Object Prediction**
   - Feature regression (RoI features)
   - Label classification (object class)
3. **Cross-Modality Matching**
4. **Image Question Answering**

For example: Predict masked word/object using vision+language context

![Pretraining Tasks](preTraining.png)


### Datasets Used for Pretraining
- **MS COCO**, **Visual Genome**
- QA datasets: **VQA**, **GQA**, **VG-QA**
- Total: **~9.18 Million** image-text pairs, **180K** images

### Evaluation Highlights
| Dataset | Metric | LXMERT Performance |
|--------|--------|--------------------|
| VQA    | Accuracy | 72.5% (SOTA)       |
| GQA    | Accuracy | 60.3%              |
| NLVR2  | Accuracy | 76.2%              |

LXMERT significantly outperforms prior models, especially on complex reasoning datasets like **NLVR2** (improvement of 22% absolute!).

### Summary
LXMERT (Learning Cross-Modality Encoder Representations from Transformers) is a foundational model designed to bridge the gap between vision and language in AI. It leverages the power of Transformers, a state-of-the-art architecture in sequence modeling, to build a multi-modal understanding of image-text pairs.