# 📄 Paper Summary: InternVideo2 – Scaling Video Foundation Models for Multimodal Understanding  

**Title**: InternVideo2: Scaling Video Foundation Models for Multimodal Understanding  
**Authors**: Yujie Zhong, Yinan He, Jinghao Zhou, et al.  
**Published in**: CVPR 2024 (Best Paper Honorable Mention)  
**Link**: [https://arxiv.org/abs/2403.15377](https://arxiv.org/abs/2403.15377)  

---

## ✅ Day 1 – Abstract & Introduction  

### 📌 Background & Motivation  
- Vision-language foundation models (e.g., **CLIP**, **BLIP-2**, **InternVideo v1**) successfully aligned image–text modalities.  
- However, **videos** introduce additional complexity: temporal dynamics, long-range dependencies, and multimodal cues (e.g., sound, speech).  
- Existing models often separate objectives (contrastive, masked modeling, captioning), leading to inefficiency and limited generalization.  
- There is a need for a **scalable, unified video foundation model** that learns from large, multimodal data.

---

### 📌 Core Idea  
InternVideo2 proposes a **progressive multimodal training framework** integrating:
1. **Masked Video Modeling (MVM)** – learn temporal–spatial representation via reconstruction.  
2. **Multimodal Contrastive Learning** – align visual, textual, and auditory modalities.  
3. **Next-Token Prediction** – enable generative reasoning for tasks like captioning and dialogue.  

This unified design builds robust temporal understanding and cross-modal alignment in a single model.

---

### 📌 Contributions  
1. A **unified multimodal video–language model** trained on **400M video–text pairs**.  
2. **Progressive training** combining reconstruction, contrastive, and generative objectives.  
3. **Scalable architecture** up to 6B parameters with strong generalization across 60+ benchmarks.  
4. Extensive evaluation on **retrieval**, **captioning**, **action recognition**, and **video QA**.

---

### 📌 Notes  
- Mirrors the “foundation model” trend seen in NLP.  
- Demonstrates that scaling data and parameters consistently improves video understanding.  
- Extends multimodal learning from image-text to **video–text–audio–speech** domains.

---

### 📌 TL;DR  
InternVideo2 = unified video–language–audio model trained progressively on 400M pairs, achieving SOTA in multimodal video understanding.

---

## ✅ Day 2 – Architecture & Method  

### 📌 Background  
To process videos efficiently, InternVideo2 extends ViT-style Transformers into **spatiotemporal and multimodal** architectures, allowing unified representation learning.

---

### 📌 Core Components  

| Module | Role | Key Mechanism |
|---------|------|----------------|
| **Spatial–Temporal Backbone (ST-Backbone)** | Encode spatiotemporal tokens | Factorized attention across space & time |
| **Cross-Modal Fusion Module (CMFM)** | Fuse video, text, audio, speech | Multimodal attention in shared latent space |
| **Hierarchical Temporal Encoder (HTE)** | Handle long-range dynamics | Temporal windowing + hierarchical aggregation |
| **Multitask Heads** | Connect tasks to shared backbone | MVM, contrastive, next-token prediction |

---

### 📌 Method  

1. **Spatial–Temporal Encoding**  
   - Videos are divided into **3D patches** (spatial + temporal cubes).  
   - Apply **spatial attention** within each frame and **temporal attention** across frames.  
   - Enables scalable long-range modeling with reduced compute.

   \[
   x_t = \text{TemporalAttn}(\text{SpatialAttn}(x_{t-1}))
   \]

2. **Cross-Modal Fusion**  
   - Each modality (Video, Text, Audio, Speech) attends to others through **shared embeddings**.  
   - Controlled by **gating mechanisms** to balance influences.  
   \[
   z = \text{Gate}(\text{Attention}(V, T, A, S))
   \]

3. **Temporal Hierarchy**  
   - Frames are chunked into temporal windows → locally encoded → globally aggregated.  
   - Similar to Temporal Pyramid Networks but integrated into ViT backbone.

4. **Multitask Heads**  
   - **MVM** for masked reconstruction  
   - **Contrastive** for multimodal alignment  
   - **Next-token prediction** for generation  
   - **Action classification** (optional supervised branch)

---

### 📌 Notes  
- All tasks share the same backbone → efficient parameter sharing.  
- Progressive training (MVM → Contrastive → Generation) improves stability.  
- Architecture scales from Base (1B) to Giant (6B) with near-linear gains.

---

### 📌 TL;DR  
InternVideo2 unifies space–time–modality modeling with a shared transformer backbone and progressive multitask training.

---

## ✅ Day 3 – Pretraining Setup & Datasets  

### 📌 Background  
Scaling to **hundreds of millions of video–text pairs** is essential for generalization.  
InternVideo2 leverages **InternVid-400M**, one of the largest multimodal datasets ever assembled.

---

### 📌 Datasets  

| Dataset | Type | Size | Key Feature |
|----------|------|------|--------------|
| **InternVid-400M** | Video–Text | 400M | Core dataset; diverse sources, filtered captions |
| **WebVid2.5M** | Video–Text | 2.5M | Natural human actions |
| **CC3M / CC12M** | Image–Text | 15M | Extra textual grounding |
| **AudioSet** | Video–Audio | 2M | Adds audio modality |
| **HowTo100M** | Instructional Video | 100M | Rich temporal and linguistic alignment |

**Data diversity** → key to multimodal scalability.

---

### 📌 Pretraining Pipeline  

**Stage 1 – Masked Video Modeling (MVM)**  
- Learns robust spatial–temporal representations.  
- Loss:  
  \[
  \mathcal{L}_{MVM} = \| \hat{x}_{mask} - x_{mask} \|_2^2
  \]

**Stage 2 – Multimodal Contrastive Learning**  
- Aligns video ↔ text ↔ audio ↔ speech.  
- CLIP-style loss:  
  \[
  \mathcal{L}_{CL} = -\log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_i, t_j)/\tau)}
  \]

**Stage 3 – Next-Token Prediction**  
- Causal decoding for generative reasoning (captioning/dialogue).  
  \[
  \mathcal{L}_{GEN} = -\sum_t \log P(x_t|x_{<t})
  \]

---

### 📌 Training Configuration  

| Setting | Value |
|----------|--------|
| **Model size** | 1B → 6B parameters |
| **Optimizer** | AdamW |
| **Learning rate** | 1e-4 (cosine decay) |
| **Batch size** | 16K clips |
| **Hardware** | 1024 × A100 (80GB) |
| **Framework** | DeepSpeed + FlashAttention |
| **Training duration** | ≈ 2 months (Giant) |

**Efficiency**: FP16/BF16 precision, gradient checkpointing, distributed contrastive memory bank.

---

### 📌 Fine-Tuning Strategy  

| Task | Dataset | Objective |
|------|----------|------------|
| Video–Text Retrieval | MSR-VTT, VATEX | Contrastive |
| Action Recognition | Kinetics-400, SSv2 | Cross-entropy |
| Captioning | MSVD, MSR-VTT | Causal LM |
| Audio–Visual QA | NExT-QA, AVQA | Multimodal reasoning |

**Tip:** Freeze early spatial layers, fine-tune higher temporal and fusion modules (LoRA optional).

---

### 📌 Notes  
- Progressive curriculum improves optimization and generalization.  
- Massive, diverse data = strong zero-shot transfer across tasks.  
- Demonstrates scaling law: more data + parameters → better performance.

---

### 📌 TL;DR  
InternVideo2’s 400M multimodal pretraining combines **MVM + CL + GEN** objectives under a massive unified dataset, enabling superior multimodal generalization.