These lecture notes cover the **Data Efficient Image Transformer (DEiT)**, a model developed by Facebook AI Research (FAIR) in 2021. The lecture builds upon the previously studied **Vision Transformer (ViT)**, addressing its primary weakness: data inefficiency.

<a href="https://miro.com/app/board/uXjVJsY_MIU=/">Read here</a>
## I. Context: The Problem with Vision Transformers (ViT)

While Vision Transformers (ViT) achieved state-of-the-art results, they possessed significant drawbacks compared to Convolutional Neural Networks (CNNs):

1.  **Data Hunger:** The original ViT required a massive dataset (JFT-300M, containing 300 million private images) to perform well. The paper explicitly states, "Transformers do not generalize well when trained on insufficient amounts of data".
2.  **Computational Cost:** The base ViT models were enormous (e.g., 600 million parameters) and required significant GPU resources and time to train.
3.  **Lack of Inductive Bias:** CNNs possess inherent **Inductive Biases**—specifically **Locality** (assuming adjacent pixels are related) and **Translation Invariance** (recognizing an object regardless of where it appears in the image). Transformers lack these biases because they treat images as sequences of patches with global attention. Consequently, ViT requires massive amounts of data to "learn" these spatial relationships from scratch.

**The Goal of DEiT:** To train a high-performance transformer on a standard-sized dataset (ImageNet-1k, ~1.2 million images) without requiring the massive JFT-300M dataset.

## II. The Solution: Knowledge Distillation

To solve the data efficiency problem, DEiT introduces **Knowledge Distillation** (specifically, a Teacher-Student model) into the Transformer training process.

### A. The Concept
*   **Teacher:** A large, pre-trained model (often a CNN like RegNet or ResNet). The teacher’s weights are frozen; it is not trained further.
*   **Student:** The DEiT model (Vision Transformer) which is trained from scratch.
*   **Objective:** The Student tries to replicate the predictions of the Teacher. By doing so, the Transformer (Student) inherits the robust representations and inductive biases of the CNN (Teacher) without changing its own architecture.

### B. Soft vs. Hard Labels
When the Teacher makes a prediction, it outputs a probability distribution (logits):
1.  **Soft Labels:** The raw probabilities (e.g., Cat: 0.8, Dog: 0.1, Background: 0.1). These contain rich information, indicating that the Teacher thinks the image looks *a little bit* like a dog.
2.  **Hard Labels:** Converting the probability to a one-hot vector (e.g., Cat: 1, Dog: 0, Background: 0).

**Surprising Finding:** The DEiT paper found that **Hard Distillation** (using the Teacher's hard decisions) often worked better than Soft Distillation when the Teacher was a CNN and the Student was a Transformer.

## III. DEiT Architecture: The Distillation Token

DEiT is 95% similar to the standard ViT but introduces one specific architectural change to facilitate distillation.

### A. The Extra Token
Standard ViT uses a **Class Token (`[CLS]`)** to aggregate information for classification. DEiT adds a second special token called the **Distillation Token (`[DIST]`)**.

1.  **Initialization:** The Distillation token is **randomly initialized**, just like the CLS token. It is *not* initialized using CNN features (which would be cheating).
2.  **Processing:** Both tokens interact with the image patches via Self-Attention layers throughout the network.
3.  **Convergence:** Ideally, one might expect the CLS and DIST tokens to converge to the same vector. However, the authors found the cosine similarity between them is low (0.06), meaning they learn distinct representations because they target different objectives.

### B. Output Structure
At the final layer, the model produces two separate context vectors:
1.  **CLS Token Output:** Fed into a classification head to predict the **Ground Truth**.
2.  **DIST Token Output:** Fed into a separate distillation head to predict the **Teacher's Label**.

## IV. The Loss Function

The training objective is a composite loss function that balances learning from the truth and learning from the teacher.

The total loss $\mathcal{L}$ is defined as:
$$ \mathcal{L} = \alpha \times \mathcal{L}_{CE} + (1-\alpha) \times \mathcal{L}_{Teacher} $$

1.  **$\mathcal{L}_{CE}$ (Cross Entropy):** Calculates the error between the **CLS Token's prediction** and the **Ground Truth** (True Labels).
2.  **$\mathcal{L}_{Teacher}$ (KL Divergence):** Calculates the error between the **DIST Token's prediction** and the **Teacher's prediction**.
    *   This is typically calculated using **Kullback-Leibler (KL) Divergence**, which measures how one probability distribution differs from another.
    *   If the Student makes a confident prediction that conflicts with the Teacher (e.g., Student says "Dog" with 100% certainty, but Teacher is unsure), the loss blows up, penalizing the Student heavily.

**Key Hyperparameter:** The **Temperature ($\tau$)** parameter is used inside the Softmax function to control the "sharpness" of the probability distribution. A higher temperature makes the distribution softer (flatter), while a lower temperature makes it sharper (peakier).

## V. Implementation & Coding Insights

The lecture concludes with a "from scratch" implementation in PyTorch. Key takeaways include:

1.  **Teacher Model:** A pre-trained ResNet50 is used. The classification layer (last fully connected layer) is modified to match the number of classes in the current dataset (e.g., 10 for MNIST/CIFAR).
2.  **Freezing:** It is crucial to freeze the Teacher's parameters (set `requires_grad = False`) so backpropagation only updates the Student.
3.  **Loss Calculation:**
    *   The model returns two logits: `cls_logits` and `dist_logits`.
    *   `cls_logits` are compared against Ground Truth ($y$).
    *   `dist_logits` are compared against the Teacher's output ($y_{teacher}$).
4.  **Inference Strategy:** At prediction time (inference), you have two heads. You can:
    *   Use only the CLS token.
    *   Use only the DIST token.
    *   Average both (Fusion).
    *   *Result:* Empirical results often show that averaging both or using the DIST token alone yields higher accuracy than the CLS token alone.

## VI. Summary of Impact

DEiT proved that Transformers *can* be efficient. By using a CNN teacher to guide the learning process (injecting inductive bias indirectly via the loss function), DEiT achieved better accuracy than its teacher (EfficientNet) and the original ViT on ImageNet-1k, with significantly fewer parameters (84M vs 600M) and faster training times.