# ELECTRA: pretraining text encoders as discriminators instead of generators

# Introduction

## Denoising objectives
- Select a subset of the input tokens, mask them and ask the network to predict the original unmasked input.
- Better than autoregressive language modeling task because of the possible bidirectionality.
- Currently state of the art as pretraining objective in language representation learning.
- **Problem:** They only learn from (only compute loss / gradients) from the subset of masked tokens which is usually around 15% of the tokens.

## Replaced token detection
- **Solution:** The authors propose a new training objective, *replaced token detection*.
- In this task the network is asked to predict whether an input token is real or if it has been synhetically replaced, i.e. is a fake one.
- This task allows the network to learn from all predictions which the authors show yield better sample efficiency and thus can train faster.
- Another advantage is that this objective also achieves better performance when finetuned on downstream tasks.

# Replaced token detection pretraining task
- An additional generator network is added whose task is to make predictions for masked out tokens just like in a standard masked language modeling (MLM) task (like BERT).
- The output sequence (original tokens and predicted tokens) from the generator is the input to the discriminator which is the model that we're actually interested in.
- The task of the discriminator is then to predict whether each token is real or if it has been synthetically replaced by the generator.
- Both of these networks are trained on their respective tasks simultaneously.
- Both networks are transformer based, but the generator is usually smaller than the discriminator.

## Overview
<img src="figs/electra/electra-fig-2.png"></img>

## Adversarial learning?
- Superficially there are similarities to GAN training, but there are key differences.
- If generator generated the actual input token it's not considered fake when computing the discriminator loss.
- The generator is not trained to adversarially fool the discriminator which is the GAN setup but rather via standard maximum likelihood estimation.
- There are challenges with an adversarial training setup in that gradients can't backpropagate through the sampling step of the generator.
- The authors did an experiment where this challenge is circumvented using reinforcement learning but did not get good results.

# Experiments
- After pretraining, only the discriminator is kept and the finetuned on the downstream tasks.
- Finetune and evaluate on SQuAD and GLUE tasks by adding simple linear classifiers on top of ELECTRA.

## Model extension: smaller generator
- Firstly intended to reduce the amount of compute required per step
- Also has an effect on performance. The authors speculate that this is because a larger generator makes the discriminator's task too difficult.

<img src="figs/electra/electra-fig-3.png"></img>

## Model extension: weight sharing
- If generator and discriminator are same size all transformer layer weights can be shared, but there is a benefit in having a larger discriminator than generator in which case only token and position embeddings are shared.
- In an experiment with a generator and discriminator of the same size, tying only token and position embeddings gave a performance increase on which tying all weights only improved very slightly.

## Model extension: training algorithms
- They experiment with a two-stage training procedure.
- First train generator for $n$ steps.
- Then initialize weights of discriminator with generator's weights (same size requirement).
- Then train discriminator for $n$ steps with generator's weights frozen.
- This procedure did not improve results. The authors speculate that the joint training provides a beneficial curriculum learning procedure.
- As mentioned previously, adversarial learning via reinforcement learning was also attempted without success.

<img src="figs/electra/electra-fig-3.png"></img>

## Small models results
<img src="figs/electra/electra-table-1.png"></img>

## Large models results

<img src="figs/electra/electra-table-3.png"></img>

<img src="figs/electra/electra-table-4.png"></img>

## Efficiency analysis
- They run some experiments to get an understanding on where the gains are coming from.
- **ELECTRA 15 %**, only compute discriminator loss on the 15% masked tokens.
- **Replace MLM**, switch the [MASK] tokens with tokens from a generator (a second generator?).
- **All-tokens MLM**, make predictions and compute losses for all tokens, not just the masked ones.
<img src="figs/electra/electra-fig-4.png"></img>

# Conclusion
- New pretraining task for language representation learning proposed, *replaced token detection*.
- The idea is to produce hard examples via a generator network.
- The new task is more compute efficient and gives better performance on downstream tasks.