# "Recent advancements"

The availability of "Gigantic Language Models" (GLMs) like GPT-3 whose weights are very difficult to modify raises interesting questions about the optimal way of using them for downstream tasks, in particular, for supervised problems.

Two paradigms seem to be emerging:

- __Prompt engineering/generation__: Try to find the the best GLM prompt(-function) and answer-transformation function for few-shot learning on the task -- solve the problem directly without training/fine-tuning.
- __Try to generate some useful intermediary artifacts__ which then can be used to train/fine-tune other models.

## Prompt-engineering/generation methods

A recently published survey, [Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing (2021)](https://arxiv.org/abs/2107.13586), gives a detailed overview of the prompt engineering approaches that have been developed for using GLMs to solve supervised problems.

Main points:

There are two prompt formats, corresponding to two main types of pretrained LMs:
- __Prefix prompts__, for LMs that __continue__ a prompt when they generate language: these are for "traditional" causal/autoregressive LMs generating continuation probabilites, e.g. GPT-x, T5 etc.
- __Cloze prompts__, for LMs using masked objective variants: in this case prompt contains some blank parts that model has to fill in, e.g., BERT, RoBERTa etc.

Prompt creation methods for a given supervised task:

- __Manual prompt template engineering__ is pretty widespread, based on expertise and an understanding of both the model architecture, pretraining method and the task.

A concrete example for semantic parsing ([source paper](https://arxiv.org/pdf/2104.08768v1.pdf)):

> Let’s translate what a human user says into what a computer might say.

> Human: when is the weekly standup

> Computer: start time of weekly standup

> Human: what date is the weekly standup

> Computer: date of weekly standup

> ...

> Human: how long is the weekly standup

> Computer:

> ____

### Automatic prompt generation

But how do we know that these prompts are optimal? We don't, so let's try to __optimize__ prompt generation! Existing implementations of this idea:

__Discrete prompts:__
- __prompt mining:__ scrape a large corpus and find __middle words or dependency paths__ (!) between the xs and ys of the example pairs, and use this to form the prompt.
- __prompt paraphrasing__: start with a seed prompt template, generate paraphrases, and then choose the highest performing variant.
- __gradient-based search__ for optimal textual prompts.
- __template generation__: you can try to train text generators, e.g. T5 to generate prompt templates based on examples...

__Prompting directly in the embedding space:___

- The core idea is to use continuous embeddings as prompts -- this opens the possibility of actually using GD-optimized networks to produce the continuous prompts from the text input and even using non-textual inputs like image embeddings,for, e.g., image captioning.

## Zero-label learning with synthetic examples (Google, Sep. 2021)

(The paper can be found [here](https://arxiv.org/pdf/2109.09193.pdf))

__Main motivation:__
- few-shot learning GLM performance is still worse than fine-tuning "small" models, e.g., fine-tuned T5 beats few-shot learning using GPT-3 on...

__SuperGLUE:__ a very challenging, recent (2019) NLU dataset consisting of
- WSD,
- coreference resolution
- question answering, and
- natural language inference
tasks.

<img src="https://miro.medium.com/max/667/1*I5m7DxaB7af8d189drFixg.png">

(Figure from the [paper](https://arxiv.org/abs/1905.00537).)

__Performance on SuperGLUE__

71.8 vs 89.3% avg. performance on SuperGLUE for fine-tuned T5 vs few-shot learning with GPT-3, human performance is
89.8.

__Approach__:
- use a GLM for __synthetic datapoint generation__ and then use the synthetically generated dataset to fine-tune a "small" pretrained LM like T5 or BERT for the task.
- with appropriate prompt engineering, you don't need labels(!) you just prompt the GLM to produce X datapoints for given Ys -- the whole procedure is "zero-label".

Figure from the paper:
<img src="https://miro.medium.com/max/933/0*xQI2ysY_loXMTlmb.png" widt="100%">

__Experiments and results__:
- Used Yelp and Amazon sentiment classification datasets (only the Xs!) and DBPedia topic classification
- Generated 10000 synthetic examples per class for the class. problems
- top K sampling for generation
- __T5 fine-tuned with the synthetic data:__ new state of the art results for zero/few shot learning (without labels!!)
- used for __SuperGLUE__ as __a data augmentation method__, the __fine-tuned T5 model__ achieved -- for the first time -- a better then human performance (90.4 percent)