# An Overview of Zero Shot Learning in NLP



## Introduction



## History of ZSL and its meaning

## Methods and Uses in NLP

### "Zero-shot" as an Evaluation Technique

### A ready-made, actually useful ZSL classifier

Recently, [Yin et al.](https://arxiv.org/abs/1909.00161) proposed a method which uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well.

As some quick background, Natural Language Inference (NLI) considers two sentences: a "premise" and a "hypothesis". The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.

![example NLI sentences](https://i.ibb.co/gWCjvdP/Screen-Shot-2020-05-26-at-5-10-07-PM.png "NLI Examples from [NLP Progress](http://nlpprogress.com/english/natural_language_inference.html)")

The idea is to take the sequence we're interested in labeling as the "premise" and to turn each candidate label into a "hypothesis." If the model says that the premise "entails" the hypothesis, we take the label to be true. This gives us a ready-made compatibility function that works reasonably well on certain tasks without any task-specific training. See the code snippet below to see how easily this can be done with 🤗 Transformers.

In [1]:
#collapse-show
# load model pretrained on MNLI
from transformers import BartForSequenceClassification, BartTokenizer
tokenizer = BartTokenizer.from_pretrained('bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('bart-large-mnli')

# pose sequence as a NLI premise and label (politics) as a hypothesis
premise = 'Who are you voting for in 2020?'
hypothesis = 'This text is about politics.'

# run through model pre-trained on MNLI
input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = model(input_ids)[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')

Probability that the label is true: 99.04%


In the paper, the authors report an F1 of $37.9$ on Yahoo Answers using the smallest version of BERT fine-tuned only on the Multi-genre NLI (MNLI) corpus. By simply using the larger and more recent Bart model pre-trained on MNLI, we were able to bring this number up to $53.7$. For context, Yahoo Answers has 10 classes and [supervised models](https://paperswithcode.com/sota/text-classification-on-yahoo-answers) get an accuracy of just over $70\%$.

Of course, this number can be improved when some data is available for training. In addition to the extreme fully unsupervised setting, the authors consider a setup which corresponds to the traditional _generalized zero-shot learning_ setting where only a subset of the dataset's labels are available during training. The model is then evaluated on all labels together, both seen and unseen, at test time.

See [our live demo here](http://35.208.71.201:8000/) to try it out for yourself! Enter a sequence you want to classify and any labels of interest and watch Bart do its magic in real time.

![live demo](https://i.ibb.co/WB6HsFk/Screen-Shot-2020-05-26-at-5-31-25-PM.png)

#### Sections to put somewhere:
- Zero-shot learning and its relationship to few-shot learning, sample efficiency, domain adaptation
- Some kind of good visualization