# Modern NLP with no (or little) training data: an overview

- comments: true



Natural language processing is an enthrallingly exciting field right now. In recent years, the community has begun to figure out some pretty effective methods of learning from the enormous amounts of unlabeled data on the internet. The success of transfer learning from unsupervised models has allowed us to surpass virtually all existing benchmarks on downstream supervised learning tasks. As we continue to develop new model architectures and unsupervised learning objectives, "state of the art" continues to be a rapidly moving target for many tasks where large amounts of labeled data are available.

In many real-world settings, however, annotated data is either scarse or  unavailable entirely. It seems almost tragic that we could have such success in unsupervised learning as a pre-training step but having focused so little on  alleviating our reliance on labaled data in downstream applications like sequence classification. Recent models like BERT, XLNet, and T5 have been shown to encode a tremendous amount of information in their weights – it seems like we should be able to figure out a way to use that data in traditionally supervised tasks but without such a heavy reliance on task-specific annotated data.

Of course, *some* research has in fact been done in this area. **In this post, I will present a few techniques, both from published research and our own experiments at Hugging Face, for using state-of-the-art NLP models for sequence classification without large annotated training sets.**

#### Background: Natural Language Inference (NLI)

Several of the methods described below use Natural Language Inference as a pre-training step, so here is a quick review. NLI considers two sentences: a "premise" and a "hypothesis". The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.

![example NLI sentences](https://joeddav.github.io/blog/images/zsl/nli-examples.png "Examples from http://nlpprogress.com/english/natural_language_inference.html")

When using transformer architectures like BERT, NLI datasets are typically modeled via _sequence-pair classification_. That is, we feed both the premise and the hypothesis through the model together as distinct segments and learn a classification head predicting one of `[contradiction, neutral, entailment]`.

## A latent embedding approach

A common approach to zero shot learning in the computer vision setting is to use an existing featurizer to embed an image and any possible class names into their corresponding latent representations. They can then take some training set and use only a subset of the available labels to learn a linear projection to align the image and label embeddings. At test time, this framework allows one to embed any label (seen or unseen) and any image into the same latent space and measure their distance.

In the text domain, we have the advantage that we can trivially use a single model to embed both the data and the class names into the same space, eliminating the need for the data-hungry alignment step. This is not a new technique – researchers and practitioners have used pooled word vectors in similar ways for some time. But recently we have seen a dramatic increase in the quality of sentence embedding models. We therefore decided to run some experiments with Sentence-BERT, a recent technique which fine-tunes the pooled sequence representations for increased semantic richness, as a method for obtaining sequence and label embeddings.

To formalize this, suppose we have a sequence embedding model $\Phi$ and set of possible class names $C$. We classify a given sequence $x$ according to,

$$
\hat{c} = \arg\max_{c \in C} \cos(\Phi(x), \Phi(c))
$$

where $\cos$ is the cosine similarity. Here's an example code snippet showing how this can be done using Sentence-BERT as our embedding model $\Phi$:

In [70]:
# load the sentence-bert model from the HuggingFace model hub
from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained('deepset/sentence_bert')
model = AutoModel.from_pretrained('deepset/sentence_bert')

sentence = 'Who are you voting for in 2020?'
labels = ['business', 'art & culture', 'politics']

# run inputs through model and mean-pool over the sequence
# dimension to get sequence-level representations
inputs = tokenizer.batch_encode_plus([sentence] + labels,
                                     return_tensors='pt',
                                     pad_to_max_length=True)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]
sentence_rep = output[:1].mean(dim=1)
label_reps = output[1:].mean(dim=1)

# now find the labels with the highest cosine similarities to
# the sentence
similarities = F.cosine_similarity(sentence_rep, label_reps)
closest = similarities.argsort(descending=True)
for ind in closest:
    print(f'label: {labels[ind]} \t similarity: {similarities[ind]}')

label: politics 	 similarity: 0.21561521291732788
label: business 	 similarity: 0.004524140153080225
label: art & culture 	 similarity: -0.027396833524107933


> Note: This code snippet uses `deepset/sentence_bert` which is the smallest version of the S-BERT model. Our experiments use larger models which are currently available only in the `sentence-transformers` GitHub repository, which we hope to make available in the Hugging Face model hub soon.

One problem with this method is that Sentence-BERT is designed to learn effective sentence-level, not single- or multi-word representations like our class names. It is therefore reasonable to suppose that our label embeddings may not be as semantically salient as popular word-level embedding methods (i.e. word2vec). This is seen in the t-SNE visualization below where the data seems to cluster together by class (color) reasonably well, but the labels are poorly aligned. If we were to use word vectors as our label representations, however, we would need annotated data to learn an alignment between the S-BERT sequence representations and the word2vec label representations.

![visual of S-BERT label and text embeddings](https://joeddav.github.io/blog/images/zsl/tsne_no_projection.png "t-SNE visualization of Yahoo Answers S-BERT embeddings. Plotted points correpond to data and text boxes to corresponding labels.")


In some of our own internal experiments, we addressed this issue with the following procedure:

1. Take the top $K$ most frequent words $V$ in the vocabulary of a word2vec model
2. Obtain embeddings for each word using word2vec, $\Phi_{\text{word}}(V)$
3. Obtain embeddings for each word using S-BERT, $\Phi_{\text{sent}}(V)$
4. Learn a least-squares linear projection matrix $Z$ with L2 regularization from $\Phi_{\text{sent}}(V)$ to $\Phi_{\text{word}}(V)$

Now we use $Z$ in our classification as an additional transformation to our latent space for both sequence and label embeddings:

$$
\hat{c} = \arg\max_{c \in C} \cos(\Phi_{\text{sent}}(x)Z, \Phi_{\text{sent}}(c)Z)
$$

This procedure can be thought of as a kind of dimensionality reduction. As seen in the t-SNE visual below, this projection makes the label embeddings much better aligned with their corresponding data while maintining the superior performance of S-BERT compared to pooled word vectors. Importantly, this procedure does not require any additional data beyond a word2vec mapping sorted by word frequency.

![visual of S-BERT + projection label and text embeddings](https://joeddav.github.io/blog/images/zsl/tsne_with_projection.png "t-SNE visualization of embeddings with SBERT to Wordvec projection")



On Yahoo Answers, we find an F1 of $46.9$ and $31.2$ with and without this projection step, respectively.

#### When some annotated data is available

This technique is flexible and easily adapted to the case where a limited amount of labeled data is available (few-shot learning) or where we have annotated data for only a subset of the classes we're interested in (traditional zero-shot learning).

To do so, we can simply learn an additional least-squares projection matrix to the embeddings of any available labels from their corresponding data embeddings. We also a add a variant of L2 regularization which regularizes towards the identity matrix. If we define $X_{Tr}, Y_{Tr}$ to be our training data and labels and $\Phi(X) = \Phi_\text{sent}(X)Z$ to be our embedding function as described above, our regularized objective is

$$
W^\ast = \arg\min \dfrac{1}{n} || \Phi(X)^\top W - \Phi(Y) ||^2 + \lambda ||W - \mathbb{I}_d||^2
$$

This is equivalent to Bayesian linear regression with a Gaussian prior on the weights centered at the identity matrix and variance controlled by $\lambda$. By pushing $W$ towards the identity matrix, we're effectively pushing the resulting projected embeddings $\Phi(X)W^\ast$ towards $\Phi(X)$.

## Classification as Natural Language Inference

Recently, [Yin et al. (2019)](https://arxiv.org/abs/1909.00161) proposed a method which uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well.

The idea is to take the sequence we're interested in labeling as the "premise" and to turn each candidate label into a "hypothesis." If the NLI model predicts that the premise "entails" the hypothesis, we take the label to be true. Unlike the previous approach which utilizes independent embeddings for the data and labels and requires us to determine their relationship, this method gives us a ready-made compatibility function that works reasonably well without any task-specific training. See the code snippet below which demonstrates how easily this can be done with 🤗 Transformers.

In [0]:
# load model pretrained on MNLI
from transformers import BartForSequenceClassification, BartTokenizer
tokenizer = BartTokenizer.from_pretrained('bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('bart-large-mnli')

# pose sequence as a NLI premise and label (politics) as a hypothesis
premise = 'Who are you voting for in 2020?'
hypothesis = 'This text is about politics.'

# run through model pre-trained on MNLI
input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = model(input_ids)[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')

Probability that the label is true: 99.04%


In the paper, the authors report a label-weighted F1 of $37.9$ on Yahoo Answers topic classification using the smallest version of BERT fine-tuned only on the Multi-genre NLI (MNLI) corpus. By simply using the larger and more recent Bart model pre-trained on MNLI, we were able to bring this number up to $53.7$. For context, Yahoo Answers has 10 classes and [supervised models](https://paperswithcode.com/sota/text-classification-on-yahoo-answers) get an accuracy of just over $70\%$.

Of course, this number can be improved when some data is available for training. In addition to the extreme fully unsupervised setting, the authors consider a setup which corresponds to the traditional _generalized zero-shot learning_ setting where only a subset of the dataset's labels are available during training. The model is then evaluated on all labels together, both seen and unseen, at test time.

See [our live demo here](http://35.208.71.201:8000/) to try it out for yourself! Enter a sequence you want to classify and any labels of interest and watch Bart do its magic in real time.

![live demo](https://joeddav.github.io/blog/images/zsl/zsl-demo-screenshot.png)

## Classification as a cloze task

One in-the-works approach to keep your eye on is a preprint on Pattern-Exploiting Training (PET) from [Schick et al.](https://arxiv.org/abs/2001.07676). In this paper, the authors reformulate text classification as a cloze task. A cloze question considers a sequence which is partially masked and requires predicting the missing value(s) from the context. PET requires a human practitioner to construct several task-appropriate cloze-style templates which, in the case of topic classification, could look something like the following:

![cloze examples](https://joeddav.github.io/blog/images/zsl/cloze.png "examples of cloze templates for topic classification. a and b are the question and answers in the case of Yahoo Answers and ____ is the class name which the model must predict.")

A pre-trained masked language model is then tasked with choosing the most likely value for the masked (blank) word from among the possible class names.

Rather than simply taking the aggregated values from this procedure as final predictions, the authors introduce a fresh classifier which is trained on the softened proxy labels generated in the previous step. My intuition is that this step is effective because it allows us to do inference over the whole test set collectively, allowing the model to learn from the set over which it is predicting rather than treating each test point independently. Though I'm ignorant of any research testing this hypothesis, I suspect that this step would be particularly helpful when adapting to novel domains which do not resemble the MLM's training corpus.

Though not discussed in the most recent version of their preprint, in their [GitHub repo](https://github.com/timoschick/pet) the authors go one step further and mention an iterative self-training procedure on top of PET which reports an impressive accuracy of $70.7\%$ on Yahoo Answers, which nearly approaches the performance of state-of-the-art supervised classification methods.

Method pros:
- Strong empirical performance
- Distilation step allows learning from unlabeled task-specific data
- Well-tested in few-shot setting
- Any masked language model can be used out of the box

Cons:
- Requires manual construction of multiple task-appropriate cloze templates
- Class names can only consist of one token
- Computational considerations: each instance requires several forward passes through a large MLM and a fresh classifier must then be trained from scratch

## On low-resource languages

Low-resource and cross-lingual learning is a huge research area in NLP right now and much has been written about it, so I'll just link a few great resources:

- Graham Neubig's recently released [Low Resource NLP Bootcamp](https://github.com/neubig/lowresource-nlp-bootcamp-2020) is a GitHub repo containing 8 lectures (plus exercises) focused on NLP in data-scarse languages. 

- Sebastian Ruder's blog post, ["A survey of cross-lingual word embedding models"](https://ruder.io/cross-lingual-embeddings/)

