# Text classification in the BERT era without annotated data

- comments: true



Let's be honest, natural language processing is an enthrallingly exciting field these days. In recent years, the community has begun to figure out some pretty effective methods of learning from the enormous amounts of unlabeled data on the internet. The success of transfer learning from unsupervised models has allowed us to surpass virtually all existing benchmarks on downstream supervised learning tasks. As we continue to develop new model architectures and unsupervised learning objectives, "state of the art" continues to be a rapidly moving target for many tasks where large amounts of labeled data is available.

In many real-world settings, however, annotated data is either scarse or  unavailable entirely. It seems almost tragic that we could have such success in unsupervised learning as a pre-training step but having focused so little on  alleviating our reliance on labaled data in downstream applications like sequence classification. Recent models like BERT, XLNet, and T5 have been shown to encode a tremendous amount of knowledge in their weights – it seems like we should be able to figure out a way to use that data in traditionally supervised tasks but without such a heavy reliance on task-specific annotated data.

Of course, *some* research has in fact been done in this area. **In this post, I will give an overview of a few techniques, both from published research and my own experiments, for using state-of-the-art NLP models for sequence classification in the absense of large annotated datasets.** Specifically, I will cover the following low-resource settings:

1. I have no training data (extreme zero-shot learning)
2. I have sufficient data for some labels, but not for others (traditional zero-shot learning)
3. I have a little bit of annotated data (few-shot learning)
4. I have no annotated data, but lots of unlabeled data (unsupervised classification)

At the end of the post, I also link a few fantastic resources out there for NLP in low-resource languages. While I focus specifically on sequence-level classification, my hope is that some of these methods will be applicable to or inspire ideas for other tasks as well.

## Setting: I have no training data

#### Background: Natural Language Inference (NLI)

Several of the methods described below use Natural Language Inference as pre-training step, so here is a quick review. NLI considers two sentences: a "premise" and a "hypothesis". The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.

![example NLI sentences](https://joeddav.github.io/blog/images/zsl/nli-examples.png "Examples from http://nlpprogress.com/english/natural_language_inference.html")

When using transformer architectures like BERT, NLI datasets are typically modeled via _sequence-pair classification_. That is, we feed both the premise and the hypothesis through the model together as distinct segments and learn a classification head predicting one of `[contradiction, neutral, entailment]`.

### NLI models as effective, ready-made zero-shot classifiers

Recently, [Yin et al. (2019)](https://arxiv.org/abs/1909.00161) proposed a method which uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well.

The idea is to take the sequence we're interested in labeling as the "premise" and to turn each candidate label into a "hypothesis." If the NLI model predicts that the premise "entails" the hypothesis, we take the label to be true. This gives us a ready-made compatibility function that works reasonably well without any task-specific training. See the code snippet below to see how easily this can be done with 🤗 Transformers.

In [0]:
# load model pretrained on MNLI
from transformers import BartForSequenceClassification, BartTokenizer
tokenizer = BartTokenizer.from_pretrained('bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('bart-large-mnli')

# pose sequence as a NLI premise and label (politics) as a hypothesis
premise = 'Who are you voting for in 2020?'
hypothesis = 'This text is about politics.'

# run through model pre-trained on MNLI
input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = model(input_ids)[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')

Probability that the label is true: 99.04%


In the paper, the authors report an F1 of $37.9$ on Yahoo Answers using the smallest version of BERT fine-tuned only on the Multi-genre NLI (MNLI) corpus. By simply using the larger and more recent Bart model pre-trained on MNLI, we were able to bring this number up to $53.7$. For context, Yahoo Answers has 10 classes and [supervised models](https://paperswithcode.com/sota/text-classification-on-yahoo-answers) get an accuracy of just over $70\%$.

Of course, this number can be improved when some data is available for training. In addition to the extreme fully unsupervised setting, the authors consider a setup which corresponds to the traditional _generalized zero-shot learning_ setting where only a subset of the dataset's labels are available during training. The model is then evaluated on all labels together, both seen and unseen, at test time.

See [our live demo here](http://35.208.71.201:8000/) to try it out for yourself! Enter a sequence you want to classify and any labels of interest and watch Bart do its magic in real time.

![live demo](https://joeddav.github.io/blog/images/zsl/zsl-demo-screenshot.png)

### A latent embedding approach

A common approach to zero shot learning in the computer vision setting is to use an existing featurizer to embed an image and any possible class names into their corresponding latent representations. They can then take some training set and use only a subset of the available labels to learn a linear projection to align the image and label embeddings. At test time, this framework allows one to embed any label (seen or unseen) and any image into the same latent space and measure their distance.

![latent embeddings of images and labels](https://joeddav.github.io/blog/images/zsl/socher.png "t-SNE visualization of projected image & class embeddings from Socher et al.")

In the text domain, we have the advantage that we can use a single model to embed both the sequences to classify and the class names into the same space, eliminating the need for the data-hungry alignment step. This is not a new technique – researchers and practitioners have used pooled word vectors in similar ways for some time. But recently we have seen a dramatic increase in the quality of sentence embedding models. We therefore experiment with Sentence-BERT, a recent technique which fine-tunes pooled sequence representations for increased semantic richness, as a method for obtaining sequence and label embeddings.

To formalize this, suppose we have a sequence embedding model $\Phi$ and set of possible class names $C$. We classify a given sequence $x$ according to,

$$
\hat{c} = \arg\max_{c \in C} \cos(\Phi(x), \Phi(c))
$$

where $\cos$ is the cosine similarity. Here's an example code snippet showing how this can be done using Sentence-BERT as our embedding model $\Phi$:

In [70]:
# load the sentence-bert model from the HuggingFace model hub
from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained("deepset/sentence_bert")
model = AutoModel.from_pretrained("deepset/sentence_bert")

sentence = 'Who are you voting for in 2020?'
labels = ['business', 'art & culture', 'politics']

# run inputs through model and mean-pool over the sequence
# dimension to get sequence-level representations
inputs = tokenizer.batch_encode_plus([sentence] + labels,
                                     return_tensors='pt',
                                     pad_to_max_length=True)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]
sentence_rep = output[:1].mean(dim=1)
label_reps = output[1:].mean(dim=1)

# now find the labels with the highest cosine similarities to
# the sentence
similarities = F.cosine_similarity(sentence_rep, label_reps)
closest = similarities.argsort(descending=True)
for ind in closest:
    print(f'label: {labels[ind]} \t similarity: {similarities[ind]}')

label: politics 	 similarity: 0.21561521291732788
label: business 	 similarity: 0.004524140153080225
label: art & culture 	 similarity: -0.027396833524107933


One downside to this method is that Sentence-BERT is designed to learn effective sentence-level, not single- or multi-word representations like our class names. It is therefore reasonable to suppose that our label embeddings may not be as semantically salient as popular word-level embedding methods (i.e. word2vec). If we were to use word vectors as our label representations, however, we would need annotated data to learn an alignment between the S-BERT sequence representations and the word2vec label representations.

We addressed this issue with the following procedure:

1. Take the top $K$ most frequent words $V$ in the vocabulary of a word2vec model
2. Obtain embeddings for each word using word2vec, $\Phi_{\text{word}}(V)$
3. Obtain embeddings for each word using S-BERT, $\Phi_{\text{sent}}(V)$
4. Learn a linear projection $W$ with L2 regularization from $\Phi_{\text{sent}}(V)$ to $\Phi_{\text{word}}(V)$

Now we use $W$ in our classification as an additional transformation to our latent space for both sequence and label embeddings:

$$
\hat{c} = \arg\max_{c \in C} \cos(\Phi_{\text{sent}}(x)W, \Phi_{\text{sent}}(c)W)
$$

This procedure can be thought of as a kind of dimensionality reduction. By learning a regularized projection from the S-BERT embeddings to word vectors, the label and sequence representations become better aligned with one another while maintining the superior performance of S-BERT compared to pooled word vectors. Importantly, this procedure does not require any additional data beyond a word2vec mapping sorted by word frequency.

On Yahoo Answers, we find an F1 of $46.9$ and $31.2$ with and without this projection step, respectively.

## Setting: Low-Resource Languages

Low-resource and cross-lingual learning is a huge research area in NLP right now and much has been written about it, so I'll just link a few great resources:

- Graham Neubig's recently released [Low Resource NLP Bootcamp](https://github.com/neubig/lowresource-nlp-bootcamp-2020) is a GitHub repo containing 8 of lectures (plus exercises) focused on NLP in data-scarse languages. 

- Sebastian Ruder's blog post, ["A survey of cross-lingual word embedding models"](https://ruder.io/cross-lingual-embeddings/)

