# Modern NLP with Little to No Annotated Data

- comments: true



Natural language processing is a task-rich area and enumerating every type of low-resource learning technique for all likely tasks of interest is impractical. Instead, I will focus primarily on sequence classification but describe techniques applicable to a wide variety of data (in)availability situations. My hope is that these methods will be useful for some and, for most, will inspire creativity in leveraging pre-trained models in low-resource settings.

Settings (focus on classification):
- No annotated data is available
    - NLI Model
    - SBERT2Wordvec
- Data is available for some labels, but missing for others
    - NLI Model
    - Align SBERT2Wordvec
- No annotated data is available, but lots of non-annotated
    - Semi-supervised... be careful not to scoop your own research here
- Some annotated data is available, but not enough to learn a good classifier
    - Few shot, sample efficiency
- Data is available, but not in the language I want
    - Cross lingual alignment techniques, link to seb ruder's post

## Setting: No Training Data is Available

#### Background: Natural Language Inference (NLI)

Several of the methods described below use Natural Language Inference as pre-training step, so here is a quick review. NLI considers two sentences: a "premise" and a "hypothesis". The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.

![example NLI sentences](https://i.ibb.co/gWCjvdP/Screen-Shot-2020-05-26-at-5-10-07-PM.png "Examples from http://nlpprogress.com/english/natural_language_inference.html")

When using transformer architectures like BERT, NLI datasets are typically modeled via _sequence-pair classification_. That is, we feed both the premise and the hypothesis through the model together as distinct segments and learn a classification head predicting one of `[contradiction, neutral, entailment]`.

### A ready-made zero-shot classifier

Recently, [Yin et al. (2019)](https://arxiv.org/abs/1909.00161) proposed a method which uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well.

The idea is to take the sequence we're interested in labeling as the "premise" and to turn each candidate label into a "hypothesis." If the model says that the premise "entails" the hypothesis, we take the label to be true. This gives us a ready-made compatibility function that works reasonably well on certain tasks without any task-specific training. See the code snippet below to see how easily this can be done with 🤗 Transformers.

In [0]:
#collapse-show
# load model pretrained on MNLI
from transformers import BartForSequenceClassification, BartTokenizer
tokenizer = BartTokenizer.from_pretrained('bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('bart-large-mnli')

# pose sequence as a NLI premise and label (politics) as a hypothesis
premise = 'Who are you voting for in 2020?'
hypothesis = 'This text is about politics.'

# run through model pre-trained on MNLI
input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = model(input_ids)[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')

Probability that the label is true: 99.04%


In the paper, the authors report an F1 of $37.9$ on Yahoo Answers using the smallest version of BERT fine-tuned only on the Multi-genre NLI (MNLI) corpus. By simply using the larger and more recent Bart model pre-trained on MNLI, we were able to bring this number up to $53.7$. For context, Yahoo Answers has 10 classes and [supervised models](https://paperswithcode.com/sota/text-classification-on-yahoo-answers) get an accuracy of just over $70\%$.

Of course, this number can be improved when some data is available for training. In addition to the extreme fully unsupervised setting, the authors consider a setup which corresponds to the traditional _generalized zero-shot learning_ setting where only a subset of the dataset's labels are available during training. The model is then evaluated on all labels together, both seen and unseen, at test time.

See [our live demo here](http://35.208.71.201:8000/) to try it out for yourself! Enter a sequence you want to classify and any labels of interest and watch Bart do its magic in real time.

![live demo](https://i.ibb.co/WB6HsFk/Screen-Shot-2020-05-26-at-5-31-25-PM.png)

### A Latent Embedding Approach

A slightly less effective but more flexible approach is to embed both the sequence and the class names of interest into the same representation space and then simply select the label closest in latent space.

This is a well-known technique in Zero Shot Learning in Computer Vision. Take the word vectors for each class and some latent representation for an image and project them to the same space. Learning this projection requires data for some labels, but allows you to generalize to unseen labels at test time.

We found that in the text regime, we can follow the same procedure but without the need for any annotated data ahead of time. By simply using a single sentence representation model, we can embed both the sequences to classify and the candidate labels into the same latent space.

$$
\hat{c} = \arg\max_{c \in C} \cos(\Phi(x), \Phi(c))
$$

where $\Phi$ is a sentence-level embedding model, $x$ is a sequence, and $C$ is a set of class labels.

Here's an example code snippet using the Sentence-BERT method:

In [0]:
# load the sentence-bert model from the HuggingFace model hub
from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained("deepset/sentence_bert")
model = AutoModel.from_pretrained("deepset/sentence_bert")

sentence = 'Who are you voting for in 2020?'
labels = ['politics', 'business', 'art & culture']

# run inputs through model and mean over the sequence dimension
# to get sentence-level representations
inputs = tokenizer.batch_encode_plus([sentence] + labels,
                                     return_tensors='pt',
                                     pad_to_max_length=True)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]
sentence_rep = output[:1].mean(dim=1)
label_reps = output[1:].mean(dim=1)

# now find the labels with the highest cosine similarities to
# the sentence
similarities = F.cosine_similarity(sentence_rep, label_reps)
closest = similarities.argsort(descending=True)
for ind in closest:
    print(f'label: {labels[ind]},\t cos: {similarities[ind]:0.4f}')

label: politics,	 cos: 0.2156
label: business,	 cos: 0.0045
label: art & culture,	 cos: -0.0274


One downside to this method is that Sentence-BERT is designed to learn sentence-level, not single- or multi-word representations like our class names. It is therefore reasonable to suppose that our label embeddings may not be as semantically salient as popular word-level embedding methods. But we can't use word vectors directly because then we would have to align our S-BERT sequence embeddings with the word vector label emebddings.

We addressed this issue with the following procedure:

1. Take the top $10,000$ most frequent words $V$ in skipgram's vocabulary
2. Obtain embeddings for each word using skipgram $\Phi_{\text{word}}(V)$
3. Obtain embeddings for each word using S-BERT $\Phi_{\text{sent}}(V)$
4. Learn a linear projection $W$ with L2 regularization from $\Phi_{\text{sent}}(V)$ to $\Phi_{\text{word}}(V)$

Now we apply $W$ as an additional transformation to our latent space for both sequence and label embeddings:

$$
\hat{c} = \arg\max_{c \in C} \cos(\Phi_{\text{sent}}(x)W, \Phi_{\text{sent}}(c)W)
$$



## Setting: Low-Resource Languages

Low-resource and cross-lingual learning is a huge research area in NLP right now and much has been written about it, so I'll just link a few great resources:

- Graham Neubig's recently released [Low Resource NLP Bootcamp](https://github.com/neubig/lowresource-nlp-bootcamp-2020).

> twitter: https://twitter.com/gneubig/status/1265644923153514496

- Sebastian Ruder's blog post, ["A survey of cross-lingual word embedding models"](https://ruder.io/cross-lingual-embeddings/)

