# Train without labels

Almost all data available is unlabeled. Labeled data takes effort to manually review and/or takes time to collect. Zero-shot classification takes existing large language models and runs a similarity comparison between candidate text and a list of labels. This has been shown to perform surprisingly well.

The problem with zero-shot classifiers is that they need to have a large number of parameters (400M+) to perform well against general tasks, which comes with sizable hardware requirements.

This notebook explores using zero-shot classifiers to build training data for smaller models. A simple form of [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation). 

# Install dependencies

Install `txtai` and all dependencies.

In [None]:
%%capture
!pip install git+https://github.com/neuml/txtai datasets pandas

# Apply zero-shot classifier to unlabeled text

The following section takes a small 1000 record random sample of the sst2 dataset and applies a zero-shot classifer to the text. The labels are ignored. This dataset was chosen only to be able to evaluate the accuracy at then end. 

In [None]:
import random

from datasets import load_dataset

from txtai.pipeline import Labels

def batch(texts, size):
    return [texts[x : x + size] for x in range(0, len(texts), size)]

# Set random seed for repeatable sampling
random.seed(42)

ds = load_dataset("glue", "sst2")

sentences = random.sample(ds["train"]["sentence"], 1000)

# Load a zero shot classifier - txtai provides this through the Labels pipeline
labels = Labels("microsoft/deberta-large-mnli")

train = []

# Zero-shot prediction using ["negative", "positive"] labels
for chunk in batch(sentences, 32):
    train.extend([{"text": chunk[x], "label": label[0][0]} for x, label in enumerate(labels(chunk, ["negative", "positive"]))])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Next, we'll use the training set we just built to train a smaller Electra model.

In [None]:
from txtai.pipeline import HFTrainer

trainer = HFTrainer()
model, tokenizer = trainer("google/electra-base-discriminator", train, num_train_epochs=5)

Some weights of the model checkpoint at google/electra-base-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.o

Step,Training Loss
500,0.2805


# Evaluating accuracy

Recall the training set is only 1000 records. To be clear, training an Electra model against the full sst2 dataset would perform better than below. But for this exercise, we're are not using the training labels and simulating labeled data not being available.

First, lets see what the baseline accuracy for the zero-shot model would be against the sst2 evaluation set. Reminder that this has not seen any of the sst2 training data. 


In [None]:
labels = Labels("microsoft/deberta-large-mnli")

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
results = [row["label"] == labels(row["sentence"], ["negative", "positive"])[0][0] for row in ds["validation"]]
sum(results) / len(ds["validation"])

0.8818807339449541

88.18% accuracy, not bad for a model that has not been trained on the dataset at all! Shows the power of zero-shot classification.

Next, let's test our model trained on the 1000 zero-shot labeled records.

In [None]:
labels = Labels((model, tokenizer), dynamic=False)

results = [row["label"] == labels(row["sentence"])[0][0] for row in ds["validation"]]
sum(results) / len(ds["validation"])

0.8864678899082569

88.65% accuracy! Wouldn't get too carried away with the percentages but this at least meets the accuracy of the zero-shot classifier if not exceeds it. 

Now this model will be highly tuned for a specific task but it had the opportunity to learn from the combined 1000 records whereas the zero-shot classifier views each record independently. It's also much more performant. 

# Conclusion

This notebook explored a method of building trained text classifiers without training data being available. Given the amount of resources needed to run large-scale zero-shot classifiers, this method is a simple way to build smaller models tuned for specific tasks. In this example, the zero-shot classifier has 400M parameters and the trained text classifier has 110M. 