# Training Hyperonym Barba

This notebook contains instructions on how to train a Hyperonym Barba model with public and private NLI datasets.

In [None]:
model_size = "large"

## Install dependencies

Install TensorFlow:

In [None]:
!pip install tensorflow==2.11.1

Install Hugging Face libraries:

In [None]:
!pip install transformers==4.28.1 datasets==2.11.0

## Prepare datasets

In [None]:
from datasets import load_dataset, concatenate_datasets, Features, Value, ClassLabel

Set number of processes to use for parallel operations:

In [None]:
num_proc = 32

A typical NLI model generally has three output labels, namely `entailment`, `neutral` and `contradiction`.

To support various datasets, Barba uses only two labels, `entailment` and `not_entailment`:

In [None]:
features = Features(
    {
        "hypothesis": Value(dtype="string"),
        "premise": Value(dtype="string"),
        "label": ClassLabel(names=["entailment", "not_entailment"]),
    }
)

Function for removing redundant columns:

In [None]:
def strip_columns(dataset):
    columns = dataset[list(dataset)[0]].column_names
    columns = [col for col in columns if col not in features]
    return dataset.remove_columns(columns)

Function for squashing `neutral` and `contradiction` into a single label:

In [None]:
def squash_labels(dataset):
    def fn(example):
        if example["label"] == 2:
            example["label"] = 1
        return example
    return dataset.map(fn, features=features, num_proc=num_proc)

All-in-one function for normalizing datasets:

In [None]:
def normalize(dataset):
    dataset = strip_columns(dataset)
    dataset = squash_labels(dataset)
    return dataset

### Load public datasets

#### SNLI

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).

In [None]:
snli = load_dataset("snli")

In [None]:
snli = normalize(snli)

#### MNLI

The Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. The authors of the benchmark use the standard test set, for which they obtained private labels from the RTE authors, and evaluate on both the matched (in-domain) and mismatched (cross-domain) section. They also uses and recommend the SNLI corpus as 550k examples of auxiliary training data.

In [None]:
mnli = load_dataset("glue", "mnli")

In [None]:
mnli = normalize(mnli)

#### MRPC

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

In [None]:
mrpc = load_dataset("glue", "mrpc")

In [None]:
mrpc = mrpc.rename_column("sentence1", "premise")
mrpc = mrpc.rename_column("sentence2", "hypothesis")

In [None]:
mrpc = normalize(mrpc)

#### RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges. The authors of the benchmark combined the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009). Examples are constructed based on news and Wikipedia text. The authors of the benchmark convert all datasets to a two-class split, where for three-class datasets they collapse neutral and contradiction into not entailment, for consistency.

In [None]:
rte = load_dataset("glue", "rte")

In [None]:
rte = rte.rename_column("sentence1", "premise")
rte = rte.rename_column("sentence2", "hypothesis")

In [None]:
rte = normalize(rte)

#### WNLI

The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. The examples are manually constructed to foil simple statistical methods: Each one is contingent on contextual information provided by a single word or phrase in the sentence. To convert the problem into sentence pair classification, the authors of the benchmark construct sentence pairs by replacing the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. They use a small evaluation set consisting of new examples derived from fiction books that was shared privately by the authors of the original corpus. While the included training set is balanced between two classes, the test set is imbalanced between them (65% not entailment). Also, due to a data quirk, the development set is adversarial: hypotheses are sometimes shared between training and development examples, so if a model memorizes the training examples, they will predict the wrong label on corresponding development set example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task. The authors of the benchmark call converted dataset WNLI (Winograd NLI).

In [None]:
wnli = load_dataset("glue", "wnli")

WNLI uses 0 as `not_entailment` and 1 as `entailment`, so we need to adjust the labels:

In [None]:
def wnli_fix(example):
    if example["label"] == 1:
        example["label"] = 0
    elif example["label"] == 0:
        example["label"] = 1
    return example


wnli = wnli.map(wnli_fix, num_proc=num_proc)

In [None]:
wnli = wnli.rename_column("sentence1", "premise")
wnli = wnli.rename_column("sentence2", "hypothesis")

In [None]:
wnli = normalize(wnli)

#### OCNLI

OCNLI stands for Original Chinese Natural Language Inference. It is corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. We want to emphasize we did not use human/machine translation in creating the dataset, and thus our Chinese texts are original and not translated.

In [None]:
ocnli = load_dataset("clue", "ocnli")

OCNLI uses 0 as `neutral` and 1 as `entailment`, so we need to adjust the labels:

In [None]:
def ocnli_fix(example):
    if example["label"] == 1:
        example["label"] = 0
    elif example["label"] == 0:
        example["label"] = 1
    return example


ocnli = ocnli.map(ocnli_fix, num_proc=num_proc)

In [None]:
ocnli = ocnli.rename_column("sentence1", "premise")
ocnli = ocnli.rename_column("sentence2", "hypothesis")

In [None]:
ocnli = normalize(ocnli)

#### JNLI

JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. NLI is a task to recognize the inference relation that a premise sentence has to a hypothesis sentence. The inference relations are entailment, contradiction, and neutral.

In [None]:
jnli = load_dataset("shunk031/JGLUE", "JNLI")

In [None]:
jnli = jnli.rename_column("sentence1", "premise")
jnli = jnli.rename_column("sentence2", "hypothesis")

In [None]:
jnli = normalize(jnli)

#### JaNLI

The JaNLI (Japanese Adversarial NLI) dataset, inspired by the English HANS dataset, is designed to necessitate an understanding of Japanese linguistic phenomena and to illuminate the vulnerabilities of models.

In [None]:
janli = load_dataset("hpprc/janli", "base")

In [None]:
janli = normalize(janli)

#### KLUE-NLI

KLUE is a collection of 8 tasks to evaluate natural language understanding capability of Korean language models. We delibrately select the 8 tasks, which are Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.

In [None]:
klue_nli = load_dataset("klue", "nli")

In [None]:
klue_nli = normalize(klue_nli)

#### Group public datasets

In [None]:
public_train_datasets = [
    snli["train"],
    mnli["train"],
    mrpc["train"],
    rte["train"],
    wnli["train"],
    ocnli["train"],
    jnli["train"],
    janli["train"],
    klue_nli["train"],
]
public_validation_datasets = [
    snli["validation"],
    mnli["validation_matched"],
    mrpc["validation"],
    rte["validation"],
    wnli["validation"],
    ocnli["validation"],
    jnli["validation"],
    klue_nli["validation"],
]

### Load private datasets

Try to load private datasets in the `datasets` directory:

In [None]:
import os

private_train_datasets = []
private_validation_datasets = []
if os.path.isdir("datasets"):
    try:
        private_dataset = load_dataset("./datasets")
        private_dataset = normalize(private_dataset)
        if "train" in private_dataset:
            private_train_datasets.append(private_dataset["train"])
        if "validation" in private_dataset:
            private_validation_datasets.append(private_dataset["validation"])
    except FileNotFoundError:
        pass

In [None]:
print(private_train_datasets)
print(private_validation_datasets)

### Concatenate datasets

In [None]:
train_dataset = concatenate_datasets(public_train_datasets + private_train_datasets)
validation_dataset = concatenate_datasets(public_validation_datasets + private_validation_datasets)

### Filter datasets

In [None]:
def filter_label(dataset):
    def fn(example):
        if example["label"] < 0 or example["label"] > 1:
            return False
        if len(example["hypothesis"]) == 0:
            return False
        if len(example["premise"]) == 0:
            return False
        return True
    return dataset.filter(fn, num_proc=num_proc)

In [None]:
train_dataset = filter_label(train_dataset)
validation_dataset = filter_label(validation_dataset)

### Analyze datasets

Compute class weight:

In [None]:
class_weight = {0: 0.0, 1: 0.0}
for example in train_dataset:
    class_weight[example["label"]] += 1.0
for i in range(2):
    class_weight[i] = (1 / class_weight[i]) * len(train_dataset) * 0.5

In [None]:
print(class_weight)

### Tokenize datasets

Load pretrained tokenizer for XLM-RoBERTa:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(f"xlm-roberta-{model_size}")

Test tokenization using examples from the [original implementation](https://github.com/facebookresearch/XLM#ii-cross-lingual-language-model-pretraining-xlm):

In [None]:
print(tokenizer("Hello world!"))  # [0, 35378,  8999, 38, 2]
print(tokenizer("你好，世界"))  # [0, 6, 124084, 4, 3221, 2]
print(tokenizer("a", "b", padding="max_length"))  # [0, 10, 2, 2, 876, 2, 1, 1, 1, ..., 1]

In [None]:
def tokenize(dataset):
    def fn(examples):
        return tokenizer(examples["hypothesis"], examples["premise"], truncation="do_not_truncate")
    return dataset.map(fn, batched=True, num_proc=num_proc)

In [None]:
train_dataset = tokenize(train_dataset)
validation_dataset = tokenize(validation_dataset)

In [None]:
def filter_input(dataset):
    def fn(example):
        return len(example["input_ids"]) <= tokenizer.model_max_length
    return dataset.filter(fn, num_proc=num_proc)

In [None]:
train_dataset = filter_input(train_dataset)
validation_dataset = filter_input(validation_dataset)

## Fine-tune model

In [None]:
import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()

Set hyperparameters based on [XNLI tasks for XLM-RoBERTa](https://github.com/facebookresearch/fairseq/issues/1367#issuecomment-555609917):

In [None]:
learning_rate = 5e-6
batch_size = 16 * strategy.num_replicas_in_sync
num_epochs = 30
patience = 5

Load pretrained model:

In [None]:
from transformers import TFAutoModelForSequenceClassification

In [None]:
with strategy.scope():
    model = TFAutoModelForSequenceClassification.from_pretrained(f"xlm-roberta-{model_size}", num_labels=2)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=["accuracy"])

Convert datasets into TensorFlow format:

In [None]:
tf_train_dataset = model.prepare_tf_dataset(train_dataset, shuffle=True, batch_size=batch_size, tokenizer=tokenizer)
tf_validation_dataset = model.prepare_tf_dataset(validation_dataset, shuffle=False, batch_size=batch_size, tokenizer=tokenizer)

Create callback for early stopping:

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", patience=patience, restore_best_weights=True)

Fine-tune the pretrained model:

In [None]:
model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=num_epochs,
    class_weight=class_weight,
    callbacks=[callback],
)

## Save model

In [None]:
%mkdir -p models

In [None]:
model.save_pretrained(f"models/barba-{model_size}")

In [None]:
tokenizer.save_pretrained(f"models/barba-{model_size}")