# Training Hyperonym Barba

This Colab notebook contains instructions on how to train a Hyperonym Barba model with public and private NLI datasets.

## Mount Google Drive

Mount Google Drive to the local file system:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Change working directory into Google Drive:

In [None]:
%mkdir -p /content/drive/MyDrive/hyperonym/barba
%cd /content/drive/MyDrive/hyperonym/barba

/content/drive/MyDrive/hyperonym/barba


## Install dependencies

Install TensorFlow:

In [None]:
!pip install tensorflow==2.11.0

Install Hugging Face libraries:

In [None]:
!pip install transformers datasets

Check if GPU is available:

In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

In [None]:
!nvidia-smi

Thu Jan 19 04:03:18 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    52W / 400W |    650MiB / 40536MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Prepare datasets

In [None]:
from datasets import load_dataset, concatenate_datasets, Features, Value, ClassLabel

Set number of processes to use for parallel operations:

In [None]:
num_proc = 12

A typical NLI model generally has three output labels, namely `entailment`, `neutral` and `contradiction`.

To support various private datasets, Barba uses only two labels, `entailment` and `not_entailment`:

In [None]:
features = Features({
  'hypothesis': Value(dtype='string'),
  'premise': Value(dtype='string'),
  'label': ClassLabel(names=['entailment', 'not_entailment'])
})

Function for removing redundant columns:

In [None]:
def strip_columns(dataset):
  columns = dataset[list(dataset)[0]].column_names
  columns = [col for col in columns if col not in features]
  return dataset.remove_columns(columns)

Function for squashing `neutral` and `contradiction` into a single label:

In [None]:
def squash_labels(dataset):
  def fn(example):
    if example['label'] == 2:
      example['label'] = 1
    return example
  return dataset.map(fn, features=features, num_proc=num_proc)

### Load public datasets

#### SNLI (Stanford Natural Language Inference)

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).

In [None]:
snli = load_dataset('snli')

In [None]:
snli = strip_columns(snli)

In [None]:
snli = squash_labels(snli)

#### XNLI (Cross-lingual Natural Language Inference)

The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI) corpus to 15 languages. The dataset was created by manually translating the validation and test sets of MultiNLI into each of those 15 languages. The English training set was machine translated for all languages.

In [None]:
xnli_zh = load_dataset('xnli', 'zh')

The Chinese subset of XNLI has whitespace between characters, we need to strip them before tokenization:

In [None]:
def xnli_zh_fix(example):
  example['premise'] = example['premise'].replace(' ', '')
  example['hypothesis'] = example['hypothesis'].replace(' ', '')
  return example
xnli_zh = xnli_zh.map(xnli_zh_fix, num_proc=num_proc)

In [None]:
xnli_zh = strip_columns(xnli_zh)

In [None]:
xnli_zh = squash_labels(xnli_zh)

#### MultiNLI (Multi-Genre Natural Language Inference)

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

In [None]:
mnli = load_dataset('multi_nli')

In [None]:
mnli = strip_columns(mnli)

In [None]:
mnli = squash_labels(mnli)

#### OCNLI (Original Chinese Natural Language Inference)

OCNLI stands for Original Chinese Natural Language Inference. It is corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. We want to emphasize we did not use human/machine translation in creating the dataset, and thus our Chinese texts are original and not translated.

In [None]:
ocnli = load_dataset('clue', 'ocnli')

OCNLI uses 0 as `neutral` and 1 as `entailment`, so we need to adjust the labels:

In [None]:
def ocnli_fix(example):
  if example['label'] == 1:
    example['label'] = 0
  elif example['label'] == 0:
    example['label'] = 1
  return example
ocnli = ocnli.map(ocnli_fix, num_proc=num_proc)

In [None]:
ocnli = ocnli.rename_column('sentence1', 'premise')
ocnli = ocnli.rename_column('sentence2', 'hypothesis')

In [None]:
ocnli = strip_columns(ocnli)

In [None]:
ocnli = squash_labels(ocnli)

#### ANLI (Adversarial Natural Language Inference)

The Adversarial Natural Language Inference (ANLI) is a new large-scale NLI benchmark dataset, The dataset is collected via an iterative, adversarial human-and-model-in-the-loop procedure. ANLI is much more difficult than its predecessors including SNLI and MNLI. It contains three rounds. Each round has train/dev/test splits.

In [None]:
anli = load_dataset('anli')

In [None]:
anli = strip_columns(anli)

In [None]:
anli = squash_labels(anli)

#### Group public datasets

In [None]:
public_train_datasets = [
  snli['train'],
  xnli_zh['train'],
  mnli['train'],
  ocnli['train'],
  anli['train_r1']
]
public_validation_datasets = [
  snli['validation'],
  xnli_zh['validation'],
  mnli['validation_matched'],
  ocnli['validation'],
  anli['dev_r1']
]

### Load private datasets

Try to load private datasets in the `datasets` directory:

In [None]:
import os
private_train_datasets = []
private_validation_datasets = []
if os.path.isdir('datasets'):
  try:
    private_dataset = load_dataset('./datasets')
    private_dataset = strip_columns(private_dataset)
    private_dataset = squash_labels(private_dataset)
    if 'train' in private_dataset:
      private_train_datasets.append(private_dataset['train'])
    if 'validation' in private_dataset:
      private_validation_datasets.append(private_dataset['validation'])
  except FileNotFoundError:
    pass

### Concatenate datasets

In [None]:
train_dataset = concatenate_datasets(public_train_datasets + private_train_datasets)
validation_dataset = concatenate_datasets(public_validation_datasets + private_validation_datasets)

### Filter datasets

In [None]:
def filter(dataset):
  def fn(example):
    if example['label'] < 0 or example['label'] > 1:
      return False
    if len(example['hypothesis']) == 0:
      return False
    if len(example['premise']) == 0:
      return False
    return True
  return dataset.filter(fn, num_proc=num_proc)

In [None]:
train_dataset = filter(train_dataset)
validation_dataset = filter(validation_dataset)

### Tokenize datasets

Load pretrained tokenizer for XLM-RoBERTa:

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')

Test tokenization using examples from the [original implementation](https://github.com/facebookresearch/XLM#ii-cross-lingual-language-model-pretraining-xlm):

In [None]:
print(tokenizer('Hello world!')) # [0, 35378,  8999, 38, 2]
print(tokenizer('你好，世界')) # [0, 6, 124084, 4, 3221, 2]
print(tokenizer('a', 'b', padding='max_length')) # [0, 10, 2, 2, 876, 2, 1, 1, 1, ..., 1]

{'input_ids': [0, 35378, 8999, 38, 2], 'attention_mask': [1, 1, 1, 1, 1]}
{'input_ids': [0, 6, 124084, 4, 3221, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}
{'input_ids': [0, 10, 2, 2, 876, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
def tokenize(dataset):
  def fn(examples):
    return tokenizer(examples['hypothesis'], examples['premise'], truncation='only_second')
  return dataset.map(fn, batched=True, num_proc=num_proc)

In [None]:
train_dataset = tokenize(train_dataset)
validation_dataset = tokenize(validation_dataset)

## Fine-tune model

Set hyperparameters based on [XNLI tasks for XLM-RoBERTa](https://github.com/facebookresearch/fairseq/issues/1367#issuecomment-555609917):

In [None]:
learning_rate = 7.5e-6
batch_size = 16
num_epochs = 3

Load pretrained model:

In [None]:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained('xlm-roberta-base', num_labels=2)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=['accuracy'])

Convert datasets into TensorFlow format:

In [None]:
tf_train_dataset = model.prepare_tf_dataset(train_dataset, shuffle=True, batch_size=batch_size, tokenizer=tokenizer)
tf_validation_dataset = model.prepare_tf_dataset(validation_dataset, shuffle=False, batch_size=batch_size, tokenizer=tokenizer)

Create callback for early stopping:

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)

Fine-tune the pretrained model:

In [None]:
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=num_epochs, callbacks=[callback])

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fd9c7c31bb0>

## Save model

In [None]:
%mkdir -p models

Save the trained model to Google Drive:

In [None]:
tf.saved_model.save(model, 'models/barba')

Flush and unmount Google Drive:

In [None]:
drive.flush_and_unmount()