In [1]:
!pip install evaluate --quiet

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import os, re, random, datasets, evaluate

import tensorflow_hub as hub
import tensorflow_text as text

pd.set_option('display.max_colwidth', None)
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


This notebook is an adapdation of a tutorial that I found [here](https://www.kaggle.com/code/rajkumarl/nlp-tutorial-fine-tuning-with-trainer-api/notebook). I've been looking for tutorials where we use fine-tuned models on Hugging Face for the purpose of classifying text using **transformers** on Youtube and Kaggle, and I found the one linked above really helpful. This serves as a way to try out and learn: **AutoTokenizer**, **AutoModelForSequenceClassification**, **DataCollatorWithPadding**, **TrainingArguments**, and **Trainer** among many others. 

# Import and Preprocess Data Set

In [3]:
train = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
test_df = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')

In [13]:
train['text'] = train.premise + " [SEP] " + train.hypothesis
test_df['text'] = test_df.premise + " [SEP] " + test_df.hypothesis

In [14]:
train_df, val_df = np.split(train.sample(frac = 1), [int(0.8 * len(train))])

In [15]:
train_dict = datasets.Dataset.from_dict(train_df.to_dict(orient="list"))
val_dict = datasets.Dataset.from_dict(val_df.to_dict(orient="list"))
test_dict = datasets.Dataset.from_dict(test_df.to_dict(orient="list"))

In [16]:
contradiction_ds = datasets.DatasetDict({"train": train_dict, "val": val_dict, "test": test_dict})

In [17]:
contradiction_ds

DatasetDict({
    train: Dataset({
        features: ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'label', 'text'],
        num_rows: 9696
    })
    val: Dataset({
        features: ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'label', 'text'],
        num_rows: 2424
    })
    test: Dataset({
        features: ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'text'],
        num_rows: 5195
    })
})

# Pre-trained Model

We are going to find a model on Hugging Face that is best suited for the purpose of the classification task at hand. We hoped to find a model that (1) works on texts written in multiple languages; (2) is fine-tuned on Natural Language Inference (NLI) tasks. The model that meets the listed criteria and has been shown to perform moderately well for this task is [symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli]("https://huggingface.co/symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli"). 

In [30]:
model_name = 'symanto/xlm-roberta-base-snli-mnli-anli-xnli'
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

Downloading (…)okenizer_config.json:   0%|          | 0.00/398 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [31]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)

Downloading (…)lve/main/config.json:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

## Tokenize the Data Set

In [32]:
def tokenize_function(dataset):
    return tokenizer(dataset['text'], truncation=True)

tokenized_data = contradiction_ds.map(tokenize_function, batched=True)

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

In [21]:
tokenized_data

DatasetDict({
    train: Dataset({
        features: ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'label', 'text', 'input_ids'],
        num_rows: 9696
    })
    val: Dataset({
        features: ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'label', 'text', 'input_ids'],
        num_rows: 2424
    })
    test: Dataset({
        features: ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'text', 'input_ids'],
        num_rows: 5195
    })
})

In [33]:
tokenized_data = tokenized_data.remove_columns(['premise','hypothesis', 'lang_abv', 'language', 'text'])
tokenized_data.with_format('pt')

DatasetDict({
    train: Dataset({
        features: ['id', 'label', 'input_ids', 'attention_mask'],
        num_rows: 9696
    })
    val: Dataset({
        features: ['id', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2424
    })
    test: Dataset({
        features: ['id', 'input_ids', 'attention_mask'],
        num_rows: 5195
    })
})

## Build and Train the Model

Before instantiating the trainer, we are going to create **TrainingArguments**. Following are the values we are going to set for each of the parameters:

- model: **model_name** this is the pre-trained model we are going to use to train and evaluate the classifier;
- evaluation_strategy: **'epoch'** determines when we are going to evaluate the performance of the classifier. We are going to do this at at the end of each epoch;
- num_train_epochs: **5** is the number of training epochs;
- learning_rate: **5e-5** is the learning rate;
- weight_decay: **0.005** is the weight decay parameter (for regularization);
- per_device_train_batch_size: **16** is the batch size per GPU/TPU core/CPU for training;
- per_device_eval_batch_size: **16** is the batch size per GPU/TPU core/CPU for evaluation;
- report_to: **'none'** is the integrations where the results and logs are reported to. We are not integrating. 

In [34]:
training_args = TrainingArguments(model_name,  
                                  evaluation_strategy = 'epoch',
                                  num_train_epochs = 5,
                                  learning_rate = 5e-5,
                                  weight_decay = 0.005,
                                  per_device_train_batch_size = 16,
                                  per_device_eval_batch_size = 16,
                                  report_to = 'none')

I am told that when using trainer, we have to define a function that computes the metric(s) for us (as opposed to simply saying **metrics == ['accuracy']**. This is one example of such function: 

In [35]:
def compute_metrics(eval_pred):
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis = -1)
    return metric.compute(predictions=predictions, references=labels)

Now, we are going to instantiate a Trainer: 

- model: **model_name** this is the pre-trained model we are going to use to train and evaluate the classifier;
- args: **training_args** are the arguments used for training. We set these above; 
- train_dataset: **tokenized_data["train"]** is the training dataset; 
- eval_dataset: **tokenized_data["val"]** is the validation dataset;
- data_collator: **data_collator** is a function that we use to form batches. DataCollatorWithPadding is used as we provide tokenizer;
- tokenizer: **tokenizer** is the tokenizer that is used to preprocess the data. This automatically pads the inputs to maximum length; 
- compute_metris: **compute_metrics** is the function that computes the metric(s) at evaluation. We defined this funciton above.

In [36]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_data["train"],
    eval_dataset = tokenized_data["val"],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

In [37]:
trainer.train()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.361899,0.857261
2,0.361900,0.389425,0.853135
3,0.361900,0.512928,0.869637
4,0.163000,0.653243,0.873762
5,0.061100,0.763527,0.875413


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]



TrainOutput(global_step=1515, training_loss=0.1939331634603318, metrics={'train_runtime': 1049.9833, 'train_samples_per_second': 46.172, 'train_steps_per_second': 1.443, 'total_flos': 2427221278540800.0, 'train_loss': 0.1939331634603318, 'epoch': 5.0})

# Prepare for Submission

Finally, we are going to use the trainer to predict the label for the texts inside the test data set. 

In [38]:
test_predictions = trainer.predict(tokenized_data["test"])
preds = np.argmax(test_predictions.predictions, axis=1)

In [39]:
submission = pd.DataFrame(list(zip(test_df.id, preds)), 
                          columns = ["id", "prediction"])
submission.to_csv("submission.csv", index=False)