# Fine-tuning BERT for GitHub issue label prediction on a local installation

Based on: [Fine-tuning BERT (and friends) for multi-label text classification](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb)
Adapted with: [Multi-Label Classification Model From Scratch: Step-by-Step Tutorial](https://huggingface.co/blog/Valerii-Knowledgator/multi-label-classification)

This is just a local exploration of the whole ML training space. The script will be migrated to be runnable inside the kubeflow installation. It's mostly copy paste from above with a few adaptions for handling the custom dataset. 

*"In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.* 

*All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores for a number of labels for every example in the batch."*

## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [1]:
!pip install -q transformers datasets

## Load dataset

Here we load the previously prepared dataset. 

In [None]:
from datasets import load_dataset, DatasetDict

dataset = load_dataset("json", data_files="prepared-issues-reduced.json")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'body', 'enhancement', 'bug'],
        num_rows: 359
    })
})

In [4]:
ds = dataset["train"].train_test_split(test_size=0.2)
ds

DatasetDict({
    train: Dataset({
        features: ['title', 'body', 'enhancement', 'bug'],
        num_rows: 287
    })
    test: Dataset({
        features: ['title', 'body', 'enhancement', 'bug'],
        num_rows: 72
    })
})

In [5]:
example = ds['train'][0]
example

{'title': 'CLI: Create namespace if necessary when installing a namespace-scoped Package',
 'body': "**Is your feature request related to a problem? Please describe.**\r\nInstalling a namespace-scoped package can be more complicated than necessary if the target namespace does not exist yet. In that case, a user has to resort to `kubectl` or other means to create the namespace before being able to install the package. This should not be necessary.\r\n\r\n**Describe the solution you'd like**\r\nWhen installing a namespace-scoped package, the CLI should check if the target namespace exists and, if it does not, create it after the user has confirmed the install operation (if no `--yes` flag is present) and before the `Package` is created. The installation summary should inform the user that the namespace is going to be created. Example output:\r\n\r\n```\r\n$ glasskube install quickwit -n test\r\nSummary:\r\n * The following packages will be installed in your cluster (minikube):\r\n    1. 

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [6]:
labels = [label for label in dataset['train'].features.keys() if label not in ['body', 'title']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['enhancement', 'bug']

## Preprocess data

*As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.*

*What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).*

In [7]:
from transformers import AutoTokenizer
import numpy as np
import math

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(example):
  text = f'{example["title"]}\n{example["body"]}'
  # encode them
  encoding = tokenizer(text, padding=True, truncation=True)

  lbls = [0. for i in range(len(labels))]
  for label in labels:
    if label in example and example[label] == True:
      label_id = label2id[label]
      lbls[label_id] = 1.

  encoding["labels"] = lbls  
  return encoding

In [8]:
encoded_dataset = ds.map(preprocess_data, remove_columns=ds['train'].column_names)
encoded_dataset

Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287/287 [00:00<00:00, 3302.66 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72/72 [00:00<00:00, 2996.59 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 287
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 72
    })
})

In [9]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [10]:
tokenizer.decode(example['input_ids'])

"[CLS] cli : create namespace if necessary when installing a namespace - scoped package * * is your feature request related to a problem? please describe. * * installing a namespace - scoped package can be more complicated than necessary if the target namespace does not exist yet. in that case, a user has to resort to ` kubectl ` or other means to create the namespace before being able to install the package. this should not be necessary. * * describe the solution you ' d like * * when installing a namespace - scoped package, the cli should check if the target namespace exists and, if it does not, create it after the user has confirmed the install operation ( if no ` - - yes ` flag is present ) and before the ` package ` is created. the installation summary should inform the user that the namespace is going to be created. example output : ` ` ` $ glasskube install quickwit - n test summary : * the following packages will be installed in your cluster ( minikube ) : 1. quickwit of type q

In [11]:
example['labels']

[1.0, 0.0]

In [12]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['enhancement']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html). 

In [13]:
encoded_dataset.set_format("torch")

## Define model

*Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.*

*This is also printed by the warning.*

*We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.*

In [14]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things: 

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [15]:
batch_size = 3
metric_name = "f1"

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    "bert-finetuned-trainer-output",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [17]:
import evaluate
import numpy as np

clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

def sigmoid(x):
   return 1/(1 + np.exp(-x))

def compute_metrics(eval_pred):

   predictions, labels = eval_pred
   predictions = sigmoid(predictions)
   predictions = (predictions > 0.5).astype(int).reshape(-1)
   return clf_metrics.compute(predictions=predictions, references=labels.astype(int).reshape(-1))

Let's verify a batch as well as a forward pass:

In [18]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [19]:
encoded_dataset['train']['input_ids'][0]

tensor([  101, 18856,  2072,  1024,  3443,  3415, 15327,  2065,  4072,  2043,
        23658,  1037,  3415, 15327,  1011,  9531,  2094,  7427,  1008,  1008,
         2003,  2115,  3444,  5227,  3141,  2000,  1037,  3291,  1029,  3531,
         6235,  1012,  1008,  1008, 23658,  1037,  3415, 15327,  1011,  9531,
         2094,  7427,  2064,  2022,  2062,  8552,  2084,  4072,  2065,  1996,
         4539,  3415, 15327,  2515,  2025,  4839,  2664,  1012,  1999,  2008,
         2553,  1010,  1037,  5310,  2038,  2000,  7001,  2000,  1036, 13970,
         4783,  6593,  2140,  1036,  2030,  2060,  2965,  2000,  3443,  1996,
         3415, 15327,  2077,  2108,  2583,  2000, 16500,  1996,  7427,  1012,
         2023,  2323,  2025,  2022,  4072,  1012,  1008,  1008,  6235,  1996,
         5576,  2017,  1005,  1040,  2066,  1008,  1008,  2043, 23658,  1037,
         3415, 15327,  1011,  9531,  2094,  7427,  1010,  1996, 18856,  2072,
         2323,  4638,  2065,  1996,  4539,  3415, 15327,  6526, 

In [20]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.7555, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.3859, -0.1816]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Let's start training!

In [21]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.362327,0.875,0.823529,0.724138,0.954545
2,No log,0.281655,0.9375,0.905263,0.843137,0.977273


TrainOutput(global_step=192, training_loss=0.4105148712793986, metrics={'train_runtime': 302.937, 'train_samples_per_second': 1.895, 'train_steps_per_second': 0.634, 'total_flos': 80538191267940.0, 'train_loss': 0.4105148712793986, 'epoch': 2.0})

## Evaluate

After training, we evaluate our model on the validation set.

In [23]:
trainer.evaluate()

{'eval_loss': 0.28165486454963684,
 'eval_accuracy': 0.9375,
 'eval_f1': 0.9052631578947369,
 'eval_precision': 0.8431372549019608,
 'eval_recall': 0.9772727272727273,
 'eval_runtime': 7.7696,
 'eval_samples_per_second': 9.267,
 'eval_steps_per_second': 3.089,
 'epoch': 2.0}

In [24]:
# trained model can be saved locally
trainer.save_model("local-bert-reduced")

## Inference

Let's test the model on new issues:

In [30]:
from transformers import BertModel, pipeline

# load existing fine-tuned model
nlp = pipeline("sentiment-analysis", model="./local-bert-reduced", top_k=5)

bug = """
**Describe the bug**

**To reproduce**

**Cluster Info (please complete the following information):**
"""

enhancement = """
**Is your feature request related to a problem? Please describe.**

**Describe the solution you'd like**

"""


print(nlp(bug))
print(nlp(enhancement))

[[{'label': 'bug', 'score': 0.7489847540855408}, {'label': 'enhancement', 'score': 0.3993261158466339}]]
[[{'label': 'enhancement', 'score': 0.8171513080596924}, {'label': 'bug', 'score': 0.06645093113183975}]]
