### Learn how to use HuggingFace transformers library to fine tune BERT and other transformer models for text classification task in Python.

https://www.thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 32.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 7.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 26.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 61.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

In [2]:
import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available

from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

In [3]:
def set_seed(seed: int):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

In [4]:
model_name = 'bert-base-uncased'
max_length = 512

### loading the dataset

In [5]:
tokenizer = BertTokenizerFast.from_pretrained(model_name, 
                                              do_lower_case=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [6]:
def read_20newsgroups(test_size=0.2):
  dataset = fetch_20newsgroups(
      subset='all',
      shuffle=True,
      remove=("headers", "footers", "quotes")
  )

  document = dataset.data
  labels = dataset.target

  return train_test_split(document, labels, test_size=test_size), dataset.target_names
  

In [7]:
# call the function
(train_texts, valid_texts, train_labels, valid_labels), target_names = read_20newsgroups()

In [8]:
train_texts[1], target_names[1]

("Hi ... Recently I found XV for MS-DOS in a subdirectory of GNU-CC (GNUISH). I \nuse frequently XV on a Sun Spark Station 1 and I never had problems, but when I\nstart it on my computer with -h option, it display the help menu and when I\nstart it with a GIF-File my Hard disk turns 2 or 3 seconds and the prompt come\nback.\n\nMy computer is a little 386/25 with copro, 4 Mega rams, Tseng 4000 (1M) running\nMS-DOS 5.0 with HIMEM.SYS and no EMM386.SYS. I had the GO32.EXE too... but no\ndriver who run with it.\n\nDo somenone know the solution to run XV ??? any help would be apprecied..\n\t\t\n\tThanx in advance !!!! \n             \n-- \n---------------------------------------------------------------------\n*\t\t\t\t\t\t\t\t    *\n*  Pascal PERRET     \t\t|\tperret@eicn.etna.ch         *\n*  Ecole d'ingénieur ETS\t|\t(Not Available at this time)*\n*  2400 Le LOCLE\t\t|\t\t\t\t    *\n*  Suisse \t\t\t\t\t\t\t    *\n*\t\t     !!!! Enjoy COMPUTER !!!!\t\t\t    *\n*\t\t\t\t\t\t\t\t    *",
 'co

In [9]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

In [10]:
class NewsGroupsDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {k : torch.tensor(v[idx]) for k, v in self.encodings.items()}
    item['labels'] = torch.tensor([self.labels[idx]])
    return item

  def __len__(self):
    return len(self.labels)

In [11]:
# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)

Since we gonna use Trainer from Transformers library, it expects our dataset as a torch.utils.data.Dataset, so we made a simple class that implements the __len__() method that returns the number of samples, and __getitem__() method to return a data sample at a specific index.

### Training the Model

Now that we have our data prepared, let's download and load our BERT model and its pre-trained weights:

In [12]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names)).to("cuda")
# load the model and pass to CUDA
# model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names))

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [13]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)

  acc = accuracy_score(labels, preds)
  return {
      "accuracy":acc,
  }

In [14]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=400,               # log & save weights each logging_steps
    save_steps=400,
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

Each argument is explained in the code comments. I've specified 8 as training batch size; that's because it's the maximum I can get to fit in a Google Colab environment's memory. If you have the CUDA out of memory error, make sure to decrease it furthermore. If you have a more powerful GPU in your environment, then increasing it will make the training significantly faster.

You can also tweak other parameters, such as increasing the number of epochs for better training.

I've set the logging_steps and save_steps to 400, which means it will evaluate and save the model after every 400 steps, make sure to increase it when you decrease the batch size lower than 8, that's because it'll save a lot of checkpoints after every few steps, and may take your whole environment disk space.

We then pass our training arguments, dataset, and compute_metrics callback to our Trainer:

In [15]:
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

In [16]:
# train the model
trainer.train()

***** Running training *****
  Num examples = 15076
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5655


Step,Training Loss,Validation Loss,Accuracy
400,2.3832,1.393765,0.603979
800,1.304,1.125011,0.66366
1200,1.1135,1.016177,0.688329
1600,1.0335,0.943768,0.715385
2000,0.9168,0.968447,0.723873
2400,0.7181,0.968347,0.722281
2800,0.7432,0.892228,0.739523
3200,0.7561,0.847947,0.748276
3600,0.6449,0.847028,0.763395
4000,0.4874,0.866981,0.772944


***** Running Evaluation *****
  Num examples = 3770
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-400
Configuration saved in ./results/checkpoint-400/config.json
Model weights saved in ./results/checkpoint-400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 3770
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-800
Configuration saved in ./results/checkpoint-800/config.json
Model weights saved in ./results/checkpoint-800/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 3770
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-1200
Configuration saved in ./results/checkpoint-1200/config.json
Model weights saved in ./results/checkpoint-1200/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 3770
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-1600
Configuration saved in ./results/checkpoint-1600/config.json
Model weights saved in ./results/checkpoint-1600/pytorch_model.bi

TrainOutput(global_step=5655, training_loss=0.8285261789957682, metrics={'train_runtime': 5802.7939, 'train_samples_per_second': 7.794, 'train_steps_per_second': 0.975, 'total_flos': 1.1901910025060352e+16, 'train_loss': 0.8285261789957682, 'epoch': 3.0})

In [17]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 3770
  Batch size = 20


{'epoch': 3.0,
 'eval_accuracy': 0.763395225464191,
 'eval_loss': 0.8470278382301331,
 'eval_runtime': 112.8182,
 'eval_samples_per_second': 33.417,
 'eval_steps_per_second': 1.675}

In [18]:
# saving the fine tuned model & tokenizer
model_path = "20newsgroups-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Configuration saved in 20newsgroups-bert-base-uncased/config.json
Model weights saved in 20newsgroups-bert-base-uncased/pytorch_model.bin
tokenizer config file saved in 20newsgroups-bert-base-uncased/tokenizer_config.json
Special tokens file saved in 20newsgroups-bert-base-uncased/special_tokens_map.json


('20newsgroups-bert-base-uncased/tokenizer_config.json',
 '20newsgroups-bert-base-uncased/special_tokens_map.json',
 '20newsgroups-bert-base-uncased/vocab.txt',
 '20newsgroups-bert-base-uncased/added_tokens.json',
 '20newsgroups-bert-base-uncased/tokenizer.json')

In [19]:
def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return target_names[probs.argmax()]

In [20]:
# Example #1
text = """
The first thing is first. 
If you purchase a Macbook, you should not encounter performance issues that will prevent you from learning to code efficiently.
However, in the off chance that you have to deal with a slow computer, you will need to make some adjustments. 
Having too many background apps running in the background is one of the most common causes. 
The same can be said about a lack of drive storage. 
For that, it helps if you uninstall xcode and other unnecessary applications, as well as temporary system junk like caches and old backups.
"""
print(get_prediction(text))

comp.sys.mac.hardware


In [21]:
# Example #2
text = """
A black hole is a place in space where gravity pulls so much that even light can not get out. 
The gravity is so strong because matter has been squeezed into a tiny space. This can happen when a star is dying.
Because no light can get out, people can't see black holes. 
They are invisible. Space telescopes with special tools can help find black holes. 
The special tools can see how stars that are very close to black holes act differently than other stars.
"""
print(get_prediction(text))

sci.space


In [22]:
# Example #3
text = """
Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.
Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment.  
Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.
"""
print(get_prediction(text))

sci.med
