<a href="https://colab.research.google.com/github/liadmagen/NLP-Course/blob/master/exercises_notebooks/11_LM_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTology

## Introduction
BERT was first released in 2019, and made a revolution in the NLP world.
It was shortly after several similar models such as ELMo and UlmFit demonstrated that using contextual word-embedding significantly improve the model accuracy.

Contextual word-embedding are vectors that are changed according to the usage of the word in the sentence. Unlike the previous embedding techniques - Word2Vec, GloVe, FastText, etc, ELMo, UlmFit and BERT change the word-embedding of the same word, according to its usage.

So for example, the word `play` would have different vectors in these sentences:
* I *play* tennis on Thursday
* I saw that *play* already

BERT uses Transformers to create a matrix of relations between the words in the sentence. It is memory expensive, and therefore **cannot take more than two sentences at once**, and up to 512 tokens in total.

However, since then, many new and more efficient models were released:
- [BigBird](https://huggingface.co/blog/big-bird#bigbird-block-sparse-attention), [Longformer](https://huggingface.co/allenai/longformer-base-4096), [Reformer](https://huggingface.co/google/reformer-crime-and-punishment) and many others, can handle classification of long documents.
- [RoBERTa](https://huggingface.co/roberta-base) and [Electra](https://huggingface.co/google/electra-small-discriminator) are memory and GPU efficient, and can train faster
- Models such as [LayoutLM](https://huggingface.co/microsoft/layoutlmv2-base-uncased) can handle a combination of both text and image, and are useful in classification of visually rich documents such as invoices, IDs and Forms.


## RoBERTa

Due to its availability and efficiency, RoBERTa is quite common in text classification, and we will use it here to classify patents, using the [patents dataset](https://huggingface.co/datasets/ccdv/patent-classification) that is hosted in [HuggingFace Dataset Hub](https://huggingface.co/datasets).

We will use [HuggingFace](https://huggingface.co/) implementation of RoBERTa. HuggingFace is a french company that has lately took over the field, due to their excellent engineering efforts, open-source software and a very large community who contributed both models, datasets and code.

Huggingface offers also [NLP courses](https://huggingface.co/course/chapter1/1), where they guide the users how to use their software.

In [1]:
%%capture
! pip install transformers
! pip install datasets

In [2]:
import numpy as np

from datasets import load_dataset, load_metric
from transformers import RobertaTokenizer, RobertaForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding


In [3]:
dataset = load_dataset('ccdv/patent-classification')


No config specified, defaulting to: patent_classification_dataset/patent
Reusing dataset patent_classification_dataset (/root/.cache/huggingface/datasets/ccdv___patent_classification_dataset/patent/1.0.0/296a870cf0b6aa21c8cbd74f4fcd0dafdf4d7795cc2bba5ee2918ddd85225740)


  0%|          | 0/3 [00:00<?, ?it/s]

The dataset is already divided to Train, Validation and Test.

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [5]:
dataset['train'][0]

{'label': 6,
 'text': 'turning now to the drawings , there is shown in fig1 an integrated circuit continuity testing system in which a specimen or circuit configuration 16 is mounted on a fixture 18 operable to vibrate the specimen under controlled conditions , e . g . sinusoidally , randomly , or a combination of the two . the specific structure of the fixture and the means for vibrating it are known in the art and thus not further discussed . the specimen and fixture are housed in a closed chamber 20 whereby the specimen under test can be subjected to temperature cycling , either alone or in conjunction with the vibration testing . an environmental control apparatus , indicated at 22 , is provided for selectively heating or cooling the chamber interior . a cable 24 electrically connects fixture 18 , and thus specimen 16 , with a continuity testing board 26 . it is to be understood that cable 24 includes a multiplicity of separate electrical connections between fixture 18 and testing 

In [6]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

We use RoBERTa's tokenizer, to tokenize our dataset's text.

There are two options: 
* either create an encoded copy of the dataset or
* create a transformation function that encode it with the tokenizer, on the fly: https://huggingface.co/docs/datasets/process.html?highlight=map#format-transform

In [7]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/ccdv___patent_classification_dataset/patent/1.0.0/296a870cf0b6aa21c8cbd74f4fcd0dafdf4d7795cc2bba5ee2918ddd85225740/cache-292546e12ab5245b.arrow


  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

First we instantiate the RoBERTa model, which was already pre-trained on wikipedia and other resources. 

In [8]:
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=dataset['train'].features['label'].num_classes)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

Then we fine-tune our RoBERTa model on our dataset.

In [9]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()


The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 25000
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 31250


Epoch,Training Loss,Validation Loss


In [None]:
trainer.evaluate()

For more information, please check the notebooks in HuggingFace:
* Fine-tuning a pretrained model: https://github.com/huggingface/notebooks/blob/master/transformers_doc/training.ipynb
* Fine-tuning a model on a text classification task: https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb

And more good resources can be found here: https://huggingface.co/docs/transformers/notebooks