## NLP. Week 12. BERT. RoBERTa

## BERT (Bidirectional Encoder Representations from Transformers)
BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural language processing. It was developed in 2018 by researchers at Google AI Language and serves as a swiss army knife solution to 11+ of the most common language tasks, such as sentiment analysis and named entity recognition.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab12/BERTclassification.png" alt="BERT classification" width="1000"/>

- **Bidirectional Training:** Unlike previous models which read text either left-to-right or right-to-left, BERT reads in both directions. This allows it to understand the context of a word based on its surroundings more comprehensively.
- **Transformers** Architecture: Utilizes the Transformer architecture, specifically focusing on the encoder part, which helps in capturing complex relationships in text data.
- Pre-training and Fine-tuning: BERT is pre-trained on a large corpus of text using unsupervised learning and then fine-tuned on specific tasks like question answering or sentiment analysis.
- **Masked Language Modeling** (MLM): During pre-training, BERT uses MLM, where some percentage of the input tokens are masked, and the model learns to predict these masked tokens (the key for bidiractional training).


### Architecture

The architecture of BERT is the same as the **encoder** of a transformer network (since the purpose of BERT is to create a language model, it only needs an encoder). It mainly consists of a series of self-attention layers (12 in case of the base model and 24 in the large model) combined with layer normalization and residual layers.

Since BERT is based on transormer's encoder, you should study / revise how transformers work (you can use [this source](https://habr.com/ru/articles/486358/) or [Google's source](http://arxiv.org/pdf/1706.03762)). [Here](https://medium.com/@amanatulla1606/transformer-architecture-explained-2c49e2257b4c) you can revise it more meticulously but lets refresh your memory now and see the architecture:

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab12/TransformerArch.png" alt="Transformer" width="600"/>


<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab12/TransformersArch.png" alt="Transformer's encoder" width="800"/>


#### Key Components
- Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sentence when encoding a particular word.
- Feed-Forward Neural Networks: Process the output of the self-attention mechanism to further refine the word representations.
- Layer Normalization and Residual Connections: Help in stabilizing the training process and allow for deeper models.

The encoder in BERT is composed of multiple layers of these components stacked on top of each other.

> Unlike traditional models that process text in a single direction (left-to-right or right-to-left), BERT processes text bidirectionally. This means it considers both the left and right context when encoding each word, leading to a more comprehensive understanding of the text.

The picture shows the general structure of a transformer encoder. The input is a sequence of tokens that are first embedded into vectors and then _processed in a neural network_ (read the information below). The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.

#### Input Representation
BERT's input representation is a combination of several embeddings:

- Token Embeddings: Represent individual words or subwords.
- Segment Embeddings: Differentiate between sentences in tasks like sentence pair classification.
- Position Embeddings: Indicate the position of each token in the sequence to maintain order.
The final input representation is the sum of these three embeddings.

BERT needs a position vector - a set of “fast” and “slow” sinusoids that change their value with each token. If two words have different meanings for fast graphs, but the same meaning for slow ones, then they are nearby. If slow graphs have different meanings, it means the words are far apart.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab12/Sinusoids.png" alt="Sinusoidsr" width="600"/>

BERT is pre-trained on two unsupervised tasks: MLM and NSP

### Masked Language Modeling (MLM)

Before entering a sequence of words into BERT, 15% of the words in each sequence are replaced with the [MASK] token. The model then attempts to predict the original meaning of the masked words based on the context provided by other, unmasked words in the sequence. From a technical point of view, predicting output words requires:

- Adding a classification layer on top of the encoder output.
- Multiplying the output vectors by the embedding matrix, converting them to the dictionary dimension.
- Calculating the probability of each word in a dictionary using softmax.

### Next Sentence Prediction (NSP)

As part of the training process, the BERT model receives pairs of phrases as input, from which it learns to predict whether the second phrase in the pair is next after the first in the source text. During training, 50% of the input data are pairs in which the second phrase is actually the next phrase in the original text, and in the remaining 50%, a random phrase from the same text is selected as the second phrase. It is assumed that a random phrase will not be related in meaning to the first phrase.

To help the model distinguish between two phrases during training, the input data is processed as follows before entering the model:

- The [CLS] token is inserted at the beginning of the first phrase. The [SEP] token is inserted at the end of each phrase.
- To each token, a phrase embedding (vector representation) is added, denoting Phrase A or Phrase B. Phrase embeddings are similar in concept to token embeddings with a dictionary of two elements.
- Positional embedding is added to each token to indicate its position in the sequence.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab12/NSP.png" alt="NSP" width="1000"/>

To predict whether the second phrase is actually related to the first, the following steps are performed:

- The entire input sequence passes through the transformer model.
- The output of the [CLS] token is converted into a 2x1 vector using a simple classification layer (trained weight and bias matrices).
- Calculating the probability of IsNextSequence using softmax.

During the training process, the BERT model, MLM and NSP are trained together to minimize the combined loss function of the two strategies.

After BERT is trained on these 2 tasks, the learned model can be then used as a feature extractor for different NLP problems, where we can either keep the learned weights fixed and just learn the newly added task-specific layers or fine-tune the pre-trained layers too.

Some useful links: [click1](https://huggingface.co/blog/bert-101), [click2](http://jalammar.github.io/illustrated-bert/)

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data.It can be used for text classification and sentiment classification tasks. As we see in the dataset structure, it has train and test parts. Both of them has 'label' and 'text' information.

For a better understanding you should keep the documentation with AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, and Trainer open. All of them are packeges from 'transformers' library.

In [None]:
from transformers import AutoTokenizer

# Use an AutoTokenizer for BERT
# automatically load the appropriate tokenizer for a given model
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")


def tokenize_function(examples):
    """ Function takes a batch of examples and tokenizes the "text" field using the BERT tokenizer

    padding="max_length" ensures all sequences in the batch are padded to the same length
    truncation=True ensures sequences longer than the model's maximum input length are truncated
    It will be passed to the map() function of the dataset. The dataset will be splitted on batches

    Args:
        examples (LazyBatch): its elements are texts and labels

    Returns:
        transformers.tokenization_utils_base.BatchEncoding or dict:
            input_ids, token_type_ids, and attention_mask
    """
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# preparing small training and evaluation datasets:
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000)).map(tokenize_function, batched=True)
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(1000)).map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Check and update your packages:
!pip install -U transformers[torch] accelerate
!pip install -U torch
# Probably session restart will be required

In [None]:
# load the model for sequence classification tasks
# num_labels=5: 5 labels, corresponding to the five star ratings in the Yelp dataset
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
%%capture
!pip install evaluate

In [None]:
# metric
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    # function for computing the accuracy
    # logits and labels are output from the model duting evluation
    # logits - (batch_size, n_labels), np.argmax(logits, axis=-1) means
    # find maximum for each row (for each element in the batch)
    # predictions - predicted label
    # labels -  contains the true class labels for each element in the batch - (batch size,)
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import TrainingArguments, Trainer

# TrainingArguments holds various training-related arguments
training_args = TrainingArguments(output_dir="example_trainer", evaluation_strategy="epoch", report_to='tensorboard')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
# model training
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.305862,0.442
2,No log,1.123573,0.526
3,No log,1.045434,0.586


TrainOutput(global_step=375, training_loss=0.974308349609375, metrics={'train_runtime': 396.7599, 'train_samples_per_second': 7.561, 'train_steps_per_second': 0.945, 'total_flos': 789354427392000.0, 'train_loss': 0.974308349609375, 'epoch': 3.0})

## RoBERTa (Robustly Optimized BERT Pretraining Approach)

The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, et.al.

It builds upon the BERT architecture, utilizing the same Transformer encoder framework, and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

#### Key Improvements and Changes:
- **Larger Training Corpus:** RoBERTa is trained on a significantly larger dataset. While BERT was trained on the BooksCorpus and English Wikipedia (16GB of text), RoBERTa uses additional data sources like Common Crawl News, OpenWebText, and others, amounting to around 160GB of text.
- **Longer Training Time:** RoBERTa undergoes longer training with more epochs, ensuring that the model learns better representations.
- **Dynamic Masking:** Unlike BERT, which uses static masking (the same tokens are masked every epoch), RoBERTa employs dynamic masking, where the masking pattern changes during each epoch. This provides a richer learning experience as the model sees a variety of masked tokens.
- **Increased Batch Size**: RoBERTa uses larger batch sizes during training, which helps in achieving more stable and effective training.
- **Higher Learning Rates:** Adjustments to learning rates during training contribute to more robust model performance.
- RoBERTa doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or <\/s>).


#### Removal of Next Sentence Prediction (NSP)
Only Dynamic MLM is used during pretraining. RoBERTa eliminates the NSP task used in BERT. Studies showed that removing this task did not negatively impact performance and allowed for more efficient use of computational resources.


### Train/inference tips
- RoBERTa has the same architecture as BERT, but uses a [byte-level BPE](https://huggingface.co/learn/nlp-course/ru/chapter6/5) as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
- RoBERTa doesn’t have token_type_ids
- Dynamic Masking: tokens are masked differently at each epoch, whereas BERT does it once and for all

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab12/Roberta.png" alt="RoBERTa" width="800"/>

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab12/Roberta2.png" alt="RoBERTa" width="800"/>


[RoBERTa on HuggingFace](https://huggingface.co/docs/transformers/en/model_doc/roberta)

#### NER example

In [None]:
task = "ner"
model_checkpoint = "roberta-base"
batch_size = 16

In [None]:
from datasets import load_dataset, load_metric
dataset = load_dataset("conll2003")
dataset

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

So for the NER tags, 0 corresponds to 'O', 1 to 'B-PER' etc... On top of the 'O' (which means no special entity), there are four labels for NER here, each prefixed with 'B-' (for beginning) or 'I-' (for intermediate), that indicate if the token is the first one for the current group with the label or not:
- 'PER' for person
- 'ORG' for organization
- 'LOC' for location
- 'MISC' for miscellaneous

In [None]:
dataset['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
label_list = dataset["train"].features[f"{task}_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens = True):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # we set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # for the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets['train']

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14041
})

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
args = TrainingArguments(
    f"{model_checkpoint}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to = 'tensorboard'
)

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
%%capture
!pip install seqeval

In [None]:
import numpy as np

metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0268,0.056862,0.94422,0.951266,0.94773,0.987282
2,0.0173,0.05584,0.954392,0.960521,0.957447,0.989384
3,0.0105,0.057976,0.954456,0.959955,0.957197,0.989031


TrainOutput(global_step=2634, training_loss=0.018928147620474105, metrics={'train_runtime': 607.0306, 'train_samples_per_second': 69.392, 'train_steps_per_second': 4.339, 'total_flos': 1022948654606748.0, 'train_loss': 0.018928147620474105, 'epoch': 3.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.05797554925084114,
 'eval_precision': 0.9544558174476476,
 'eval_recall': 0.9599546656592368,
 'eval_f1': 0.9571973442576635,
 'eval_accuracy': 0.9890307140007978,
 'eval_runtime': 15.1208,
 'eval_samples_per_second': 214.935,
 'eval_steps_per_second': 13.491,
 'epoch': 3.0}

# Task
Your task is to download trained model DistilBert and generate answers for questions. Get familiar with one more [BERT based model here](https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).

### Task 1. 
1. Download DistilBertForQuestionAnswering model from transformers
2. Download DistilBertTokenizer
3. Use cuda if available

In [2]:
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
      

### Task 2. 
Download the dataset with questions, context, and answers. **Prepare** it for the further usage

In [3]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/Questions.csv

--2024-07-17 13:19:53--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/Questions.csv
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 547913 (535K) [text/plain]
Saving to: ‘Questions.csv’


2024-07-17 13:19:54 (2,00 MB/s) - ‘Questions.csv’ saved [547913/547913]



In [12]:
import pandas as pd
df = pd.read_csv('Questions.csv')

In [None]:
assert df.shape == (1000, 6)

In [5]:
# Preprocess the dataset if needed

df = df.dropna()
df.head()

(884, 6)


Unnamed: 0,question,distractor3,distractor1,distractor2,correct_answer,support
0,Compounds that are capable of accepting electr...,residues,antioxidants,Oxygen,oxidants,Oxidants and Reductants Compounds that are cap...
1,What term in biotechnology means a genetically...,phenotype,adult,male,clone,But transgenic animals just have one novel gen...
2,Vertebrata are characterized by the presence o...,Thumbs,Bones,Muscles,backbone,Figure 29.7 Vertebrata are characterized by th...
3,What is the height above or below sea level ca...,variation,depth,latitude,elevation,"As you know, the surface of Earth is not flat...."
4,"Ice cores, varves and what else indicate the e...",magma,mountain ranges,fossils,tree rings,"Tree rings, ice cores, and varves indicate the..."


In [None]:
assert df.shape == (884, 6)

### Task 3. 
Define get_prediction function, follow the instructions in the comments. Predict answers and check the results

In [30]:
# Build the function for predictions

def get_prediction(question: str, context: str) -> str:
    """ Function for computing the answer. """

    # processes the question and context into token IDs, use the tokenizer,
    # ensure the output is in PyTorch tensor format, set the max_length to 512, move it to device
    inputs = tokenizer(question, context, return_tensors="pt", max_length=512,  truncation=True).to(device)

    # disable gradient calculation and compute the outputs
    with torch.no_grad():
        outputs = model(**inputs)         

    # find the index of the highest value in the start logits - the most likely start position of the answer
    answer_start_index = outputs.start_logits.argmax()

    # find the index of highest value in the end logits - the most likely end position of the answer
    answer_end_index = outputs.end_logits.argmax()

    # extract the token IDs corresponding to the predicted answer span from the input IDs
    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

    # decode using the tokenizer
    prediction = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)
    return prediction

In [7]:
from tqdm import tqdm
import logging
logging.disable(logging.WARNING)

predictions = []
# predict answers
for question, context in tqdm(zip(df['question'].values, df['support'].values), total=len(df)):
    predictions.append(get_prediction(question, context))


100%|██████████| 884/884 [00:58<00:00, 15.02it/s]


In [8]:
predictions = [pred if pred!='' else 'none' for pred in predictions]
assert len(predictions) == 884

### Task 4. 
Compute the fraction of correct answers

In [9]:
# count the fraction of correct answers  

def correct_pred(predicted, real):
  correct = 0
  for pred, test in zip(predicted, real):
    if pred == test:
      correct += 1
  return correct / min(len(predicted), len(real))

In [10]:
fraction = correct_pred(predictions, list(df['correct_answer']))
fraction

0.6481900452488688

In [None]:
assert 0.63 < fraction < 0.66

# Conclusion

In this lesson, we've delved deeply into the world of BERT and RoBERTa, two of the most influential models in natural language processing. Let's summarize the key takeaways:
- BERT Overview and Architecture: BERT has revolutionized NLP with its ability to consider the context of a word from both directions, thanks to its Transformer-based architecture. BERT's architecture is composed of multiple layers of bidirectional Transformer encoders, which allow it to capture intricate relationships within the text. This architecture is fundamental to BERT's understanding of language.
- BERT’s Pre-training Tasks:
    - MLM: Predicting masked words in a sentence, helping the model learn context from both directions.
    - NSP: Understanding the relationship between sentence pairs, which enhances tasks like question answering and natural language inference.
- Practical Label Prediction Using BERT: We explored how BERT can be fine-tuned for specific tasks like text classification. 
- RoBERTa and Its Enhancements: RoBERTa builds upon BERT by optimizing the pre-training process. Key improvements include training on more data, removing the NSP task, and using dynamic masking during training. These changes result in a more robust model that often outperforms BERT.
- Training RoBERTa and Practical Example with NER: The training of RoBERTa involves longer training times, larger batch sizes, and more data, leading to better generalization and performance in various NLP tasks.We provided a practical example of training and evaluating a Named Entity Recognition model using BERT. This example demonstrated how to prepare data, fine-tune a model, and evaluate its performance on entity recognition tasks.
- Question Answering with DistilBERT: Lastly, we tackled a practical task of question answering using DistilBERT, a smaller and faster version of BERT. This showed how to fine-tune DistilBERT for extracting answers from a given text, illustrating its efficiency and effectiveness for real-world applications.
By understanding the architecture and training processes of BERT and RoBERTa, and by seeing practical implementations in tasks like NER and question answering, you are now equipped with the knowledge to leverage these powerful models in your own NLP projects.