BERT stands for **Bidirectional Encoder Representations from Transformers**. It is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by **Google**. BERT is designed to understand the context of a word in search queries and improve the performance of NLP tasks such as question answering, sentiment analysis, and named entity recognition. Here are key features and concepts related to BERT:

## Key Features of BERT:

1.**Bidirectional Training:**

BERT is trained to consider the context from both directions (left-to-right and right-to-left) in all layers, unlike traditional models that only look at the context from one direction.

2.**Transformer Architecture:**

BERT uses the transformer architecture, which is based on attention mechanisms to model the relationships between all words in a sentence regardless of their distance from each other.

3.**Pre-training and Fine-tuning:**

**Pre-training:** BERT is pre-trained on a large corpus of text, including Wikipedia and the BookCorpus, in an unsupervised manner. During pre-training, it learns to predict masked words in a sentence (Masked Language Model) and the next sentence in a pair of sentences (Next Sentence Prediction).
**Fine-tuning:** After pre-training, BERT can be fine-tuned on a specific task with labeled data. The fine-tuning process adjusts BERT's parameters for the specific task, such as sentiment analysis or question answering.

4.**Masked Language Model (MLM):**

During pre-training, some percentage of the input tokens are masked at random, and BERT learns to predict these masked tokens. This helps the model understand context deeply.

5.**Next Sentence Prediction (NSP):**

BERT also learns to predict whether a given sentence B is the actual next sentence that follows sentence A in the original text, helping it understand the relationship between sentences.
### Applications:
1.Question Answering: BERT is used in systems that need to understand and answer questions based on a given context.

2.Sentiment Analysis: BERT helps in determining the sentiment expressed in a piece of text.

3.Named Entity Recognition (NER): BERT can identify and classify proper names and other entities in text.

4.Text Classification: BERT is used to classify texts into predefined categories.

5.Language Translation: BERT's deep understanding of context helps in translating text between languages.

By utilizing BERT, many NLP tasks have seen significant performance improvements, making it a cornerstone model in the field of natural language processing.

In [1]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.0.0->bert-score)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.0.0->bert-score)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.0.0->bert-score)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.0.0->bert-score)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.0.0->bert-score)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.0.0->bert-s

In [2]:
from bert_score import score

In [3]:
generated_text = ["The quick brown fox jumps over the lazy dog"]
original_text = [" A fast brown fox leaps over the lazy dog"]

In [4]:
#calculating BERT score
P, R, F1 = score(generated_text, original_text, lang="en", verbose=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.21 seconds, 0.82 sentences/sec


In [8]:
print(f"Precision: {P.mean().item()}")
print(f"Recall: {R.mean().item()}")
print(f"F1: {F1.mean().item()}")

Precision: 0.9893993735313416
Recall: 0.9893993735313416
F1: 0.9893993735313416


In [11]:
print(f"Precision: {P.item()}")
print(f"Recall: {R.item()}")
print(f"F1: {F1.item()}")

Precision: 0.9893993735313416
Recall: 0.9893993735313416
F1: 0.9893993735313416


P (Precision Matrix)
ùëÉ

P: This represents the matrix of pairwise cosine similarities between the embeddings of words in the candidate sentence and the reference sentence.
For a candidate sentence with embeddings
ùê∂
=
{
ùëê
1
,
ùëê
2
,
‚Ä¶
,
ùëê
ùëö
}
C={c
1
‚Äã
 ,c
2
‚Äã
 ,‚Ä¶,c
m
‚Äã
 } and a reference sentence with embeddings
ùëÖ
=
{
ùëü
1
,
ùëü
2
,
‚Ä¶
,
ùëü
ùëõ
}
R={r
1
‚Äã
 ,r
2
‚Äã
 ,‚Ä¶,r
n
‚Äã
 }, the
ùëÉ
P matrix is of size
ùëö
√ó
ùëõ
m√ón.
Each element
ùëÉ
ùëñ
ùëó
P
ij
‚Äã
  in the matrix represents the cosine similarity between the
ùëñ
i-th word embedding in the candidate sentence and the
ùëó
j-th word embedding in the reference sentence.
ùëÉ
.
ùëö
ùëí
ùëé
ùëõ
(
)

P.mean() (Mean Precision Score)
ùëÉ
.
ùëö
ùëí
ùëé
ùëõ
(
)

P.mean(): This represents the mean precision score, which is the average of the maximum cosine similarities for each word in the candidate sentence.
After computing the
ùëÉ
P matrix, for each word
ùëê
ùëñ
c
i
‚Äã
  in the candidate sentence, you identify the maximum similarity score with any word in the reference sentence.
Formally,
ùëÉ
ùëñ
=
max
‚Å°
ùëó
(
ùëÉ
ùëñ
ùëó
)
P
i
‚Äã
 =max
j
‚Äã
 (P
ij
‚Äã
 ), where
ùëÉ
ùëñ
P
i
‚Äã
  is the maximum similarity score for the
ùëñ
i-th word in the candidate sentence.
The mean precision score is then computed as:
ùëÉ
.
ùëö
ùëí
ùëé
ùëõ
(
)
=
1
ùëö
‚àë
ùëñ
=
1
ùëö
ùëÉ
ùëñ
P.mean()=
m
1
‚Äã
 ‚àë
i=1
m
‚Äã
 P
i
‚Äã


### Disadvantages of BERT Score:

1.Latency is there,i.e,very compute intensive.

2.can work well only for smaller context and not for big set of corpus that we want to evaluate.

3.Is a based on BERT  model, so works well on transformer model