# Supervised Text Quality Evaluation

Here we will demonstrate the following metrics to compare generated text to a refence:

* BLEU (Bilingual Evaluation Understudy)
* BERTScore
* ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

## Install required packages

In [None]:
!pip install nltk rouge bert_score

## Imports

In [None]:
from bert_score import BERTScorer
from nltk.translate.bleu_score import sentence_bleu
import pandas as pd
from rouge import Rouge

# Example Text

We need some examples of text to score. Here we define one to use as our ground truth and two that we will compare against it:

- response_a: The target; a clear and concise explanation.
- response_b: A confusing and convoluted explanation that mixes several ideas.
- response_c: An explanation that doesn't connect well and is very unclear.

In [None]:
response_a = """Blockchain is a decentralized digital ledger that records and
stores data in a secure and transparent manner. It is a chain of blocks, where
each block contains a set of transactions that are verified and added to the
chain through a consensus mechanism. This means that the data stored on the
blockchain cannot be altered or deleted, making it immutable and tamper-proof.
The data is also distributed across a network of computers, making it highly
secure and resistant to hacking. This technology is most commonly associated
with cryptocurrencies, but it has many other potential applications such as
supply chain management, voting systems, and smart contracts. Overall,
blockchain provides a reliable and efficient way to store and transfer data
without the need for intermediaries, making it a revolutionary technology with
the potential to transform various industries."""

response_b = """Blockchain is internet technology which distributes a duplicate
record to all nodes in order to protect the network from fraud or dishonesty.
The digital ledger holds transactions and allows instant verification by the
entire system. All copies or blocks of data connected one after another into
processing lines form blockchain. It was first developed for bitcoin payments
but can be used for other cryptocurrency in major organizations.Finally,
blockchain has a large scalability and stores data in a block structure that is
permanently chained together. Blockchain tech is on internet where same record
gets spread on all nodes concurrently keeping net secure from fraudulence.
Ledger keeps track of transactions through instantaneous confirmation with the
whole setup."""

response_c = """Think of the blockchain as a Jenga tower that is constantly
growing and adding new blocks at the top, while also reinforcing and connecting
each block to the blocks underneath it. This tower only moves forward with
consensus from all the players involved, but anyone can contribute their own
block as long as they follow the predetermined rules and maintain transparency
within the chain. This ensures an ever-evolving and decentralized network that
maintains immutability through shared verification processes. Essentially, it's
like creating a never-ending game of three-dimensional collaborative digital
Tetris while wearing blindfolds made of pseudo-random algorithms."""

# Compute Evaluation Metrics

## BLEU

- **Interpretation**: Higher BLEU scores indicate closer word choice and phrasing to the reference, suggesting higher **coherence** and **precision** in the generated response. However, BLEU is sensitive to exact matches, which can sometimes penalize well-phrased responses that use different wordings.
- **Typical Range**: BLEU scores range from 0 to 1 (or 0–100%), with higher scores representing greater overlap with the reference. Scores above 0.5 (50%) often indicate high similarity, but even a lower score can reflect fluency if the response is phrased differently.

In [None]:
def compute_bleu(reference, hypothesis):
 reference = reference.split()  # Tokenize the reference
 hypothesis = hypothesis.split()  # Tokenize the hypothesis
 return sentence_bleu([reference], hypothesis)

bleu_b = compute_bleu(response_a, response_b)
bleu_c = compute_bleu(response_a, response_c)

In [None]:
print(f"BLEU Score for Response B: {bleu_b:.3f}")
print(f"BLEU Score for Response C: {bleu_c:.3f}")

## BERTScore

- **Interpretation**: A high BERTScore indicates that the generated response is semantically close to the reference, which implies both **coherence** (in terms of aligned ideas) and **fluency** (capturing meaning even with varied wording).
- **Typical Range**: BERTScore ranges from 0 to 1. Scores above 0.85 generally signify strong semantic alignment with the reference, indicating high relevance and conceptual accuracy.

In [None]:
scorer = BERTScorer(lang="en", rescale_with_baseline=True)
P, R, F1 = scorer.score([response_b], [response_a])
bertscore_b = F1.item()
P, R, F1 = scorer.score([response_c], [response_a])
bertscore_c = F1.item()

In [None]:
print(f"BERTScore for Response B: {bertscore_b:.3f}")
print(f"BERTScore for Response C: {bertscore_c:.3f}")

## ROUGE

- **Interpretation**: A higher ROUGE score (typically, ROUGE-1, ROUGE-2, or ROUGE-L) suggests that the generated text covers more relevant phrases or sequences from the reference text. This indicates better **relevance** and **completeness**.
- **Typical Range**: ROUGE scores are generally between 0 and 1 (or 0–100%). Scores closer to 1 mean greater similarity to the reference.
  - ROUGE-1 shows overlap at the word level.
  - ROUGE-2 is for bigram overlap.
  - ROUGE-L captures overlap in sentence structureindicating coherence at the sentence level.
  - Recall (r) measures how much of the reference text’s content is captured in the generated text.
  - Precision (p) measures how relevant the generated text is to the reference text.
  - F1-score (f) is a harmonic mean of recall and precision, balancing both to give a single representative score.

In [None]:
rouge = Rouge()
scores_b = rouge.get_scores(response_b, response_a)[0]
scores_c = rouge.get_scores(response_c, response_a)[0]

In [None]:
print("\nROUGE Score for Response B:\n\n", pd.DataFrame(scores_b))
print("\nROUGE Score for Response C:\n\n", pd.DataFrame(scores_c))