## Traditional NLP Metrics

### BLEU (Bilingual Evaluation Understudy)

BLEU compares the n-gram overlap between the generated and reference text. It measures how much the generated text resembles the reference text.

	•	Precision-based: Measures how many n-grams in the candidate exist in the reference

	•	Brevity Penalty: Penalizes short outputs to prevent cheating.
    
	•	Range: 0 (bad) to 1 (perfect match).

In [1]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hInstalling collected packages: nltk
Successfully installed nltk-3.9.1



$$ BLEU = BP \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $$


In [5]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Example 1
reference = [["the", "cat", "is", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]

bleu_score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score:.4f}")  # Perfect match, should be 1.0

BLEU Score: 1.0000


In [6]:
# Example 2
reference = [["the", "cat", "is", "on", "the", "mat"]]
candidate = ["the", "cat", "sits", "on", "the", "floor"]

bleu_score = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)
print(f"BLEU Score: {bleu_score:.4f}")

BLEU Score: 0.1221


The words “sits” and “floor” do not match the reference, lowering the score.

In [7]:
# Example 3 - Testing different n-gram weights
bleu_1gram = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))  # Unigrams
bleu_2gram = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))  # Bigrams
bleu_3gram = sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))  # Trigrams

print(f"BLEU (1-gram): {bleu_1gram:.4f}")
print(f"BLEU (2-gram): {bleu_2gram:.4f}")
print(f"BLEU (3-gram): {bleu_3gram:.4f}")

BLEU (1-gram): 0.6667
BLEU (2-gram): 0.5164
BLEU (3-gram): 0.0000


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


## ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

How ROUGE Works

ROUGE measures recall-based word overlap, used mainly for summarization.

Types of ROUGE

	1.	ROUGE-1: Unigram (single word) overlap.

	2.	ROUGE-2: Bigram (two consecutive words) overlap.
	
	3.	ROUGE-L: Longest Common Subsequence (LCS).

ROUGE Formula


$$ ROUGE = \frac{|Overlapping\ words|}{|Words\ in\ reference|} $$

It calculates Precision, Recall, and F1-score.

In [8]:
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting absl-py (from rouge-score)
  Using cached absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Using cached absl_py-2.1.0-py3-none-any.whl (133 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=3046cb3a98d41d4e3d04b8394fddd2db5bf2e737f6ef475958d09baa19cab898
  Stored in directory: /Users/harshbhatt/Library/Caches/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: absl-py, rouge-score
Successfully installed absl-py-2.1.0 rouge-score-0.1.2


In [9]:
from rouge_score import rouge_scorer

# Example 1
reference = "the cat is on the mat"
candidate = "the cat sits on the floor"

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)

# Display scores
for key, value in scores.items():
    print(f"{key}: Precision={value.precision:.4f}, Recall={value.recall:.4f}, F1-Score={value.fmeasure:.4f}")

rouge1: Precision=0.6667, Recall=0.6667, F1-Score=0.6667
rouge2: Precision=0.4000, Recall=0.4000, F1-Score=0.4000
rougeL: Precision=0.6667, Recall=0.6667, F1-Score=0.6667


In [10]:
# Example 2 - Perfect match
reference = "the cat is on the mat"
candidate = "the cat is on the mat"

scores = scorer.score(reference, candidate)
for key, value in scores.items():
    print(f"{key}: Precision={value.precision:.4f}, Recall={value.recall:.4f}, F1-Score={value.fmeasure:.4f}")

rouge1: Precision=1.0000, Recall=1.0000, F1-Score=1.0000
rouge2: Precision=1.0000, Recall=1.0000, F1-Score=1.0000
rougeL: Precision=1.0000, Recall=1.0000, F1-Score=1.0000


In [16]:
# Example 3 - More different sentence
reference = "The quick brown fox jumps over the lazy dog"
candidate = "A fast fox leaps above a sleepy canine"

scores = scorer.score(reference, candidate)
for key, value in scores.items():
    print(f"{key}: Precision={value.precision:.4f}, Recall={value.recall:.4f}, F1-Score={value.fmeasure:.4f}")

rouge1: Precision=0.1250, Recall=0.1111, F1-Score=0.1176
rouge2: Precision=0.0000, Recall=0.0000, F1-Score=0.0000
rougeL: Precision=0.1250, Recall=0.1111, F1-Score=0.1176


### METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR improves BLEU by considering:
Synonym Matching (e.g., “fast” and “quick” are considered the same)

Stem Matching (e.g., “running” and “run” are matched)

Recall & Precision Balance

Word Order Consideration

🔹 Range: 0 (bad) to 1 (perfect)

🔹 Best for: Translation & Summarization

In [18]:
import nltk
from nltk.translate.meteor_score import meteor_score

nltk.download('wordnet')

# Example 1 - Basic METEOR Calculation
reference = [["the", "cat", "is", "on", "the", "mat"]]  # List of lists (each reference tokenized)
candidate = ["the", "cat", "sits", "on", "the", "floor"]  # Tokenized candidate sentence

score = meteor_score(reference, candidate)
print(f"METEOR Score: {score:.4f}")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/harshbhatt/nltk_data...


METEOR Score: 0.6250


In [19]:
# Example 2 - Handling Synonyms & Stemming
reference = [["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]]
candidate = ["a", "fast", "fox", "leaps", "above", "a", "sleepy", "canine"]

score = meteor_score(reference, candidate)
print(f"METEOR Score: {score:.4f}")  # Should be higher than BLEU

METEOR Score: 0.2871
