## Traditional NLP Metrics

### BLEU (Bilingual Evaluation Understudy)

BLEU compares the n-gram overlap between the generated and reference text. It measures how much the generated text resembles the reference text.

	•	Precision-based: Measures how many n-grams in the candidate exist in the reference

	•	Brevity Penalty: Penalizes short outputs to prevent cheating.
    
	•	Range: 0 (bad) to 1 (perfect match).

In [1]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hInstalling collected packages: nltk
Successfully installed nltk-3.9.1



$$ BLEU = BP \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $$


In [5]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Example 1
reference = [["the", "cat", "is", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]

bleu_score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score:.4f}")  # Perfect match, should be 1.0

BLEU Score: 1.0000


In [6]:
# Example 2
reference = [["the", "cat", "is", "on", "the", "mat"]]
candidate = ["the", "cat", "sits", "on", "the", "floor"]

bleu_score = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)
print(f"BLEU Score: {bleu_score:.4f}")

BLEU Score: 0.1221


The words “sits” and “floor” do not match the reference, lowering the score.

In [7]:
# Example 3 - Testing different n-gram weights
bleu_1gram = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))  # Unigrams
bleu_2gram = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))  # Bigrams
bleu_3gram = sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))  # Trigrams

print(f"BLEU (1-gram): {bleu_1gram:.4f}")
print(f"BLEU (2-gram): {bleu_2gram:.4f}")
print(f"BLEU (3-gram): {bleu_3gram:.4f}")

BLEU (1-gram): 0.6667
BLEU (2-gram): 0.5164
BLEU (3-gram): 0.0000


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
