# BLEU (Bilingual Evaluation Understudy)

**BLEU** is a metric for evaluating the quality of text which has been machine-translated from one language to another. A BLEU score compares the n-grams of the candidate translation (the output of the machine translation) with the n-grams of one or more reference translations (human translations). The main idea is to measure the overlap between the candidate and reference translations.

### Calculation Steps:

1. **Tokenize Text**: Break down both candidate and reference translations into tokens (usually words or sub-words).
2. **Count Matching n-grams**: Count the number of n-grams in the candidate translation that match any n-gram in the reference translations.
3. **Precision**: Calculate the precision for each n-gram size (unigram, bigram, trigram, etc.). Precision is the fraction of matched n-grams to the total n-grams in the candidate translation.
4. **Brevity Penalty**: Apply a brevity penalty to avoid favoring shorter translations. If the candidate translation is shorter than the reference translation, the score is penalized.
5. **Geometric Mean**: Combine the precision scores for different n-gram sizes using a geometric mean.

![image-3.png](attachment:image-3.png)

### Example:

- Candidate Translation: "The cat is on the mat"
- Reference Translation: "The cat is sitting on the mat"

  Let's calculate BLEU with unigrams (n=1) for simplicity:

  - Unigrams in Candidate: {The, cat, is, on, the, mat}
  - Unigrams in Reference: {The, cat, is, sitting, on, the, mat}

  Matched Unigrams: {The, cat, is, on, the, mat} (6 matches)

  Precision: 
  ![image.png](attachment:image.png)

  Since the candidate translation length is equal to the reference, the brevity penalty is 1.

  BLEU Score: \( 1.0 \)

# ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

**ROUGE** is a set of metrics for evaluating automatic summarization and machine translation that measures the overlap between the candidate and reference summaries/translations. The most common variants are ROUGE-N, ROUGE-L, and ROUGE-W.

- **ROUGE-N**: Measures the overlap of n-grams between the candidate and reference.
- **ROUGE-L**: Measures the longest common subsequence (LCS) between the candidate and reference.
- **ROUGE-W**: Weighted version of ROUGE-L considering the sequence length.

### Calculation Steps for ROUGE-N:

1. **Tokenize Text**: Break down both candidate and reference texts into tokens.
2. **Count Matching n-grams**: Count the number of n-grams in the candidate that match any n-gram in the reference.
3. **Calculate Precision, Recall, and F-Score**:
   - Precision: Fraction of matched n-grams to total n-grams in the candidate.
   - Recall: Fraction of matched n-grams to total n-grams in the reference.
   - F-Score: Harmonic mean of precision and recall.

### Example:

- Candidate Summary: "The cat sat on the mat"
- Reference Summary: "The cat is sitting on the mat"

  Let's calculate ROUGE-1 (unigram) for simplicity:

  - Unigrams in Candidate: {The, cat, sat, on, the, mat}
  - Unigrams in Reference: {The, cat, is, sitting, on, the, mat}

  Matched Unigrams: {The, cat, on, the, mat} (5 matches)

  ![image-2.png](attachment:image-2.png)

These metrics help to assess how well a machine-generated text aligns with human-generated references, guiding improvements in natural language processing tasks.


# Summary
- Scores closer to 1 indicate high similarity and good quality translations/summaries.
- Scores closer to 0 indicate little to no similarity and poor quality translations/summaries.
- Intermediate scores represent partial similarity and varying degrees of quality.