### BLEU

Calculating the BLEU (Bilingual Evaluation Understudy) score involves comparing a generated text (usually a machine translation) to one or more reference texts (human translations). Here’s a step-by-step example to illustrate the calculation.

### Example

**Reference Sentences:**
1. The cat is on the mat.
2. A cat is sitting on the mat.

**Generated Sentence:**
The cat is sitting on the mat.

### Step 1: Tokenization
First, tokenize the sentences into words (or n-grams).

**Reference 1:**  
- Tokens: ["The", "cat", "is", "on", "the", "mat."]

**Reference 2:**  
- Tokens: ["A", "cat", "is", "sitting", "on", "the", "mat."]

**Generated:**  
- Tokens: ["The", "cat", "is", "sitting", "on", "the", "mat."]

### Step 2: Count n-grams
For simplicity, let’s calculate BLEU with unigrams (1-grams) and bigrams (2-grams).

#### Unigram Counts
- **Generated Unigrams:**  
  "The", "cat", "is", "sitting", "on", "the", "mat."
  
- **Reference Unigrams:**  
  From Reference 1: "The", "cat", "is", "on", "the", "mat."  
  From Reference 2: "A", "cat", "is", "sitting", "on", "the", "mat."

- **Matched Unigrams:**
  - "The": 1
  - "cat": 1
  - "is": 1
  - "on": 1
  - "the": 1
  - "sitting": 1 (only in generated)
  - "mat.": 1
  
Total matched unigrams = 5.

#### Bigram Counts
- **Generated Bigrams:**  
  ("The", "cat"), ("cat", "is"), ("is", "sitting"), ("sitting", "on"), ("on", "the"), ("the", "mat.")

- **Reference Bigrams:**  
  From Reference 1: ("The", "cat"), ("cat", "is"), ("is", "on"), ("on", "the"), ("the", "mat.")  
  From Reference 2: ("A", "cat"), ("cat", "is"), ("is", "sitting"), ("sitting", "on"), ("on", "the"), ("the", "mat.")

- **Matched Bigrams:**
  - ("The", "cat"): 1
  - ("cat", "is"): 1
  - ("is", "sitting"): 1 (only in generated)
  - ("sitting", "on"): 1 (only in generated)
  - ("on", "the"): 1
  - ("the", "mat."): 1

Total matched bigrams = 5.

### Step 3: Calculate Precision
- Total generated unigrams = 7  
- Total matched unigrams = 5  
  \$
  \text{Unigram Precision} = \frac{\text{Matched Unigrams}}{\text{Total Generated Unigrams}} = \frac{5}{7} \approx 0.714
  \$

- Total generated bigrams = 6  
- Total matched bigrams = 5  
  \$
  \text{Bigram Precision} = \frac{\text{Matched Bigrams}}{\text{Total Generated Bigrams}} = \frac{5}{6} \approx 0.833
  \$

### Step 4: Calculate BLEU Score
The BLEU score is typically calculated using a geometric mean of the precision scores multiplied by a brevity penalty (BP) if the generated sentence is shorter than the reference sentences.

For simplicity, let’s assume there’s no brevity penalty here.

\$
\text{BLEU} = \exp\left(\frac{1}{N} \sum_{n=1}^{N} \log P_n\right)
\$

Where \$ P_n \$ is the precision for n-grams.

For unigrams (N=1):
\$
\text{BLEU}_{1} = \exp(\log(0.714)) \approx 0.714
\$

For bigrams (N=2):
\$
\text{BLEU}_{2} = \exp\left(\frac{1}{2}(\log(0.714) + \log(0.833))\right) \approx \sqrt{0.714 \times 0.833} \approx 0.774
\$

### Final BLEU Score
For the combined score:
\$
\text{BLEU} \approx 0.774
\$

### Conclusion
In this example, the generated sentence has a BLEU score of approximately 0.774 compared to the reference sentences, indicating a reasonable level of similarity. In practice, BLEU is typically calculated with more reference sentences and for n-grams up to 4-grams to get a comprehensive score.

### ROUGE

Calculating the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score involves comparing a generated text (like a summary) with one or more reference texts. The most commonly used ROUGE metrics are ROUGE-N (for n-grams), ROUGE-L (for longest common subsequence), and ROUGE-W (weighted longest common subsequence).

### Example

**Reference Summaries:**
1. The cat sat on the mat.
2. A cat is resting on the mat.

**Generated Summary:**
The cat is on the mat.

### Step 1: Tokenization
First, tokenize the sentences into words.

**Reference 1:**  
- Tokens: ["The", "cat", "sat", "on", "the", "mat."]

**Reference 2:**  
- Tokens: ["A", "cat", "is", "resting", "on", "the", "mat."]

**Generated:**  
- Tokens: ["The", "cat", "is", "on", "the", "mat."]

### Step 2: Calculate ROUGE-N

#### ROUGE-1 (Unigrams)

1. **Generate Unigrams:**
   - **Reference 1 Unigrams:** {"The", "cat", "sat", "on", "the", "mat."}
   - **Reference 2 Unigrams:** {"A", "cat", "is", "resting", "on", "the", "mat."}
   - **Generated Unigrams:** {"The", "cat", "is", "on", "the", "mat."}

2. **Count Matched Unigrams:**
   - From the generated summary, the matched unigrams are: "The", "cat", "on", "the", "mat."
   - Matched Unigrams = 5

3. **Calculate Precision, Recall, and F1-Score:**
   - Total generated unigrams = 6
   - Total reference unigrams = 7 (for both references, but we only need unique counts, which is 8)

   **Precision:**
   \$
   P = \frac{\text{Matched Unigrams}}{\text{Total Generated Unigrams}} = \frac{5}{6} \approx 0.833
   \$

   **Recall:**
   \$
   R = \frac{\text{Matched Unigrams}}{\text{Total Reference Unigrams}} = \frac{5}{8} \approx 0.625
   \$

   **F1-Score:**
   \$
   F1 = 2 \times \frac{P \times R}{P + R} = 2 \times \frac{0.833 \times 0.625}{0.833 + 0.625} \approx 0.714
   \$

#### ROUGE-2 (Bigrams)

1. **Generate Bigrams:**
   - **Reference 1 Bigrams:** {"The cat", "cat sat", "sat on", "on the", "the mat."}
   - **Reference 2 Bigrams:** {"A cat", "cat is", "is resting", "resting on", "on the", "the mat."}
   - **Generated Bigrams:** {"The cat", "cat is", "is on", "on the", "the mat."}

2. **Count Matched Bigrams:**
   - Matched Bigrams = {"on the", "the mat."} (2 matches)

3. **Calculate Precision, Recall, and F1-Score:**
   - Total generated bigrams = 5
   - Total reference bigrams = 9 (considering both references)

   **Precision:**
   \$
   P = \frac{2}{5} = 0.4
   \$

   **Recall:**
   \$
   R = \frac{2}{9} \approx 0.222
   \$

   **F1-Score:**
   \$
   F1 = 2 \times \frac{0.4 \times 0.222}{0.4 + 0.222} \approx 0.285
   \$

### Step 3: Calculate ROUGE-L

ROUGE-L evaluates the longest common subsequence (LCS).

1. **Find LCS:**
   - The LCS between "The cat is on the mat." and both references is "cat on the mat."

   Length of LCS = 4

2. **Calculate Precision and Recall:**
   - Length of generated summary = 6
   - Length of reference summary = 8 (for both)

   **Precision:**
   \$
   P = \frac{LCS}{\text{Length of Generated}} = \frac{4}{6} \approx 0.667
   \$

   **Recall:**
   \$
   R = \frac{LCS}{\text{Length of Reference}} = \frac{4}{8} = 0.5
   \$

   **F1-Score:**
   \$
   F1 = 2 \times \frac{0.667 \times 0.5}{0.667 + 0.5} \approx 0.571
   \$

### Summary of Scores
- **ROUGE-1:**
  - Precision: 0.833
  - Recall: 0.625
  - F1: 0.714

- **ROUGE-2:**
  - Precision: 0.4
  - Recall: 0.222
  - F1: 0.285

- **ROUGE-L:**
  - Precision: 0.667
  - Recall: 0.5
  - F1: 0.571

### Conclusion
This example illustrates how to calculate the ROUGE score, which provides insights into the quality of generated text by comparing it to reference summaries. In practice, multiple references and longer texts can be evaluated for a more robust score.

In [7]:
import numpy  as np
from nltk import ngrams
from collections import Counter
from datasets import load_dataset
from transformers import AutoTokenizer

In [8]:
#*Note:* Reload funtion calcuate rouge for update new tokenizer
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-23-8B")

In [9]:
def brevity_penalty(candidate, reference):
    """
    Calculates the brevity penalty given the candidate and reference sentences.
    """
    reference_length = len(reference)
    candidate_length = len(candidate)

    if reference_length < candidate_length:
        BP = 1
    else:
        penalty = 1 - (reference_length / candidate_length)
        BP = np.exp(penalty)

    return BP


def average_clipped_precision(candidate:str, reference:str,n:int):
    """
    Calculates the precision given the candidate and reference sentences.
    """

    clipped_precision_score = []
    
    # Loop through values 1, 2, 3, 4. This is the length of n-grams
    for n_gram_length in range(1, n):
        reference_n_gram_counts = Counter(ngrams(reference, n_gram_length))        
        candidate_n_gram_counts = Counter(ngrams(candidate, n_gram_length))

        total_candidate_ngrams = sum(candidate_n_gram_counts.values())       
        
        for ngram in candidate_n_gram_counts: 
            # check if it is in the reference n-gram
            if ngram in reference_n_gram_counts:
                # if the count of the candidate n-gram is bigger than the corresponding
                # count in the reference n-gram, then set the count of the candidate n-gram 
                # to be equal to the reference n-gram
                
                if candidate_n_gram_counts[ngram] > reference_n_gram_counts[ngram]: 
                    candidate_n_gram_counts[ngram] = reference_n_gram_counts[ngram] # t
                                                   
            else:
                candidate_n_gram_counts[ngram] = 0 # else set the candidate n-gram equal to zero

        clipped_candidate_ngrams = sum(candidate_n_gram_counts.values())
        
        clipped_precision_score.append(clipped_candidate_ngrams / total_candidate_ngrams)
    
    # Calculate the geometric average: take the mean of elemntwise log, then exponentiate
    # This is equivalent to taking the n-th root of the product as shown in equation (1) above
    s = np.exp(np.mean(np.log(clipped_precision_score)))
    
    return s

def bleu_score(candidate:str, reference:str, n:int):
    assert n >=2, "n must >= 2"
    BP = brevity_penalty(candidate, reference)    
    geometric_average_precision = average_clipped_precision(candidate, reference, n)    
    return BP * geometric_average_precision *100

In [10]:
bleu_score("The cat is on the mat.","The cat is sitting on the mat.",2)

69.51439283988789

In [11]:
def calculate_rouge(reference, generated, n=1):

    
    # Tokenize the input strings into words
    reference_tokens = tokenizer.tokenize(reference) #reference.split()
    generated_tokens = tokenizer.tokenize(generated) #generated.split()
    
    # Generate n-grams
    reference_ngrams = list(ngrams(reference_tokens, n))
    generated_ngrams = list(ngrams(generated_tokens, n))
    
    # Count n-grams
    reference_count = Counter(reference_ngrams)
    generated_count = Counter(generated_ngrams)

    # Calculate matched n-grams
    matched_ngrams = reference_count & generated_count
    
    # Precision
    precision = (sum(matched_ngrams.values()) / len(generated_ngrams)) if generated_ngrams else 0.0
    
    # Recall
    recall = (sum(matched_ngrams.values()) / len(reference_ngrams)) if reference_ngrams else 0.0
    
    # F1 Score
    if precision + recall > 0:
        f1_score = 2 * (precision * recall) / (precision + recall)
    else:
        f1_score = 0.0
    
    return {
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score
    }

# Example usage
reference_summary = "The cat sat on the mat."
generated_summary = "The cat is on the mat."

# Calculate ROUGE-1
rouge_1 = calculate_rouge(reference_summary, generated_summary, n=1)
print("ROUGE-1:", rouge_1)

# Calculate ROUGE-2
rouge_2 = calculate_rouge(reference_summary, generated_summary, n=2)
print("ROUGE-2:", rouge_2)


def lcs_length(x, y):
    """Calculate the length of the longest common subsequence (LCS)"""
    m, n = len(x), len(y)
    # Create a 2D array to store lengths of longest common subsequence.
    lcs_table = [[0] * (n + 1) for _ in range(m + 1)]

    # Fill the lcs_table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i - 1] == y[j - 1]:
                lcs_table[i][j] = lcs_table[i - 1][j - 1] + 1
            else:
                lcs_table[i][j] = max(lcs_table[i - 1][j], lcs_table[i][j - 1])

    return lcs_table[m][n]

def calculate_rouge_l(reference, generated):
    # Tokenize the input strings into words
    reference_tokens = tokenizer.tokenize(reference) #reference.split()
    generated_tokens = tokenizer.tokenize(generated) #generated.split()

    # Calculate the length of the longest common subsequence
    lcs_len = lcs_length(reference_tokens, generated_tokens)

    # Precision
    precision = lcs_len / len(generated_tokens) if generated_tokens else 0.0

    # Recall
    recall = lcs_len / len(reference_tokens) if reference_tokens else 0.0

    # F1 Score
    if precision + recall > 0:
        f1_score = 2 * (precision * recall) / (precision + recall)
    else:
        f1_score = 0.0

    return {
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score
    }

# Example usage
reference_summary = "The cat sat on the mat."
generated_summary = "The cat is on the mat."

# Calculate ROUGE-L
rouge_l = calculate_rouge_l(reference_summary, generated_summary)
print("ROUGE-L:", rouge_l)

ROUGE-1: {'precision': 0.8571428571428571, 'recall': 0.8571428571428571, 'f1_score': 0.8571428571428571}
ROUGE-2: {'precision': 0.6666666666666666, 'recall': 0.6666666666666666, 'f1_score': 0.6666666666666666}
ROUGE-L: {'precision': 0.8571428571428571, 'recall': 0.8571428571428571, 'f1_score': 0.8571428571428571}
