## Testing BART Model Performance After Fine-Tuning

In this notebook, we evaluate the BART model's ability to generate summaries after undergoing fine-tuning on custom user data. The analysis is based on popular text quality metrics, specifically **ROUGE** and **BLEU**.

### Experiment Objective:

1. Compare the quality of summaries generated by:
   - The pre-trained BART model.
   - The BART model fine-tuned on the user's dataset.
2. Use the same test examples for evaluating both models.
3. Calculate and compare ROUGE and BLEU scores to better understand the performance differences.

The notebook begins by preparing the necessary tools and downloading resources required for computing the evaluation metrics.

---

## Preparing the Environment for Metric Computation

The following code installs and sets up the NLTK library to enable sentence and word tokenization. This is a crucial step, as ROUGE and BLEU metrics require proper text preprocessing to accurately compare reference and generated summaries.

### Detailed explanation:

1. **Importing NLTK**:
   - `nltk` (Natural Language Toolkit) is a widely used library for natural language processing, including tasks such as text tokenization.

2. **Downloading NLTK resources**:
   - `nltk.download('punkt')` downloads the `punkt` resource, which enables sentence and word tokenization.
   - This is required for subsequent BLEU and ROUGE metric calculations.

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kamiljaworski/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Comparing Summaries Generated by the BART Model

This code loads a previously saved BART model, generates a summary for a randomly selected article from the test set, and compares it with both the original article and the reference summary from the dataset.

### Detailed explanation:

1. **Function `load_model`**:
   - Loads a fine-tuned BART model and its tokenizer from the specified directory.
   - **Arguments**:
     - `model_name`: Name of the directory containing the saved model.
     - `base_dir`: Base path where models are stored (default: `./models`).
   - **Returns**:
     - `model`: The loaded BART model.
     - `tokenizer`: The loaded tokenizer.

2. **Function `load_test_data`**:
   - Loads test data from a JSON file.
   - **Arguments**:
     - `file_path`: Path to the test JSON file.
   - **Returns**:
     - A list of articles from the test dataset.

3. **Function `generate_summary`**:
   - Uses the BART model to generate a summary for a given text.
   - **Arguments**:
     - `model`: The fine-tuned BART model.
     - `tokenizer`: The tokenizer associated with the model.
     - `text`: The input article text to summarize.
     - `max_length`: Maximum length of the generated summary (default: 150 tokens).
     - `min_length`: Minimum length of the generated summary (default: 40 tokens).
   - **Returns**:
     - A string containing the generated summary.

4. **Environment setup**:
   - The model and tokenizer are loaded from the `model_v3` directory.
   - The model is moved to the appropriate device (`mps` on macOS or `cpu` otherwise).
   - Test data is loaded from the file located in `./datasets/splits_filtered_with_summary/test.json`.

5. **Generating and comparing summaries**:
   - A random article is selected from the test set.
   - For that article:
     - The original article text is retrieved (`original_text`).
     - The reference summary from the dataset is retrieved (`dataset_summary`).
     - A new summary is generated by the model (`generated_summary`).

6. **Displaying results**:
   - The comparison is printed in the following format:
     - **Original Text** – the full article.
     - **Dataset Summary** – the reference summary from the dataset.
     - **Generated Summary by Model** – the summary produced by the fine-tuned model.
   - Texts are formatted using `textwrap.fill` to limit line width to 80 characters for better readability.

### Result:
- This script enables direct comparison between the model-generated summary and the human-written reference summary, allowing for a subjective evaluation of the model's performance.

### Notes:
- The model and tokenizer must be saved in the directory `./models/{model_name}` prior to running this script.
- The test data must include `text` and `summary` fields for each article.
- The `generate_summary` function uses beam search (`num_beams=4`) to improve the quality of generated text, at the cost of higher computational load.
- The code is flexible and can be easily adapted for testing more articles or adjusting generation parameters.

In [7]:
import os
import json
import random
import torch
from transformers import BartTokenizer, BartForConditionalGeneration
import textwrap

# Function to load the model and tokenizer
def load_model(model_name, base_dir="./models"):
    """
    Load a trained model and its tokenizer from a specified directory.

    Args:
        model_name (str): Name of the directory containing the model.
        base_dir (str): Base directory where models are stored.

    Returns:
        model: The BART model.
        tokenizer: The BART tokenizer.
    """
    model_path = os.path.join(base_dir, model_name)
    model = BartForConditionalGeneration.from_pretrained(model_path)
    tokenizer = BartTokenizer.from_pretrained(model_path)
    return model, tokenizer

# Function to load test data from a JSON file
def load_test_data(file_path):
    """
    Load test dataset from a JSON file.

    Args:
        file_path (str): Path to the JSON file.

    Returns:
        list: A list of articles from the dataset.
    """
    with open(file_path, "r") as file:
        return json.load(file)

# Function to generate a summary for a given text
def generate_summary(model, tokenizer, text, max_length=150, min_length=40):
    """
    Generate a summary for the input text using the BART model.

    Args:
        model: Trained BART model.
        tokenizer: Tokenizer associated with the model.
        text (str): Text to summarize.
        max_length (int): Maximum length of the generated summary.
        min_length (int): Minimum length of the generated summary.

    Returns:
        str: Generated summary.
    """
    inputs = tokenizer(
        text,
        max_length=512,
        truncation=True,
        return_tensors="pt"
    ).to(model.device)
    
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Load the model and tokenizer
model_name = "model_v3"  # Replace with your model name
model, tokenizer = load_model(model_name)
device = "mps" if torch.backends.mps.is_available() else "cpu"  # Adjust for your hardware
model.to(device)

# Load test dataset
test_file_path = "./datasets/splits_filtered_with_summary/test.json"
test_data = load_test_data(test_file_path)

# Select a random article from the test dataset
random_article = random.choice(test_data)

# Retrieve the original text, dataset summary, and generate a new summary
original_text = random_article["text"]
dataset_summary = random_article["summary"]
generated_summary = generate_summary(model, tokenizer, original_text)

# Display side-by-side comparison
print("=== Comparison ===\n")

print("Original Text:")
print(textwrap.fill(original_text, width=80))

print("\nDataset Summary:")
print(textwrap.fill(dataset_summary, width=80))

print("\nGenerated Summary by Model:")
print(textwrap.fill(generated_summary, width=80))

=== Comparison ===

Original Text:
And we turn now to Lebanon. Lebanon takes stock of the damage done by the war
between Israel and Hezbollah. It finds it's always complicated political
situation has shaken up even more than usual. After a year of pushing for
democratic reforms, Prime Minister Fouad Siniora is on the defensive as
opposition leaders call for his resignation. Hezbollah, meanwhile, is riding a
wave of popularity for its battle with Israel, and the movement is showing no
inclination to turn in its weapons. Analysts say the need to strengthen the
Lebanese state is more urgent than ever but no easier. NPR's Peter Kenyon
reports from Beirut. Since the war started in July, Prime Minister Siniora has
seen his popularity rise as Lebanese rallied around their government under the
strain of the Israeli military bombardment. Siniora is part of the so-called
March 14th Alliance, named for the date of the massive Beirut rally that united
the opposition in the wake of the assassinatio

## BART Model Evaluation: Pre-trained vs Fine-tuned — **ROUGE**

This section of the code compares the performance of the pre-trained BART model and a fine-tuned version trained on custom user data. The comparison is performed using **ROUGE** metrics, with results displayed in tabular format as well as on sample-level summaries.

---

### Detailed explanation:

1. **Function `load_model`**:
   - Loads the BART model and tokenizer.
   - If `model_name` is specified, it loads a fine-tuned model from `./models/{model_name}`.
   - Otherwise, it loads the pre-trained model (`facebook/bart-base`).

2. **Function `load_test_data`**:
   - Loads the test dataset from a JSON file.

3. **Function `generate_summary`**:
   - Generates a summary for the given input text using the BART model.
   - Generation settings:
     - `max_length=150`: Maximum length of the generated summary.
     - `min_length=40`: Minimum length of the generated summary.
     - `num_beams=2`: Number of beams used in beam search (lowered for speed during testing).

4. **Function `evaluate_rouge`**:
   - Compares model-generated summaries against reference summaries using ROUGE metrics.
   - **Returns**:
     - ROUGE scores computed over the dataset.
     - A DataFrame containing individual results for each article: original text, reference summary, and generated summary.

5. **Evaluation process**:
   - Both the pre-trained and fine-tuned models are evaluated on the same subset of test data (first 10 examples).
   - ROUGE results for each model are collected and displayed in a comparison table.

6. **Comparison of results**:
   - ROUGE scores for both models are printed side-by-side, including:
     - **ROUGE-1**: Unigram overlap.
     - **ROUGE-2**: Bigram overlap.
     - **ROUGE-L** and **ROUGE-Lsum**: Longest common subsequence coverage.
   - Evaluation time is measured for each model to provide performance insights.

7. **Presentation of example outputs**:
   - For each article in the evaluated subset, the following are printed:
     - The original article text.
     - The reference summary from the dataset.
     - The summary generated by the pre-trained model.
     - The summary generated by the fine-tuned model.

---

### Output:

- A table comparing ROUGE scores between the pre-trained and fine-tuned BART models.
- Detailed textual comparisons for individual examples to qualitatively assess summarization performance.

---

### Notes:

- **Model effectiveness**:
  - The fine-tuned model is expected to achieve better ROUGE scores, as it has been adapted to the domain-specific data.
- **Evaluation time**:
  - The fine-tuned model may take longer to generate summaries depending on its complexity and tuning.
- **Requirements**:
  - The `evaluate` library is required to compute ROUGE metrics (`pip install evaluate`).
  - The test dataset must contain `text` and `summary` fields for each article.

This evaluation setup offers both quantitative and qualitative insights into how fine-tuning impacts the performance of the BART model on summarization tasks.

In [8]:
import os
import json
from tqdm.auto import tqdm
from transformers import BartTokenizer, BartForConditionalGeneration
from evaluate import load
import time
import pandas as pd
import torch

# Load the ROUGE metric
rouge = load("rouge")

# Function to load the model and tokenizer
def load_model(model_name=None, base_dir="./models", pretrained_model="facebook/bart-base"):
    """
    Load a model and tokenizer. If model_name is None, load the pre-trained BART.
    """
    if model_name:
        model_path = os.path.join(base_dir, model_name)
        model = BartForConditionalGeneration.from_pretrained(model_path)
        tokenizer = BartTokenizer.from_pretrained(model_path)
    else:
        model = BartForConditionalGeneration.from_pretrained(pretrained_model)
        tokenizer = BartTokenizer.from_pretrained(pretrained_model)
    return model, tokenizer

# Function to load the test dataset
def load_test_data(file_path):
    """
    Load test data from a JSON file.
    """
    with open(file_path, "r") as file:
        return json.load(file)

# Function to generate a summary
def generate_summary(model, tokenizer, text, max_length=150, min_length=40):
    """
    Generate a summary for the given text using the model.
    """
    inputs = tokenizer(
        text,
        max_length=512,
        truncation=True,
        return_tensors="pt"
    ).to(model.device)
    
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=2,  # Reduce beams for faster generation
        early_stopping=True
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Function to evaluate the model using ROUGE
def evaluate_rouge(model, tokenizer, test_data, max_length=150, min_length=40):
    """
    Evaluate the model using ROUGE metrics and return individual results.
    """
    references = []
    predictions = []
    results = []
    
    for article in tqdm(test_data, desc="Evaluating"):
        original_text = article["text"]
        reference_summary = article["summary"]
        
        # Generate the summary
        generated_summary = generate_summary(model, tokenizer, original_text, max_length, min_length)
        
        # Append the generated and reference summaries
        predictions.append(generated_summary)
        references.append(reference_summary)
        
        # Save individual results
        results.append({
            "Original Text": original_text,
            "Reference Summary": reference_summary,
            "Generated Summary": generated_summary
        })
    
    # Compute ROUGE scores
    rouge_scores = rouge.compute(predictions=predictions, references=references)
    return rouge_scores, pd.DataFrame(results)

# Load test data
test_file_path = "./datasets/splits_filtered_with_summary/test.json"  # Update path if necessary
test_data = load_test_data(test_file_path)

# Evaluate the pre-trained BART model
print("\nEvaluating Pre-trained BART...")
start_time = time.time()
bart_model, bart_tokenizer = load_model()
bart_model.to("mps" if torch.backends.mps.is_available() else "cpu")
bart_rouge_scores, bart_results_df = evaluate_rouge(bart_model, bart_tokenizer, test_data[:10])  # Use 10 examples for quick evaluation
bart_end_time = time.time()

# Evaluate your fine-tuned model
print("\nEvaluating Fine-tuned BART...")
start_time_ft = time.time()
fine_tuned_model, fine_tuned_tokenizer = load_model(model_name="model_v3")
fine_tuned_model.to("mps" if torch.backends.mps.is_available() else "cpu")
ft_rouge_scores, ft_results_df = evaluate_rouge(fine_tuned_model, fine_tuned_tokenizer, test_data[:10])
ft_end_time = time.time()

# Display ROUGE scores comparison
print("\n=== ROUGE Scores Comparison ===")
rouge_comparison_df = pd.DataFrame({
    "Metric": ["ROUGE-1", "ROUGE-2", "ROUGE-L", "ROUGE-Lsum"],
    "Pre-trained BART": [
        bart_rouge_scores["rouge1"],
        bart_rouge_scores["rouge2"],
        bart_rouge_scores["rougeL"],
        bart_rouge_scores["rougeLsum"]
    ],
    "Fine-tuned BART": [
        ft_rouge_scores["rouge1"],
        ft_rouge_scores["rouge2"],
        ft_rouge_scores["rougeL"],
        ft_rouge_scores["rougeLsum"]
    ]
})
print(rouge_comparison_df.to_string(index=False, float_format="{:.4f}".format))

print(f"\nPre-trained BART Evaluation Time: {bart_end_time - start_time:.2f} seconds")
print(f"Fine-tuned BART Evaluation Time: {ft_end_time - start_time_ft:.2f} seconds")

# Display individual examples
print("\n=== Sample Results Comparison ===")
for idx in range(len(bart_results_df)):
    print(f"\nExample {idx + 1}:")
    print("\nOriginal Text:")
    print(bart_results_df.iloc[idx]["Original Text"])
    print("\nReference Summary:")
    print(bart_results_df.iloc[idx]["Reference Summary"])
    print("\nGenerated Summary (Pre-trained BART):")
    print(bart_results_df.iloc[idx]["Generated Summary"])
    print("\nGenerated Summary (Fine-tuned BART):")
    print(ft_results_df.iloc[idx]["Generated Summary"])
    print("\n" + "-" * 100)


Evaluating Pre-trained BART...


Evaluating: 100%|██████████| 10/10 [00:18<00:00,  1.86s/it]



Evaluating Fine-tuned BART...


Evaluating: 100%|██████████| 10/10 [00:06<00:00,  1.44it/s]


=== ROUGE Scores Comparison ===
    Metric  Pre-trained BART  Fine-tuned BART
   ROUGE-1            0.3202           0.4465
   ROUGE-2            0.1250           0.2370
   ROUGE-L            0.2130           0.3610
ROUGE-Lsum            0.2138           0.3613

Pre-trained BART Evaluation Time: 21.51 seconds
Fine-tuned BART Evaluation Time: 7.99 seconds

=== Sample Results Comparison ===

Example 1:

Original Text:
During the presidential campaign, Donald Trump suggested that he might favor creating a database for Muslims who enter the United States. At other times, he has called for extreme vetting for people from terror-prone countries. The U.S. government actually once had a system that could serve as a model for this. After 9/11, the Bush administration established a registry called NSEERS. That stands for the National Security Entry-Exit Registration System. We're going to talk now with someone who has studied NSEERS. Muzaffar Chishti directs the Migration Policy Institute's off




## Conclusions from ROUGE Evaluation Results

Based on the evaluation of both the pre-trained BART model and the fine-tuned BART model using ROUGE metrics, the following conclusions can be drawn:

### ROUGE Score Comparison:

1. **ROUGE-1** (unigram overlap):
   - The pre-trained BART achieved a score of **0.3202**, while the fine-tuned BART reached **0.4465**.
   - This marks an improvement of approximately **39%**, indicating significantly better word-level alignment with reference summaries after fine-tuning.

2. **ROUGE-2** (bigram overlap):
   - Pre-trained BART: **0.1250**, Fine-tuned BART: **0.2370**.
   - Nearly a **90%** increase, which reflects a stronger ability of the fine-tuned model to capture local word sequences and context.

3. **ROUGE-L** (longest common subsequence):
   - Pre-trained BART: **0.2130**, Fine-tuned BART: **0.3610**.
   - This represents a **69%** improvement, suggesting better structural coherence in the generated summaries.

4. **ROUGE-Lsum** (summary-level variant of ROUGE-L):
   - Pre-trained BART: **0.2138**, Fine-tuned BART: **0.3613**.
   - These values closely mirror the ROUGE-L scores, confirming consistent gains in summary alignment.

---

### Evaluation Time:

1. **Pre-trained BART**:
   - Evaluation time: **21.51 seconds**.

2. **Fine-tuned BART**:
   - Evaluation time: **7.99 seconds**, which is **2.7x faster**.

> ⚠️ Note: The shorter generation time for the fine-tuned model may also be influenced by the model producing more concise outputs, not solely due to computational efficiency.

---

### Key Takeaways:

1. **Improved Summary Generation Quality**:
   - The fine-tuned BART model outperforms the pre-trained model across all ROUGE metrics, particularly in ROUGE-2 and ROUGE-L, demonstrating better contextual and structural summary generation.

2. **Better Adaptation to Custom Data**:
   - Fine-tuning enables the model to learn patterns specific to the user’s dataset, resulting in more relevant and coherent summaries.

3. **Faster Inference Time**:
   - The fine-tuned model produces summaries faster, which may be advantageous in real-time or resource-constrained applications.

4. **Practical Implications**:
   - Fine-tuned BART is more suitable for tasks requiring high-quality, domain-specific summaries, especially when the input data differs significantly from the general corpus used for pre-training.

---

### Summary:

The fine-tuned BART model significantly outperforms the pre-trained version in terms of both output quality (as measured by ROUGE metrics) and generation speed. Fine-tuning proves to be a crucial step in tailoring the model to specific needs and improving its effectiveness in real-world applications.

## Evaluation of BART Models: Pre-trained vs Fine-tuned - **BLEU**

This code fragment compares the performance of a pre-trained BART model and a fine-tuned BART model on user data using the BLEU metric (Bilingual Evaluation Understudy). BLEU measures the similarity between generated summaries and reference summaries in the test set.

---

### Detailed Description:

1. **Function `load_model`**:
   - Loads the BART model and tokenizer from the specified directory.

2. **Function `load_test_data`**:
   - Loads test data from a JSON file.

3. **Function `generate_summary`**:
   - Generates a summary for a given input text using the BART model.
   - Generation parameters:
     - `max_length`: Maximum length of the summary (default is 150 tokens).
     - `min_length`: Minimum length of the summary (default is 40 tokens).
     - `num_beams=2`: Number of beams used during beam search generation.

4. **Function `evaluate_bleu`**:
   - Computes the BLEU score for the generated summaries.
   - Steps:
     - For each article in the test set:
       - Generates a summary using the model.
       - Compares the generated summary with the reference using NLTK's `sentence_bleu` function.
     - Calculates the average BLEU score across all examples.

5. **Evaluation Process**:
   - Both models (pre-trained and fine-tuned) are evaluated on the same subset of test data (first 10 examples).
   - BLEU scores and evaluation times for each model are printed in the console.

---

### Output:
- The BLEU score for the pre-trained model (`Pre-trained BART BLEU`) and the fine-tuned model (`Fine-tuned BART BLEU`) is presented in decimal format.
- The evaluation time for each model is also reported in seconds.

---

### Notes:
- **BLEU score**:
  - A higher BLEU score indicates greater similarity between generated and reference summaries.
  - BLEU does not account for synonyms, so the results may not always perfectly reflect summary quality.
- **Fine-tuned model performance**:
  - The fine-tuned model is expected to achieve higher BLEU scores because it has been adapted to user-specific data.
- **Use of a test subset**:
  - To speed up evaluation, only 10 examples from the test set are used.
- **Requirements**:
  - NLTK must be installed (`pip install nltk`).
  - Test data must include the `text` (original text) and `summary` (reference summary) fields.

This code allows for comparing the quality of generated summaries between both models and assessing the impact of fine-tuning on model performance.

In [9]:
import os
import json
from tqdm.auto import tqdm
from transformers import BartTokenizer, BartForConditionalGeneration
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import time

# Function to load the model and tokenizer
def load_model(model_name, base_dir="./models"):
    """
    Load a pre-trained model and tokenizer from a directory.
    """
    model_path = os.path.join(base_dir, model_name)
    model = BartForConditionalGeneration.from_pretrained(model_path)
    tokenizer = BartTokenizer.from_pretrained(model_path)
    return model, tokenizer

# Function to load the test dataset
def load_test_data(file_path):
    """
    Load test data from a JSON file.
    """
    with open(file_path, "r") as file:
        return json.load(file)

# Function to generate a summary
def generate_summary(model, tokenizer, text, max_length=150, min_length=40):
    """
    Generate a summary for the given text using the model.
    """
    inputs = tokenizer(
        text,
        max_length=512,
        truncation=True,
        return_tensors="pt"
    ).to(model.device)
    
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=2,  # Reduce beams for faster generation
        early_stopping=True
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Function to evaluate the model using BLEU
def evaluate_bleu(model, tokenizer, test_data, max_length=150, min_length=40):
    """
    Evaluate the model using BLEU scores.
    """
    smoothing_function = SmoothingFunction().method1
    scores = []
    
    for article in tqdm(test_data, desc="Evaluating BLEU"):
        original_text = article["text"]
        reference_summary = article["summary"]
        
        # Generate the summary
        generated_summary = generate_summary(model, tokenizer, original_text, max_length, min_length)
        
        # Compute BLEU score
        reference_tokens = [reference_summary.split()]
        candidate_tokens = generated_summary.split()
        score = sentence_bleu(reference_tokens, candidate_tokens, smoothing_function=smoothing_function)
        scores.append(score)
    
    # Calculate average BLEU score
    avg_bleu = sum(scores) / len(scores)
    return avg_bleu

# Load the model and tokenizer
model_name = "model_v3"  # Replace with your fine-tuned model name
pretrained_model_name = "facebook/bart-base"  # Pre-trained BART
device = "mps" if torch.backends.mps.is_available() else "cpu"  # Use MPS for Apple Silicon

# Load fine-tuned model
fine_tuned_model, fine_tuned_tokenizer = load_model(model_name)
fine_tuned_model.to(device)

# Load pre-trained model
pretrained_model = BartForConditionalGeneration.from_pretrained(pretrained_model_name).to(device)
pretrained_tokenizer = BartTokenizer.from_pretrained(pretrained_model_name)

# Load test data
test_file_path = "./datasets/splits_filtered_with_summary/test.json"  # Update path if necessary
test_data = load_test_data(test_file_path)
small_test_data = test_data[:10]  # Use only 10 examples for quick evaluation

# Evaluate fine-tuned model
print("Evaluating fine-tuned model...")
start_time = time.time()
fine_tuned_bleu = evaluate_bleu(fine_tuned_model, fine_tuned_tokenizer, small_test_data)
end_time = time.time()
print(f"Fine-tuned BART BLEU: {fine_tuned_bleu:.4f}")
print(f"Evaluation Time: {end_time - start_time:.2f} seconds")

# Evaluate pre-trained model
print("\nEvaluating pre-trained model...")
start_time = time.time()
pretrained_bleu = evaluate_bleu(pretrained_model, pretrained_tokenizer, small_test_data)
end_time = time.time()
print(f"Pre-trained BART BLEU: {pretrained_bleu:.4f}")
print(f"Evaluation Time: {end_time - start_time:.2f} seconds")

Evaluating fine-tuned model...


Evaluating BLEU: 100%|██████████| 10/10 [00:06<00:00,  1.48it/s]


Fine-tuned BART BLEU: 0.1067
Evaluation Time: 6.78 seconds

Evaluating pre-trained model...


Evaluating BLEU: 100%|██████████| 10/10 [00:17<00:00,  1.79s/it]

Pre-trained BART BLEU: 0.0411
Evaluation Time: 17.91 seconds





## Conclusions from BLEU Score Analysis

Based on the evaluation results of the pre-trained and fine-tuned BART models using the BLEU metric, the following conclusions can be drawn:

---

### BLEU Scores:

1. **Fine-tuned BART**:
   - Achieved a BLEU score of **0.1067**.
   - This indicates significantly better alignment between generated summaries and reference summaries compared to the pre-trained BART model.

2. **Pre-trained BART**:
   - Scored **0.0411** on the BLEU metric.
   - This much lower score suggests that the pre-trained model produces summaries that are less accurate in the context of the user's specific dataset.

3. **Difference in Scores**:
   - The BLEU score of the fine-tuned BART is approximately **2.6 times higher** than that of the pre-trained BART.
   - This difference highlights the importance of fine-tuning in improving the quality of generated summaries, especially when test data differs from the generic corpora used to train the base model.

---

### Evaluation Time:

1. **Fine-tuned BART**:
   - The evaluation took **6.78 seconds**, significantly faster than the pre-trained model.

2. **Pre-trained BART**:
   - The evaluation took **17.91 seconds**, which is approximately **2.64 times longer** than the fine-tuned model.

---

### Key Takeaways:

1. **Improved Summary Quality**:
   - Fine-tuned BART generates summaries that are more consistent with reference summaries, as reflected in its higher BLEU score.
   - This suggests the model is better at capturing key linguistic and semantic patterns in the user-specific dataset.

2. **Better Time Efficiency**:
   - In addition to producing more accurate summaries, the fine-tuned BART does so in significantly less time, making it more suitable for real-time or production environments.

3. **Value of Fine-tuning**:
   - The substantial difference in BLEU scores between the pre-trained and fine-tuned models emphasizes the effectiveness of the fine-tuning process in enhancing model performance for domain-specific tasks.

---

### Summary:

The fine-tuned BART model clearly outperforms the pre-trained BART in both summary quality and processing speed. Fine-tuning proves essential in adapting the model to the user's specific needs, resulting in more accurate and coherent summaries delivered in less time.