### Exploration of the dataset and the early stage modeling

In [1]:
## import libraries
import os
import sys
import time
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import Dataset, DatasetDict, load_dataset
import evaluate
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline, DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

import warnings
warnings.filterwarnings("ignore")

In [2]:
## import dataset from huggingface
dataset = load_dataset("cnn_dailymail", "3.0.0")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})


In [3]:
sample = dataset['test'][0]
print("ARTICLE:\n", sample['article'])
print("\nHIGHLIGHTS:\n", sample['highlights'])


ARTICLE:
 (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ce

### Load Pre-trained BART Model and Tokenizer


In [None]:
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [None]:
## use GPU if available
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

### Summarize a News Article

In [None]:
def summarize_text(text, max_input=1024, max_output=150):
    """
    Summarize the input text using the BART model. 
    Args:
        text (str): The input text to summarize.
        max_input (int): The maximum length of the input text.
        max_output (int): The maximum length of the output summary.
    Returns:
        str: The generated summary.
    """
    inputs = tokenizer.encode(text, return_tensors="pt", max_length=max_input, truncation=True) # tokenizes the input text into numerical IDs, returns a PyTorch tensor.
    inputs = inputs.to(model.device) # move the input tensor to the same device as the model

    summary_ids = model.generate( 
        inputs,
        max_length=max_output,
        min_length=40,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
        skip_special_tokens=True
        
    )       # the model generates a summary based on the input text using BARTâ€™s sequence generation logic. 
            #Enables beam search, which explores multiple possible outputs and picks the best.
            

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True) # The generated summary IDs are then decoded back into human-readable text using the tokenizer.
