# **Business Objective**

Text summarization is an active research area within data science. While text summarization techniques have a history, recent years have seen significant progress in the fields of natural language processing and deep learning. Many technology companies are actively contributing to this research and publishing research papers. Salesforce, for example, has published significant papers demonstrating state-of-the-art abstractive summarization techniques. In May 2018, a substantial milestone was achieved with the release of a sizable summarization dataset, supported by a Google Research grant.

Despite the intense research activities, there is a lack of literature discussing practical applications of AI-driven summarization. Summarization is a complex task without a universal solution. Factors such as document length and content genre (e.g., technology, sports, finance, travel) significantly influence the approach to summarization. Summarizing a news article, for instance, is quite different from summarizing a financial earnings report. Consequently, the approach to summarization must be tailored to the specific use case.

---


# **BART Summarization Pre-Training Data Description: CNN/DM**

The CNN/DailyMail dataset, as introduced by Hermann et al. in 2015, comprises 93,000 articles from CNN and 220,000 articles from the Daily Mail newspapers. Both publications include concise bullet-point summaries alongside their articles. A non-anonymized variant of this dataset can be found in the work by See et al. in 2017.

To obtain this dataset, you can download and extract the stories directories for both CNN and the Daily Mail. The files are accessible for download through the terminal using the gdown tool, which can be installed with the command "pip install gdown."

---


In [1]:
# python version- 3.8.10 and 3.10.12(recent colab python version)
!pip install datasets==2.9.0
!pip install transformers==4.26.1
!pip install pytorch_lightning==1.9.1
!pip install torch==1.13.1+cu116
!pip install scikit-learn==1.0.2
!pip install pandas==1.3.5

Collecting fsspec>=2021.11.1 (from fsspec[http]>=2021.11.1->datasets==2.9.0)
  Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Downloading fsspec-2024.10.0-py3-none-any.whl (179 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.6/179.6 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2021.6.1
    Uninstalling fsspec-2021.6.1:
      Successfully uninstalled fsspec-2021.6.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.27.0 requires pandas>=1.5.3, but you have pandas 1.3.5 which is incompatible.
bigframes 1.27.0 requires scikit-learn>=1.2.2, but you have scikit-learn 1.0.2 which is incompatible.
cudf-cu12 24.10.1 requires pandas<2.2.3dev0,>=2.0, but you have pandas 1.3.5 which is incompatible.
senten

In [2]:
!pip install datasets==1.6.2
!pip install fsspec==2021.6.1


Collecting datasets==1.6.2
  Using cached datasets-1.6.2-py3-none-any.whl.metadata (9.2 kB)
Requested datasets==1.6.2 from https://files.pythonhosted.org/packages/46/1a/b9f9b3bfef624686ae81c070f0a6bb635047b17cdb3698c7ad01281e6f9a/datasets-1.6.2-py3-none-any.whl has invalid metadata: Expected matching RIGHT_PARENTHESIS for LEFT_PARENTHESIS, after version specifier
    pyarrow (>=1.0.0<4.0.0)
            ~~~~~~~~^
Please use pip<24.1 if you need to use this version.[0m[33m
[0m[31mERROR: Could not find a version that satisfies the requirement datasets==1.6.2 (from versions: 0.0.9, 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.2.0, 1.2.1, 1.3.0, 1.4.0, 1.4.1, 1.5.0, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 1.13.2, 1.13.3, 1.14.0, 1.15.0, 1.15.1, 1.16.0, 1.16.1, 1.17.0, 1.18.0, 1.18.1, 1.18.2, 1.18.3, 1.18.4, 2.0.0, 2.1.0, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0, 2.5.0, 2.5.1, 2.5.2, 2.6.0, 2.6.1, 2.6.2, 2.7.0

In [3]:
from datasets import load_dataset, list_datasets

datasets = list_datasets()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Permalink: https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0



In [4]:
from pprint import pprint

print(f"🤩 Currently {len(datasets)} datasets are available on the hub:")
pprint(datasets, compact=True)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 'tintnguyen/generated-viwiki-questions-negs-2', 'Dipl0/ASL_tokens',
 'liangzid/robench-eval-Time28-s', 'HashBigBro/gms8k_gemma_qwen2.5',
 'howl-anderson/Game-Oracle_DOTA2-Match-Prediction-Dataset',
 'DrissDo/half_vietnamese_curated_dataset',
 'achnew001/Consolidated-Open-Source-Dataset-for-Global-Wellbeing-and-Sustainability',
 'Kucharek9/Airforce1_project', 'liangzid/robench-eval-Time29-s',
 'ABrain-One/nn-dataset', 'VladislavKaryukin/AmericanEnglishTTS',
 'aysekaya/turkishTextToSql-ds', 'neoneye/simon-arc-combine-v197',
 'richardodliu/leetcode_test',
 'Mateusz1017/company_reports_features_combined_full',
 'aaditya-aub/desktop-element-detection-001',
 'argilla-internal-testing/test_import_dataset_from_hub_with_classlabel_90a4eb58-04d7-41b3-828d-c5bd2dd28d49',
 'liangzid/robench-eval-Time30-s',
 'argilla-internal-testing/test_import_dataset_from_hub_with_classlabel_05840ecd-6f70-4784-ac29-38e013ea4ee7',
 'rssaem/btsdataL

---

In [5]:
from datasets import load_dataset

# Disable split verification by setting ignore_verifications=True
dataset = load_dataset("cnn_dailymail", "3.0.0", ignore_verifications=True)

# Select a small subset for testing
subset = dataset['train'].select(range(15))
print(subset[0])




  0%|          | 0/3 [00:00<?, ?it/s]

  table = cls._concat_blocks(blocks, axis=0)


{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

In [6]:
dataset_ = subset
print(dataset_)

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 15
})


In [7]:
print(f"👉Dataset len(dataset): {len(dataset_)}")
print("\n👉First item 'dataset[0]':")
pprint(dataset_[0])

👉Dataset len(dataset): 15

👉First item 'dataset[0]':
{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe '
            'gains access to a reported £20 million ($41.1 million) fortune as '
            "he turns 18 on Monday, but he insists the money won't cast a "
            'spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter '
            'and the Order of the Phoenix" To the disappointment of gossip '
            'columnists around the world, the young actor says he has no plans '
            'to fritter his cash away on fast cars, drink and celebrity '
            'parties. "I don\'t plan to be one of those people who, as soon as '
            'they turn 18, suddenly buy themselves a massive sports car '
            'collection or something similar," he told an Australian '
            'interviewer earlier this month. "I don\'t think I\'ll be '
            'particularly extravagant. "The things I like buying are things '
            'that cost a

---

### **BART Fine-Tuning: Using Transformers**

In [8]:
# Importing librareis
import torch
from torch.nn import functional as F
from torch import nn
import pytorch_lightning as pl

from transformers import BartForConditionalGeneration, BartTokenizer
from sklearn.model_selection import train_test_split
import pandas as pd

from transformers import (
    AdamW,
    get_linear_schedule_with_warmup
)
from torch.utils.data import DataLoader

In [9]:
# Checking out the GPU we have access to. This is output is from the google colab version.
!nvidia-smi

Fri Dec  6 21:47:38 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

---

In [10]:
import torch
import pandas as pd
import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

class Dataset(torch.utils.data.Dataset):
    """Custom dataset class for text summarization using PyTorch DataLoader.

    For more information about Dataset and DataLoader, see:
    https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    """

    def __init__(self, texts, summaries, tokenizer, source_len, summ_len):
        """
        Initialize the Dataset.

        Args:
            texts (list): List of input texts.
            summaries (list): List of target summaries.
            tokenizer: Tokenizer for text encoding.
            source_len (int): Maximum length for input text.
            summ_len (int): Maximum length for target summary.
        """
        self.texts = texts
        self.summaries = summaries
        self.tokenizer = tokenizer
        self.source_len = source_len
        self.summ_len = summ_len

    def __len__(self):
        """
        Get the number of samples in the dataset.
        """
        return len(self.summaries) - 1

    def __getitem__(self, index):
        """
        Get a single data sample from the dataset.

        Args:
            index (int): Index of the data sample to retrieve.

        Returns:
            Tuple containing:
            - source input IDs
            - source attention mask
            - target input IDs
            - target attention mask
        """
        text = ' '.join(str(self.texts[index]).split())
        summary = ' '.join(str(self.summaries[index]).split())

        # Article text pre-processing
        source = self.tokenizer.batch_encode_plus([text],
                                                  max_length=self.source_len,
                                                  pad_to_max_length=True,
                                                  return_tensors='pt')
        # Summary Target pre-processing
        target = self.tokenizer.batch_encode_plus([summary],
                                                  max_length=self.summ_len,
                                                  pad_to_max_length=True,
                                                  return_tensors='pt')

        return (
            source['input_ids'].squeeze(),
            source['attention_mask'].squeeze(),
            target['input_ids'].squeeze(),
            target['attention_mask'].squeeze()
        )

class BARTDataLoader(pl.LightningDataModule):
    '''Pytorch Lightning Model Dataloader class for BART'''

    def __init__(self, tokenizer, text_len, summarized_len, file_path,
                 corpus_size, columns_name, train_split_size, batch_size):
        """
        Initialize the BARTDataLoader.

        Args:
            tokenizer: Tokenizer for text encoding.
            text_len (int): Maximum length for input text.
            summarized_len (int): Maximum length for target summary.
            file_path (str): Path to the CSV data file.
            corpus_size (int): Number of rows to read from the CSV file.
            columns_name (list): List of column names to use.
            train_split_size (float): Size of the training split (e.g., 0.8 for 80%).
            batch_size (int): Batch size for data loading.
        """
        super().__init__()
        self.tokenizer = tokenizer
        self.text_len = text_len
        self.summarized_len = summarized_len
        self.input_text_length = summarized_len
        self.file_path = file_path
        self.nrows = corpus_size
        self.columns = columns_name
        self.train_split_size = train_split_size
        self.batch_size = batch_size

    def prepare_data(self):
        """
        Load and preprocess the data from the CSV file.
        """
        data = pd.read_csv(self.file_path, nrows=self.nrows, encoding='latin-1')
        data = data[self.columns]
        data.iloc[:, 1] = 'summarize: ' + data.iloc[:, 1]
        self.text = list(data.iloc[:, 0].values)
        self.summary = list(data.iloc[:, 1].values)

    def setup(self, stage=None):
        """
        Split the data into training and validation sets.

        Args:
            stage (str): The current stage ('fit' or 'test').
        """
        X_train, y_train, X_val, y_val = train_test_split(
            self.text, self.summary, train_size=self.train_split_size
        )

        self.train_dataset = (X_train, y_train)
        self.val_dataset = (X_val, y_val)

    def train_dataloader(self):
        """
        Create a DataLoader for the training dataset.
        """
        train_data = Dataset(texts=self.train_dataset[0],
                             summaries=self.train_dataset[1],
                             tokenizer=self.tokenizer,
                             source_len=self.text_len,
                             summ_len=self.summarized_len)
        return DataLoader(train_data, self.batch_size)

    def val_dataloader(self):
        """
        Create a DataLoader for the validation dataset.
        """
        val_dataset = Dataset(texts=self.val_dataset[0],
                              summaries=self.val_dataset[1],
                              tokenizer=self.tokenizer,
                              source_len=self.text_len,
                              summ_len=self.summarized_len)
        return DataLoader(val_dataset, self.batch_size)


In [11]:
import torch
import pytorch_lightning as pl
from transformers import AdamW

class AbstractiveSummarizationBARTFineTuning(pl.LightningModule):
    """Abstractive summarization model class for fine-tuning BART."""

    def __init__(self, model, tokenizer):
        """
        Initialize the AbstractiveSummarizationBARTFineTuning model.

        Args:
            model: Pre-trained BART model.
            tokenizer: BART tokenizer.
        """
        super().__init__()
        self.model = model
        self.tokenizer = tokenizer

    def forward(self, input_ids, attention_mask, decoder_input_ids,
                decoder_attention_mask=None, lm_labels=None):
        """
        Forward pass for the model.

        Args:
            input_ids: Input token IDs.
            attention_mask: Attention mask for input.
            decoder_input_ids: Target token IDs.
            decoder_attention_mask: Attention mask for target.
            lm_labels: Language modeling labels.

        Returns:
            Model outputs.
        """
        outputs = self.model.forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            labels=decoder_input_ids
        )

        return outputs

    def preprocess_batch(self, batch):
        """
        Reformat and preprocess the batch for model input.

        Args:
            batch: Batch of data.

        Returns:
            Formatted input and target data.
        """
        input_ids, source_attention_mask, decoder_input_ids, \
        decoder_attention_mask = batch

        y = decoder_input_ids
        decoder_ids = decoder_input_ids
        source_ids = input_ids
        source_mask = source_attention_mask

        return source_ids, source_mask, decoder_ids, decoder_attention_mask, decoder_attention_mask

    def training_step(self, batch, batch_idx):
        """
        Training step for the model.

        Args:
            batch: Batch of training data.
            batch_idx: Index of the batch.

        Returns:
            Loss for the training step.
        """
        input_ids, source_attention_mask, decoder_input_ids, \
        decoder_attention_mask, lm_labels = self.preprocess_batch(batch)

        outputs = self.forward(input_ids=input_ids, attention_mask=source_attention_mask,
                               decoder_input_ids=decoder_input_ids,
                               decoder_attention_mask=decoder_attention_mask,
                               lm_labels=lm_labels
                       )
        loss = outputs.loss

        return loss

    def validation_step(self, batch, batch_idx):
        """
        Validation step for the model.

        Args:
            batch: Batch of validation data.
            batch_idx: Index of the batch.

        Returns:
            Loss for the validation step.
        """
        input_ids, source_attention_mask, decoder_input_ids, \
        decoder_attention_mask, lm_labels = self.preprocess_batch(batch)

        outputs = self.forward(input_ids=input_ids, attention_mask=source_attention_mask,
                               decoder_input_ids=decoder_input_ids,
                               decoder_attention_mask=decoder_attention_mask,
                               lm_labels=lm_labels
                       )
        loss = outputs.loss

        return loss

    def training_epoch_end(self, outputs):
        """
        Calculate and log the average training loss for the epoch.

        Args:
            outputs: List of training step outputs.
        """
        avg_loss = torch.stack([x["loss"] for x in outputs]).mean()
        self.log('Epoch', self.trainer.current_epoch)
        self.log('avg_epoch_loss', {'train': avg_loss})

    def val_epoch_end(self, loss):
        """
        Calculate and log the average validation loss for the epoch.

        Args:
            loss: List of validation step losses.
        """
        avg_loss = torch.stack([x["loss"] for x in loss]).mean()
        self.log('avg_epoch_loss', {'Val': avg_loss})

    def configure_optimizers(self):
        """
        Configure and return the optimizer for the model.

        Returns:
            Optimizer for training.
        """
        model = self.model
        optimizer = AdamW(model.parameters())
        self.opt = optimizer

        return [optimizer]


In [12]:
# Tokenizer
# Upload the curated_data_subset.csv if using Colab or change the path to a local file
model_ = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

# Dataloader
# Initialize a DataLoader for processing and loading data
dataloader = BARTDataLoader(tokenizer=tokenizer, text_len=512,
                            summarized_len=150,
                            file_path='/content/curated_data_subset.csv',
                            corpus_size=50, columns_name=['article_content','summary'],
                            train_split_size=0.8, batch_size=2)

# Read and pre-process data
dataloader.prepare_data()

# Train-test Split
# Split the data into training and validation sets
dataloader.setup()




In [13]:
# Main Model class
# Create an instance of the AbstractiveSummarizationBARTFineTuning model
model = AbstractiveSummarizationBARTFineTuning(model=model_, tokenizer=tokenizer)


In [14]:
# Trainer Class
# Initialize a PyTorch Lightning Trainer for training and evaluation
# You can specify the number of GPUs (e.g., gpus=1) if available, or remove it if not.
trainer = pl.Trainer(check_val_every_n_epoch=1, max_epochs=5, gpus=1)

# Fit model
# Train the model using the specified trainer and data loader
trainer.fit(model, dataloader)


  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                         | Params
-------------------------------------------------------
0 | model | BartForConditionalGeneration | 139 M 
-------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
557.682   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


---

# **BART Abstractive Summarization: Using Pre Trained Model**

In [15]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

In [16]:
def summarize_article(article):
    # Load BART model and tokenizer
    model_name = 'facebook/bart-large-cnn'
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize and encode the article
    inputs = tokenizer.encode(article, return_tensors='pt', max_length=1024, truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs, num_beams=4, max_length=150, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example usage
article = """
My friends are cool but they eat too many carbs.
"""

summary = summarize_article(article)
print("Summary:")
print(summary)


Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['model.shared.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Summary:
My friends are cool but they eat too many carbs. That's what this is all about. I don't want you to think I'm a bad person. I'm not. I just don't like to be around people who eat too much carbs. This is my way of telling you that.


In [17]:
article = '''Iranâ€™s High Council of Cyberspace, one of the main entities deciding the fate of virtual currencies in Iran, has welcomed the idea of Bitcoin and other cryptocurrencies if they are harnessed by clearly-stated regulations. â€œWe [at the HCC] welcome Bitcoin, but we must have regulations for Bitcoin and any other digital currency. Studies are necessary for considering a new currency,â€ Abolhassan Firouzabadi, HCC secretary, told ILNA. He said HCC and the Central Bank of Iran are currently studying virtual currencies, as they have captured the attention of the world. However, he points out that even though the CBI has yet to devise definitive regulations for Bitcoin and similar currencies, â€œmany in Iran are dealing with Bitcoin, be it purchasing, selling or mining it, and even dealing with it in exchange shops, creating content and establishing startupsâ€. CBI has envisioned six documents on fintechs and cryptocurrencies that will be unveiled by the end of the next fiscal year in March 2019. Two of them, dealing with payment initiators and payment facilitators, have already been published with a third covering micropayments and related technologies in fintech to be announced in the next few weeks. The fifth document exclusively deals with cryptocurrencies and will be unveiled by the time the sixth month of the next fiscal year comes to an end in September 2018. As CBI has announced, it will be a regulatory framework instead of clear-cut regulations. Firouzabadi said Iranâ€™s central bank, like that of many other nations, has not come to a stable and defined stance on Bitcoin, noting that many countries look at it as a potentially dangerous option in light of its violent price fluctuations and investment risk. Less than two weeks ago, the head of CBIâ€™s Innovative Technologies Department, Nasser Hakimi, asked investors and the public to refrain from dealing with virtual currencies without proper knowledge and to remain cautious. â€œMechanisms of control and supervision over the supply of cryptocurrencies are being implemented through the collaboration of the central bank and related entities, but the people must be aware of their risks and dangers on the demand side,â€ he added. Around the same days, Bitcoin had experienced its third steep decline of the year to stoop lower than $6,000 after briefly skyrocketing to around $7,800. In the two weeks since, it reached yet another staggering high of more than $8,300. The cryptocurrency is priced just shy of $8,200 as of this writing, while Iranians can purchase one at roughly 350 million rials ($8,400). The HCC secretary said that in light of the ease of use and anonymity of Bitcoin, high investment profits and lack of any transaction fees, it has emerged as a cataclysmic change in banking and online businesses. â€œOur view regarding Bitcoin is positive, but it does not mean that we will not require regulations in this regard because following the rules is a must,â€ he said. Firouzabadi said he hopes CBIâ€™s stance toward cryptocurrencies will soon be clarified, adding that HCC will continue to conduct studies and evaluate pros and cons of virtual currencies.'''

In [18]:
summary = summarize_article(article)
print("Summary:")
print(summary)

Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['model.shared.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Summary:
Iran's High Council of Cyberspace has welcomed the idea of Bitcoin and other cryptocurrencies if they are harnessed by clearly-stated regulations. HCC and the Central Bank of Iran are currently studying virtual currencies. CBI has envisioned six documents on fintechs and cryptocurrencies that will be unveiled by the end of the next fiscal year in March 2019.


In [19]:
!pip install rouge-score



In [24]:
from datasets import load_metric

rouge = load_metric("rouge")
ground_truth_summaries = [
    "Iran open to cryptocurrencies, planning new regulationsÂ"
    ]  # Ground truth summaries (highlights)

results = rouge.compute(predictions=[summarize_article(article)], references=ground_truth_summaries)

# Print ROUGE results
print("ROUGE Results:", results)

  rouge = load_metric("rouge")
Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['model.shared.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ROUGE Results: {'rouge1': AggregateScore(low=Score(precision=0.05084745762711865, recall=0.42857142857142855, fmeasure=0.09090909090909091), mid=Score(precision=0.05084745762711865, recall=0.42857142857142855, fmeasure=0.09090909090909091), high=Score(precision=0.05084745762711865, recall=0.42857142857142855, fmeasure=0.09090909090909091)), 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rougeL': AggregateScore(low=Score(precision=0.05084745762711865, recall=0.42857142857142855, fmeasure=0.09090909090909091), mid=Score(precision=0.05084745762711865, recall=0.42857142857142855, fmeasure=0.09090909090909091), high=Score(precision=0.05084745762711865, recall=0.42857142857142855, fmeasure=0.09090909090909091)), 'rougeLsum': AggregateScore(low=Score(precision=0.05084745762711865, recall=0.42857142857142855, fmeasure=0.09090909090909091), mid=Score(precision

---

# References:

* https://huggingface.co/transformers/model_doc/bart.html
* https://www.pytorchlightning.ai/
* https://www.frase.io/blog/20-applications-of-automatic-summarization-in-the-enterprise/
* https://medium.com/sciforce/towards-automatic-summarization-part-2-abstractive-methods-c424386a65ea
* https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-attention-mechanism-9e844763d07b
* https://github.com/CurationCorp/curation-corpus

---

## Enhanced Preprocessing

In [25]:

import re

def clean_text(text):
    # Remove special characters and multiple spaces
    text = re.sub(r'[^a-zA-Z0-9 .,?!]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.lower()

# Example usage
sample_text = "Iranâ€™s High Council of Cyberspace, one of the main entities deciding the fate of virtual currencies in Iran, has welcomed the idea of Bitcoin and other cryptocurrencies if they are harnessed by clearly-stated regulations. â€œWe [at the HCC] welcome Bitcoin, but we must have regulations for Bitcoin and any other digital currency. Studies are necessary for considering a new currency,â€ Abolhassan Firouzabadi, HCC secretary, told ILNA. He said HCC and the Central Bank of Iran are currently studying virtual currencies, as they have captured the attention of the world. However, he points out that even though the CBI has yet to devise definitive regulations for Bitcoin and similar currencies, â€œmany in Iran are dealing with Bitcoin, be it purchasing, selling or mining it, and even dealing with it in exchange shops, creating content and establishing startupsâ€. CBI has envisioned six documents on fintechs and cryptocurrencies that will be unveiled by the end of the next fiscal year in March 2019. Two of them, dealing with payment initiators and payment facilitators, have already been published with a third covering micropayments and related technologies in fintech to be announced in the next few weeks. The fifth document exclusively deals with cryptocurrencies and will be unveiled by the time the sixth month of the next fiscal year comes to an end in September 2018. As CBI has announced, it will be a regulatory framework instead of clear-cut regulations. Firouzabadi said Iranâ€™s central bank, like that of many other nations, has not come to a stable and defined stance on Bitcoin, noting that many countries look at it as a potentially dangerous option in light of its violent price fluctuations and investment risk. Less than two weeks ago, the head of CBIâ€™s Innovative Technologies Department, Nasser Hakimi, asked investors and the public to refrain from dealing with virtual currencies without proper knowledge and to remain cautious. â€œMechanisms of control and supervision over the supply of cryptocurrencies are being implemented through the collaboration of the central bank and related entities, but the people must be aware of their risks and dangers on the demand side,â€ he added. Around the same days, Bitcoin had experienced its third steep decline of the year to stoop lower than $6,000 after briefly skyrocketing to around $7,800. In the two weeks since, it reached yet another staggering high of more than $8,300. The cryptocurrency is priced just shy of $8,200 as of this writing, while Iranians can purchase one at roughly 350 million rials ($8,400). The HCC secretary said that in light of the ease of use and anonymity of Bitcoin, high investment profits and lack of any transaction fees, it has emerged as a cataclysmic change in banking and online businesses. â€œOur view regarding Bitcoin is positive, but it does not mean that we will not require regulations in this regard because following the rules is a must,â€ he said. Firouzabadi said he hopes CBIâ€™s stance toward cryptocurrencies will soon be clarified, adding that HCC will continue to conduct studies and evaluate pros and cons of virtual currencies."
print("Original:", sample_text)
print("Cleaned:", clean_text(sample_text))


Original: Iranâ€™s High Council of Cyberspace, one of the main entities deciding the fate of virtual currencies in Iran, has welcomed the idea of Bitcoin and other cryptocurrencies if they are harnessed by clearly-stated regulations. â€œWe [at the HCC] welcome Bitcoin, but we must have regulations for Bitcoin and any other digital currency. Studies are necessary for considering a new currency,â€ Abolhassan Firouzabadi, HCC secretary, told ILNA. He said HCC and the Central Bank of Iran are currently studying virtual currencies, as they have captured the attention of the world. However, he points out that even though the CBI has yet to devise definitive regulations for Bitcoin and similar currencies, â€œmany in Iran are dealing with Bitcoin, be it purchasing, selling or mining it, and even dealing with it in exchange shops, creating content and establishing startupsâ€. CBI has envisioned six documents on fintechs and cryptocurrencies that will be unveiled by the end of the next fiscal 

## Dataset and Model Loading

In [26]:

from datasets import load_dataset
from transformers import AutoTokenizer, BartForConditionalGeneration

# Load CNN/DailyMail dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

print("Dataset and model loaded successfully.")




  0%|          | 0/3 [00:00<?, ?it/s]

  table = cls._concat_blocks(blocks, axis=0)
Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['model.shared.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Dataset and model loaded successfully.


## Fine-Tuning Setup

In [27]:
from datasets import load_dataset
from transformers import AutoTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
import torch

# Free up GPU memory
torch.cuda.empty_cache()

# Load and preprocess dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

# Initialize tokenizer and model with a smaller variant
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")  # Switch to bart-base
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")  # Use bart-base

# Preprocessing function
def preprocess_function(examples):
    inputs = tokenizer(examples['article'], max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(examples['highlights'], max_length=64, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

# Use a smaller subset of the dataset for faster iterations
small_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))
small_val_dataset = dataset['validation'].shuffle(seed=42).select(range(500))

# Apply preprocessing function
tokenized_datasets = small_train_dataset.map(preprocess_function, batched=True)
tokenized_datasets_val = small_val_dataset.map(preprocess_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,  # Reduce batch size
    gradient_accumulation_steps=4,  # Accumulate gradients to simulate larger batch size
    num_train_epochs=1,  # Only train for 1 epoch for testing
    weight_decay=0.01,
    save_strategy="no",  # No need to save models during quick testing
    fp16=True,  # Mixed precision training
    dataloader_num_workers=4,  # Use 4 CPU workers for faster data loading
    logging_steps=50,  # Log every 50 steps
    report_to="none",  # Disable W&B logging for now
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets_val,
    tokenizer=tokenizer
)

# Train the model
trainer.train()




  0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?ba/s]

Using cuda_amp half precision backend
  self.scaler = torch.cuda.amp.GradScaler()
The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: article, highlights, id. If article, highlights, id are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 62
  Number of trainable parameters = 139420416
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` me

Epoch,Training Loss,Validation Loss
0,3.6803,2.357327


The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: article, highlights, id. If article, highlights, id are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to th

TrainOutput(global_step=62, training_loss=3.5461487923899004, metrics={'train_runtime': 42.7969, 'train_samples_per_second': 23.366, 'train_steps_per_second': 1.449, 'total_flos': 302429283287040.0, 'train_loss': 3.5461487923899004, 'epoch': 0.99})

In [28]:
def generate_summary(article_texts, model, tokenizer):
    inputs = tokenizer(article_texts, max_length=1024, return_tensors="pt", truncation=True, padding=True)
    input_ids = inputs["input_ids"].to(model.device)

    # Generate summaries
    summary_ids = model.generate(input_ids, max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)

    # Decode and return summaries
    summaries = [tokenizer.decode(summary_id, skip_special_tokens=True) for summary_id in summary_ids]
    return summaries

# Sample articles from the validation dataset for summary generation
sample_articles = ["Iranâ€™s High Council of Cyberspace, one of the main entities deciding the fate of virtual currencies in Iran, has welcomed the idea of Bitcoin and other cryptocurrencies if they are harnessed by clearly-stated regulations. â€œWe [at the HCC] welcome Bitcoin, but we must have regulations for Bitcoin and any other digital currency. Studies are necessary for considering a new currency,â€ Abolhassan Firouzabadi, HCC secretary, told ILNA. He said HCC and the Central Bank of Iran are currently studying virtual currencies, as they have captured the attention of the world. However, he points out that even though the CBI has yet to devise definitive regulations for Bitcoin and similar currencies, â€œmany in Iran are dealing with Bitcoin, be it purchasing, selling or mining it, and even dealing with it in exchange shops, creating content and establishing startupsâ€. CBI has envisioned six documents on fintechs and cryptocurrencies that will be unveiled by the end of the next fiscal year in March 2019. Two of them, dealing with payment initiators and payment facilitators, have already been published with a third covering micropayments and related technologies in fintech to be announced in the next few weeks. The fifth document exclusively deals with cryptocurrencies and will be unveiled by the time the sixth month of the next fiscal year comes to an end in September 2018. As CBI has announced, it will be a regulatory framework instead of clear-cut regulations. Firouzabadi said Iranâ€™s central bank, like that of many other nations, has not come to a stable and defined stance on Bitcoin, noting that many countries look at it as a potentially dangerous option in light of its violent price fluctuations and investment risk. Less than two weeks ago, the head of CBIâ€™s Innovative Technologies Department, Nasser Hakimi, asked investors and the public to refrain from dealing with virtual currencies without proper knowledge and to remain cautious. â€œMechanisms of control and supervision over the supply of cryptocurrencies are being implemented through the collaboration of the central bank and related entities, but the people must be aware of their risks and dangers on the demand side,â€ he added. Around the same days, Bitcoin had experienced its third steep decline of the year to stoop lower than $6,000 after briefly skyrocketing to around $7,800. In the two weeks since, it reached yet another staggering high of more than $8,300. The cryptocurrency is priced just shy of $8,200 as of this writing, while Iranians can purchase one at roughly 350 million rials ($8,400). The HCC secretary said that in light of the ease of use and anonymity of Bitcoin, high investment profits and lack of any transaction fees, it has emerged as a cataclysmic change in banking and online businesses. â€œOur view regarding Bitcoin is positive, but it does not mean that we will not require regulations in this regard because following the rules is a must,â€ he said. Firouzabadi said he hopes CBIâ€™s stance toward cryptocurrencies will soon be clarified, adding that HCC will continue to conduct studies and evaluate pros and cons of virtual currencies."]  # Using first 5 articles from the validation set

# Generate summaries for these articles
generated_summaries = generate_summary(sample_articles, model, tokenizer)

# Display the generated summaries
for i, summary in enumerate(generated_summaries):
    print(f"Generated Summary {i+1}: {summary}\n")

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.1"
}



Generated Summary 1: Abolhassan Firouzabadi, HCC secretary, told ILNA that the CBI has yet to devise definitive regulations for Bitcoin and other digital currencies.
The HCC is currently studying the pros and cons of virtual currencies, he said.
It will be a regulatory framework instead of clear-cut regulations, he added.



## Evaluation and Metrics

In [29]:
from datasets import load_metric
import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

# Load ROUGE metric
rouge = load_metric("rouge")

# Sample input for evaluation
sample_inputs = ["Iranâ€™s High Council of Cyberspace, one of the main entities deciding the fate of virtual currencies in Iran, has welcomed the idea of Bitcoin and other cryptocurrencies if they are harnessed by clearly-stated regulations. â€œWe [at the HCC] welcome Bitcoin, but we must have regulations for Bitcoin and any other digital currency. Studies are necessary for considering a new currency,â€ Abolhassan Firouzabadi, HCC secretary, told ILNA. He said HCC and the Central Bank of Iran are currently studying virtual currencies, as they have captured the attention of the world. However, he points out that even though the CBI has yet to devise definitive regulations for Bitcoin and similar currencies, â€œmany in Iran are dealing with Bitcoin, be it purchasing, selling or mining it, and even dealing with it in exchange shops, creating content and establishing startupsâ€. CBI has envisioned six documents on fintechs and cryptocurrencies that will be unveiled by the end of the next fiscal year in March 2019. Two of them, dealing with payment initiators and payment facilitators, have already been published with a third covering micropayments and related technologies in fintech to be announced in the next few weeks. The fifth document exclusively deals with cryptocurrencies and will be unveiled by the time the sixth month of the next fiscal year comes to an end in September 2018. As CBI has announced, it will be a regulatory framework instead of clear-cut regulations. Firouzabadi said Iranâ€™s central bank, like that of many other nations, has not come to a stable and defined stance on Bitcoin, noting that many countries look at it as a potentially dangerous option in light of its violent price fluctuations and investment risk. Less than two weeks ago, the head of CBIâ€™s Innovative Technologies Department, Nasser Hakimi, asked investors and the public to refrain from dealing with virtual currencies without proper knowledge and to remain cautious. â€œMechanisms of control and supervision over the supply of cryptocurrencies are being implemented through the collaboration of the central bank and related entities, but the people must be aware of their risks and dangers on the demand side,â€ he added. Around the same days, Bitcoin had experienced its third steep decline of the year to stoop lower than $6,000 after briefly skyrocketing to around $7,800. In the two weeks since, it reached yet another staggering high of more than $8,300. The cryptocurrency is priced just shy of $8,200 as of this writing, while Iranians can purchase one at roughly 350 million rials ($8,400). The HCC secretary said that in light of the ease of use and anonymity of Bitcoin, high investment profits and lack of any transaction fees, it has emerged as a cataclysmic change in banking and online businesses. â€œOur view regarding Bitcoin is positive, but it does not mean that we will not require regulations in this regard because following the rules is a must,â€ he said. Firouzabadi said he hopes CBIâ€™s stance toward cryptocurrencies will soon be clarified, adding that HCC will continue to conduct studies and evaluate pros and cons of virtual currencies."]

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

# Move the model to GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Tokenize input
tokenized_inputs = tokenizer(sample_inputs, max_length=1024, return_tensors="pt", truncation=True)

# Move tokenized inputs to the same device as the model
tokenized_inputs = {key: value.to(device) for key, value in tokenized_inputs.items()}

# Generate summary
summary_ids = model.generate(tokenized_inputs["input_ids"], max_length=50, min_length=10, length_penalty=2.0)

# Decode the summaries
decoded_summaries = [tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids]
print("Generated Summary:", decoded_summaries)

# Ground truth for evaluation
reference_summaries = ["Iran open to cryptocurrencies, planning new regulations"]

# Compute ROUGE score
rouge_score = rouge.compute(predictions=decoded_summaries, references=reference_summaries)
print("ROUGE Score:", rouge_score)


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/config.json
Model config BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": t

Generated Summary: ["Iran's High Council of Cyberspace has welcomed the idea of Bitcoin and other cryptocurrencies if they are harnessed by clearly-stated regulations. CBI has envisioned six documents on fintechs and cryptocurrencies that will be unveiled by the end"]
ROUGE Score: {'rouge1': AggregateScore(low=Score(precision=0.07692307692307693, recall=0.42857142857142855, fmeasure=0.13043478260869565), mid=Score(precision=0.07692307692307693, recall=0.42857142857142855, fmeasure=0.13043478260869565), high=Score(precision=0.07692307692307693, recall=0.42857142857142855, fmeasure=0.13043478260869565)), 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rougeL': AggregateScore(low=Score(precision=0.07692307692307693, recall=0.42857142857142855, fmeasure=0.13043478260869565), mid=Score(precision=0.07692307692307693, recall=0.42857142857142855, fmeasure=0.13

## Deployment Preparation

In [30]:

# Save the model and tokenizer
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")

print("Model and tokenizer exported successfully.")


Configuration saved in ./saved_model/config.json
Configuration saved in ./saved_model/generation_config.json
Model weights saved in ./saved_model/pytorch_model.bin
tokenizer config file saved in ./saved_model/tokenizer_config.json
Special tokens file saved in ./saved_model/special_tokens_map.json


Model and tokenizer exported successfully.
