# Overview:

---


This project focuses on summarizing news articles using MiniLM language model embeddings and the BitsAndBytes configuration for quantization. It begins with data preprocessing to clean and prepare news articles from the CNN/DailyMail dataset. The system then uses MiniLM embeddings to summarize articles into concise statements, demonstrating proficiency in natural language processing and summarization tasks.

In [None]:
!pip install opendatasets datasets transformers datasets peft accelerate bitsandbytes --upgrade --quiet

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
    GenerationConfig, TrainingArguments, Trainer
)
from peft import LoraConfig, get_peft_model
import pandas as pd
from datasets import Dataset
import re

**Data Acquisition:**
Downloaded and prepared the CNN/DailyMail dataset for news article summarization using opendatasets.

In [None]:
import opendatasets as od
od.download("https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: omaratef3221
Your Kaggle Key: ··········
Downloading newspaper-text-summarization-cnn-dailymail.zip to ./newspaper-text-summarization-cnn-dailymail


100%|██████████| 503M/503M [00:07<00:00, 67.1MB/s]





**Data Cleaning and Preprocessing:**
Cleaned and preprocessed the dataset by filtering and standardizing text for model input.

In [None]:
def filter_text(text):
  text = text.lower()
  text = re.sub('[^A-Za-z0-9]+', ' ', text)
  return text

train_df["article"] = train_df["article"].apply(filter_text)
train_df["highlights"] = train_df["highlights"].apply(filter_text)

**Model Selection and Tokenization:**
Selected bigscience/bloom-1b1 as the pre-trained model for news article summarization. Utilized AutoTokenizer for tokenization and input preparation.


In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", quantization_config=quant_config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

In [None]:
train_df = pd.read_csv("/content/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv")[["article", "highlights"]]
train_df = train_df.sample(10000)

In [None]:
train_df.head()

Unnamed: 0,article,highlights
157885,lawyers for the alleged teenage sex slave who ...,lawyers for alleged teenage sex slave are tryi...
257177,jamie carragher believes liverpool have failed...,brendan rodgers side have conceded four goals ...
241660,by daniel miller published 11 22 est 25 januar...,turkish airlines flight with 114 people on boa...
191585,an overweight mother of two shed five stone af...,anna lloyd from crewe promised her son isaac s...
72942,by sam webb published 04 52 est 17 july 2013 u...,natasha jones gave guests a wedding they will ...


In [None]:
train_df["final_statement"] = ""
for indx, row in train_df.iterrows():
  row["final_statement"] = "Summarize the following article.\n\n" +str(row["article"]) + "\Summary:\n" + str(row["highlights"])

train_df = train_df[["final_statement"]]

In [None]:
print(train_df["final_statement"].iloc[9])

Summarize the following article.

 cnn late last month in aleppo syria civilians who have cell phone subscriptions received a foreboding text message in arabic game over those on prepaid phones including many opposition fighters and activists who tend to throw their devices away after several uses to avoid detection did not receive the text or subsequent messages signed by the syrian arab army telling them to surrender their weapons the government was sending a message to the rebels through people who subscribe says taufiq rahim a dubai based arab affairs analyst an act of psychological warfare carried out by cell phone the texts have increased syria watchers concerns that the embattled government has realized both the full potential of using the internet and mobile carriers to communicate with its leaderless opposition and the importance of the networks as domestic and international lifelines for the rebels defecting syrian propagandist says his job was to fabricate there are growing 

**Model Training:**
Defined training parameters (TrainingArguments) and utilized Trainer to train the model on tokenized datasets.

In [None]:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(example):
    example["input_ids"] = tokenizer(example["final_statement"], padding="max_length", max_length = 250, truncation=True, return_tensors="pt").input_ids
    example["labels"] = tokenizer(example["final_statement"], padding="max_length", max_length = 250, truncation=True, return_tensors="pt").input_ids
    return example

# Convert your DataFrame into a Dataset object
train_data = Dataset.from_pandas(train_df)

# # Apply the tokenize function
train_tokenized_datasets = train_data.map(tokenize_function, batched=True, remove_columns=train_data.column_names)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
print(tokenizer.decode(train_tokenized_datasets[5]["input_ids"], skip_special_tokens = True))

Summarize the following article.



In [None]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, peft_params)
peft_model.print_trainable_parameters()

trainable params: 2,359,296 || all params: 1,067,673,600 || trainable%: 0.22097539922313336


**Evaluation and Metrics:**
Trained the model using Trainer and evaluated its performance on summarizing news articles.

In [None]:
training_args = TrainingArguments(
output_dir = './model_checkpoints',
save_total_limit = 1,
auto_find_batch_size = True,
learning_rate = 1e-3,
num_train_epochs = 5,
)

trainer = Trainer(
model = peft_model,
args = training_args,
train_dataset = train_tokenized_datasets,
)

trainer.train()

trainer.model.save_pretrained('./final_model')
tokenizer.save_pretrained('./final_model')

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
500,3.3107
1000,3.2653
1500,3.2023
2000,3.1618
2500,3.1612
3000,3.0383
3500,3.0553
4000,3.0104
4500,2.9597
5000,2.9669


Checkpoint destination directory ./model_checkpoints/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


('./final_model/tokenizer_config.json',
 './final_model/special_tokens_map.json',
 './final_model/tokenizer.json')

**Deployment and Inference:**
Deployed the trained model for news article summarization. Implemented functions for generating summaries based on input articles.

In [None]:
news_article = """
All but one of the 100 cities with the world’s worst air pollution last year were in Asia, according to a new report, with the climate crisis playing a pivotal role in bad air quality that is risking the health of billions of people worldwide.

The vast majority of these cities — 83 — were in India and all exceeded the World Health Organization’s air quality guidelines by more than 10 times, according to the report by IQAir, which tracks air quality worldwide.

The study looked specifically at fine particulate matter, or PM2.5, which is the tiniest pollutant but also the most dangerous. Only 9% of more than 7,800 cities analyzed globally recorded air quality that met WHO’s standard, which says average annual levels of PM2.5 should not exceed 5 micrograms per cubic meter.

“We see that in every part of our lives that air pollution has an impact,” said IQAir Global CEO Frank Hammes. “And it typically, in some of the most polluted countries, is likely shaving off anywhere between three to six years of people’s lives. And then before that will lead to many years of suffering that are entirely preventable if there’s better air quality.”

"""

filtered_news_article = "Summarize the following article.\n\n" +filter_text(news_article) + "\nSummary:\n"
tokenizerd_news_article = tokenizer(filtered_news_article, max_length = 250, return_tensors="pt")
output = model.generate(tokenizerd_news_article.input_ids, max_new_tokens = 100)
summary = tokenizer.decode(output[0], skip_special_tokens = True)


In [None]:
print(summary.split("\nSummary:\n")[1])

the report says the climate crisis is playing a pivotal role in bad air quality that is risking the health of billions of people worldwide the vast majority of these cities 83 were in india and all exceeded the world health organization s air quality guidelines by more than 10 times according to the report by iqair which tracks air quality worldwide the study looked specifically at fine particulate matter or pm2 5 which is the tiniest pollutant but also the most dangerous only 9 of more than 7 800 cities
