# News Summarization - Finetuning using Quantization LoRA
### Datasets : [CNN News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail)

# Installing and importing relevant libraries

In [1]:
!pip install transformers datasets accelerate opendatasets bitsandbytes peft --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [2]:
import pandas as pd
import opendatasets as od
import numpy as np
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer, GenerationConfig
from datasets import load_dataset, Dataset
from peft import LoraConfig, get_peft_model
import os
import re
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Loading the dataset

In [3]:
od.download('https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: hilmiatha
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail
Downloading newspaper-text-summarization-cnn-dailymail.zip to ./newspaper-text-summarization-cnn-dailymail


100%|██████████| 503M/503M [00:08<00:00, 61.8MB/s]





In [4]:
train = pd.read_csv('newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv')
val = pd.read_csv('newspaper-text-summarization-cnn-dailymail/cnn_dailymail/validation.csv')
test = pd.read_csv('newspaper-text-summarization-cnn-dailymail/cnn_dailymail/test.csv')

In [5]:
train = train.sample(5000, random_state=42)[['article','highlights']]

In [6]:
train

Unnamed: 0,article,highlights
272581,By . Mia De Graaf . Britons flocked to beaches...,People enjoyed temperatures of 17C at Brighton...
772,A couple who weighed a combined 32st were sham...,Couple started piling on pounds after the birt...
171868,Video footage shows the heart stopping moment ...,A 17-year-old boy suffering lacerations to his...
63167,"Istanbul, Turkey (CNN) -- About 250 people rac...",Syrians citizens hightail it to Turkey .\nMost...
68522,By . Daily Mail Reporter . PUBLISHED: . 12:53 ...,The Xue Long had provided the helicopter that ...
...,...,...
271171,"By . Matt Chorley, Mailonline Political Editor...",Major General Jonathan Shaw accuses ministers ...
146080,"ST. POELTEN, Austria (CNN) -- A verdict in th...","Friztl pleads guilty to imprisonment, incest d..."
270020,"By . Hugo Gye . PUBLISHED: . 07:49 EST, 22 Jan...","Ex-footballer, 43, has repeatedly been targete..."
126659,"(CNN) -- It's no Super Bowl. Heck, it's no Mon...",ESPN moves English club soccer game to flagshi...


# Prepocess the data

In [7]:
def filter_text(text):
  text = text.lower()
  text = re.sub(r'[^a-zA-Z0-9]+', ' ', text)
  return text

In [8]:
train['article'] = train['article'].apply(filter_text)
train['highlights'] = train['highlights'].apply(filter_text)

In [9]:
train

Unnamed: 0,article,highlights
272581,by mia de graaf britons flocked to beaches acr...,people enjoyed temperatures of 17c at brighton...
772,a couple who weighed a combined 32st were sham...,couple started piling on pounds after the birt...
171868,video footage shows the heart stopping moment ...,a 17 year old boy suffering lacerations to his...
63167,istanbul turkey cnn about 250 people raced acr...,syrians citizens hightail it to turkey most of...
68522,by daily mail reporter published 12 53 est 3 j...,the xue long had provided the helicopter that ...
...,...,...
271171,by matt chorley mailonline political editor pu...,major general jonathan shaw accuses ministers ...
146080,st poelten austria cnn a verdict in the case o...,friztl pleads guilty to imprisonment incest de...
270020,by hugo gye published 07 49 est 22 january 201...,ex footballer 43 has repeatedly been targeted ...
126659,cnn it s no super bowl heck it s no monday ni...,espn moves english club soccer game to flagshi...


## Combine the article and the summary because we will be using decoder only llm (causalLM)

In [10]:
# Ensure 'final_statement' column exists in the DataFrame
train['final_statement'] = ''

# Iterate over the DataFrame rows
for idx, row in train.iterrows():
    # Update the 'final_statement' column using loc
    train.loc[idx, 'final_statement'] = 'Summarize the following article: \n\n' + str(row['article']) + '\nSummary:' + str(row['highlights'])


In [11]:
train = train[['final_statement']]

In [12]:
train

Unnamed: 0,final_statement
272581,Summarize the following article: \n\nby mia de...
772,Summarize the following article: \n\na couple ...
171868,Summarize the following article: \n\nvideo foo...
63167,Summarize the following article: \n\nistanbul ...
68522,Summarize the following article: \n\nby daily ...
...,...
271171,Summarize the following article: \n\nby matt c...
146080,Summarize the following article: \n\nst poelte...
270020,Summarize the following article: \n\nby hugo g...
126659,Summarize the following article: \n\n cnn it s...


# Tokenization Section



In [13]:
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-1b1')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

In [14]:
tokenizer.pad_token = tokenizer.eos_token

In [15]:
def tokenize_function(examples):
  examples['input_ids'] = tokenizer(examples['final_statement'], padding='max_length', truncation=True, max_length=512, return_tensors='pt').input_ids
  examples['labels'] = tokenizer(examples['final_statement'], padding='max_length', truncation=True, max_length=512, return_tensors='pt').input_ids
  return examples

In [16]:
train_data = Dataset.from_pandas(train)
train_tokenized_data = train_data.map(tokenize_function, batched=True, remove_columns=train_data.column_names)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [17]:
train_tokenized_data

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 5000
})

# QLoRA Implementation

## Quantization

In [18]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
)

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", quantization_config=quant_config)

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

## LoRA

In [19]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.01,
    r=8,
    bias='none',
    task_type='CAUSAL_LM'
)
peft_model = get_peft_model(model, peft_params)

In [20]:
peft_model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 1,066,493,952 || trainable%: 0.1106


# Training the model

In [None]:
import huggingface_hub

In [24]:
training_args = TrainingArguments(
    output_dir= './model_checkpoints',
    save_total_limit = 1,
    auto_find_batch_size = True,
    learning_rate = 1e-3,
    num_train_epochs = 3,
    resume_from_checkpoint=True,
)

trainer = Trainer(
    model = peft_model,
    train_dataset = train_tokenized_data,
    args = training_args,
)

trainer.train()
trainer.model.save_pretrained('./model_final')
tokenizer.save_pretrained('./model_final')

Step,Training Loss
500,3.1762
1000,3.1355
1500,3.0918
2000,3.0829
2500,3.0615
3000,2.9667
3500,2.9908




('./model_final/tokenizer_config.json',
 './model_final/special_tokens_map.json',
 './model_final/tokenizer.json')

In [37]:
tokenizer_final = AutoTokenizer.from_pretrained('./model_final')
model_final = AutoModelForCausalLM.from_pretrained("./model_final").to('cuda')



In [38]:
#testing

article = '''summarize the following article:\n\n in commemorating the 35th anniversary of asean rok dialogue relations the international conference on asean korea cultural heritage cooperation with the theme the future of asean korea cooperation cultural heritage and socio cultural solidarity was held in seoul rok on 26 june 2024 h e ekkaphab phanthavong deputy secretary general of asean for asean socio cultural community delivered an opening remark at the international conference in attendance were mr choi eung cheon administrator of korea heritage service and mr jeong byung won deputy minister of foreign affairs alongside ambassadors and representatives of asean member states as well as members of the asean rok working committee on cultural heritage cooperation the international conference held back to back with the 4th asean rok working committee on cultural heritage cooperation discussed the opportunities and way forward to enhance cultural heritage cooperation between asean and the rok\nsummary\n'''


# article = filter_text(article)
input_id = tokenizer(article, padding='max_length', truncation=True, max_length=512, return_tensors='pt').input_ids.to('cuda')


In [None]:
input_id

In [29]:
article

'summarize the following article:\n\n in commemorating the 35th anniversary of asean rok dialogue relations the international conference on asean korea cultural heritage cooperation with the theme the future of asean korea cooperation cultural heritage and socio cultural solidarity was held in seoul rok on 26 june 2024 h e ekkaphab phanthavong deputy secretary general of asean for asean socio cultural community delivered an opening remark at the international conference in attendance were mr choi eung cheon administrator of korea heritage service and mr jeong byung won deputy minister of foreign affairs alongside ambassadors and representatives of asean member states as well as members of the asean rok working committee on cultural heritage cooperation the international conference held back to back with the 4th asean rok working committee on cultural heritage cooperation discussed the opportunities and way forward to enhance cultural heritage cooperation between asean and the rok\nsumm

In [34]:
# del peft_model
# torch.cuda.empty_cache()

In [52]:
output = model_final.generate(input_id, max_new_tokens=512)

In [53]:
output_decoded = tokenizer.decode(output[0], skip_special_tokens = True)

In [56]:
print(output_decoded.split('Summary:')[1])

the international conference on asean korea cultural heritage cooperation was held in seoul on 26 june 2024 the conference discussed the opportunities and way forward to enhance cultural heritage cooperation between asean and the rok working committee discussed the opportunities and way forward to enhance cultural heritage cooperation between asean and the rok 



In [57]:
import huggingface_hub
huggingface_hub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [58]:
model_final.push_to_hub('bloom-1b1-news-summarizer')
tokenizer_final.push_to_hub('bloom-1b1-news-summarizer')



adapter_model.safetensors:   0%|          | 0.00/4.73M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/hilmiatha/bloom-1b1-news-summarizer/commit/91cb0a7b29380d9d589e0d85f17392b146776544', commit_message='Upload tokenizer', commit_description='', oid='91cb0a7b29380d9d589e0d85f17392b146776544', pr_url=None, pr_revision=None, pr_num=None)