<a href="https://colab.research.google.com/github/leePhilip23/NLP_News_Summarization/blob/features%2Fadd_colab/finetuning/QLoRA_Article.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Configure Data

We will first access raw news article data from "Cornell Newsroom" and configure it so that it could be used later on.

In [None]:
# Load all dependencies
! pip install jsonlines
! pip install transformers
! pip install datasets
! pip install peft
! pip install rouge_score
! pip install -U accelerate

Collecting jsonlines
  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-4.0.0
Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m90.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64

In [None]:
from google.colab import drive

# Access google account for data
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import jsonlines
import pandas as pd
import numpy as np

text = []
summary = []

# Parse through jsonl file to gather data
with jsonlines.open('/content/drive/MyDrive/test.jsonl') as f:
    for line in f.iter():
        text.append(line['text'])
        summary.append(line['summary'])

In [None]:
import random

rand_dict = set()
rand_text = []
rand_summary = []
count = 0

# Random sample 1250 data points
while count < 1250:
  rand_num = random.randint(0, len(text)-1)
  if rand_num not in rand_dict:
    rand_text.append(text[rand_num])
    rand_summary.append(summary[rand_num])
    rand_dict.add(rand_num)
    count += 1

In [None]:
from datasets import DatasetDict, Dataset

full_data = {
  'text':text,
  'summary':summary
}

data = {
  'text':rand_text,
  'summary':rand_summary
}

# Turn full and random sampled dataset into dataset object
ds = Dataset.from_pandas(pd.DataFrame(full_data))
rand_ds = Dataset.from_pandas(pd.DataFrame(data))

In [None]:
# Save full dataset to disk and to later put into zip file
ds.save_to_disk('/content/full_article.hf')

Saving the dataset (0/1 shards):   0%|          | 0/108862 [00:00<?, ? examples/s]

In [None]:
# Save random sampled dataset to disk to later put into zip file
rand_ds.save_to_disk('/content/article.hf')

Saving the dataset (0/1 shards):   0%|          | 0/1250 [00:00<?, ? examples/s]

In [None]:
# Put hf files into a zip file for anyone to download and manually put into google
! zip -r /content/ds.zip /content/dataset.hf /content/full_article.hf

In [None]:
from datasets import load_from_disk, load_dataset

# Use dataset I put into hugging face
dataset = load_dataset('philTheThill/news-articles')
dataset = dataset["train"].train_test_split(test_size=0.2)

dataset

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.99M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'summary'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'summary'],
        num_rows: 250
    })
})

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id="google/flan-t5-base"

# Load tokenizer and model of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)
original = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# Function To Find ROUGE-1, Rouge-2, Rouge-L, and Rouge L-Sum

These are the primary metrics we are going to be using for the Flan-T5 Base with N shot inference and the fine-tuned Flan-T5 with QLoRA optimization.

In [None]:
from datasets import load_metric

# Load the ROUGE metric
rouge_metric = load_metric("rouge")

def get_scores(references, predictions, rouge_metric=rouge_metric):
  # Compute ROUGE scores
  results = rouge_metric.compute(
      predictions=predictions, # Predicted Summaries
      references=references, # Actual summaries
      use_aggregator=True, # Just want the scores themselves
      # Below may overall increase ROUGE scores but scores may not be accurate
      # use_stemmer=True
  )

  # Return the scores
  return results


  rouge_metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [None]:
import random

# Get random sample
def random_sample(data):
   sample_text = [] # Stores sample text
   sample_summary = [] # Store sample summaries
   rand_dict = set() # To make sure there isn't duplicates

   # Get random sample of 50
   sample_size = 0
   while sample_size < 50:
     rand_num = random.randint(0, len(data)-1)
     if rand_num not in rand_dict:
       rand_dict.add(rand_num)
       sample_size += 1

   for r in rand_dict:
     sample_text.append(data[r]['text'])
     sample_summary.append(data[r]['summary'])

   return sample_text, sample_summary


# Get summary of article
def n_shot_learning(prompts):
   # Tokenize input text
   tokens_input = tokenizer(
      prompts,
      return_tensors='pt',
      max_length=512,
      padding="max_length",
      truncation=True
   )

   # Generate summary
   summary_ids = original.generate(
      input_ids=tokens_input['input_ids'],
      max_length=200
   )

   # Decode and print the summary
   summary = tokenizer.decode(
      summary_ids[0],
      skip_special_tokens=True
   )

   # Return finalized summary
   return summary

# Zero Shot Inference

Right below, we will utilize Zero Shot Inference to see the performance of the model.

In [None]:
# Store all new strings
start = "Summarize the following article:\n\n"
end = "\n\nSummary:"

samples, summaries = random_sample(dataset['test'])
prompts = [start + dialogue + end for dialogue in samples]

# Store all results of one shot inferences
zero_shot = []

# Zero shot inference
for p in prompts:
  summary = n_shot_learning(p)
  zero_shot.append(summary)

# Zero Shot Metrics

The Rouge 1, Rouge 2, Rouge L, and rouge L-Sum will be calculated for Zero Shot Inference.

In [None]:
res = get_scores(summaries, zero_shot)

names = ["Rouge 1", "Rouge 2", "Rouge L", "Rouge L-Sum"]
interval = ["Low (2.5th percentile)", "Medium (50th percentile)", "High (97.5th percentile)"]
scores = ["Precision:", "Recall:", "F-Measure:"]
index = 0

for k, v in res.items():
   print(f"{names[index]}")
   index += 1
   for i in range(3):
     precision = f"Precision: {round(v[i][0], 3)}"
     recall = f"Recall: {round(v[i][1], 3)}"
     f_measure = f"F-Measure: {round(v[i][2], 3)}"

     print(f"\t{interval[i]} Confidence Interval:")
     print(f"\t\t" + precision)
     print(f"\t\t" + recall)
     print(f"\t\t" + f_measure)


Rouge 1
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.244
		Recall: 0.236
		F-Measure: 0.212
	Medium (50th percentile) Confidence Interval:
		Precision: 0.324
		Recall: 0.318
		F-Measure: 0.279
	High (97.5th percentile) Confidence Interval:
		Precision: 0.412
		Recall: 0.404
		F-Measure: 0.35
Rouge 2
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.107
		Recall: 0.101
		F-Measure: 0.091
	Medium (50th percentile) Confidence Interval:
		Precision: 0.182
		Recall: 0.18
		F-Measure: 0.158
	High (97.5th percentile) Confidence Interval:
		Precision: 0.273
		Recall: 0.268
		F-Measure: 0.236
Rouge L
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.204
		Recall: 0.202
		F-Measure: 0.181
	Medium (50th percentile) Confidence Interval:
		Precision: 0.279
		Recall: 0.275
		F-Measure: 0.241
	High (97.5th percentile) Confidence Interval:
		Precision: 0.367
		Recall: 0.361
		F-Measure: 0.312
Rouge L-Sum
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.204

# One Shot Inference

Right below, we will utilize One Shot Inference to see the performance of the model.

In [None]:
import random

count = 0
while count < 1:
  rand_num = random.randint(0, len(dataset['test'])-1)

  if rand_num not in rand_dict:
    rand_text = dataset['train'][rand_num]['text']
    rand_summ = dataset['train'][rand_num]['summary']
    count += 1

one_example = f"""
   Summarize the following article:

   {rand_text}

   Summary:

   {rand_summ}
"""

# Store all new strings
start = "\n\nSummarize the following article:\n\n"
end = "\n\nSummary:"

prompts = [one_example + start + dialogue + end for dialogue in samples]

# Store all results of one shot inferences
one_shot = []

# One shot inference
for p in prompts:
  summary = n_shot_learning(p)
  one_shot.append(summary)

# One Shot Metrics

The Rouge 1, Rouge 2, Rouge L, and rouge L-Sum will be calculated for One Shot Inference.

In [None]:
res = get_scores(summaries, one_shot)

names = ["Rouge 1", "Rouge 2", "Rouge L", "Rouge L-Sum"]
interval = ["Low (2.5th percentile)", "Medium (50th percentile)", "High (97.5th percentile)"]
scores = ["Precision:", "Recall:", "F-Measure:"]
index = 0

for k, v in res.items():
   print(f"{names[index]}")
   index += 1
   for i in range(3):
     precision = f"Precision: {round(v[i][0], 3)}"
     recall = f"Recall: {round(v[i][1], 3)}"
     f_measure = f"F-Measure: {round(v[i][2], 3)}"

     print(f"\t{interval[i]} Confidence Interval:")
     print(f"\t\t" + precision)
     print(f"\t\t" + recall)
     print(f"\t\t" + f_measure)

Rouge 1
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.153
		Recall: 0.06
		F-Measure: 0.082
	Medium (50th percentile) Confidence Interval:
		Precision: 0.185
		Recall: 0.075
		F-Measure: 0.101
	High (97.5th percentile) Confidence Interval:
		Precision: 0.218
		Recall: 0.091
		F-Measure: 0.119
Rouge 2
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.0
		Recall: 0.0
		F-Measure: 0.0
	Medium (50th percentile) Confidence Interval:
		Precision: 0.002
		Recall: 0.001
		F-Measure: 0.001
	High (97.5th percentile) Confidence Interval:
		Precision: 0.006
		Recall: 0.002
		F-Measure: 0.003
Rouge L
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.136
		Recall: 0.054
		F-Measure: 0.073
	Medium (50th percentile) Confidence Interval:
		Precision: 0.162
		Recall: 0.066
		F-Measure: 0.088
	High (97.5th percentile) Confidence Interval:
		Precision: 0.189
		Recall: 0.08
		F-Measure: 0.103
Rouge L-Sum
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.135
		Rec

# Few (Two) Shot Inference

Right below, we will utilize Few Shot Inference to see the performance of the model.

In [None]:
import random

rand_text = []
rand_summ = []
count = 0

while count < 2:
  rand_num = random.randint(0, len(dataset['train'])-1)

  if rand_num not in rand_dict:
    rand_text.append(dataset['train'][rand_num]['text'])
    rand_summ.append(dataset['train'][rand_num]['summary'])
    count += 1

# Prompt
dialogue= ""
few_examples = f"""
  Summarize the following article:
  {rand_text[0]}

  Summary:
  {rand_summ[0]}

  Summarize the following article:
  {rand_text[1]}

  Summary:
  {rand_summ[1]}
"""

# Store all new strings
start = "\n\nSummarize the following article:\n\n"
end = "\n\nSummary:"

samples, summaries = random_sample(dataset['test'])
prompts = [few_examples + start + dialogue + end for dialogue in samples]

# Store all results of one shot inferences
few_shot = []

# Few shot inference
for p in prompts:
  summary = n_shot_learning(p)
  few_shot.append(summary)

# Few Shot Metrics

The Rouge 1, Rouge 2, Rouge L, and rouge L-Sum will be calculated for Few Shot Inference.

In [None]:
res = get_scores(summaries, few_shot)

names = ["Rouge 1", "Rouge 2", "Rouge L", "Rouge L-Sum"]
interval = ["Low (2.5th percentile)", "Medium (50th percentile)", "High (97.5th percentile)"]
scores = ["Precision:", "Recall:", "F-Measure:"]
index = 0

for k, v in res.items():
   print(names[index])
   index += 1
   for i in range(3):
     precision = f"Precision: {round(v[i][0], 3)}"
     recall = f"Recall: {round(v[i][1], 3)}"
     f_measure = f"F-Measure: {round(v[i][2], 3)}"

     print(f"\t{interval[i]} Confidence Interval:")
     print(f"\t\t" + precision)
     print(f"\t\t" + recall)
     print(f"\t\t" + f_measure, end='\n')

Rouge 1
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.0
		Recall: 0.0
		F-Measure: 0.0
	Medium (50th percentile) Confidence Interval:
		Precision: 0.001
		Recall: 0.001
		F-Measure: 0.001
	High (97.5th percentile) Confidence Interval:
		Precision: 0.002
		Recall: 0.003
		F-Measure: 0.002
Rouge 2
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.0
		Recall: 0.0
		F-Measure: 0.0
	Medium (50th percentile) Confidence Interval:
		Precision: 0.0
		Recall: 0.0
		F-Measure: 0.0
	High (97.5th percentile) Confidence Interval:
		Precision: 0.0
		Recall: 0.0
		F-Measure: 0.0
Rouge L
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.0
		Recall: 0.0
		F-Measure: 0.0
	Medium (50th percentile) Confidence Interval:
		Precision: 0.001
		Recall: 0.001
		F-Measure: 0.001
	High (97.5th percentile) Confidence Interval:
		Precision: 0.002
		Recall: 0.003
		F-Measure: 0.002
Rouge L-Sum
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.0
		Recall: 0.0
		F-Measure: 0.

# Fine-Tuning Flan-T5 Base

As you can see, zero shot inference produced better summaries than one shot and few shot inferences. This concludes that the model summary predictions gets worse as more examples are added to the prompt given how most of the articles produce over 512 tokens from tokenizer. Below, we will utilize  the same random sampled news articles to fine-tune the model using QLoRA optimization to see if the results improve drastically.

In [None]:
from datasets import load_from_disk, load_dataset

dataset = load_dataset('philTheThill/news-articles')
dataset = dataset["train"].train_test_split(test_size=0.2)

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'summary'],
        num_rows: 250
    })
})

# Preprocess Data For Fine-Tuning

The inputs and outputs will be tokenized with truncation and padding. Each padding will also be marked as -100 to let the model know that the padding should be ignored and not trained.

In [None]:
def preprocess(sample, padding="max_length"):
    # add prefix to the input for t5
    inputs = ["Summarize\n\n: " + item for item in sample["text"]]

    # Tokenize input data
    model_inputs = tokenizer(
        inputs,
        max_length=tokenizer.model_max_length,
        padding="max_length",
        truncation=True,
        #return_tensors='pt' # Training is quicker without tensors
    )

    # Tokenize targets with the `text_target` keyword argument
    outputs = tokenizer(
        text_target=sample["summary"],
        max_length=tokenizer.model_max_length,
        padding="max_length",
        truncation=True,
        #return_tensors='pt', # Training is quicker without tensors
    )

    # We should replace all tokenizer.pad_token_id in the labels by arbritrary number (-111) to identify
    # each padding and ignore it during training since it doens't bring any value
    if padding == "max_length":
        outputs["input_ids"] = [
            [(o if o != tokenizer.pad_token_id else -100) for o in output] for output in outputs["input_ids"]
        ]

    model_inputs["labels"] = outputs["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess, batched=True, remove_columns=["text", "summary"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


In [None]:
# Show tokenized dataset
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 250
    })
})

In [None]:
# Show that Tesla T4 GPU is available
!nvidia-smi

Mon Oct 16 19:00:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import torch

print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"Current Device Index: {torch.cuda.current_device()}")
print(f"Current GPU Used: {torch.cuda.get_device_name(0)}")

CUDA Available: True
Current Device Index: 0
Current GPU Used: Tesla T4


In [None]:
# In order to use BitsAndBytesConfig for model, you have to
# update acclerate package and restart runtime. Note: you can only
# use this with GPU.
! pip install bitsandbytes
! pip install -U accelerate

Collecting bitsandbytes
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.41.1


In [None]:
from transformers import BitsAndBytesConfig
import torch

# Load base model using 4-bit quantization
bnb_config = BitsAndBytesConfig(
  load_in_4bit=True, # Load input data in 4 bit
  bnb_4bit_use_double_quant=True, # Both model parameters and gradients in 4 bit
  bnb_4bit_quant_type="nf4", # Use nf4 since it's more effective than fp4
  bnb_4bit_compute_dtype=torch.bfloat16, # Computation done in bfloat16
)

In [None]:
from transformers import AutoModelForSeq2SeqLM

model_id="google/flan-t5-base"

# Load model of FLAN-T5-base from the hub and quanitize it
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    quantization_config=bnb_config
)

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

# Define LoRA Config
lora_config = LoraConfig(
  r=8,
  lora_alpha=32,
  target_modules=["q", "v"],
  lora_dropout=0.05,
  bias="none",
  task_type=TaskType.SEQ_2_SEQ_LM
)

# Add LoRA adaptor
peft_model = get_peft_model(model, lora_config).to('cuda')
peft_model.print_trainable_parameters()

trainable params: 884,736 || all params: 248,462,592 || trainable%: 0.3560841867092814


In [None]:
from transformers import AutoTokenizer

# Load tokenizer of FLAN-T5-base
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    max_length=512,
    padding="max_length",
    truncation=True
)

In [None]:
from transformers import DataCollatorForSeq2Seq

# Ignore tokenizer pad token in the loss since it's irrelevant info
label_pad_token_id = -100

# Data collator for ignoring pad token
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=peft_model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [None]:
from transformers import Trainer, TrainingArguments

output_dir="FLAN-T5-Base/QLoRA-Article"

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=200,
    report_to=None
)

# Trainer Instance
peft_trainer = Trainer(
    model=peft_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test']
)

# To train model quicker, set caching to False
peft_model.config.use_cache=False

In [None]:
# Train QLoRA model
peft_trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
200,1.6822
400,1.65
600,1.5595
800,1.5296
1000,1.3936
1200,1.4617


TrainOutput(global_step=1250, training_loss=1.5351587890625, metrics={'train_runtime': 1199.6038, 'train_samples_per_second': 4.168, 'train_steps_per_second': 1.042, 'total_flos': 3437376307200000.0, 'train_loss': 1.5351587890625, 'epoch': 5.0})

In [None]:
# Evaluate test dataset
peft_trainer.evaluate()

{'eval_loss': 1.4248046875,
 'eval_runtime': 26.8307,
 'eval_samples_per_second': 9.318,
 'eval_steps_per_second': 1.193,
 'epoch': 5.0}

In [None]:
from huggingface_hub import notebook_login

# Login to hugging face to deploy model to hugging face hub
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Deploy QLoRA Flan-T5 Base model to hub
peft_trainer.push_to_hub("philTheThill/QLoRA-Articles")

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.09k [00:00<?, ?B/s]

'https://huggingface.co/philTheThill/QLoRA-Article/tree/main/'

In [None]:
# You can also save model locally
peft_trainer.save_model("/content/QLoRA-Articles")

# Fine-tuned model Metrics

Rouge 1, Rouge 2, Rouge L, and Rouge L-Sum will once again be measured for fine-tuned model after deploying it to Hugging Face.

In [None]:
from datasets import load_from_disk, load_dataset

# Load dataset and train-test-split
dataset = load_dataset('philTheThill/news-articles')
dataset = dataset["train"].train_test_split(test_size=0.2)

dataset

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.99M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'summary'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'summary'],
        num_rows: 250
    })
})

In [None]:
from transformers import pipeline, AutoTokenizer

# Load tokenizer of FLAN-t5-base
token_id = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(
    token_id,
    max_length=512,
    padding="max_length",
    truncation=True
)

model_id = "philTheThill/QLoRA-Article"
model = pipeline(
    "summarization",
    model=model_id,
    tokenizer=tokenizer,
    device=-1 # device=0 if utilizing GPU for inference
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)/adapter_config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Summaries made from model
fine_tuned_summ = []

samples, summaries = random_sample(dataset['test'])

# Zero shot inference
for s in samples:
  summary = model(s)
  fine_tuned_summ.append(summary)

Token indices sequence length is longer than the specified maximum sequence length for this model (1367 > 512). Running this sequence through the model will result in indexing errors
Your max_length is set to 200, but your input_length is only 183. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=91)
Your max_length is set to 200, but your input_length is only 169. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=84)
Your max_length is set to 200, but your input_length is only 138. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=69)


In [None]:
res = get_scores(summaries, fine_tuned_summ)

names = ["Rouge 1", "Rouge 2", "Rouge L", "Rouge L-Sum"]
interval = ["Low (2.5th percentile)", "Medium (50th percentile)", "High (97.5th percentile)"]
scores = ["Precision:", "Recall:", "F-Measure:"]
index = 0

for k, v in res.items():
   print(names[index])
   index += 1
   for i in range(3):
     precision = f"Precision: {round(v[i][0], 3)}"
     recall = f"Recall: {round(v[i][1], 3)}"
     f_measure = f"F-Measure: {round(v[i][2], 3)}"

     print(f"\t{interval[i]} Confidence Interval:")
     print(f"\t\t" + precision)
     print(f"\t\t" + recall)
     print(f"\t\t" + f_measure, end='\n')

Rouge 1
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.266
		Recall: 0.41
		F-Measure: 0.299
	Medium (50th percentile) Confidence Interval:
		Precision: 0.344
		Recall: 0.493
		F-Measure: 0.371
	High (97.5th percentile) Confidence Interval:
		Precision: 0.432
		Recall: 0.589
		F-Measure: 0.454
Rouge 2
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.162
		Recall: 0.218
		F-Measure: 0.173
	Medium (50th percentile) Confidence Interval:
		Precision: 0.254
		Recall: 0.323
		F-Measure: 0.264
	High (97.5th percentile) Confidence Interval:
		Precision: 0.352
		Recall: 0.423
		F-Measure: 0.357
Rouge L
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.23
		Recall: 0.359
		F-Measure: 0.258
	Medium (50th percentile) Confidence Interval:
		Precision: 0.317
		Recall: 0.448
		F-Measure: 0.341
	High (97.5th percentile) Confidence Interval:
		Precision: 0.404
		Recall: 0.54
		F-Measure: 0.424
Rouge L-Sum
	Low (2.5th percentile) Confidence Interval:
		Precision: 0.231


# Testing Inference

After fine-tuning, the inference is tested to see if there's any significant improvements to summarizing news articles.

In [None]:
import random

# Get a random article to summarize
r = random.randint(0, len(dataset['test'])-1)
text = dataset['test'][r]['text']

In [None]:
# Print sample article
text

'Many of Julie Taymor’s signature touches in Broadway’s “Spider-Man: Turn Off the Dark” would be cut or altered in the producers’ new creative plan, which includes scaling back the villainess Arachne, dropping the “Deeply Furious” number of shoe-wearing spider-ladies, and reshaping the Geek Chorus of narrators, according to three people who work on the show and were briefed Thursday on plans.\n\nThe producers announced Wednesday that Ms. Taymor was stepping aside from the $65 million production because of schedule conflicts, though she will still be billed as its director and a script writer. Taking over to reshape the show will be the theater and circus director Philip William McKinley (Broadway’s “Boy From Oz”) and the playwright Roberto Aguirre-Sacasa.\n\nFriends and colleagues of Ms. Taymor have said she was forced out because she would not make extensive changes that the producers wanted.\n\nThe producers have now decided that they will shut down the show sometime this spring, but

In [None]:
# Print summary from dataset
summary = dataset['test'][r]['summary']
summary

'Producers will cut some of Julie Taymor’s signature touches as they reshape the Broadway show “Spider-Man: Turn Off the Dark.”'

In [None]:
# Get the summary from fine-tuned model
predicted = model(dataset['test'][r]['text'])[0]['summary_text']

In [None]:
# Print the predicted summary
predicted

'Julie Taymor is stepping aside from the $65 million production of “Spider-Man: Turn Off the Dark” because of schedule conflicts, according to three people who work on the show'

In [None]:
# Get another random article to summarize
r = random.randint(0, len(dataset['test'])-1)
text = dataset['test'][r]['text']

In [None]:
# Print sample article again
text

'A Lebanese official says Beirut airport authorities have foiled one of the country’s largest drug smuggling attempts, seizing two tonnes of the amphetamine fenethylline before they were loaded on to the private plane of a Saudi prince.\n\nThe official said the prince and four others had been detained on Monday. He spoke on condition of anonymity because he was not authorised to give official statements.\n\nThe manufacture of fenethylline pills thrives in Lebanon and war-torn Syria, which have become a gateway for the drug to the Middle East and particularly the Gulf.\n\nIn a 2014 report, the United Nations Office of Drugs and Crime says the amphetamine market is on the rise in the Middle East, with Saudi Arabia, Jordan and Syria accounting for more than 55% of amphetamines seized worldwide.'

In [None]:
# Print summary from dataset again
summary = dataset['test'][r]['summary']
summary

'Prince and four others detained after fenethylline pills were confiscated before they were loaded on to private jet in Beirut'

In [None]:
# Get the summary from fine-tuned model
predicted = model(dataset['test'][r]['text'])[0]['summary_text']

Your max_length is set to 200, but your input_length is only 196. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=98)


In [None]:
# Print the predicted summary
predicted

'Lebanese authorities have foiled one of the country’s largest drug smuggling attempts, seizing two tonnes of the amphetamine fenethylline before being loaded on to a Saudi prince’s plane.'

# Conclusion

It turns out that the fine-tuned QLoRA Flan-T5 Base model has performed well with it's summaries after training for 5 epochs. Instead of training with thousands or millions of data, you only need about 1000 data points to get a good model. With the CPU, the inference is pretty slow ranging from 10-45 seconds. However, this model still produces good predicted summaries (and sometimes even better than the reference summaries) given results above.