# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `UBC-NLP/AraT5-base` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/news_articles_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [None]:
# !pip install transformers datasets torch

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K 

In [None]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split

In [None]:
from transformers import pipeline

In [None]:
from datasets import load_dataset
ds = load_dataset("CUTD/news_articles_df", split = 'train')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


news_articles_df.csv:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8378 [00:00<?, ? examples/s]

In [None]:
ds = ds.train_test_split(test_size=0.2)

In [None]:
ds['train'][0]

{'Unnamed: 0': 6131,
 'summarizer': '\nقال الأمين العام للاتحاد العام التونسي للشغل اليوم الخميس، إن منظمته مستعدة للتضحية بشأن الأجور لكن بشروط. واستنكر العباسي ما اعتبره تنكرا للاتفاقيات المبرمة مع المنظمة الشغيلة، معلقا بالقول بأنه هناك محاولة لفرض حلول على الاتحاد دون موافقته. وأضاف أن مطالبة هذه الشريحة بدفع الضريبة واجب وليس تضحية كما يروّجونه، بحسب تعبيره.',
 'text': 'قال الامين العام للاتحاد العام التونسي للشغل اليوم الخميس منظمته مستعده للتضحيه بشان الاجور بشروط وقال تصريحات صحفيه هامش ندوه وطنيه لقطاع النقل البري بالحمامات واكبتها مراسلتنا روضه العلاقي مستعدون للتضحيه بشرط تكون التضحيه مشتركه الاطراف بحسب تعبيره واوضح المطلوب حاليا الاجراء التضحيه والتخلي جزء رواتبهم المقابل تضحيه مماثله الاجراء غرار المحامين والاطباء مطالبا بتضحيه متساويه الجميع واضاف مطالبه الشريحه بدفع الضريبه واجب وليس تضحيه بحسب تعبيره واستنكر العباسي اعتبره تنكرا للاتفاقيات المبرمه المنظمه الشغيله معلقا بالقول بانه محاوله لفرض حلول الاتحاد موافقته'}

In [None]:
df = ds['train'].to_pandas()

In [None]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42) #80% train, 20% test

In [None]:
#convert back to Hugging Face datasets
from datasets import Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 6702
    })
    test: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 1676
    })
})

## Step 2: Load the Pretrained Tokenizer

Initialize a tokenizer from the gevin model checkpoint.

In [None]:
from transformers import AutoTokenizer

checkpoint = "UBC-NLP/AraT5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/2.44M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [None]:
# Step 3: Preprocess the Dataset
prefix = "summarize: " #adding a prefix
def preprocess_function(examples):
    # Add 'summarize: ' prefix to the article for the T5 model
    inputs = [prefix + article for article in examples['text']]

    # Tokenize inputs and labels
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    # Tokenize summaries (target texts) as labels
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['summarizer'], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [None]:
tokenized_ds = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/6702 [00:00<?, ? examples/s]



Map:   0%|          | 0/1676 [00:00<?, ? examples/s]

In [None]:
#the same for the test
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1341 [00:00<?, ? examples/s]

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="pt")

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Trainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
for param in model.parameters(): param.data = param.data.contiguous()

## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_model",
    eval_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    #push_to_hub=True,
)

## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    #compute_metrics=compute_metrics,
)


## Step 8: Fine-tune the Model

Train the model using the specified arguments and dataset.

In [None]:
trainer.train()

Step,Training Loss
500,13.7347
1000,8.745
1500,8.5198


TrainOutput(global_step=1676, training_loss=10.135428562938172, metrics={'train_runtime': 497.9443, 'train_samples_per_second': 13.459, 'train_steps_per_second': 3.366, 'total_flos': 1589853463971840.0, 'train_loss': 10.135428562938172, 'epoch': 1.0})

## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.

In [None]:
model.save_pretrained("my_awesome_model")
tokenizer.save_pretrained("/content/my_awesome_model")

('/content/my_awesome_model/tokenizer_config.json',
 '/content/my_awesome_model/special_tokens_map.json',
 '/content/my_awesome_model/spiece.model',
 '/content/my_awesome_model/added_tokens.json',
 '/content/my_awesome_model/tokenizer.json')

In [None]:
text = "summarize: ينطلق مهرجان القيروان للشعر العربي ببيت الشعر بالقيروان بالتنسيق بيت الشارقه وذلك اطار الذكرى السنويه لبعث بيت الشعر الثامن الى العاشر بحضور وزير الشؤون الثقافيه وممثلين بيت الشعر بالشارقه الافتتاح سيكون بامسيه شعريه يوم الخميس واحتفال باحدى القاعات بمدينه القيروان تليها سهره موسيقيه باحد النزل وستلتام خلال اليوم الثاني الندوه النقديه بعنوان الشعر وسؤال الهويه بحضور مستشرقه ايطاليه ومحاضرين جامعيين وعديد الشعراء بالاضافه الى امسيه شعريه وتمت برمجه اصبوحه شعريه خلال اليوم الثالث يليها حفل لتكريم المشاركين ويتخلل المهرجان مسابقات شعريه جوائز وتقديم لعرض مجموعه التلاميذ وياتي المهرجان اثر مهرجانات اخرى اطار الاحتفال بسنويه بيت الشعر سبقت القيروان كالاقصر مصر ومدينه المفرق بالاردن يحتضن بيت الشعر بالقيروان يعتبر فضاء متعدد الاختصاصات عديد المعارض الفنيه والسهرات الشعريه والفنيه ويفتح ابوابه الفنون الجميله كالرسم والغناء تشريك عدد التلاميذ القاء الشعر وحفظ المعلقات وورشات العروض ويشارك مهرجان القيروان للشعر العربي عدد كبير الشعراء غرار محمد الخالدي ويوسف الوهيبي وعبدالرحمان الكبلوطي وشريفه البدري وادم فتحي وجهاد المثناني وغيره الشعراء تونس والخارج"

In [None]:
summarizer = pipeline("summarization", model="/content/my_awesome_model")
summarizer(text)

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': '، في في في في في في في في في في في في في في .'}]