# SumArabic Dataset

SumArabic is an Arabic abstractive text summarization dataset,
the data are from the following two Arabic news websites:
- emaratalyoum.com
- almamlakatv.com

The data are splitted into training, testing, validation, and out-of-domain sets, totalling in 84,764 examples. The number of examples in each split is as follows:

- Training: 75,817

- Validation: 4,121

- Testing: 4,174

- Out-of-domain: 652

Bani Almarjeh, Mohammad (2022), “SumArabic”, Mendeley Data, V1, doi: 10.17632/7kr75c9h24.1

## Downloading the Dataset

In [None]:
import kagglehub

# Downloading latest version
path = kagglehub.dataset_download("abdelbassetdjamai/sumarabic")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/abdelbassetdjamai/sumarabic?dataset_version_number=1...


100%|██████████| 16.5M/16.5M [00:00<00:00, 58.8MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/abdelbassetdjamai/sumarabic/versions/1


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

In [None]:
#@title load sumArabic
from datasets import load_dataset
data_dir = "/content/drive/MyDrive/content/SumArabic"

sumArabic = load_dataset("json",
            data_files={
                "train": f"{data_dir}/sumarabic-1.0-train.jsonl",
                "validation": f"{data_dir}/sumarabic-1.0-valid.jsonl",
                "test": f"{data_dir}/sumarabic-1.0-test.jsonl"
            })

print(sumArabic.keys())
print(sumArabic["train"].num_rows)
print(sumArabic["validation"].num_rows)
print(sumArabic["test"].num_rows)

# Print the first example in the training set
print(sumArabic["train"][0])
# Display the dataset structure
sumArabic

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

dict_keys(['train', 'validation', 'test'])
75817
4121
4174
{'dataset': 'train', 'filename': 'crawl-data/CC-MAIN-2019-04/segments/1547583705091.62/warc/CC-MAIN-20190120082608-20190120104608-00309.warc.gz', 'headline': 'المصري فؤاد الطاهر بطل للشطرنج الديناميكي', 'length': 14478, 'offset': 750279704, 'published': datetime.datetime(2008, 2, 2, 20, 0), 'section': 'emaratalyoum.com', 'subdomain': 'emaratalyoum.com', 'text': 'اختتمت مساء أول من أمس نهائيات بطولة الإمارات للشطرنج الديناميكي المفتوحة التي نظمها اتحاد الإمارات للشطرنج بمقر نادي دبي للشطرنج والثقافة حيث توج الأستاذ الدولي المصري فؤاد الطاهر بطلا للمسابقة برصيد 6 نقاط من 7 جولات.', 'url': 'https://www.emaratalyoum.com/local-section/2008-02-03-1.192514'}


DatasetDict({
    train: Dataset({
        features: ['dataset', 'filename', 'headline', 'length', 'offset', 'published', 'section', 'subdomain', 'text', 'url'],
        num_rows: 75817
    })
    validation: Dataset({
        features: ['dataset', 'filename', 'headline', 'length', 'offset', 'published', 'section', 'subdomain', 'text', 'url'],
        num_rows: 4121
    })
    test: Dataset({
        features: ['dataset', 'filename', 'headline', 'length', 'offset', 'published', 'section', 'subdomain', 'text', 'url'],
        num_rows: 4174
    })
})

## Exploratory Data Analysis (EDA)

In [None]:
# Find the highest 'length' across all splits
max_length = max(
    max(len(example["headline"]) for example in sumArabic["train"]),
    max(len(example["headline"]) for example in sumArabic["validation"]),
    max(len(example["headline"]) for example in sumArabic["test"])
)

print("The maximum length of 'headline' in the dataset is:", max_length)

The maximum length of 'headline' in the dataset is: 111


In [None]:
# Find the highest 'length' across all splits
min_length = min(
    min(len(example["headline"]) for example in sumArabic["train"]),
    min(len(example["headline"]) for example in sumArabic["validation"]),
    min(len(example["headline"]) for example in sumArabic["test"])
)

print("The minimum length of 'headline' in the dataset is:", min_length)

The minimum length of 'headline' in the dataset is: 11


In [None]:
# Find the highest 'length' in the entire dataset across all splits
max_length = max(
    max(example["length"] for example in sumArabic["train"]),
    max(example["length"] for example in sumArabic["validation"]),
    max(example["length"] for example in sumArabic["test"])
)

print("The highest 'length' in the dataset is:", max_length)

The highest 'length' in the dataset is: 764903


In [None]:
# Find the highest 'length' in the entire dataset across all splits
min_length = min(
    min(example["length"] for example in sumArabic["train"]),
    min(example["length"] for example in sumArabic["validation"]),
    min(example["length"] for example in sumArabic["test"])
)

print("The minimum 'length' in the dataset is:", min_length)

The minimum 'length' in the dataset is: 8394


In [None]:
# Get distinct sections from the training set
distinct_sections = set(example['section'] for example in sumArabic['train'])

# Print the distinct sections
print("Distinct sections in the dataset:")
for section in distinct_sections:
    print(section)

# Print the number of distinct sections
print(f"\nTotal distinct sections: {len(distinct_sections)}")

Distinct sections in the dataset:
technology
sports
business
life
online
politics
hotline
covid19
news
emaratalyoum.com
travel

Total distinct sections: 11


In [None]:
from collections import defaultdict

# Initialize counters
total_tokens = 0
unique_terms = set()

# Function to count tokens and unique terms in a dataset split
def count_tokens_and_unique_terms(dataset):
    global total_tokens  # Use the global variable for total tokens
    for example in dataset:
        text = example['text']  # Get the text from the example
        tokens = text.split()  # Split the text into tokens (words)
        total_tokens += len(tokens)  # Update the total token count
        unique_terms.update(tokens)  # Add unique terms to the set

# Count for the training and validation datasets
count_tokens_and_unique_terms(sumArabic['train'])
count_tokens_and_unique_terms(sumArabic['validation'])

# Print the results
print(f"The total number of tokens in the training and validation sets is {total_tokens:,}, "
      f"with {len(unique_terms):,} unique terms.")

The total number of tokens in the training and validation sets is 2,688,390, with 196,237 unique terms.


# Experiments

## Installing Required Libraries

In [None]:
!pip install transformers datasets rouge_score

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


In [None]:
import numpy as np
import evaluate
import random
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments, T5ForConditionalGeneration, T5Tokenizer, MBartForConditionalGeneration, MBartTokenizer
from datasets import load_dataset
from bert_score import score
from tqdm import tqdm

## AraT5v2 Model - 10% of the Dataset

### Training the Model

In [None]:
# Load the model and tokenizer
model_name = "UBC-NLP/AraT5v2-base-1024"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/2.35M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.40M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

In [None]:
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [None]:
# Load dataset
dataset = load_dataset("json", data_files={
    "train": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-train.jsonl",
    "validation": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-valid.jsonl",
    "test": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-test.jsonl"
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
prefix = "لخص: "
def tokenize_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=1024)

    labels = tokenizer(examples["headline"], padding="max_length", truncation=True, max_length=128)

    # Check lengths and truncate
    model_inputs["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in label]
        for label in labels["input_ids"]
    ]

    return model_inputs

# Apply the tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/75817 [00:00<?, ? examples/s]

Map:   0%|          | 0/4121 [00:00<?, ? examples/s]

Map:   0%|          | 0/4174 [00:00<?, ? examples/s]

In [None]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    weight_decay=0.001,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True
)



In [None]:
# Split into train and validation subsets
train_size = int(0.1 * len(tokenized_datasets["train"]))
val_size = int(0.1 * len(tokenized_datasets["validation"]))

# Take 10% of each dataset
sampled_train = tokenized_datasets["train"].shuffle(seed=42).select(range(train_size))
sampled_val = tokenized_datasets["validation"].shuffle(seed=42).select(range(val_size))

In [None]:
# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=sampled_train,
    eval_dataset=sampled_val,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
# Clear cahce
torch.cuda.empty_cache()

In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.9428,1.981268,0.1698,0.0057,0.1636,0.1651,12.9733


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.9428,1.981268,0.1698,0.0057,0.1636,0.1651,12.9733
2,2.4669,1.880354,0.1872,0.0138,0.1813,0.1824,12.0243
3,2.2889,1.859542,0.1921,0.0154,0.1867,0.1882,12.0121


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

TrainOutput(global_step=11373, training_loss=2.7010269104664006, metrics={'train_runtime': 10215.636, 'train_samples_per_second': 2.226, 'train_steps_per_second': 1.113, 'total_flos': 3.952602328124621e+16, 'train_loss': 2.7010269104664006, 'epoch': 3.0})

In [None]:
# Saving the model and tokenizer
trainer.save_model("/content/drive/MyDrive/output/final_model")
tokenizer.save_pretrained("/content/drive/MyDrive/output/final_model_tokenizer")

('/content/drive/MyDrive/output/final_model_tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/output/final_model_tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/output/final_model_tokenizer/spiece.model',
 '/content/drive/MyDrive/output/final_model_tokenizer/added_tokens.json',
 '/content/drive/MyDrive/output/final_model_tokenizer/tokenizer.json')

### Evaluating the Model

In [None]:
# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/output/final_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/output/final_model_tokenizer")

# Generate predictions
test_dataset = tokenized_datasets["test"]

# Set a random seed for reproducibility
random.seed(42)

# Select 10% of the test dataset
subset_size = int(0.1 * len(test_dataset))
sample_indices = random.sample(range(len(test_dataset)), subset_size)
sampled_test_dataset = [test_dataset[i] for i in sample_indices]

# Generate predictions
predictions = []
references = []

for sample in sampled_test_dataset:
    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    with torch.no_grad():
        generated_ids = model.generate(input_ids, max_length=128, num_beams=4, early_stopping=True)
    pred_summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    ref_summary = sample["headline"]

    predictions.append(pred_summary)
    references.append(ref_summary)


In [None]:
# Load metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# Compute ROUGE scores
rouge_scores = rouge.compute(predictions=predictions, references=references, use_stemmer=True)
print("ROUGE Scores:", rouge_scores)

# Compute BERTScore
bert_scores = bertscore.compute(predictions=predictions, references=references, lang="ar", rescale_with_baseline=False)
print("BERTScore Precision:", np.mean(bert_scores["precision"]))
print("BERTScore Recall:", np.mean(bert_scores["recall"]))
print("BERTScore F1:", np.mean(bert_scores["f1"]))

ROUGE Scores: {'rouge1': 0.16378896882494004, 'rouge2': 0.011990407673860911, 'rougeL': 0.16342925659472418, 'rougeLsum': 0.1641087130295763}
BERTScore Precision: 0.8200366106822337
BERTScore Recall: 0.8109678410225921
BERTScore F1: 0.8149022928816523


In [None]:
# Loading the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("/content/drive/MyDrive/output/final_model")
tokenizer = T5Tokenizer.from_pretrained("/content/drive/MyDrive/output/final_model_tokenizer")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
# View a generated summary to evaluate if it's a meaningful text
input_text = "لخص: قدم سمو الشيخ حمدان بن محمد بن راشد آل مكتوم ولي عهد دبي، رئيس مجلس دبي الرياضي، 106 سيارات حديثة لكل فائز في أشواط المهرجان الختامي لسباقات الهجن العربية الأصيلة، التي يحتضنها ميدان السوان برأس الخيمة، وشهدها سمو الشيخ سعود بن صقر القاسمي ولي عهد ونائب حاكم رأس الخيمة، وذلك حرصاً من سموه على دعم التراث المحلي، ورياضة الآباء والأجداد."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)

# Decode the output
generated_summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_summary)



حمدان بن محمد يقدم 106 سيارات حديثة لكل فائز في «ختام سباقات الهجن» برأس الخيمة


## AraT5v2 Model - 15% of the Dataset

### Training the Model

In [None]:
# Load the model and tokenizer
model_name = "UBC-NLP/AraT5v2-base-1024"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [None]:
# Load dataset
dataset = load_dataset("json", data_files={
    "train": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-train.jsonl",
    "validation": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-valid.jsonl",
    "test": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-test.jsonl"
})

In [None]:
prefix = "لخص: "
def tokenize_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=1024)

    labels = tokenizer(examples["headline"], padding="max_length", truncation=True, max_length=128)

    # Check lengths and truncate
    model_inputs["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in label]
        for label in labels["input_ids"]
    ]

    return model_inputs

# Apply the tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/75817 [00:00<?, ? examples/s]

Map:   0%|          | 0/4121 [00:00<?, ? examples/s]

Map:   0%|          | 0/4174 [00:00<?, ? examples/s]

In [None]:
rouge = evaluate.load("rouge")

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 with pad_token_id for labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE score
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Compute BERTScore
    bert_precision, bert_recall, bert_f1 = score(decoded_preds, decoded_labels, lang="ar")
    result["bert_precision"] = torch.mean(torch.tensor(bert_precision, dtype=torch.float32))
    result["bert_recall"] = torch.mean(torch.tensor(bert_recall, dtype=torch.float32))
    result["bert_f1"] = torch.mean(torch.tensor(bert_f1, dtype=torch.float32))

    # Calculate average generation length
    prediction_lens = [len(decoded_pred.split()) for decoded_pred in decoded_preds if decoded_pred]  # Ensure non-empty predictions
    result["gen_len"] = torch.mean(torch.tensor(prediction_lens, dtype=torch.float32)) if prediction_lens else 0.0  # Default to 0 if empty

    return {k: round(v.item(), 4) for k, v in result.items()}

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,  # Simulate larger batch sizes
    weight_decay=0.001,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,  # Enable mixed precision
)



In [None]:
# Split into train and validation subsets
train_size = int(0.15 * len(tokenized_datasets["train"]))
val_size = int(0.15 * len(tokenized_datasets["validation"]))

# Take 15% of each dataset
sampled_train = tokenized_datasets["train"].shuffle(seed=42).select(range(train_size))
sampled_val = tokenized_datasets["validation"].shuffle(seed=42).select(range(val_size))

In [None]:
# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=sampled_train,
    eval_dataset=sampled_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
# Clear cahce
torch.cuda.empty_cache()

In [None]:
# Train the model
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Bert Precision,Bert Recall,Bert F1,Gen Len
0,3.3514,2.263201,0.1195,0.0027,0.1202,0.1206,0.7884,0.7869,0.7872,7.8382
2,2.4649,1.919946,0.1649,0.014,0.1632,0.1642,0.815,0.8053,0.8095,7.9741
4,2.3444,1.906623,0.1607,0.0124,0.1579,0.1582,0.8179,0.8072,0.8119,7.9612


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_clas

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Bert Precision,Bert Recall,Bert F1,Gen Len
0,3.3514,2.263201,0.1195,0.0027,0.1202,0.1206,0.7884,0.7869,0.7872,7.8382
2,2.4649,1.919946,0.1649,0.014,0.1632,0.1642,0.815,0.8053,0.8095,7.9741
4,2.2962,1.896561,0.1618,0.014,0.1584,0.1588,0.8179,0.8065,0.8116,7.8932


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

TrainOutput(global_step=7105, training_loss=2.8161522561273973, metrics={'train_runtime': 17477.3998, 'train_samples_per_second': 3.253, 'train_steps_per_second': 0.407, 'total_flos': 9.878464421167104e+16, 'train_loss': 2.8161522561273973, 'epoch': 4.9982412944073165})

In [None]:
# Saving the model and tokenizer
trainer.save_model("/content/drive/MyDrive/output2/AraT5v2_15_model")
tokenizer.save_pretrained("/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer")

('/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer/spiece.model',
 '/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer/added_tokens.json',
 '/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer/tokenizer.json')

### Evaluating the Model

In [None]:
# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/output2/AraT5v2_15_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer")

In [None]:
# Generate predictions
test_dataset = tokenized_datasets["test"]

# Set a random seed for reproducibility
random.seed(42)

# Select 10% of the test dataset
subset_size = int(0.2 * len(test_dataset))
sample_indices = random.sample(range(len(test_dataset)), subset_size)
sampled_test_dataset = [test_dataset[i] for i in sample_indices]

# Generate predictions
predictions = []
references = []

In [None]:
for sample in sampled_test_dataset:
    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    with torch.no_grad():
        generated_ids = model.generate(input_ids, max_length=128, num_beams=4, early_stopping=True)
    pred_summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    ref_summary = sample["headline"]

    predictions.append(pred_summary)
    references.append(ref_summary)

In [None]:
# Load metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# Compute ROUGE scores
rouge_scores = rouge.compute(predictions=predictions, references=references, use_stemmer=True)
print("ROUGE Scores:", rouge_scores)

# Compute BERTScore
bert_scores = bertscore.compute(predictions=predictions, references=references, lang="ar", rescale_with_baseline=False)
print("BERTScore Precision:", np.mean(bert_scores["precision"]))
print("BERTScore Recall:", np.mean(bert_scores["recall"]))
print("BERTScore F1:", np.mean(bert_scores["f1"]))

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

ROUGE Scores: {'rouge1': 0.16446842525979222, 'rouge2': 0.008393285371702638, 'rougeL': 0.16324940047961634, 'rougeLsum': 0.16326938449240608}
BERTScore Precision: 0.8199248575478149
BERTScore Recall: 0.808980030621842
BERTScore F1: 0.8139196114717342


In [None]:
# Loading the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("/content/drive/MyDrive/output2/AraT5v2_15_model")
tokenizer = T5Tokenizer.from_pretrained("/content/drive/MyDrive/output2/AraT5v2_15_model_tokenizer")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
# View a generated summary to evaluate if it's a meaningful text
input_text = "لخص: قدم سمو الشيخ حمدان بن محمد بن راشد آل مكتوم ولي عهد دبي، رئيس مجلس دبي الرياضي، 106 سيارات حديثة لكل فائز في أشواط المهرجان الختامي لسباقات الهجن العربية الأصيلة، التي يحتضنها ميدان السوان برأس الخيمة، وشهدها سمو الشيخ سعود بن صقر القاسمي ولي عهد ونائب حاكم رأس الخيمة، وذلك حرصاً من سموه على دعم التراث المحلي، ورياضة الآباء والأجداد."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)

# Decode the output
generated_summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_summary)



حمدان بن محمد يقدم 106 سيارات حديثة لكل فائز في المهرجان الختامي لسباقات الهجن برأس الخيمة


## AraT5v2 Model - 20% of the Dataset (**Best Model**)

### Training the Model

In [None]:
# Load the model and tokenizer
model_name = "UBC-NLP/AraT5v2-base-1024"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [None]:
# Load dataset
dataset = load_dataset("json", data_files={
    "train": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-train.jsonl",
    "validation": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-valid.jsonl",
    "test": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-test.jsonl"
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
prefix = "لخص: "
def tokenize_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=1024)

    labels = tokenizer(examples["headline"], padding="max_length", truncation=True, max_length=128)

    # Check lengths and truncate
    model_inputs["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in label]
        for label in labels["input_ids"]
    ]

    return model_inputs

# Apply the tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/75817 [00:00<?, ? examples/s]

Map:   0%|          | 0/4121 [00:00<?, ? examples/s]

Map:   0%|          | 0/4174 [00:00<?, ? examples/s]

In [None]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 with pad_token_id for labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE score
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Compute BERTScore
    bert_precision, bert_recall, bert_f1 = score(decoded_preds, decoded_labels, lang="ar")
    result["bert_precision"] = torch.mean(torch.tensor(bert_precision, dtype=torch.float32))
    result["bert_recall"] = torch.mean(torch.tensor(bert_recall, dtype=torch.float32))
    result["bert_f1"] = torch.mean(torch.tensor(bert_f1, dtype=torch.float32))

    # Calculate average generation length
    prediction_lens = [len(decoded_pred.split()) for decoded_pred in decoded_preds if decoded_pred]  # Ensure non-empty predictions
    result["gen_len"] = torch.mean(torch.tensor(prediction_lens, dtype=torch.float32)) if prediction_lens else 0.0  # Default to 0 if empty

    return {k: round(v.item(), 4) for k, v in result.items()}

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,  # Simulate larger batch sizes
    weight_decay=0.001,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,  # Enable mixed precision
)



In [None]:
# Split into train and validation subsets
train_size = int(0.2 * len(tokenized_datasets["train"]))
val_size = int(0.2 * len(tokenized_datasets["validation"]))

# Take 20% of each dataset
sampled_train = tokenized_datasets["train"].shuffle(seed=42).select(range(train_size))
sampled_val = tokenized_datasets["validation"].shuffle(seed=42).select(range(val_size))

In [None]:
# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=sampled_train,
    eval_dataset=sampled_val,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
# Clear cahce
torch.cuda.empty_cache()

In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Bert Precision,Bert Recall,Bert F1,Gen Len
0,2.536,1.90381,0.1594,0.0113,0.1594,0.159,0.816,0.8037,0.8093,7.8228
1,2.3277,1.860358,0.1654,0.0174,0.1637,0.163,0.8181,0.8077,0.8123,7.9575
2,2.2188,1.846295,0.1645,0.0174,0.1621,0.1619,0.8169,0.8062,0.811,7.8993


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_clas

TrainOutput(global_step=5685, training_loss=2.384257159740545, metrics={'train_runtime': 14023.7866, 'train_samples_per_second': 3.244, 'train_steps_per_second': 0.405, 'total_flos': 7.904161890828288e+16, 'train_loss': 2.384257159740545, 'epoch': 2.9994064499109676})

In [None]:
# Saving the model and tokenizer
trainer.save_model("/content/drive/MyDrive/output2/final_model_20")
tokenizer.save_pretrained("/content/drive/MyDrive/output2/final_model_tokenizer_20")

('/content/drive/MyDrive/output2/final_model_tokenizer_20/tokenizer_config.json',
 '/content/drive/MyDrive/output2/final_model_tokenizer_20/special_tokens_map.json',
 '/content/drive/MyDrive/output2/final_model_tokenizer_20/spiece.model',
 '/content/drive/MyDrive/output2/final_model_tokenizer_20/added_tokens.json',
 '/content/drive/MyDrive/output2/final_model_tokenizer_20/tokenizer.json')

### Evaluating the Model

In [None]:
# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/output2/final_model_20").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/output2/final_model_tokenizer_20")

In [None]:
# Generate predictions
test_dataset = tokenized_datasets["test"]

# Set a random seed for reproducibility
random.seed(42)

# Select 10% of the test dataset
subset_size = int(0.2 * len(test_dataset))
sample_indices = random.sample(range(len(test_dataset)), subset_size)
sampled_test_dataset = [test_dataset[i] for i in sample_indices]

# Generate predictions
predictions = []
references = []

In [None]:
for sample in sampled_test_dataset:
    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    with torch.no_grad():
        generated_ids = model.generate(input_ids, max_length=128, num_beams=4, early_stopping=True)
    pred_summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    ref_summary = sample["headline"]

    predictions.append(pred_summary)
    references.append(ref_summary)


In [None]:
# Load metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# Compute ROUGE scores
rouge_scores = rouge.compute(predictions=predictions, references=references, use_stemmer=True)
print("ROUGE Scores:", rouge_scores)

# Compute BERTScore
bert_scores = bertscore.compute(predictions=predictions, references=references, lang="ar", rescale_with_baseline=False)
print("BERTScore Precision:", np.mean(bert_scores["precision"]))
print("BERTScore Recall:", np.mean(bert_scores["recall"]))
print("BERTScore F1:", np.mean(bert_scores["f1"]))

ROUGE Scores: {'rouge1': 0.16790567545963236, 'rouge2': 0.010391686650679457, 'rougeL': 0.16682653876898484, 'rougeLsum': 0.1673461231015188}
BERTScore Precision: 0.8241333144603016
BERTScore Recall: 0.8123160288345328
BERTScore F1: 0.8176661718377678


In [None]:
# Loading the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("/content/drive/MyDrive/output2/final_model_20")
tokenizer = T5Tokenizer.from_pretrained("/content/drive/MyDrive/output2/final_model_tokenizer_20")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
# View a generated summary to evaluate if it's a meaningful text
input_text = "لخص: قدم سمو الشيخ حمدان بن محمد بن راشد آل مكتوم ولي عهد دبي، رئيس مجلس دبي الرياضي، 106 سيارات حديثة لكل فائز في أشواط المهرجان الختامي لسباقات الهجن العربية الأصيلة، التي يحتضنها ميدان السوان برأس الخيمة، وشهدها سمو الشيخ سعود بن صقر القاسمي ولي عهد ونائب حاكم رأس الخيمة، وذلك حرصاً من سموه على دعم التراث المحلي، ورياضة الآباء والأجداد."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)

# Decode the output
generated_summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_summary)



حمدان بن محمد يقدم 106 سيارات حديثة للفائزين في المهرجان الختامي لسباقات الهجن


## MBart Model - 25% of the Dataset

### Training the Model

In [None]:
# Load the pretrained model and tokenizer for MBart
model_name = "facebook/mbart-large-50-many-to-many-mmt"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = MBartTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'MBart50Tokenizer'. 
The class this function is called from is 'MBartTokenizer'.


In [None]:
# Load dataset
dataset = load_dataset("json", data_files={
    "train": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-train.jsonl",
    "validation": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-valid.jsonl",
    "test": "/content/drive/MyDrive/content/SumArabic/sumarabic-1.0-test.jsonl"
})

In [None]:
# Tokenize the input text and labels
def tokenize_function(examples):
    inputs = examples["text"]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")

    labels = examples["headline"]
    model_inputs["labels"] = tokenizer(labels, max_length=128, truncation=True, padding="max_length")["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [None]:
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE score
    rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Extract the ROUGE scores directly
    result = {
        "rouge1": rouge_result["rouge1"].score if isinstance(rouge_result["rouge1"], dict) else rouge_result["rouge1"],
        "rouge2": rouge_result["rouge2"].score if isinstance(rouge_result["rouge2"], dict) else rouge_result["rouge2"],
        "rougeL": rouge_result["rougeL"].score if isinstance(rouge_result["rougeL"], dict) else rouge_result["rougeL"]
    }

    # Compute BERTScore for the entire epoch
    bert_scores = bertscore.compute(predictions=decoded_preds, references=decoded_labels, lang="ar")
    avg_bert_precision = np.mean(bert_scores["precision"])
    avg_bert_recall = np.mean(bert_scores["recall"])
    avg_bert_f1 = np.mean(bert_scores["f1"])

    result["bert_precision"] = avg_bert_precision
    result["bert_recall"] = avg_bert_recall
    result["bert_f1"] = avg_bert_f1

    return result

In [None]:
# Example of training setup
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    weight_decay=0.001,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
)



In [None]:
# Split into train and validation subsets
train_size = int(0.25 * len(tokenized_datasets["train"]))
val_size = int(0.25 * len(tokenized_datasets["validation"]))

# Take 25% of each dataset
sampled_train = tokenized_datasets["train"].shuffle(seed=42).select(range(train_size))
sampled_val = tokenized_datasets["validation"].shuffle(seed=42).select(range(val_size))

In [None]:
# Trainer setup
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=sampled_train,
    eval_dataset=sampled_val,
    compute_metrics=compute_metrics,
)

In [None]:
# Clear cahce
torch.cuda.empty_cache()

In [None]:
# Train the model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mnouf0fahad145[0m ([33mnouf0fahad145-king-saud-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bert Precision,Bert Recall,Bert F1
0,0.2752,0.251949,0.161036,0.016505,0.157783,0.797081,0.805184,0.800575
1,0.2053,0.241229,0.166343,0.018447,0.163074,0.799929,0.810538,0.804718
2,0.1662,0.243056,0.16479,0.018123,0.162006,0.800999,0.811427,0.805735


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_clas

TrainOutput(global_step=7107, training_loss=0.29478811798374815, metrics={'train_runtime': 17464.5226, 'train_samples_per_second': 3.256, 'train_steps_per_second': 0.407, 'total_flos': 1.2321433008576922e+17, 'train_loss': 0.29478811798374815, 'epoch': 2.9996834441278883})

In [None]:
# Saving the model and tokenizer
trainer.save_model("/content/drive/MyDrive/mbertoutput/mbert_model_20")
tokenizer.save_pretrained("/content/drive/MyDrive/mbertoutput/mbert_model_tokenizer_20")

### Evaluating the Model

In [None]:
# Load the model and tokenizer
model_path = "/content/drive/MyDrive/mbertoutput/mbert_model_20"
tokenizer_path = "/content/drive/MyDrive/mbertoutput/mbert_model_tokenizer_20"

model = MBartForConditionalGeneration.from_pretrained(model_path).to("cuda")
tokenizer = MBartTokenizer.from_pretrained(tokenizer_path)

In [None]:
# Prepare the test dataset
test_dataset = tokenized_datasets["test"]

In [None]:
# Set a random seed for reproducibility
random.seed(42)

In [None]:
# Select 25% of the test dataset
subset_size = int(0.25 * len(test_dataset))
sample_indices = random.sample(range(len(test_dataset)), subset_size)
sampled_test_dataset = [test_dataset[i] for i in sample_indices]

In [None]:
# Initialize lists for predictions and references
predictions = []
references = []

# Generate predictions and collect references
for sample in sampled_test_dataset:
    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    with torch.no_grad():
        generated_ids = model.generate(input_ids, max_length=128, num_beams=4, early_stopping=True)
    pred_summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    ref_summary = sample["headline"]

    predictions.append(pred_summary)
    references.append(ref_summary)

In [None]:
# Compute ROUGE scores
rouge_scores = rouge.compute(predictions=predictions, references=references, use_stemmer=True)
print("\nROUGE Scores:")
for k, v in rouge_scores.items():
    print(f"{k}: {v:.4f}")

# Compute BERTScore
bert_scores = bertscore.compute(predictions=predictions, references=references, lang="ar", rescale_with_baseline=False)
avg_bert_precision = np.mean(bert_scores["precision"])
avg_bert_recall = np.mean(bert_scores["recall"])
avg_bert_f1 = np.mean(bert_scores["f1"])


ROUGE Scores:
rouge1: 0.1691
rouge2: 0.0139
rougeL: 0.1671
rougeLsum: 0.1670


In [None]:
print("\nBERTScore:")
print(f"Precision: {avg_bert_precision:.4f}")
print(f"Recall: {avg_bert_recall:.4f}")
print(f"F1: {avg_bert_f1:.4f}")


BERTScore:
Precision: 0.8007
Recall: 0.8092
F1: 0.8044


In [None]:
# Example: Generating a single summary
input_text = "قدم سمو الشيخ حمدان بن محمد بن راشد آل مكتوم ولي عهد دبي، رئيس مجلس دبي الرياضي، 106 سيارات حديثة لكل فائز في أشواط المهرجان الختامي لسباقات الهجن العربية الأصيلة، التي يحتضنها ميدان السوان برأس الخيمة، وشهدها سمو الشيخ سعود بن صقر القاسمي ولي عهد ونائب حاكم رأس الخيمة، وذلك حرصاً من سموه على دعم التراث المحلي، ورياضة الآباء والأجداد."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
output_ids = model.generate(input_ids, max_length=128, num_beams=4, early_stopping=True)

# Decode the output
generated_summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("\nGenerated Summary:")
print(generated_summary)


Generated Summary:
حمدان بن محمد يقدم 106 سيارات حديثة لسباقات الهجن العربية الأصيلة
