<a href="https://colab.research.google.com/github/lazouine/text_summarization_project/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Welcome to the Arabic Text Summarization Project

## Overview
Welcome, students 👩‍🎓👨‍🎓, to our exciting journey into the world of Natural Language Processing (NLP)! In this project, we'll be delving into the fascinating task of text summarization with a focus on the Arabic language 📚. Our goal is to develop a model that can efficiently summarize Arabic text, making it easier to grasp the essence of large documents quickly 🚀.

## Project Objectives
- **Understanding Text Summarization**: Learn the fundamentals of how text summarization works 📝.
- **Exploring NLP Models**: Get hands-on experience with advanced NLP models like AraGPT2 🤖.
- **Model Fine-Tuning and Training**: Discover how to fine-tune pre-trained models on a custom dataset for specific tasks like summarization 🧠.
- **Practical Application**: Apply your knowledge to build a model that can summarize Arabic texts 🌐.

## Dataset
We'll be using a custom dataset of Arabic texts and their summaries 📖. This dataset will allow us to train our model to understand and generate concise summaries.

We generated this dataset using ChatGPT 😜
If you've read this sentence, send me a message.




## ⚠️ **Important: Use GPU Runtime** ⚠️

To ensure this notebook functions correctly and efficiently, it is **crucial to use a GPU runtime**. Follow these steps to enable GPU acceleration:

1. **Open Runtime settings**: At the top of the page, click on `Runtime` in the menu bar. 🔄

2. **Change the runtime type**: In the dropdown menu, select `Change runtime type`. 🛠️

3. **Select GPU as the hardware accelerator**: In the dialog that appears, under `Hardware accelerator`, choose `GPU T4` from the dropdown menu. 🖥️

4. **Save the settings**: Click `Save` to apply the changes. 💾

By enabling GPU, the computations in this notebook will be significantly faster, especially for tasks like training neural networks, processing large datasets, or performing complex calculations.


## PART1: Load AraGPT2

Using the link below, learn how to load araGPT2 base model.

https://huggingface.co/aubmindlab/aragpt2-base

In [1]:
!pip install arabert

Collecting arabert
  Downloading arabert-1.0.1-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyArabic (from arabert)
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting farasapy (from arabert)
  Downloading farasapy-0.0.14-py3-none-any.whl (11 kB)
Collecting emoji==1.4.2 (from arabert)
  Downloading emoji-1.4.2.tar.gz (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.0/185.0 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.4.2-py3-none-any.whl size=186460 sha256=f93066940e604c0367c376b78859ced234abb

In [2]:
from transformers import GPT2TokenizerFast, pipeline
from transformers import GPT2LMHeadModel
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
from arabert.preprocess import ArabertPreprocessor

In [3]:
#TODO: Complete this cell
MODEL_NAME= 'aubmindlab/aragpt2-base'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)

text="الجزائر بلد"
text_clean = arabert_prep.preprocess(text)

model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer =GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline =pipeline("text-generation",model=model,tokenizer=tokenizer)

#feel free to try different decoding settings
generation_pipeline(text,
    pad_token_id=tokenizer.eos_token_id,
    num_beams=15,
    max_length=20,
    top_p=0.7,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 2)[0]['generated_text']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/553M [00:00<?, ?B/s]

Some weights of the model checkpoint at aubmindlab/aragpt2-base were not used when initializing GPT2LMHeadModel: ['ln_f.weight', 'ln_f.bias']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at aubmindlab/aragpt2-base and are newly initialized: ['emb_norm.bias', 'emb_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

'الجزائر بلدجب انها ان هي هى و بصورةون الى فيه ، الأسرة بشكلبنرم قيمة كافة'

### Print AraGPT Model and analyze the architecture



# TODO: print AraGPT2
الجزائر بلد هي ، ان بشكلجبرم الىالإ هى و انها فيهون ولا هو الأسرة بصورة


## PART2: Fine-tuning

To fine-tune AraGPT2 for text summarization, we use the file `arabic_texts_summaries.csv`

#### *Fine-tuning Steps:*


1.   Load datasets and split it into train/test
2.   Create Datalaoders of train and val.
3.   Resize model embeddings for new tokenizer length.
4.   Fine-tuning model by passing train data and evaluating it on val data during training.
5.   Store the tokenizer and fine-tuned model.
6.   Generate summaries for test set which is not used during fine tune.



In [4]:
from utils_data import *
from utils_tokenizer import *
from train import *

In [5]:
max_length = 100
sum_length = 150
split_probability = 0.8

In [6]:
train, val, test = process_data("/content/arabic_texts_summaries.csv",max_length , sum_length, split_probability)

train size: 10
val size: 20
test size: 20
test head:
                                                text  \
6  تدور أحداث هذا النص حول حفلة في القرية. يبدأ ا...   

                                             summary  text_len  
6  حفلة تراثية في قرية تظهر العادات والتقاليد الم...        49  


In [7]:
# Add token to AraGPT2 tokenizer
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('aubmindlab/aragpt2-base')

special_tokens = {'bos_token':'<BOS>', 'eos_token':'<EOS>', 'pad_token':'<PAD>', 'additional_special_tokens':['<SUMMARIZE>']}
tokenizer.add_special_tokens(special_tokens)

print('tokenizer len: {}'.format(len(tokenizer)))

ignore_idx = tokenizer.pad_token_id


tokenizer len: 64004


In [9]:
# TODO: apply tokenizer
import os

tokenizer_dir ="tokenizer_path_save"
if not os.path.exists(tokenizer_dir):
  os.makedirs(tokenizer_dir) # Create output directory if needed

max_seq_len = 768
tokenizer.save_pretrained(tokenizer_dir)
tokenizer_len = len(tokenizer)
print('ignore_index: {}'.format(ignore_idx))
print('max_len: {}'.format(max_seq_len))

train_dataset, val_dataset, test_dataset = tokenize_dataset(train, val, test, tokenizer, max_seq_len) # Fix tokenize_dataset function in utils_tokenizer and call it


ignore_index: 64002
max_len: 768


TypeError: 'DataFrame' object is not callable

In [11]:
#Generate train/val/test files
#save tokenized data
out_dir="tokenizer_data"
processed_set= "dataset"
data_dir = os.path.join(out_dir, processed_set)
if not os.path.exists(data_dir):
  os.makedirs(data_dir) # Create output directory if needed
file = os.path.join(data_dir,"train.csv")
train.to_csv(file, index=False)

file = os.path.join(data_dir,"val.csv")
val.to_csv(file, index=False)

file = os.path.join(data_dir,"test.csv")
test.to_csv(file, index=False)

In [21]:
# TODO: Visualize train and explain each column
import pandas as pd

# Read the train.csv file into a DataFrame
train_df = pd.read_csv("tokenizer_data/dataset/train.csv")

# Display the first few rows of the DataFrame
print("First few rows of the train dataset:")
print(train_df.head())

# Display summary statistics of the DataFrame
print("\nSummary statistics of the train dataset:")
print(train_df.describe())

# Display distribution of a specific column (e.g., 'label')
print("\nDistribution of labels:")
print(train_df['category'].value_counts())

# You can also use matplotlib or seaborn for visualization
import matplotlib.pyplot as plt

# Example: Visualize the distribution of labels
train_df['category'].value_counts().plot(kind='bar')
plt.xlabel('Label')
plt.ylabel('Count')
plt.title('Distribution of Labels')
plt.show()


First few rows of the train dataset:
                                                text  \
0  تدور أحداث هذا النص حول مهرجان ثقافي. يبدأ الن...   
1  تدور أحداث هذا النص حول مهرجان ثقافي. يبدأ الن...   
2  تدور أحداث هذا النص حول رحلة بحرية. يبدأ النص ...   
3  تدور أحداث هذا النص حول مغامرة في الجبال. يبدأ...   
4  تدور أحداث هذا النص حول يوم في السوق. يبدأ الن...   

                                           summary  text_len  
0  الاحتفال بمهرجان ثقافي يعرض فنون وثقافات متنوعة        46  
1  الاحتفال بمهرجان ثقافي يعرض فنون وثقافات متنوعة        46  
2                  مغامرة بحرية تستكشف عجائب البحر        46  
3            تجربة تسلق جبال شاهقة واكتشاف الطبيعة        49  
4      يوم حافل بالتسوق واستكشاف الأسواق التقليدية        49  

Summary statistics of the train dataset:
        text_len
count  10.000000
mean   47.500000
std     1.581139
min    46.000000
25%    46.000000
50%    47.500000
75%    49.000000
max    49.000000

Distribution of labels:


KeyError: 'category'

In [20]:
# TODO: Data Loaders
# Fix code in utils_data.py

import torch
train_dataset, val_dataset= get_gpt2_dataset(train_df, val)# call function get_gpt2_dataset

b = train_dataset.__getitem__(0) # check one data row

train_dataloader = DataLoader(train_dataset, sampler = RandomSampler(train_dataset), batch_size = 1)
val_dataloader = DataLoader(val_dataset, sampler = SequentialSampler(val_dataset), batch_size = 1)

train_loader_len =len(train_dataloader)

KeyError: 'encodings'