# About This Project

# Bert Model and Tokenizers

In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the from_pretrained method.

The models were pretrained on ~8.2 Billion words:

* Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
* Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
and other Arabic resources which sum up to ~95GB of text.

Pretraining procedure follows training settings of bert with some changes: trained for 4M training steps with batchsize of 128, instead of 1M with batchsize of 256.

|  | BERT-Mini | BERT-Medium   | BERT-Base  | BERT-Large  |
|:---:|:---:|:---:|:---:|:---:|
| Hidden Layers | 4 | 8 | 12 | 24 |
| Attention heads | 4 | 8 | 12 | 16 |
| Hidden size | 256 | 512 | 768 | 1024 |
| Parameters | 11M | 42M | 110M | 340M |

* Mini:   *asafaya/bert-mini-arabic* 
* Medium: *asafaya/bert-medium-arabic* 
* Base:   *asafaya/bert-base-arabic *
* Large:  *asafaya/bert-large-arabic* 



Hugginface provides pretrained models and architecture into a single lines

* tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
* model = AutoModel.from_pretrained("asafaya/bert-base-arabic")

# Look inside the dataset files

Dataset files are already divided into train and test dataset. 

In [3]:
import os
import sys

import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
from matplotlib import rc
import joblib

from transformers import AutoTokenizer, AutoModel

In [2]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 5.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 52.2 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 55.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [5]:
pip install pytorch_lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch_lightning
  Downloading pytorch_lightning-1.8.6-py3-none-any.whl (800 kB)
[K     |████████████████████████████████| 800 kB 6.5 MB/s 
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.11.0-py3-none-any.whl (512 kB)
[K     |████████████████████████████████| 512 kB 50.1 MB/s 
Collecting tensorboardX>=2.2
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 54.2 MB/s 
Collecting lightning-utilities!=0.4.0,>=0.3.0
  Downloading lightning_utilities-0.5.0-py3-none-any.whl (18 kB)
Installing collected packages: torchmetrics, tensorboardX, lightning-utilities, pytorch-lightning
Successfully installed lightning-utilities-0.5.0 pytorch-lightning-1.8.6 tensorboardX-2.5.1 torchmetrics-0.11.0


In [30]:
!pip install Arabic-Stopwords

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Arabic-Stopwords
  Downloading Arabic_Stopwords-0.3-py3-none-any.whl (353 kB)
[K     |████████████████████████████████| 353 kB 5.7 MB/s 
[?25hCollecting pyarabic>=0.6.2
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 47.6 MB/s 
Installing collected packages: pyarabic, Arabic-Stopwords
Successfully installed Arabic-Stopwords-0.3 pyarabic-0.6.15


In [32]:
!pip install tashaphyne

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tashaphyne
  Downloading Tashaphyne-0.3.6-py3-none-any.whl (251 kB)
[K     |████████████████████████████████| 251 kB 6.2 MB/s 
Installing collected packages: tashaphyne
Successfully installed tashaphyne-0.3.6


In [7]:
train = pd.read_csv("/content/train.csv")
val = pd.read_csv("/content/dev.csv")
train.head(10)

Unnamed: 0,text,category,stance
0,بيل غيتس يتلقى لقاح #كوفيد19 من غير تصوير الاب...,celebrity,1
1,وزير الصحة لحد اليوم وتحديدا هلأ بمؤتمروا الصح...,info_news,1
2,قولكن رح يكونو اد المسؤولية ب لبنان لما يوصل ...,info_news,1
3,#تركيا.. وزير الصحة فخر الدين قوجة يتلقى أول ج...,celebrity,1
4,وئام وهاب يشتم الدول الخليجية في كل طلة اعلامي...,personal,0
5,"لقاح #كورونا في أميركا.. قلق متزايد من ""التوزي...",info_news,0
6,لبنان اشترى مليونان لقاح امريكي اذا شلنا يلي ع...,info_news,1
7,من عوارض لقاح كورونا<LF>هو تهكير حسابك عتويتر<...,personal,0
8,هناك 1780 مليونيراً في لبنان. ماذا لو فُرضت ال...,unrelated,0
9,دعبول حضرتك منو انت وتطلب من قائد دولة إسلامية...,info_news,1


In [33]:
import re
import arabicstopwords.arabicstopwords as ast
from tashaphyne.stemming import ArabicLightStemmer

punctuations_list = '''^_-`$%&÷×؛<=>()*&^%][،/;:"؟.,'{}~¦+|!”…“'''
#english_punctuations = string.punctuation
#print(english_punctuations)
#punctuations_list = arabic_punctuations

def remove_punctuations(text):
    translator = str.maketrans(punctuations_list, ' '*len(punctuations_list))
    return text.translate(translator)
    
def remove_emoji(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text) # no emoji

# a small function to remove stop words
def remove_stop_words(text):
    #getting a stopwords_list
    stop_words = ast.stopwords_list()
    return ' '.join(word for word in text.split() if word not in stop_words)

# a small function to remove stop words
ArListem = ArabicLightStemmer()
def lemmatiz_word(text):
    # lemmer = qalsadi.lemmatizer.Lemmatizer()
    # return ' '.join(lemmer.lemmatize(word) for word in text.split()) 
    #---------
    #st = ISRIStemmer()
    #return ' '.join(st.stem(word) for word in text.split())
    return ' '.join(ArListem.light_stem(word) for word in text.split())
    
def processPost(tweet): 

    #Remove <LF> from tweet
    tweet = re.sub('<LF>', ' ', tweet)
    
    #Replace @username with empty string
    tweet = re.sub('@[^\s]+', ' ', tweet)
    
    #remove url
    tweet = re.sub('(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})',' ',tweet)
    
    #remove hashtage #
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)

    # remove punctuations
    tweet = remove_punctuations(tweet)
    
    # remove emoji  (not sure with this step)
    tweet = remove_emoji(tweet)
    
    # normalize the tweet
    # tweet= normalize_arabic(tweet)
    
    # remove repeated letters
    tweet=re.sub(r'(.)\1+', r'\1', tweet)

    #remove stop words
    tweet = remove_stop_words(tweet)

    tweet=lemmatiz_word(tweet)
    
    return tweet

train["text"] = train['text'].apply(lambda x: processPost(x))
val['text'] = val['text'].apply(lambda x: processPost(x))    

In [35]:
print(val["text"].apply(len).mean())

81.327


In [None]:
# %matplotlib inline
# %config InlineBackend.figure_format='retina'

# tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-mini-arabic")

# sns.set(style='whitegrid', palette='muted', font_scale=1.2)
# rcParams['figure.figsize'] = 16, 6

# text_token_counts = df['clean_text'].apply(lambda x : len(tokenizer.encode(x)))
# fig, (ax1, ax2) = plt.subplots(1, 2)
# sns.histplot(text_token_counts, ax=ax1)
# sns.boxplot(text_token_counts, ax=ax2)

# Dataset Module

In [6]:
import torch
import pytorch_lightning as pl

from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Feature Engieering and data files

In [8]:
from sklearn.preprocessing import LabelEncoder
import joblib
#in previous cell we read datafiles 
# train = pd.concat((train_pos,train_neg),axis=0).sample(frac=1.0).reset_index(drop=True)
# val = pd.concat((test_pos,test_neg),axis=0).sample(frac=1.0).reset_index(drop=True)
# train = train.rename(columns={0:"label",1:"text"})
# val = val.rename(columns={0:"label",1:"text"})
lbl_enc = LabelEncoder()
train.loc[:,"stance"] = lbl_enc.fit_transform(train["stance"])
val.loc[:,"stance"] = lbl_enc.transform(val["stance"])
joblib.dump(lbl_enc,"label_encoder.pkl")
train.to_csv("train.csv",index=False)
val.to_csv("dev.csv",index=False)

In [11]:
lbl_enc.classes_

array([-1,  0,  1])

In [42]:
class ArabicDataset(Dataset):
    def __init__(self,data,max_len,model_type="Mini"):
        super().__init__()
        self.labels = data["stance"].values
        #data["text"] = data['text'].apply(lambda x: processPost(x)) # applay post processing 
        self.texts = data["text"].values
        self.max_len = max_len
        model = {"Mini": "asafaya/bert-mini-arabic",
                "Medium": "asafaya/bert-medium-arabic",
                "Base": "asafaya/bert-base-arabic",
                "Large": "asafaya/bert-large-arabic"}
        self.tokenizer = AutoTokenizer.from_pretrained(model[model_type])
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self,idx):
        text = " ".join(self.texts[idx].split())
        label = self.labels[idx]
        inputs = self.tokenizer(text,padding='max_length',
                                max_length=self.max_len,truncation=True,return_tensors="pt")
        #input_ids,token_type_ids,attention_mask
        return {
            "inputs":{"input_ids":inputs["input_ids"][0],
                      "token_type_ids":inputs["token_type_ids"][0],
                      "attention_mask":inputs["attention_mask"][0],
                     },
            "labels": torch.tensor(label,dtype=torch.long) 
        }
        


Peeking dataset module

In [None]:
# arabic_dataset = ArabicDataset(train,100)
# print(next(iter(arabic_dataset)))

In [37]:
class ArabicDataModule(pl.LightningDataModule):
    def __init__(self,train_path,val_path,batch_size=12,max_len=100,model_type="Mini"):
        super().__init__()
        self.train_path,self.val_path= train_path,val_path
        self.batch_size = batch_size
        self.max_len = max_len
        self.model_type = model_type
    
    def setup(self,stage=None):
        train = pd.read_csv(self.train_path)
        val = pd.read_csv(self.val_path)
        self.train_dataset = ArabicDataset(data=train,max_len=self.max_len,model_type=self.model_type)
        self.val_dataset = ArabicDataset(data=val,max_len=self.max_len,model_type=self.model_type)
    
    def train_dataloader(self):
        return DataLoader(self.train_dataset,batch_size=self.batch_size,shuffle=True,num_workers=4)
    
    def val_dataloader(self):
        return DataLoader(self.val_dataset,batch_size=self.batch_size,shuffle=False,num_workers=4)

peeking into dataloader module

In [None]:
# load = ArabicDataModule(train_path="./train.csv",
#                            val_path = "./train.csv",
#                 batch_size=12,max_len=20)
# load.setup()
# next(iter(load.train_dataloader()))

# Bert fine tuning Module

In [15]:
import torch
from torch import nn,optim

from transformers import AutoTokenizer, AutoModel

from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split

# Run and Save Checkpoints

In [38]:
class ArabicBertModel(pl.LightningModule):
    def __init__(self,model_type="Mini"):
        super().__init__()
        model = {"Mini": ("asafaya/bert-mini-arabic",256),
                "Medium": ("asafaya/bert-medium-arabic",512),
                "Base": ("asafaya/bert-base-arabic",768),
                "Large": ("asafaya/bert-large-arabic",1024)}
        self.bert_model = AutoModel.from_pretrained(model[model_type][0])
        self.fc = nn.Linear(model[model_type][1],3)
    
    def forward(self,inputs):
        out = self.bert_model(**inputs)#inputs["input_ids"],inputs["token_type_ids"],inputs["attention_mask"])
        pooler = out[1]
        out = self.fc(pooler)
        return out
    
    def configure_optimizers(self):
        return optim.AdamW(self.parameters(), lr=0.0001)
    
    def criterion(self,output,target):
        return nn.CrossEntropyLoss()(output,target)
    
    #TODO: adding metrics
    def training_step(self,batch,batch_idx):
        x,y = batch["inputs"],batch["labels"]
        out = self(x)
        loss = self.criterion(out,y)
        return loss
    
    def validation_step(self,batch,batch_idx):
        x,y = batch["inputs"],batch["labels"]
        out = self(x)
        loss = self.criterion(out,y)
        return loss

In [39]:
#TODO: getting different models sizes results
MODEL_TYPE = "Mini"
dm = ArabicDataModule(train_path="/content/train.csv",
                val_path = "/content/train.csv",
                batch_size=128,max_len=80, model_type=MODEL_TYPE)

model = ArabicBertModel(model_type=MODEL_TYPE)
trainer = pl.Trainer(max_epochs=3)
trainer.fit(model,dm)

Some weights of the model checkpoint at asafaya/bert-mini-arabic were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_li

Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f83dafacdc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f83dafacdc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/pytho

Validation: 0it [00:00, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f83dafacdc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f83dafacdc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1466, in __del__
      File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
self._shutdown_workers()
    if w.is_alive():
  File "/usr/lib/pytho

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


# Results and Discussions

In [40]:
from tqdm.auto import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

preds = []
real_values = []

load = ArabicDataModule(train_path="/content/train.csv",
                           val_path = "/content/dev.csv",
                batch_size=512,max_len=60)
load.setup()
test_dataloader = load.val_dataloader()

progress_bar = tqdm(range(len(test_dataloader)))

model.eval()
for batch in test_dataloader:    
    x,y = batch["inputs"],batch["labels"]
    inp = {k: v.to(device) for k, v in x.items()}
    
    with torch.no_grad():
        outputs = model(inp)

    predictions = torch.argmax(outputs, dim=-1)
    
    preds.extend(predictions)
    real_values.extend(y)

    progress_bar.update()
    
preds = torch.stack(preds).cpu()
real_values = torch.stack(real_values).cpu()

  0%|          | 0/2 [00:00<?, ?it/s]

In [43]:
from sklearn.metrics import classification_report

print(classification_report(real_values, preds, target_names=list(map(str,lbl_enc.classes_))))

              precision    recall  f1-score   support

          -1       0.55      0.17      0.26        70
           0       0.43      0.44      0.43       126
           1       0.87      0.92      0.90       804

    accuracy                           0.81      1000
   macro avg       0.61      0.51      0.53      1000
weighted avg       0.79      0.81      0.79      1000

