# About This Project

# Bert Model and Tokenizers

In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the from_pretrained method.

The models were pretrained on ~8.2 Billion words:

* Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
* Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
and other Arabic resources which sum up to ~95GB of text.

Pretraining procedure follows training settings of bert with some changes: trained for 4M training steps with batchsize of 128, instead of 1M with batchsize of 256.

|  | BERT-Mini | BERT-Medium   | BERT-Base  | BERT-Large  |
|:---:|:---:|:---:|:---:|:---:|
| Hidden Layers | 4 | 8 | 12 | 24 |
| Attention heads | 4 | 8 | 12 | 16 |
| Hidden size | 256 | 512 | 768 | 1024 |
| Parameters | 11M | 42M | 110M | 340M |

* Mini:   *asafaya/bert-mini-arabic* 
* Medium: *asafaya/bert-medium-arabic* 
* Base:   *asafaya/bert-base-arabic *
* Large:  *asafaya/bert-large-arabic* 



Hugginface provides pretrained models and architecture into a single lines

* tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
* model = AutoModel.from_pretrained("asafaya/bert-base-arabic")

# Look inside the dataset files

Dataset files are already divided into train and test dataset. 

In [45]:
import os
import sys

import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
from matplotlib import rc
import joblib

from transformers import AutoTokenizer, AutoModel

In [1]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 4.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 43.3 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 61.5 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [3]:
pip install pytorch_lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch_lightning
  Downloading pytorch_lightning-1.8.6-py3-none-any.whl (800 kB)
[K     |████████████████████████████████| 800 kB 3.3 MB/s 
Collecting lightning-utilities!=0.4.0,>=0.3.0
  Downloading lightning_utilities-0.5.0-py3-none-any.whl (18 kB)
Collecting tensorboardX>=2.2
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 57.9 MB/s 
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.11.0-py3-none-any.whl (512 kB)
[K     |████████████████████████████████| 512 kB 9.2 MB/s 
Installing collected packages: torchmetrics, tensorboardX, lightning-utilities, pytorch-lightning
Successfully installed lightning-utilities-0.5.0 pytorch-lightning-1.8.6 tensorboardX-2.5.1 torchmetrics-0.11.0


In [58]:
train = pd.read_csv("/content/train.csv")
val = pd.read_csv("/content/dev.csv")
train.head(10)

Unnamed: 0,text,category,stance
0,بيل غيتس يتلقى لقاح #كوفيد19 من غير تصوير الاب...,celebrity,2
1,وزير الصحة لحد اليوم وتحديدا هلأ بمؤتمروا الصح...,info_news,2
2,قولكن رح يكونو اد المسؤولية ب لبنان لما يوصل ...,info_news,2
3,#تركيا.. وزير الصحة فخر الدين قوجة يتلقى أول ج...,celebrity,2
4,وئام وهاب يشتم الدول الخليجية في كل طلة اعلامي...,personal,1
5,"لقاح #كورونا في أميركا.. قلق متزايد من ""التوزي...",info_news,1
6,لبنان اشترى مليونان لقاح امريكي اذا شلنا يلي ع...,info_news,2
7,من عوارض لقاح كورونا<LF>هو تهكير حسابك عتويتر<...,personal,1
8,هناك 1780 مليونيراً في لبنان. ماذا لو فُرضت ال...,unrelated,1
9,دعبول حضرتك منو انت وتطلب من قائد دولة إسلامية...,info_news,2


In [None]:
print(val["text"].apply(len).mean())

81.327


In [None]:
# %matplotlib inline
# %config InlineBackend.figure_format='retina'

# tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-mini-arabic")

# sns.set(style='whitegrid', palette='muted', font_scale=1.2)
# rcParams['figure.figsize'] = 16, 6

# text_token_counts = df['clean_text'].apply(lambda x : len(tokenizer.encode(x)))
# fig, (ax1, ax2) = plt.subplots(1, 2)
# sns.histplot(text_token_counts, ax=ax1)
# sns.boxplot(text_token_counts, ax=ax2)

# Dataset Module

In [22]:
import torch
import pytorch_lightning as pl

from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Feature Engieering and data files

In [59]:
from sklearn.preprocessing import LabelEncoder
import joblib
#in previous cell we read datafiles 
# train = pd.concat((train_pos,train_neg),axis=0).sample(frac=1.0).reset_index(drop=True)
# val = pd.concat((test_pos,test_neg),axis=0).sample(frac=1.0).reset_index(drop=True)
# train = train.rename(columns={0:"label",1:"text"})
# val = val.rename(columns={0:"label",1:"text"})
lbl_enc = LabelEncoder()
train.loc[:,"category"] = lbl_enc.fit_transform(train["category"])
val.loc[:,"category"] = lbl_enc.fit_transform(val["category"])
joblib.dump(lbl_enc,"label_encoder.pkl")
train.to_csv("train.csv",index=False)
val.to_csv("dev.csv",index=False)

In [60]:
lbl_enc.classes_

array(['advice', 'celebrity', 'info_news', 'others', 'personal', 'plan',
       'requests', 'restrictions', 'rumors', 'unrelated'], dtype=object)

In [62]:
class ArabicDataset(Dataset):
    def __init__(self,data,max_len,model_type="Mini"):
        super().__init__()
        self.labels = data["category"].values
        #data["text"] = data['text'].apply(lambda x: processPost(x)) # applay post processing 
        self.texts = data["text"].values
        self.max_len = max_len
        model = {"Mini": "asafaya/bert-mini-arabic",
                "Medium": "asafaya/bert-medium-arabic",
                "Base": "asafaya/bert-base-arabic",
                "Large": "asafaya/bert-large-arabic"}
        self.tokenizer = AutoTokenizer.from_pretrained(model[model_type])
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self,idx):
        text = " ".join(self.texts[idx].split())
        label = self.labels[idx]
        inputs = self.tokenizer(text,padding='max_length',
                                max_length=self.max_len,truncation=True,return_tensors="pt")
        #input_ids,token_type_ids,attention_mask
        return {
            "inputs":{"input_ids":inputs["input_ids"][0],
                      "token_type_ids":inputs["token_type_ids"][0],
                      "attention_mask":inputs["attention_mask"][0],
                     },
            "labels": torch.tensor(label,dtype=torch.long) 
        }
        


Peeking dataset module

In [None]:
# arabic_dataset = ArabicDataset(train,100)
# print(next(iter(arabic_dataset)))

In [63]:
class ArabicDataModule(pl.LightningDataModule):
    def __init__(self,train_path,val_path,batch_size=12,max_len=100,model_type="Mini"):
        super().__init__()
        self.train_path,self.val_path= train_path,val_path
        self.batch_size = batch_size
        self.max_len = max_len
        self.model_type = model_type
    
    def setup(self,stage=None):
        train = pd.read_csv(self.train_path)
        val = pd.read_csv(self.val_path)
        self.train_dataset = ArabicDataset(data=train,max_len=self.max_len,model_type=self.model_type)
        self.val_dataset = ArabicDataset(data=val,max_len=self.max_len,model_type=self.model_type)
    
    def train_dataloader(self):
        return DataLoader(self.train_dataset,batch_size=self.batch_size,shuffle=True,num_workers=4)
    
    def val_dataloader(self):
        return DataLoader(self.val_dataset,batch_size=self.batch_size,shuffle=False,num_workers=4)

peeking into dataloader module

In [None]:
# load = ArabicDataModule(train_path="./train.csv",
#                            val_path = "./train.csv",
#                 batch_size=12,max_len=20)
# load.setup()
# next(iter(load.train_dataloader()))

# Bert fine tuning Module

In [27]:
import torch
from torch import nn,optim

from transformers import AutoTokenizer, AutoModel

from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split

# Run and Save Checkpoints

In [68]:
class ArabicBertModel(pl.LightningModule):
    def __init__(self,model_type="Mini"):
        super().__init__()
        model = {"Mini": ("asafaya/bert-mini-arabic",256),
                "Medium": ("asafaya/bert-medium-arabic",512),
                "Base": ("asafaya/bert-base-arabic",768),
                "Large": ("asafaya/bert-large-arabic",1024)}
        self.bert_model = AutoModel.from_pretrained(model[model_type][0])
        self.fc = nn.Linear(model[model_type][1],10)
    
    def forward(self,inputs):
        out = self.bert_model(**inputs)#inputs["input_ids"],inputs["token_type_ids"],inputs["attention_mask"])
        pooler = out[1]
        out = self.fc(pooler)
        return out
    
    def configure_optimizers(self):
        return optim.AdamW(self.parameters(), lr=0.0001)
    
    def criterion(self,output,target):
        return nn.CrossEntropyLoss(weight=torch.tensor([9.0, 2.0, 1.0, 3.0, 1.0, 3.0, 3.0, 9.0, 9.0, 3.0]))(output,target)
    
    #TODO: adding metrics
    def training_step(self,batch,batch_idx):
        x,y = batch["inputs"],batch["labels"]
        out = self(x)
        loss = self.criterion(out,y)
        return loss
    
    def validation_step(self,batch,batch_idx):
        x,y = batch["inputs"],batch["labels"]
        out = self(x)
        loss = self.criterion(out,y)
        return loss

In [69]:
#TODO: getting different models sizes results
MODEL_TYPE = "Mini"
dm = ArabicDataModule(train_path="/content/train.csv",
                val_path = "/content/train.csv",
                batch_size=128,max_len=80, model_type=MODEL_TYPE)

model = ArabicBertModel(model_type=MODEL_TYPE)
trainer = pl.Trainer(max_epochs=3)
trainer.fit(model,dm)

Some weights of the model checkpoint at asafaya/bert-mini-arabic were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_li

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


# Results and Discussions

In [71]:
from tqdm.auto import tqdm



modeltest= ArabicBertModel(model_type=MODEL_TYPE)
modeltest = modeltest.load_from_checkpoint(
    checkpoint_path="/content/lightning_logs/version_4/checkpoints/epoch=2-step=165.ckpt",
    hparams_file="/content/lightning_logs/version_4/hparams.yaml",
    map_location=None,
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
modeltest.to(device)


preds = []
real_values = []

load = ArabicDataModule(train_path="/content/train.csv",
                           val_path = "/content/dev.csv",
                batch_size=128,max_len=80)
load.setup()
test_dataloader = load.val_dataloader()

progress_bar = tqdm(range(len(test_dataloader)))

modeltest.eval()
for batch in test_dataloader:    
    x,y = batch["inputs"],batch["labels"]
    inp = {k: v.to(device) for k, v in x.items()}
    
    with torch.no_grad():
        outputs = modeltest(inp)

    predictions = torch.argmax(outputs, dim=-1)
    
    preds.extend(predictions)
    real_values.extend(y)

    progress_bar.update()
    
preds = torch.stack(preds).cpu()
real_values = torch.stack(real_values).cpu()

Some weights of the model checkpoint at asafaya/bert-mini-arabic were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at asafaya/bert-mini-arabic were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerN

  0%|          | 0/8 [00:00<?, ?it/s]

In [72]:
from sklearn.metrics import classification_report

print(classification_report(real_values, preds, target_names=list(map(str,lbl_enc.classes_))))

              precision    recall  f1-score   support

      advice       0.14      0.50      0.22        10
   celebrity       0.80      0.88      0.84       145
   info_news       0.74      0.66      0.70       545
      others       0.00      0.00      0.00        17
    personal       0.53      0.58      0.55       128
        plan       0.23      0.32      0.27        82
    requests       0.50      0.20      0.29        20
restrictions       0.00      0.00      0.00         2
      rumors       0.12      0.13      0.12        15
   unrelated       0.33      0.42      0.37        36

    accuracy                           0.61      1000
   macro avg       0.34      0.37      0.34      1000
weighted avg       0.63      0.61      0.62      1000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [57]:
!zip -r /content/lightning_logs.zip /content/lightning_logs

  adding: content/lightning_logs/ (stored 0%)
  adding: content/lightning_logs/version_0/ (stored 0%)
  adding: content/lightning_logs/version_0/hparams.yaml (stored 0%)
  adding: content/lightning_logs/version_0/checkpoints/ (stored 0%)
  adding: content/lightning_logs/version_0/checkpoints/epoch=2-step=165.ckpt (deflated 27%)
  adding: content/lightning_logs/version_3/ (stored 0%)
  adding: content/lightning_logs/version_3/hparams.yaml (stored 0%)
  adding: content/lightning_logs/version_3/checkpoints/ (stored 0%)
  adding: content/lightning_logs/version_3/checkpoints/epoch=4-step=275.ckpt (deflated 27%)
  adding: content/lightning_logs/version_1/ (stored 0%)
  adding: content/lightning_logs/version_1/hparams.yaml (stored 0%)
  adding: content/lightning_logs/version_1/checkpoints/ (stored 0%)
  adding: content/lightning_logs/version_1/checkpoints/epoch=2-step=165.ckpt (deflated 27%)
  adding: content/lightning_logs/version_2/ (stored 0%)
  adding: content/lightning_logs/version_2/hpa