*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Classification of Arabic News Articles using BERT

In [1]:
import os
import sys

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

sys.path.append("../../")
from utils_nlp.dataset.dac import load_pandas_df
from utils_nlp.common.timer import Timer
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier

## Introduction
In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).

We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic.

In [2]:
DATA_FOLDER = "./temp"
BERT_CACHE_DIR = "./temp"
LANGUAGE = Language.MULTILINGUAL
MAX_LEN = 200
BATCH_SIZE = 32
NUM_GPUS = 2
NUM_EPOCHS = 1
TRAIN_SIZE = 0.7
NUM_ROWS = 10_000

## Read Dataset
We start by loading the data. The following line also downloads the file if it doesn't exist, and extracts the csv file into the specified data folder. We retain a subset, of size *NUM_ROWS*, of the data for quicker model training.

In [3]:
df = load_pandas_df(DATA_FOLDER).sample(NUM_ROWS)

In [4]:
df.head()

Unnamed: 0,text,targe
68859,روماو الملعب عائق ولن أختبئ وراءه أكد جوزي روم...,4
90874,قال إنه سعيد لأن اللاعبين لم ينزلوا أيديهم وإن...,4
11604,يقال يخلق من الشبه أربعين غير أن هذا البرازيلي...,0
51273,الملك يقدم وصفة النجاة من طوفان التيار السلفي ...,3
26745,لقيت راعية غنم بجماعة باب مرزوقة حتفها صباح ال...,1


In [7]:
# set the text and label columns
text_col = df.columns[0]
label_col = df.columns[1]

In [8]:
# remove empty documents
df = df[df[text_col].isna() == False]

Inspect the distribution of labels:

In [9]:
df[label_col].value_counts()

4    3928
3    1816
1    1555
2    1238
0    1200
Name: targe, dtype: int64

We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:


In [10]:
# ordered list of labels
labels = ["culture", "diverse", "economy", "politics", "sports"]
num_labels = len(labels)
pd.DataFrame({"label": labels})

Unnamed: 0,label
0,culture
1,diverse
2,economy
3,politics
4,sports


Next, we split the data for training and testing:

In [11]:
df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=0)
print("Number of training examples: {}".format(df_train.shape[0]))
print("Number of testing examples: {}".format(df_test.shape[0]))

Number of training examples: 6815
Number of testing examples: 2922




## Tokenize and Preprocess

Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets.

In [12]:
tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)
tokens_train = tokenizer.tokenize(list(df_train[text_col].astype(str)))
tokens_test = tokenizer.tokenize(list(df_test[text_col].astype(str)))

100%|██████████| 6815/6815 [00:34<00:00, 198.25it/s]
100%|██████████| 2922/2922 [00:14<00:00, 195.18it/s]


In addition, we perform the following preprocessing steps in the cell below:
- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence
- Pad or truncate the token lists to the specified max length
- Return mask lists that indicate paddings' positions

*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*

In [13]:
tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(
    tokens_train, MAX_LEN
)
tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(
    tokens_test, MAX_LEN
)

## Create Model
Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels.

In [14]:
classifier = BERTSequenceClassifier(
    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR
)

100%|██████████| 662804195/662804195 [00:11<00:00, 57860894.21B/s]


## Train
We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:

In [15]:
with Timer() as t:
    classifier.fit(
        token_ids=tokens_train,
        input_mask=mask_train,
        labels=list(df_train[label_col]),    
        num_gpus=NUM_GPUS,        
        num_epochs=NUM_EPOCHS,
        batch_size=BATCH_SIZE,    
        verbose=True,
    )    
print("[Training time: {:.3f} hrs]".format(t.interval / 3600))

t_total value of -1 results in schedule not being applied
Iteration:   0%|          | 1/213 [00:05<19:11,  5.43s/it]

epoch:1/1; batch:1->22/213; average training loss:1.605419


Iteration:  11%|█         | 23/213 [00:40<05:03,  1.60s/it]

epoch:1/1; batch:23->44/213; average training loss:1.090210


Iteration:  21%|██        | 45/213 [01:15<04:29,  1.60s/it]

epoch:1/1; batch:45->66/213; average training loss:0.904462


Iteration:  31%|███▏      | 67/213 [01:51<03:54,  1.61s/it]

epoch:1/1; batch:67->88/213; average training loss:0.776574


Iteration:  42%|████▏     | 89/213 [02:26<03:19,  1.61s/it]

epoch:1/1; batch:89->110/213; average training loss:0.701259


Iteration:  52%|█████▏    | 111/213 [03:01<02:43,  1.61s/it]

epoch:1/1; batch:111->132/213; average training loss:0.646278


Iteration:  62%|██████▏   | 133/213 [03:37<02:12,  1.65s/it]

epoch:1/1; batch:133->154/213; average training loss:0.609719


Iteration:  73%|███████▎  | 155/213 [04:13<01:34,  1.63s/it]

epoch:1/1; batch:155->176/213; average training loss:0.573199


Iteration:  83%|████████▎ | 177/213 [04:49<00:58,  1.64s/it]

epoch:1/1; batch:177->198/213; average training loss:0.541991


Iteration:  93%|█████████▎| 199/213 [05:27<00:24,  1.76s/it]

epoch:1/1; batch:199->213/213; average training loss:0.522652


Iteration: 100%|██████████| 213/213 [05:52<00:00,  1.74s/it]

[Training time: 0.099 hrs]





## Score
We score the test set using the trained classifier:

In [16]:
preds = classifier.predict(
    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE
)

Iteration: 100%|██████████| 92/92 [00:48<00:00,  2.20it/s]


## Evaluate Results
Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set.

In [19]:
print("accuracy: {}\n".format(accuracy_score(df_test[label_col], preds)))
print(classification_report(df_test[label_col], preds, target_names=labels))

accuracy: 0.9117043121149897

              precision    recall  f1-score   support

     culture       0.91      0.88      0.90       355
     diverse       0.90      0.95      0.93       466
     economy       0.76      0.82      0.79       369
    politics       0.87      0.83      0.85       583
      sports       0.99      0.98      0.99      1149

   micro avg       0.91      0.91      0.91      2922
   macro avg       0.89      0.89      0.89      2922
weighted avg       0.91      0.91      0.91      2922

