*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Classification of Arabic News Articles using BERT

In [1]:
import json
import os
import sys

import numpy as np
import pandas as pd
import scrapbook as sb
import torch
import torch.nn as nn
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

sys.path.append("../../")
from utils_nlp.common.timer import Timer
from utils_nlp.dataset.dac import load_pandas_df
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier

## Introduction
In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).

We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic.

In [2]:
DATA_FOLDER = "./temp"
BERT_CACHE_DIR = "./temp"
LANGUAGE = Language.MULTILINGUAL
MAX_LEN = 200
BATCH_SIZE = 32
NUM_GPUS = 2
NUM_EPOCHS = 1
TRAIN_SIZE = 0.8
NUM_ROWS = 15000
RANDOM_STATE = 0

## Read Dataset
We start by loading the data. The following line also downloads the file if it doesn't exist, and extracts the csv file into the specified data folder. We retain a subset, of size *NUM_ROWS*, of the data for quicker model training.

In [3]:
df = load_pandas_df(DATA_FOLDER).sample(NUM_ROWS, random_state=RANDOM_STATE)

In [4]:
df.head()

Unnamed: 0,text,targe
80414,فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك...,4
6649,أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا...,0
3722,أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي...,0
82317,الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه...,4
5219,المطرب المصري يخوض حملة إعلامية لترويج ألبومه ...,0


In [5]:
# set the text and label columns
text_col = df.columns[0]
label_col = df.columns[1]

In [6]:
# remove empty documents
df = df[df[text_col].isna() == False]

Inspect the distribution of labels:

In [7]:
df[label_col].value_counts()

4    5844
3    2796
1    2139
0    1917
2    1900
Name: targe, dtype: int64

We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:


In [8]:
# ordered list of labels
labels = ["culture", "diverse", "economy", "politics", "sports"]
num_labels = len(labels)
pd.DataFrame({"label": labels})

Unnamed: 0,label
0,culture
1,diverse
2,economy
3,politics
4,sports


Next, we split the data for training and testing:

In [9]:
df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=RANDOM_STATE)
print("Number of training examples: {}".format(df_train.shape[0]))
print("Number of testing examples: {}".format(df_test.shape[0]))

Number of training examples: 11676
Number of testing examples: 2920




## Tokenize and Preprocess

Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets.

In [10]:
tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)
tokens_train = tokenizer.tokenize(list(df_train[text_col].astype(str)))
tokens_test = tokenizer.tokenize(list(df_test[text_col].astype(str)))

100%|██████████| 11676/11676 [00:59<00:00, 196.42it/s]
100%|██████████| 2920/2920 [00:14<00:00, 197.99it/s]


In addition, we perform the following preprocessing steps in the cell below:
- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence
- Pad or truncate the token lists to the specified max length
- Return mask lists that indicate paddings' positions

*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*

In [11]:
tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(
    tokens_train, MAX_LEN
)
tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(
    tokens_test, MAX_LEN
)

## Create Model
Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels.

In [12]:
classifier = BERTSequenceClassifier(
    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR
)

## Train
We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:

In [13]:
with Timer() as t:
    classifier.fit(
        token_ids=tokens_train,
        input_mask=mask_train,
        labels=list(df_train[label_col]),    
        num_gpus=NUM_GPUS,        
        num_epochs=NUM_EPOCHS,
        batch_size=BATCH_SIZE,    
        verbose=True,
    )    
print("[Training time: {:.3f} hrs]".format(t.interval / 3600))

t_total value of -1 results in schedule not being applied
Iteration:   0%|          | 1/365 [00:03<21:12,  3.49s/it]

epoch:1/1; batch:1->37/365; average training loss:1.591262


Iteration:  10%|█         | 38/365 [01:02<08:45,  1.61s/it]

epoch:1/1; batch:38->74/365; average training loss:0.745935


Iteration:  21%|██        | 75/365 [02:02<07:52,  1.63s/it]

epoch:1/1; batch:75->111/365; average training loss:0.593934


Iteration:  31%|███       | 112/365 [03:03<06:56,  1.65s/it]

epoch:1/1; batch:112->148/365; average training loss:0.530150


Iteration:  41%|████      | 149/365 [04:03<05:54,  1.64s/it]

epoch:1/1; batch:149->185/365; average training loss:0.481620


Iteration:  51%|█████     | 186/365 [05:05<05:02,  1.69s/it]

epoch:1/1; batch:186->222/365; average training loss:0.455032


Iteration:  61%|██████    | 223/365 [06:06<03:59,  1.69s/it]

epoch:1/1; batch:223->259/365; average training loss:0.421702


Iteration:  71%|███████   | 260/365 [07:08<02:56,  1.68s/it]

epoch:1/1; batch:260->296/365; average training loss:0.401165


Iteration:  81%|████████▏ | 297/365 [08:09<01:52,  1.65s/it]

epoch:1/1; batch:297->333/365; average training loss:0.382719


Iteration:  92%|█████████▏| 334/365 [09:12<00:52,  1.71s/it]

epoch:1/1; batch:334->365/365; average training loss:0.372204


Iteration: 100%|██████████| 365/365 [10:04<00:00,  1.63s/it]

[Training time: 0.169 hrs]





## Score
We score the test set using the trained classifier:

In [14]:
preds = classifier.predict(
    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE
)

Iteration: 100%|██████████| 92/92 [00:48<00:00,  2.25it/s]


## Evaluate Results
Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set.

In [15]:
report = classification_report(df_test[label_col], preds, target_names=labels, output_dict=True) 
accuracy = accuracy_score(df_test[label_col], preds )
print("accuracy: {}".format(accuracy))
print(json.dumps(report, indent=4, sort_keys=True))

accuracy: 0.9277397260273973
{
    "culture": {
        "f1-score": 0.9081761006289307,
        "precision": 0.8848039215686274,
        "recall": 0.9328165374677002,
        "support": 387
    },
    "diverse": {
        "f1-score": 0.9237983587338804,
        "precision": 0.9471153846153846,
        "recall": 0.9016018306636155,
        "support": 437
    },
    "economy": {
        "f1-score": 0.8547418967587034,
        "precision": 0.8221709006928406,
        "recall": 0.89,
        "support": 400
    },
    "macro avg": {
        "f1-score": 0.9099850933798536,
        "precision": 0.9087524907040864,
        "recall": 0.9125256551533433,
        "support": 2920
    },
    "micro avg": {
        "f1-score": 0.9277397260273973,
        "precision": 0.9277397260273973,
        "recall": 0.9277397260273973,
        "support": 2920
    },
    "politics": {
        "f1-score": 0.8734177215189873,
        "precision": 0.8994413407821229,
        "recall": 0.8488576449912126,
        "s

In [16]:
# for testing
sb.glue("accuracy", accuracy)
sb.glue("precision", report["macro avg"]["precision"])
sb.glue("recall", report["macro avg"]["recall"])
sb.glue("f1", report["macro avg"]["f1-score"])