*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Classification of Arabic News Articles using BERT

In [1]:
import os
import sys

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

sys.path.append("../../")
from utils_nlp.dataset.dac import load_pandas_df
from utils_nlp.common.timer import Timer
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier

## Introduction
In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).

We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic.

In [2]:
DATA_FOLDER = "./temp"
BERT_CACHE_DIR = "./temp"
LANGUAGE = Language.MULTILINGUAL
MAX_LEN = 200
BATCH_SIZE = 32
NUM_GPUS = 2
NUM_EPOCHS = 1
TRAIN_SIZE = 0.7
NUM_ROWS = 40_000

## Read Dataset
We start by loading the data. The following line also downloads the file if it doesn't exist, and extracts the csv file into the specified data folder. We retain a subset, of size *NUM_ROWS*, of the data for quicker model training.

In [3]:
df = load_pandas_df(DATA_FOLDER).sample(NUM_ROWS)
df = df.fillna('')

100%|██████████| 80.1k/80.1k [00:05<00:00, 14.4kKB/s]


OSError: File doesn't exist

In [4]:
df.head()

Unnamed: 0,text,targe
42589,أخبارنا المغربية بعد أشهر قليلة من حصوله على ل...,2
51605,باشر عزيز أخنوش أمين عام حزب التجمع الوطني للأ...,3
34439,بعد مرور خمس سنوات على ما يسمى الربيع العربي ا...,2
53249,إبراهيم الجملي من منطلق الوعي العميق بجسامة ال...,3
60263,في ما يلي عرض لأبرز العناوين التي تصدرت صفحات ...,3


In [5]:
# set the text and label columns
text_col = df.columns[0]
label_col = df.columns[1]

Inspect the distribution of labels:

In [6]:
df[label_col].value_counts()

4    16657
3     7363
1     5923
2     5051
0     5006
Name: targe, dtype: int64

We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:


In [7]:
# ordered list of labels
labels = ["culture", "diverse", "economy", "politics", "sports"]
num_labels = len(labels)
pd.DataFrame({"label": labels})

Unnamed: 0,label
0,culture
1,diverse
2,economy
3,politics
4,sports


Next, we split the data for training and testing:

In [8]:
df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=0)
print("Number of training examples: {}".format(df_train.shape[0]))
print("Number of testing examples: {}".format(df_test.shape[0]))

Number of training examples: 28000
Number of testing examples: 12000


## Tokenize and Preprocess

Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets.

In [9]:
tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)
tokens_train = tokenizer.tokenize(list(df_train[text_col].astype(str)))
tokens_test = tokenizer.tokenize(list(df_test[text_col].astype(str)))

100%|██████████| 28000/28000 [02:18<00:00, 202.29it/s]
100%|██████████| 12000/12000 [00:57<00:00, 207.21it/s]


In addition, we perform the following preprocessing steps in the cell below:
- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence
- Pad or truncate the token lists to the specified max length
- Return mask lists that indicate paddings' positions

*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*

In [10]:
tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(
    tokens_train, MAX_LEN
)
tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(
    tokens_test, MAX_LEN
)

## Create Model
Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels.

In [11]:
classifier = BERTSequenceClassifier(
    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR
)

## Train
We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:

In [12]:
with Timer() as t:
    classifier.fit(
        token_ids=tokens_train,
        input_mask=mask_train,
        labels=list(df_train[label_col]),    
        num_gpus=NUM_GPUS,        
        num_epochs=NUM_EPOCHS,
        batch_size=BATCH_SIZE,    
        verbose=True,
    )    
print("[Training time: {:.3f} hrs]".format(t.interval / 3600))

t_total value of -1 results in schedule not being applied
Iteration:   0%|          | 1/875 [00:03<52:24,  3.60s/it]

epoch:1/1; batch:1->88/875; average training loss:1.594673


Iteration:  10%|█         | 89/875 [02:51<25:30,  1.95s/it]

epoch:1/1; batch:89->176/875; average training loss:0.751236


Iteration:  20%|██        | 177/875 [05:41<21:59,  1.89s/it]

epoch:1/1; batch:177->264/875; average training loss:0.585310


Iteration:  30%|███       | 265/875 [08:31<19:25,  1.91s/it]

epoch:1/1; batch:265->352/875; average training loss:0.497757


Iteration:  40%|████      | 353/875 [11:20<16:33,  1.90s/it]

epoch:1/1; batch:353->440/875; average training loss:0.454207


Iteration:  50%|█████     | 441/875 [14:11<14:07,  1.95s/it]

epoch:1/1; batch:441->528/875; average training loss:0.418486


Iteration:  60%|██████    | 529/875 [17:00<11:35,  2.01s/it]

epoch:1/1; batch:529->616/875; average training loss:0.385412


Iteration:  71%|███████   | 617/875 [19:49<08:06,  1.88s/it]

epoch:1/1; batch:617->704/875; average training loss:0.361678


Iteration:  81%|████████  | 705/875 [22:38<05:32,  1.96s/it]

epoch:1/1; batch:705->792/875; average training loss:0.342387


Iteration:  91%|█████████ | 793/875 [25:26<02:36,  1.91s/it]

epoch:1/1; batch:793->875/875; average training loss:0.330646


Iteration: 100%|██████████| 875/875 [28:04<00:00,  1.91s/it]


[Training time: 0.471 hrs]


## Score
We score the test set using the trained classifier:

In [13]:
preds = classifier.predict(
    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE
)

Iteration: 100%|██████████| 375/375 [03:40<00:00,  1.71it/s]


## Evaluate Results
Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set.

In [14]:
print(classification_report(df_test[label_col], preds, target_names=labels))

              precision    recall  f1-score   support

     culture       0.90      0.96      0.93      1507
     diverse       0.96      0.93      0.95      1804
     economy       0.83      0.89      0.86      1499
    politics       0.91      0.83      0.87      2143
      sports       0.98      0.99      0.98      5047

    accuracy                           0.94     12000
   macro avg       0.92      0.92      0.92     12000
weighted avg       0.94      0.94      0.94     12000

