*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Classification of Arabic News Articles using BERT

In [1]:
import sys
sys.path.append("../../")
import os
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from utils_nlp.dataset.url_utils import maybe_download, extract_zip
from utils_nlp.eval.classification import eval_classification
from utils_nlp.bert.sequence_classification import BERTSequenceClassifier
from utils_nlp.bert.common import Language, Tokenizer
from utils_nlp.common.timer import Timer
import torch
import torch.nn as nn
import numpy as np

## Introduction
In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).

We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic.

In [9]:
URL = ("https://data.mendeley.com/datasets/v524p5dhpj/2" 
       "/files/91cb8398-9451-43af-88fc-041a0956ae2d/"
       "arabic_dataset_classifiction.csv.zip")
DATA_FOLDER = "../../../temp"
BERT_CACHE_DIR = "../../../temp"
LANGUAGE = Language.MULTILINGUAL
MAX_LEN = 200
BATCH_SIZE = 32
NUM_GPUS = 2
NUM_EPOCHS = 1
TRAIN_SIZE = 0.75

## Read Dataset
We start by loading the data. The following lines also download the file if it doesn't exist, and extract the csv file into the specified data folder.

In [13]:
zip_file = "dac.zip"
csv_file = "arabic_dataset_classifiction.csv"
maybe_download(URL, filename=zip_file, work_directory=DATA_FOLDER)
extract_zip(file_path=os.path.join(DATA_FOLDER, zip_file), dest_path=DATA_FOLDER)
df = pd.read_csv(os.path.join(DATA_FOLDER, csv_file))

In [21]:
df.head()

Unnamed: 0,text,targe
0,بين أستوديوهات ورزازات وصحراء مرزوكة وآثار ولي...,0
1,قررت النجمة الأمريكية أوبرا وينفري ألا يقتصر ع...,0
2,أخبارنا المغربية الوزاني تصوير الشملالي ألهب ا...,0
3,اخبارنا المغربية قال ابراهيم الراشدي محامي سعد...,0
4,تزال صناعة الجلود في المغرب تتبع الطريقة التقل...,0


In [None]:
# set the text and label columns
text_col = df.columns[0]
label_col = df.columns[1]

Inspect the distribution of labels:

In [22]:
df_train[label_col].value_counts()

4    34899
3    15283
1    12637
2    10718
0    10259
Name: targe, dtype: int64

We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:


In [32]:
# ordered list of labels
labels = ["culture", "diverse", "economy", "politics", "sports"]
num_labels = len(labels)
pd.DataFrame({"label": labels})

Unnamed: 0,label
0,culture
1,diverse
2,economy
3,politics
4,sports


Next, we split the data for training and testing:

In [33]:
df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=0)
print("Number of training examples: {}".format(df_train.shape[0]))
print("Number of testing examples: {}".format(df_test.shape[0]))

Number of training examples: 83796
Number of testing examples: 27932


## Tokenize and Preprocess

Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets.

In [36]:
tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)
tokens_train = tokenizer.tokenize(df_train[text_col].astype(str))
tokens_test = tokenizer.tokenize(df_test[text_col].astype(str))

In addition, we perform the following preprocessing steps in the cell below:
- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
- Add sentence markers
- Pad or truncate the token lists to the specified max length
- Return mask lists that indicate paddings' positions

*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*

In [37]:
tokens_train, mask_train = tokenizer.preprocess_classification_tokens(
    tokens_train, MAX_LEN
)
tokens_test, mask_test = tokenizer.preprocess_classification_tokens(
    tokens_test, MAX_LEN
)

## Create Model
Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels.

In [38]:
classifier = BERTSequenceClassifier(
    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR
)

## Train
We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:

In [39]:
with Timer() as t:
    classifier.fit(
        token_ids=tokens_train,
        input_mask=mask_train,
        labels=list(df_train[label_col]),    
        num_gpus=NUM_GPUS,        
        num_epochs=NUM_EPOCHS,
        batch_size=BATCH_SIZE,    
        verbose=True,
    )    
print("[Training time: {:.3f} hrs]".format(t.interval / 3600))

t_total value of -1 results in schedule not being applied


epoch:1/1; batch:1->262/2618; loss:1.655931
epoch:1/1; batch:263->524/2618; loss:0.129833
epoch:1/1; batch:525->786/2618; loss:0.295053
epoch:1/1; batch:787->1048/2618; loss:0.043921
epoch:1/1; batch:1049->1310/2618; loss:0.156879
epoch:1/1; batch:1311->1572/2618; loss:0.168521
epoch:1/1; batch:1573->1834/2618; loss:0.217612
epoch:1/1; batch:1835->2096/2618; loss:0.314651
epoch:1/1; batch:2097->2358/2618; loss:0.065314
epoch:1/1; batch:2359->2620/2618; loss:0.088071
[Training time: 1.414 hrs]


## Score
We score the test set using the trained classifier:

In [40]:
preds = classifier.predict(
    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE
)

27936it [08:36, 53.07it/s]                           


## Evaluate Results
Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set.

In [41]:
accuracy = accuracy_score(df_test[label_col], preds)
precision = precision_score(df_test[label_col], preds, average=None)
recall = recall_score(df_test[label_col], preds, average=None)
f1 = f1_score(df_test[label_col], preds, average=None)

print("\n accuracy: {:.6f}".format(accuracy))
pd.DataFrame({"label": labels, "precision": precision, "recall": recall, "f1": f1})


 accuracy: 0.946656


Unnamed: 0,label,precision,recall,f1
0,culture,0.911389,0.963783,0.936854
1,diverse,0.923255,0.976289,0.949032
2,economy,0.918719,0.84845,0.882188
3,politics,0.911776,0.886633,0.899029
4,sports,0.989656,0.987783,0.988719
