# AI4D Malawi News Classification Challenge

File name: AI4DClassificationAI.ipynb

Author: kogni7

Date: April/Mai 2021

## Contents
* 1 Preparation
    * 1.1 GPU
    * 1.2 Time
    * 1.3 Installation
    * 1.4 Libraries and Seed
    * 1.5 Working directory
* 2 Data
    * 2.1 Label encoding and Validation set
    * 2.2 Tokenization
    * 2.3 Datasets
* 3 Training
    * 3.1 Model and Parameters
    * 3.2 Train!
* 4 Prediction and Submission

This notebook uses only the data sets provided by ZINDI. These data sets contain sentences in Chichewa. These sentences are the only used features in this notebook. The task is to classify the sentences.

The file system for this project is:
* AI4DClassificationAI (root)
    * AI4DClassificationAI.ipynb (this notebook)
    * Data
        * Train.csv
        * Test.csv
        * SampleSubmission.csv
    * Submission
        * 1 - x: Submission directions, named by the version number
            * submission.csv

This jupyter notebook runs in Google Colab without special configuration. GPU is enabled.

The idea of the notebook is a transformer (BERT) based approach.

## 1 Preparation
### 1.1 GPU

Make sure the GPU is the one which is stated below, otherwise restart the environment.

In [1]:
!nvidia-smi

Sun May  9 18:13:42 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 1.2 Time

In [2]:
import time
start_time = time.time()

### 1.3 Installation

In [3]:
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-mx40idfa
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-mx40idfa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 22.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d

### 1.4 Libraries and Seed

In [4]:
SEED = 42

# Math
import numpy as np
print("Numpy Version: " + str(np.__version__))

import random
import os
os.environ['PYTHONHASHSEED'] = str(SEED)

np.random.seed(SEED + 1)

random.seed(SEED + 2)

# PyTorch
import torch
print("PyTorch Version: "  + str(torch.__version__))
torch.manual_seed(SEED + 3)
torch.cuda.manual_seed_all(SEED + 4)

# Time
import time

# CSV
import pandas as pd
print("Pandas Version: " + str(pd.__version__))

# Machine Learning
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
print("SciKit-Learn Version: " + str(sklearn.__version__))

# Transformers
import transformers
from transformers import BertTokenizerFast, BertForSequenceClassification, Trainer, TrainingArguments
print("Transformers Version: " + str(transformers.__version__))

Numpy Version: 1.19.5
PyTorch Version: 1.8.1+cu101
Pandas Version: 1.1.5
SciKit-Learn Version: 0.22.2.post1
Transformers Version: 4.6.0.dev0


### 1.5 Working directory

In [5]:
# The Version
VERSION = '8'

# for use in Google Colab
from google.colab import drive
drive.mount('/content/drive')

# Working Directory
WD = os.getcwd() + '/drive/My Drive/Colab Notebooks/AI4DClassificationAI'

Mounted at /content/drive


## 2 Data

In [6]:
train_csv = pd.read_csv(WD + '/Data/Train.csv')
test_csv = pd.read_csv(WD + '/Data/Test.csv')
sample_submission_csv = pd.read_csv(WD + '/Data/SampleSubmission.csv')
train_csv.head()

Unnamed: 0,ID,Text,Label
0,ID_AASHwXxg,Mwangonde: Khansala wachinyamata Akamati achi...,POLITICS
1,ID_AGoFySzn,MCP siidakhutire ndi kalembera Chipani cha Ma...,POLITICS
2,ID_AGrrkBGP,Bungwe la MANEPO Lapempha Boma Liganizire Anth...,HEALTH
3,ID_AIJeigeG,Ndale zogawanitsa miyambo zanyanya Si zachile...,POLITICS
4,ID_APMprMbV,Nanga wapolisi ataphofomoka? Masiku ano sichi...,LAW/ORDER


In [7]:
test_csv.head()

Unnamed: 0,ID,Text
0,ID_ADHEtjTi,Abambo odzikhweza akuchuluka Kafukufuku wa ap...
1,ID_AHfJktdQ,Ambuye Ziyaye Ayamikira Aphunzitsi a Tilitonse...
2,ID_AUJIHpZr,Anatcheleza: Akundiopseza a gogo wanga Akundi...
3,ID_AUKYBbIM,Ulova wafika posauzana Adatenga digiri ya uph...
4,ID_AZnsVPEi,"Dzombe kukoma, koma Kuyambira makedzana, pant..."


In [8]:
sample_submission_csv.head()

Unnamed: 0,ID,Label
0,ID_sQaPRMWO,0
1,ID_TanclvfR,0
2,ID_CNbveyvk,0
3,ID_MclKMhyP,0
4,ID_rNrmXOGD,0


### 2.1 Label encoding and Validation set

In [9]:
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(list(train_csv["Label"]))

X_train, X_val, y_train, y_val = train_test_split(list(train_csv["Text"]), labels, test_size=0.2, random_state=SEED)

X_test = list(test_csv["Text"])

### 2.2 Tokenization

In [10]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-cased")

train_data = tokenizer(X_train, return_tensors="pt", padding=True, truncation=True)
val_data = tokenizer(X_val, return_tensors="pt", padding=True, truncation=True)
test_data = tokenizer(X_test, return_tensors="pt", padding=True, truncation=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961828.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




### 2.3 Datasets

In [11]:
class MakeDataSet(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.data.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

class MakeTestSet(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.data.items()}
        return item

    def __len__(self):
        return len(self.data['input_ids'])

train_dataset = MakeDataSet(train_data, y_train)
val_dataset = MakeDataSet(val_data, y_val)
test_dataset = MakeTestSet(test_data)

## 3 Training
### 3.1 Model and Parameters

In [12]:
LABELS = 20

model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=LABELS)

batch_size = 8

args = TrainingArguments(
        output_dir="output",
        evaluation_strategy = "steps",
        learning_rate=0.4e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        #weight_decay=0.01,
        save_total_limit=3,
        num_train_epochs=25,
        load_best_model_at_end=True,
        save_strategy="steps",
        logging_steps=500,
        save_steps=500,
        seed=SEED)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714314041.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

In [13]:
def compute_metrics(eval_preds):
    """
    The accuracy metric.
    """
    preds, labels = eval_preds
    predictions = np.argmax(preds, axis=1)
    return {'accuracy': accuracy_score(labels, predictions)}

### 3.2 Train!

In [14]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Step,Training Loss,Validation Loss,Accuracy
500,2.4957,2.206061,0.34375
1000,1.9576,1.974657,0.409722
1500,1.6101,1.845281,0.454861
2000,1.3287,1.834575,0.486111
2500,1.1232,1.833148,0.496528
3000,0.9811,1.818415,0.517361
3500,0.8946,1.827718,0.520833


TrainOutput(global_step=3600, training_loss=1.4681135283576117, metrics={'train_runtime': 3444.1901, 'train_samples_per_second': 1.045, 'total_flos': 434944759971840.0, 'epoch': 25.0, 'init_mem_cpu_alloc_delta': 2088693760, 'init_mem_gpu_alloc_delta': 711757312, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 312045568, 'train_mem_gpu_alloc_delta': 2197129728, 'train_mem_cpu_peaked_delta': 94392320, 'train_mem_gpu_peaked_delta': 6546114560})

# 4 Prediction and Submission

In [15]:
prediction = trainer.predict(test_dataset)
prediction = np.argmax(prediction.predictions, axis=1)
prediction = label_encoder.inverse_transform(prediction)

In [16]:
sample_submission_csv.ID = test_csv.ID
sample_submission_csv.Label = list(prediction)
sample_submission_csv.head()

Unnamed: 0,ID,Label
0,ID_ADHEtjTi,LAW/ORDER
1,ID_AHfJktdQ,RELIGION
2,ID_AUJIHpZr,RELATIONSHIPS
3,ID_AUKYBbIM,HEALTH
4,ID_AZnsVPEi,FARMING


In [17]:
os.mkdir(WD + '/Submission/' + str(VERSION))

In [18]:
sample_submission_csv.to_csv(WD + '/Submission/' + str(VERSION) + '/submission.csv', index=False)

In [19]:
drive.flush_and_unmount()

In [20]:
end_time = time.time()
print("Runtime of the Notebook: {} min".format(np.round((end_time - start_time) / 60, 2)))

Runtime of the Notebook: 59.16 min
