<a href="https://colab.research.google.com/github/gizdatalab/SDG_11_Tracking_Colombia/blob/main/src/Classifier_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installations

In [None]:
%%capture
! pip install datasets transformers sentencepiece huggingface_hub
! apt install git-lfs
! pip install -U albumentations
! pip install sentence-transformers
! pip install optuna

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


To be able to share your model with the community there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up if you haven't already!) then execute the following cell and input your username and password:

In [None]:
!git config --global credential.helper store

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

# Load Packages

In [None]:
import torch
from pathlib import Path
import datasets
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          PreTrainedModel, BertModel, BertForSequenceClassification,RobertaForSequenceClassification,
                          TrainingArguments, Trainer, TrainerCallback)
from transformers.modeling_outputs import SequenceClassifierOutput
import torch.optim as optim
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
import numpy as np
import pandas as pd
import sklearn.metrics as skm
import os
from sklearn.metrics import mean_absolute_error, accuracy_score,confusion_matrix,f1_score

# Dataset

* https://huggingface.co/docs/datasets/load_hub

There are two ways provided below for loading the data  1. In-memory(pandas) 2. External source (using transformers dataset)

### from Pandas

In [None]:
sector_dir = '/content/drive/MyDrive/Colab Notebooks/giz/policyData/sector_data/'

In [None]:
df = pd.read_json(sector_dir + "train_val_1.json")

# renaming the columns because later the tokenizer will need it
df = df.rename(columns = {'sector_label':'labels', 'context':'text'})
# if data has column to identify for train-val
df_train = df[df.split == 'train']
df_val = df[df.split == 'val']

# in case of hyper-parameter search it is good to reduce the training set 
df_train = df_train.sample(frac = 0.85)

In [None]:
# we need datasets format to work with
train_ds = datasets.Dataset.from_pandas(df_train)
val_ds =  datasets.Dataset.from_pandas(df_val)
train_ds = train_ds.shuffle(seed=7)
val_ds = val_ds.shuffle(seed=7)

In [None]:
print("trianing data size:",train_ds.num_rows)
print("val data size:", val_ds.num_rows)

trianing data size: 1178
val data size: 606


## Load Dataset

In [None]:
sector_dir = '/content/drive/MyDrive/Colab Notebooks/giz/policyData/sector_data/'

In [None]:
import datasets
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": sector_dir + "train_val_1.json"})
dataset = dataset.shuffle(seed=7)
dataset = dataset['train'].rename_column("sector_label", "labels")
dataset = dataset.rename_column("context", "text")
dataset = dataset.train_test_split(test_size =0.10)

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-f7dcfce4f8edd77b/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-f7dcfce4f8edd77b/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'Document', 'sector_list', 'Document_name', 'country_count', 'doc_count', 'sector_tuple', 'country', 'labels', 'Agriculture', 'Buildings', 'Coastal Zone', 'Disaster Risk Management (DRM)', 'Economy-wide', 'Energy', 'Environment', 'Health', 'Industries', 'LULUCF/Forestry', 'Social Development', 'Transport', 'Urban', 'Waste', 'Water', 'set0', 'set1', 'set2', 'set3', 'INDC', 'First NDC', 'Second NDC', 'Revised First NDC', 'split'],
        num_rows: 7615
    })
    test: Dataset({
        features: ['text', 'Document', 'sector_list', 'Document_name', 'country_count', 'doc_count', 'sector_tuple', 'country', 'labels', 'Agriculture', 'Buildings', 'Coastal Zone', 'Disaster Risk Management (DRM)', 'Economy-wide', 'Energy', 'Environment', 'Health', 'Industries', 'LULUCF/Forestry', 'Social Development', 'Transport', 'Urban', 'Waste', 'Water', 'set0', 'set1', 'set2', 'set3', 'INDC', 'First NDC', 'Second NDC', 'Revised First NDC', 'spli

In [None]:
print("trianing data size:",dataset['train'].num_rows)
print("val data size:", dataset['test'].num_rows)

trianing data size: 7615
val data size: 847


In [None]:
sectors = ['Agriculture', 'Buildings','Coastal Zone', 'Disaster Risk Management (DRM)',
       'Economy-wide', 'Energy', 'Environment', 'Health', 'Industries',
       'LULUCF/Forestry', 'Social Development', 'Transport', 'Urban', 'Waste',
       'Water']

From the Sector Classification we will keep the only rows which have answers as it means that that some response was detected for those contexts and hence the Sector label might be correct.


# Multi_label Classification

The sub-sections here will give a walk through important components of training that we might need. 


Resources - Colab notebooks and Code


*   https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb
*   https://github.com/akaver/nlp2019-final/blob/master/BertForMultiLabelSequenceClassification.py
* https://discuss.huggingface.co/t/fine-tune-for-multiclass-or-multilabel-multiclass/4035/8
*  https://colab.research.google.com/drive/1X7l8pM6t4VLqxQVJ23ssIxmrsc4Kpc5q?usp=sharing#scrollTo=i3DMa8tLvzhq
* https://github.com/Dirkster99/PyNotes/blob/master/Transformers/transformers_multi_label_classification.ipynb
* https://colab.research.google.com/drive/1aue7x525rKy6yYLqqt-5Ll96qjQvpqS7#scrollTo=u61lCTUu5606

* https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb


[HF- Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
[Trainer Source Code](https://github.com/huggingface/transformers/blob/v4.27.2/src/transformers/trainer_utils.py#L362)

In [None]:
sector_dir = '/content/drive/MyDrive/Colab Notebooks/giz/policyData/sector_data/'

# define the list of labels here, if you are working with already trained model
# better to check the sequence of labels of your classifiers.
sectors = ['Agriculture', 'Buildings','Coastal Zone', 'Disaster Risk Management (DRM)',
       'Economy-wide', 'Energy', 'Environment', 'Health', 'Industries',
       'LULUCF/Forestry', 'Social Development', 'Transport', 'Urban', 'Waste',
       'Water']

In [None]:
label_names = sectors 
id2label = {idx:label for idx, label in enumerate(label_names)}
label2id = {label:idx for idx, label in enumerate(label_names)}
num_labels = len(sectors)
id2label

{0: 'Agriculture',
 1: 'Buildings',
 2: 'Coastal Zone',
 3: 'Disaster Risk Management (DRM)',
 4: 'Economy-wide',
 5: 'Energy',
 6: 'Environment',
 7: 'Health',
 8: 'Industries',
 9: 'LULUCF/Forestry',
 10: 'Social Development',
 11: 'Transport',
 12: 'Urban',
 13: 'Waste',
 14: 'Water'}

## Metrics

In [None]:
# Define the compute metrics as per requirements
# in case of  imbalanced data we need to look at  precision,recall,f1
def get_predictions(y_pred, y_true, thresh=0.5, sigmoid=True): 
    y_pred = torch.from_numpy(y_pred)
    y_true = torch.from_numpy(y_true)
    if sigmoid: 
      y_pred = y_pred.sigmoid()
      y_pred = (y_pred > thresh)
    report = skm.classification_report(y_true, y_pred, output_dict=True)
    df_report = pd.DataFrame(report).transpose()
    return {"Precision_micro": df_report.loc['micro avg']['precision'],
            "Precision_weighted": df_report.loc['weighted avg']['precision'],
            "Precision_samples": df_report.loc['samples avg']['precision'],
            "Recall_micro": df_report.loc['micro avg']['recall'],
            "Recall_weighted": df_report.loc['weighted avg']['recall'],
            "Recall_samples": df_report.loc['samples avg']['recall'],
            "F1-Score":df_report.loc['samples avg']['f1-score'],
            "accuracy": skm.accuracy_score(y_true, y_pred)}

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return get_predictions(predictions, labels)

## Sentence Transformer

Always prefer Sentence Transofrmers over the simple Bert models

1.   https://huggingface.co/blog/classification-use-cases
2.   https://www.sbert.net/docs/pretrained_models.html



In [None]:
# define the model checkpoint
model_checkpoint = "sentence-transformers/all-mpnet-base-v2"

# problem_type, is not needed in tokenizer but keeping it for conformity
# https://huggingface.co/docs/transformers/main_classes/configuration?highlight=multi_label_classification#transformers.PretrainedConfig 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,problem_type="multi_label_classification")

### Tokenization

In [None]:
import datasets

cols = train_ds.column_names
cols.remove("labels")
print('Training data:',train_ds.num_rows)
print('Validation data:',val_ds.num_rows)


Training data: 1178
Validation data: 606


In [None]:
# Need to tokenize the data using the tokenizer of the model
def tokenize_and_encode(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True,
                        max_length=384)    

In [None]:
train_tokenized = train_ds.map(tokenize_and_encode, batched=True, remove_columns= cols)
val_tokenized = val_ds.map(tokenize_and_encode, batched=True, remove_columns= cols)

Map:   0%|          | 0/1178 [00:00<?, ? examples/s]

Map:   0%|          | 0/606 [00:00<?, ? examples/s]

In [None]:
# need this to avoid error due to type mismatch
# https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3
train_tokenized.set_format("torch")
train_tokenized = (train_tokenized
          .map(lambda x : {"float_labels": x["labels"].to(torch.float)}, remove_columns=["labels"])
          .rename_column("float_labels", "labels"))

Map:   0%|          | 0/1178 [00:00<?, ? examples/s]

In [None]:
val_tokenized.set_format("torch")
val_tokenized = (val_tokenized
          .map(lambda x : {"float_labels": x["labels"].to(torch.float)}, remove_columns=["labels"])
          .rename_column("float_labels", "labels"))

Map:   0%|          | 0/606 [00:00<?, ? examples/s]



1.   https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab
2. https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html



In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
            id2label = id2label,label2id = label2id,num_labels=num_labels,
            problem_type="multi_label_classification").to('cuda')
batch_size = 8
num_train_epochs = 5


args = TrainingArguments(
    "mpnet-multilabel-sector-classifier_hpsearch",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy="epoch",
    learning_rate=8e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.02,
    lr_scheduler_type = "linear",
    # push_to_hub=True,
    # fp16 = True,
    warmup_steps = 200,
)
multi_trainer =  Trainer(
    model,
    args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

multi_trainer.evaluate()

In [None]:
multi_trainer.train()

In [None]:
predictions= multi_trainer.predict(val_tokenized)
pred,labels,_ = predictions
y_pred = torch.from_numpy(pred)
y_true = torch.from_numpy(labels)
y_prob = y_pred.sigmoid()
thresh = 0.4
y_pred = (y_prob>thresh).bool()

  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
import sklearn.metrics as skm
import pandas as pd

cm = skm.multilabel_confusion_matrix(y_true, y_pred)
print(cm)
print( skm.classification_report(y_true,y_pred))
report = skm.classification_report(y_true, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
            id2label = id2label,label2id = label2id,num_labels=num_labels,
            problem_type="multi_label_classification").to('cuda')
batch_size = 8
num_train_epochs = 5


args = TrainingArguments(
    "mpnet-multilabel-sector-classifier_32_cosine",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy="epoch",
    learning_rate=7e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.02,
    lr_scheduler_type = "linear",
    # push_to_hub=True,
    # fp16 = True,
    warmup_steps = 200,
)

multi_trainer =  Trainer(
    model,
    args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

multi_trainer.evaluate()

In [None]:
multi_trainer.train()

In [None]:
predictions= multi_trainer.predict(val_tokenized)
pred,labels,_ = predictions
y_pred = torch.from_numpy(pred)
y_true = torch.from_numpy(labels)
y_prob = y_pred.sigmoid()
thresh = 0.5
y_pred = (y_prob>thresh).bool()

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
import sklearn.metrics as skm
import pandas as pd

cm = skm.multilabel_confusion_matrix(y_true, y_pred)
print(cm)
print( skm.classification_report(y_true,y_pred))
report = skm.classification_report(y_true, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()

## Hyper-parameter search

*  https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/34?page=3
*   https://github.com/huggingface/blog/blob/main/ray-tune.md
*  https://github.com/huggingface/transformers/pull/6576




In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                        return_dict=True, num_labels=num_labels,
            problem_type="multi_label_classification" )


# define the space parameters and range to explore from
def my_hp_space_optuna(trial):    
    return {
        "learning_rate": trial.suggest_float("learning_rate", 2e-5, 1e-4, log=True),
        "warmup_steps":  trial.suggest_float("warmup_steps", 100, 500, step=100),
        "weight_decay":  trial.suggest_float("weight_decay", 1e-4, 1e-1),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [3, 8, 16]),
        "gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps" ,[1,2]),
    }


# Define the compute metrics as per requirements
# here we have imbalanced data so using F1
def get_predictions(y_pred, y_true, thresh=0.5, sigmoid=True): 
    y_pred = torch.from_numpy(y_pred)
    y_true = torch.from_numpy(y_true)
    if sigmoid: 
      y_pred = y_pred.sigmoid()
      y_pred = (y_pred > thresh)
    report = skm.classification_report(y_true, y_pred, output_dict=True)
    df_report = pd.DataFrame(report).transpose()
    return {"F1-Score":df_report.loc['samples avg']['f1-score']}
    #         "Precision_micro": df_report.loc['micro avg']['precision'],
    # #         "Precision_weighted": df_report.loc['weighted avg']['precision'],
    # #         "Precision_samples": df_report.loc['samples avg']['precision'],
    # #         "Recall_micro": df_report.loc['micro avg']['recall'],
    # #         "Recall_weighted": df_report.loc['weighted avg']['recall'],
    # #         "Recall_samples": df_report.loc['samples avg']['recall'],
    #         "F1-Score":df_report.loc['samples avg']['f1-score'],
    #         "accuracy": skm.accuracy_score(y_true, y_pred)}

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return get_predictions(predictions, labels)


class MemorySaverCallback(TrainerCallback):
    "A callback that deletes the folder in which checkpoints are saved, to save memory"
    def __init__(self, run_name):
        super(MemorySaverCallback, self).__init__()
        self.run_name = run_name

    def on_train_begin(self, args, state, control, **kwargs):
        print("Removing dirs...")
        if os.path.isdir(f'./{self.run_name}'):
            import shutil
            shutil.rmtree(f'./{self.run_name}')
        else:
            print("\n\nDirectory does not exists")

# def my_objective(metrics):
#     return metrics["eval_f1"]

In [None]:
batch_size = 8
# this epochs value will be used for all trials
num_train_epochs = 3

# repo/folder name in local directory, where model files will be saved
RUN_NAME = "mpnet-multilabel-sector-classifier_hpsearch"

args = TrainingArguments(
    RUN_NAME,
    evaluation_strategy = "epoch",
    save_strategy="no",
    logging_strategy="steps",
    logging_steps=1,
    overwrite_output_dir=True,
    learning_rate=8e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.02,
    lr_scheduler_type = "linear",
    gradient_accumulation_steps = 1,
    warmup_steps = 200,
)


In [None]:
multi_trainer = Trainer(
    model_init=model_init,
    args=args, 
    # remember to keep the training set small to avoid long runtime
    # as we are doing only parameter search
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    compute_metrics=compute_metrics,
    callbacks=[MemorySaverCallback(RUN_NAME)]
)

In [None]:
best_run = multi_trainer.hyperparameter_search(n_trials=10, direction="maximize",
                                               hp_space=my_hp_space_optuna,)

## Class weights

This subsection deals with using the class weights for the imbalanced data


*   https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html
*   https://discuss.pytorch.org/t/weights-in-bcewithlogitsloss/27452
*   https://discuss.pytorch.org/t/pos-weight-and-weight-parameters-in-bcewithlogitsloss/130651/5
*   https://discuss.pytorch.org/t/multi-label-multi-class-class-imbalance/37573/10



In [None]:
for i,sector in enumerate(sectors):
    # df_train[sector] = df_train.apply(lambda x: x['sector_label'][i], axis =1)
    print(i,".",sector, ":", sum(df_train[sector]))

positive_weights = {}
negative_weights = {}
for sector in sectors:
    positive_weights[sector] = df_train.shape[0]/(2*np.count_nonzero(df_train[sector]==1))
    negative_weights[sector] = df_train.shape[0]/(2*np.count_nonzero(df_train[sector]==0))
print(positive_weights)
print(negative_weights)

0 . Agriculture : 1748
1 . Buildings : 108
2 . Coastal Zone : 396
3 . Disaster Risk Management (DRM) : 447
4 . Economy-wide : 447
5 . Energy : 1889
6 . Environment : 591
7 . Health : 421
8 . Industries : 270
9 . LULUCF/Forestry : 1135
10 . Social Development : 326
11 . Transport : 651
12 . Urban : 294
13 . Waste : 398
14 . Water : 965
{'Agriculture': 2.2471395881006866, 'Buildings': 36.370370370370374, 'Coastal Zone': 9.919191919191919, 'Disaster Risk Management (DRM)': 8.787472035794183, 'Economy-wide': 8.787472035794183, 'Energy': 2.0794070937003704, 'Environment': 6.646362098138748, 'Health': 9.330166270783849, 'Industries': 14.548148148148147, 'LULUCF/Forestry': 3.4607929515418503, 'Social Development': 12.049079754601227, 'Transport': 6.033794162826421, 'Urban': 13.360544217687075, 'Waste': 9.869346733668342, 'Water': 4.070466321243523}
{'Agriculture': 0.6430910281597905, 'Buildings': 0.5069695405265875, 'Coastal Zone': 0.5265415549597855, 'Disaster Risk Management (DRM)': 0.53016

`pos_weight > 1 will increase the recall while pos_weight < 1 will increase the precision.`

In [None]:
# as we dont want to miss anything (high recall) we use positive weights calculated above.
pos_weights = list(positive_weights.values())

# If using GPU we need to place all required data on else there will be error
posweights = torch.FloatTensor(pos_weights).to(device)

In [None]:
# for class weights we need to use Custom Multi-Label Trainer
# In multi-label problem we will be using Binary Cross Entropy loss with 
# sigmoid layer on top rather than softmax.
class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss(pos_weight=posweights)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), 
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
            id2label = id2label,label2id = label2id,num_labels=num_labels,
            problem_type="multi_label_classification").to(device)
batch_size = 8
num_train_epochs = 8


args = TrainingArguments(
    "mpnet-multilabel-sector-classifier_8_cosine",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    lr_scheduler_type = "linear",
    # push_to_hub=True,
    # fp16 = True,
    warmup_steps = 200,
)

multi_trainer =  MultilabelTrainer(
    model,
    args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

multi_trainer.evaluate()

In [None]:
multi_trainer.train()

In [None]:
predictions= multi_trainer.predict(val_tokenized)
pred,labels,_ = predictions
y_pred = torch.from_numpy(pred)
y_true = torch.from_numpy(labels)
y_prob = y_pred.sigmoid()
thresh = 0.5
y_pred = (y_prob>thresh).bool()

  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
import sklearn.metrics as skm
import pandas as pd

cm = skm.multilabel_confusion_matrix(y_true, y_pred)
print(cm)
print( skm.classification_report(y_true,y_pred))
report = skm.classification_report(y_true, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()

## Upsampling (augmentation)

We will be using the simple Sentence Shuffling from albumentations library. In this exaple i.e sector_data there are different Sectors categorized as:

`set0 = ['Agriculture','Energy']`  count ~ 2000

`set1 = ['LULUCF/Forestry','Water','Environment']`  ~2000 >count > 1000

`set2 = ['Coastal Zone','Disaster Risk Management (DRM)','Economy-wide',
        'Health','Social Development','Transport','Urban','Waste']` ~500
        
`set3 = ['Industries','Buildings']` count < 500 `

We need these categorization so that we can do upsampling for these less represented classes.


In [None]:
df_train.columns

Index(['text', 'Document', 'sector_list', 'Document_name', 'country_count',
       'doc_count', 'sector_tuple', 'country', 'labels', 'Agriculture',
       'Buildings', 'Coastal Zone', 'Disaster Risk Management (DRM)',
       'Economy-wide', 'Energy', 'Environment', 'Health', 'Industries',
       'LULUCF/Forestry', 'Social Development', 'Transport', 'Urban', 'Waste',
       'Water', 'set0', 'set1', 'set2', 'set3', 'INDC', 'First NDC',
       'Second NDC', 'Revised First NDC', 'split'],
      dtype='object')

In [None]:
for i,sector in enumerate(sectors):
    # df_train[sector] = df_train.apply(lambda x: x['sector_label'][i], axis =1)
    print(i,".",sector, ":", sum(df_train[sector]))

0 . Agriculture : 1748
1 . Buildings : 108
2 . Coastal Zone : 396
3 . Disaster Risk Management (DRM) : 447
4 . Economy-wide : 447
5 . Energy : 1889
6 . Environment : 591
7 . Health : 421
8 . Industries : 270
9 . LULUCF/Forestry : 1135
10 . Social Development : 326
11 . Transport : 651
12 . Urban : 294
13 . Waste : 398
14 . Water : 965


In [None]:
# upsampling_sectors = ['Buildings', 'Coastal Zone', 'Disaster Risk Management (DRM)',
#        'Economy-wide', 'Health', 'Industries','Social Development', 'Urban', 'Waste']

# df_train['upsampling'] = df_train.apply(lambda x: np.sum([True if val in upsampling_sectors 
#                            else False  for val in x['sector_tuple']]), axis = 1)

In [None]:
train_examples  = len(df)
print('Set0 examples:', round(len(df[df.set0 >0])/train_examples,2))
print('Set1 examples:', round(len(df[df.set1 >0])/train_examples,2))
print('Set2 examples:', round(len(df[df.set2 >0])/train_examples,2))
print('Set3 examples:', round(len(df[df.set3 >0])/train_examples,2))

Set0 examples: 0.45
Set1 examples: 0.32
Set2 examples: 0.4
Set3 examples: 0.05


In [None]:
set1 = df_train[df_train.set1 > 0]
print(len(set1))
set2 = df_train[df_train.set2 > 0]
print(len(set2))
set3 = df_train[df_train.set3 > 0]
print(len(set3))
set2not01 = df_train[(df_train.set0 == 0) & (df_train.set1 == 0) & (df_train.set2 !=0)]
print(len(set2not01))

2545
3188
371
2153


In [None]:
import random
import re
import pandas as pd
from nltk import sent_tokenize
import nltk
nltk.download('punkt')
from tqdm import tqdm
from albumentations.core.transforms_interface import DualTransform, BasicTransform

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
class NLPTransform(BasicTransform):
    """ Transform for nlp task."""
    LANGS = {
        'en': 'english',
        'it': 'italian', 
        'fr': 'french', 
        'es': 'spanish',
        'tr': 'turkish', 
        'ru': 'russian',
        'pt': 'portuguese'
    }

    @property
    def targets(self):
        return {"data": self.apply}
    
    def update_params(self, params, **kwargs):
        if hasattr(self, "interpolation"):
            params["interpolation"] = self.interpolation
        if hasattr(self, "fill_value"):
            params["fill_value"] = self.fill_value
        return params

    def get_sentences(self, text, lang='en'):
        return sent_tokenize(text, self.LANGS.get(lang, 'english'))

class ShuffleSentencesTransform(NLPTransform):
    """ Do shuffle by sentence """
    def __init__(self, always_apply=False, p=0.5):
        super(ShuffleSentencesTransform, self).__init__(always_apply, p)

    def apply(self, data, **params):
        text, lang = data
        sentences = self.get_sentences(text, lang)
        random.shuffle(sentences)
        return ' '.join(sentences), lang

In [None]:
def augment_data(p_values, df):
    # p-value : fraction of sentences which will be swapped 
    # (should be a list of float values)
    placeholder = {}
    for p_val in p_values:
        transform = ShuffleSentencesTransform(p=p_val)
        lang = 'en'
        df['augmented_text_{}'.format(p_val)] = df.apply(lambda x: transform(data= (x['text'],lang))['data'][0], axis=1)
        placeholder[p_val] = df[['augmented_text_{}'.format(p_val),'labels']]
    # augmented_data = pd.DataFrame()

    for p_val in p_values:
      placeholder[p_val] = placeholder[p_val].rename(columns = {'augmented_text_{}'.format(p_val):'text'})
    
    
    augmented_data = pd.concat(list(placeholder.values()))
    print(len(augmented_data))

    augmented_data = pd.concat([augmented_data, df[['text','labels']]])
    augmented_data = augmented_data.drop_duplicates(subset = ['text'])
    # augemnted_data = pd.concat([aug])
    return augmented_data


In [None]:
set3_augmented  = augment_data([0.7,0.8,0.9,0.95,0.99,1.0],set3)
set3_augmented.info()

In [None]:
set2_augmented  = augment_data([0.8,1.0],set2not01)
set2_augmented.info()

In [None]:
df_train_augmented = pd.concat([df_train[['text','labels']],set2_augmented,set3_augmented])
df_train_augmented  = df_train_augmented.drop_duplicates(subset = ['text'])
print(len(df_train_augmented))
for i,sector in enumerate(sectors):
    df_train_augmented[sector] = df_train_augmented.apply(lambda x: x['labels'][i], axis =1)
    print(i,".",sector, ":", np.sum(df_train_augmented[sector]))

12122
0 . Agriculture : 1823
1 . Buildings : 476
2 . Coastal Zone : 663
3 . Disaster Risk Management (DRM) : 824
4 . Economy-wide : 834
5 . Energy : 2289
6 . Environment : 601
7 . Health : 813
8 . Industries : 1303
9 . LULUCF/Forestry : 1212
10 . Social Development : 530
11 . Transport : 1431
12 . Urban : 521
13 . Waste : 918
14 . Water : 966


In [None]:
train_ds = datasets.Dataset.from_pandas(df_train_augmented)
val_ds =  datasets.Dataset.from_pandas(df_val)
train_ds = train_ds.shuffle(seed=7)
val_ds = val_ds.shuffle(seed=7)

import datasets

cols = train_ds.column_names
cols.remove("labels")
print('Training data:',train_ds.num_rows)
print('Validation data:',val_ds.num_rows)

model_checkpoint = "sentence-transformers/all-mpnet-base-v2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,problem_type="multi_label_classification")

train_tokenized = train_ds.map(tokenize_and_encode, batched=True, remove_columns= cols)
val_tokenized = val_ds.map(tokenize_and_encode, batched=True, remove_columns= cols)

Training data: 12122
Validation data: 606


Map:   0%|          | 0/12122 [00:00<?, ? examples/s]

Map:   0%|          | 0/606 [00:00<?, ? examples/s]

In [None]:
# need this to avoid error due to type mismatch
train_tokenized.set_format("torch")
train_tokenized = (train_tokenized
          .map(lambda x : {"float_labels": x["labels"].to(torch.float)}, remove_columns=["labels"])
          .rename_column("float_labels", "labels"))

val_tokenized.set_format("torch")
val_tokenized = (val_tokenized
          .map(lambda x : {"float_labels": x["labels"].to(torch.float)}, remove_columns=["labels"])
          .rename_column("float_labels", "labels"))

Map:   0%|          | 0/12122 [00:00<?, ? examples/s]

Map:   0%|          | 0/606 [00:00<?, ? examples/s]

In [None]:
for i,sector in enumerate(sectors):
    # df_train[sector] = df_train.apply(lambda x: x['sector_label'][i], axis =1)
    print(i,".",sector, ":", sum(df_train_augmented[sector]))

positive_weights = {}
negative_weights = {}
for sector in sectors:
    positive_weights[sector] = df_train_augmented.shape[0]/(2*np.count_nonzero(df_train_augmented[sector]==1))
    negative_weights[sector] = df_train_augmented.shape[0]/(2*np.count_nonzero(df_train_augmented[sector]==0))
print(positive_weights)
print(negative_weights)

0 . Agriculture : 1834
1 . Buildings : 512
2 . Coastal Zone : 660
3 . Disaster Risk Management (DRM) : 820
4 . Economy-wide : 842
5 . Energy : 2327
6 . Environment : 599
7 . Health : 825
8 . Industries : 1429
9 . LULUCF/Forestry : 1222
10 . Social Development : 525
11 . Transport : 1435
12 . Urban : 533
13 . Waste : 934
14 . Water : 966
{'Agriculture': 3.351145038167939, 'Buildings': 12.00390625, 'Coastal Zone': 9.312121212121212, 'Disaster Risk Management (DRM)': 7.495121951219512, 'Economy-wide': 7.2992874109263655, 'Energy': 2.6411688869789427, 'Environment': 10.26043405676127, 'Health': 7.449696969696969, 'Industries': 4.3009097270818755, 'LULUCF/Forestry': 5.029459901800327, 'Social Development': 11.706666666666667, 'Transport': 4.282926829268293, 'Urban': 11.53095684803002, 'Waste': 6.580299785867238, 'Water': 6.36231884057971}
{'Agriculture': 0.5876840696117804, 'Buildings': 0.5217317487266554, 'Coastal Zone': 0.5283700137551581, 'Disaster Risk Management (DRM)': 0.5357391910739

In [None]:
pos_weights = list(positive_weights.values())
posweights = torch.FloatTensor(pos_weights).to('cuda')

# for class weights we need to use 
class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), 
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
            id2label = id2label,label2id = label2id,num_labels=num_labels,
            problem_type="multi_label_classification").to('cuda')
batch_size = 8
num_train_epochs = 5


args = TrainingArguments(
    "mpnet-multilabel-sector-classifier_8_linear",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.1,
    lr_scheduler_type = "linear",
    # push_to_hub=True,
    # fp16 = True,
    warmup_steps = 200,
)

multi_trainer =  MultilabelTrainer(
    model,
    args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

multi_trainer.evaluate()


In [None]:
multi_trainer.train()

In [None]:
predictions= multi_trainer.predict(val_tokenized)
pred,labels,_ = predictions
y_pred = torch.from_numpy(pred)
y_true = torch.from_numpy(labels)
y_prob = y_pred.sigmoid()
thresh = 0.5
y_pred = (y_prob>thresh).bool()

In [None]:
import sklearn.metrics as skm
import pandas as pd

cm = skm.multilabel_confusion_matrix(y_true, y_pred)
print(cm)
print( skm.classification_report(y_true,y_pred))
report = skm.classification_report(y_true, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()

# Data Augmentation



*   https://neptune.ai/blog/data-augmentation-nlp
*   https://towardsdatascience.com/nlp-data-augmentation-using-transformers-89a44a993bab
* https://medium.com/the-owl/imbalanced-multilabel-image-classification-using-keras-fbd8c60d7a4b
*   https://github.com/google-research/uda




### Albumentations

*   https://www.kaggle.com/code/shonenkov/nlp-albumentations/notebook
*   https://github.com/albumentations-team/albumentations#installation



In [None]:
!pip install -U albumentations

In [None]:
import random
import re
import pandas as pd
from nltk import sent_tokenize
import nltk
nltk.download('punkt')
from tqdm import tqdm
from albumentations.core.transforms_interface import DualTransform, BasicTransform

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
class NLPTransform(BasicTransform):
    """ Transform for nlp task."""
    LANGS = {
        'en': 'english',
        'it': 'italian', 
        'fr': 'french', 
        'es': 'spanish',
        'tr': 'turkish', 
        'ru': 'russian',
        'pt': 'portuguese'
    }

    @property
    def targets(self):
        return {"data": self.apply}
    
    def update_params(self, params, **kwargs):
        if hasattr(self, "interpolation"):
            params["interpolation"] = self.interpolation
        if hasattr(self, "fill_value"):
            params["fill_value"] = self.fill_value
        return params

    def get_sentences(self, text, lang='en'):
        return sent_tokenize(text, self.LANGS.get(lang, 'english'))

In [None]:
class ShuffleSentencesTransform(NLPTransform):
    """ Do shuffle by sentence """
    def __init__(self, always_apply=False, p=0.5):
        super(ShuffleSentencesTransform, self).__init__(always_apply, p)

    def apply(self, data, **params):
        text, lang = data
        sentences = self.get_sentences(text, lang)
        random.shuffle(sentences)
        return ' '.join(sentences), lang

In [None]:
set1notset0 = sector_data[(sector_data.set0 == 0) & (sector_data.set1 == 1)]
set1notset0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2947 entries, 3 to 10309
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   context        2947 non-null   object
 1   sector_list    2947 non-null   object
 2   Document_name  2947 non-null   object
 3   Countries      2947 non-null   object
 4   country_count  2947 non-null   int64 
 5   doc_count      2947 non-null   int64 
 6   sector_tuple   2947 non-null   object
 7   set0           2947 non-null   int64 
 8   set1           2947 non-null   int64 
 9   set2           2947 non-null   int64 
 10  set3           2947 non-null   int64 
 11  set4           2947 non-null   int64 
dtypes: int64(7), object(5)
memory usage: 299.3+ KB


In [None]:
transform = ShuffleSentencesTransform(p=0.3)
lang = 'en'
set1notset0['augmented_data'] = set1notset0.apply(lambda x: transform(data= (x['context'],lang))['data'][0], axis=1)

### NLP Aug


*   https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb
*   https://github.com/makcedward/nlpaug



In [None]:
! pip install numpy requests nlpaug
! pip install torch>=1.6.0 transformers>=4.11.3 sentencepiece

In [None]:
import os
os.environ["model_dir"] = '../model'

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

In [None]:
text = 'The quick brown fox jumps over the lazy dog .'
print(text)

The quick brown fox jumps over the lazy dog .


In [None]:
aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="insert")
augmented_text = aug.augment(text)

In [None]:
set2not01 = sector_data[(sector_data.set0 == 0) & (sector_data.set1 == 0) & (sector_data.set2 !=0)]
set2not01.info()

In [None]:
# list_of_sentences = list(set2not01.context)

# list_of_aug_text = []

In [None]:
aug2.augment(list_of_sentences[0])

['" 1 ( con ) : improved environmental standards for vehicles : limitation of emissions of certain polluting gases from vehicle emissions. after 2023, that standard obliges manufacturers shall produce cleaner cars, while respecting, in particular, the emission rates of fine particles and nitrogen oxides... thus, from january 2023, all vehicles new individuals and utility vehicles ( of categories m and n ) placed on the moroccan market must comply with the euro 6 standard. ; 2 ( con ) : bonus - malus products : the bonus - malus system aims to encourage the choice of a vehicle for low co2 emissions and to penalize the purchase of the most polluting models. ; 3 ( con ) : eco driving : the adoption of good eco - driving practices aiming to reduce fuel consumption bills and vehicle maintenance costs, pollute the environment less and help improve road design. ; 4 ( con ) : application of performance standards in co2 emissions for new passenger cars and for new light commercial vehicles : th

In [None]:
aug.augment(list_of_sentences[0])

['" 1 ( con ) : improved environmental standards for vehicles : limitation of emissions limits of certain polluting gases from vehicle emissions. from 2023, the standard obliges manufacturers to produce cleaner cars, while respecting, as in particular, the emission rates of fine particles and nitrogen oxides... on thus, from january 2023, all vehicles new individuals and utility vehicles ( of categories m and n ) placed on the moroccan market must comply with the euro 6 standard. ; 2 ( con ) : bonus - malus system : the bonus - malus system aims to encourage the choice of a vehicle with low co2 dust emissions and to penalize the purchase of the most polluting models. ; 3 ( con ) : successful eco driving : the adoption of good eco - driving best practices aims to reduce fuel consumption bills and vehicle maintenance costs, pollute the environment emissions less and help improve road safety. ; 4 ( con ) : application of performance standards in co2 emissions for new road passenger cars a

In [None]:
aug2 = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="substitute")

# Using K-Fold
https://huggingface.co/docs/datasets/loading#slice-splits



In [None]:
sector_dir = '/content/drive/MyDrive/Colab Notebooks/giz/policyData/sector_data/'
import datasets
from datasets import load_dataset

# creating data slices for K-fold
val_ds  = datasets.load_dataset("json", data_files = {"train": sector_dir + 'train_val.json'},
                              split=[f"train[{k}%:{k+20}%]" for k in range(0, 100, 20)])

train_ds  = datasets.load_dataset("json",data_files = {"train":sector_dir + 'train_val.json'},
                              split=[f"train[:{k}%]+train[{k+20}%:]" for k in range(0, 100, 20)])

In [None]:
splits = len(train_ds)
print(splits)
model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_and_encode(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True,
                        max_length=384)

In [None]:
print(train_ds[0].column_names,)

['context', 'sector_list', 'Document_name', 'country_count', 'doc_count', 'sector_tuple', 'country', 'sector_label', 'Agriculture', 'Buildings', 'Coastal Zone', 'Cross-Cutting Area', 'Disaster Risk Management (DRM)', 'Economy-wide', 'Energy', 'Environment', 'Health', 'Industries', 'LULUCF/Forestry', 'Social Development', 'Transport', 'Urban', 'Waste', 'Water', 'set0', 'set1', 'set2', 'set3', 'INDC', 'First NDC', 'Second NDC', 'Revised First NDC']


https://huggingface.co/docs/transformers/main_classes/trainer#trainer

In [None]:
class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), 
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss


def get_predictions(y_pred, y_true, thresh=0.5, sigmoid=True): 
    y_pred = torch.from_numpy(y_pred)
    y_true = torch.from_numpy(y_true)
    if sigmoid: 
      y_pred = y_pred.sigmoid()
      y_pred = (y_pred > thresh)
    # report = skm.classification_report(y_true, y_pred, output_dict=True)
    df_report = pd.DataFrame(report).transpose()
    return {"Precision_micro": df_report.loc['micro avg']['precision'],
            "Precision_weighted": df_report.loc['weighted avg']['precision'],
            "Precision_samples": df_report.loc['samples avg']['precision'],
            "Recall_micro": df_report.loc['micro avg']['recall'],
            "Recall_weighted": df_report.loc['weighted avg']['recall'],
            "Recall_samples": df_report.loc['samples avg']['recall'],
            "accuracy": skm.accuracy_score(y_true, y_pred)}

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return get_predictions(predictions, labels)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                              num_labels=num_labels,id2label = id2label,
                              label2id = label2id).to('cuda')
batch_size = 16
#  Each fold will be trianed for 1 epoch
num_train_epochs = 1


# keeping separate first fold args as this differs from other fold args in terms of
# warmup_ratio which is important in start pahse of training
# lr_scheduler is 'cosine' and not default 'linear'
first_epoch_args = TrainingArguments(
    "bert-multilabel-sector-classifier",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy="epoch",
    lr_scheduler_type = 'cosine',
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    push_to_hub=True,
    # fp16 = True,
    warmup_ratio = 0.5,
)

args = TrainingArguments(
    "bert-multilabel-sector-classifier",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy="epoch",
    lr_scheduler_type = 'cosine',
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    push_to_hub=True,
    # fp16 = True,
    # warmup_ratio = 0.01,
)

In [None]:
from tqdm.autonotebook import tqdm
log_results = []
for split in tqdm(range(splits)):
    train_split = train_ds[split]
    val_split = val_ds[split]
    train_split  = train_split.rename_column("sector_label", "labels")
    val_split = val_split.rename_column("sector_label", "labels")
    train_split  = train_split.rename_column("context", "text")
    val_split = val_split.rename_column("context", "text")
    cols = val_split.column_names
    cols.remove("labels")  

    train_tokenized = train_split.map(tokenize_and_encode, batched=True, remove_columns= cols)
    val_tokenized = val_split.map(tokenize_and_encode, batched=True, remove_columns= cols)

    if split  == 0:
        multi_trainer =  MultilabelTrainer(
            model,
            args = first_epoch_args,
            train_dataset=train_tokenized,
            eval_dataset=val_tokenized,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer)
    else:
       multi_trainer =  MultilabelTrainer(
            model,
            args = args,
            train_dataset=train_tokenized,
            eval_dataset=val_tokenized,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer)
    multi_trainer.train()
    # save the logs from each fold.
    log_results.append([multi_trainer.state.log_history])

In [None]:
from tqdm.autonotebook import tqdm
for split in tqdm(range(splits)):
    train_split = train_ds[split]
    val_split = val_ds[split]
    train_split  = train_split.rename_column("sector_label", "labels")
    val_split = val_split.rename_column("sector_label", "labels")
    train_split  = train_split.rename_column("context", "text")
    val_split = val_split.rename_column("context", "text")
    cols = val_split.column_names
    cols.remove("labels")  

    train_tokenized = train_split.map(tokenize_and_encode, batched=True, remove_columns= cols)
    val_tokenized = val_split.map(tokenize_and_encode, batched=True, remove_columns= cols)

    # if split  == 0:
    #     multi_trainer =  MultilabelTrainer(
    #         model,
    #         args = first_epoch_args,
    #         train_dataset=train_tokenized,
    #         eval_dataset=val_tokenized,
    #         compute_metrics=compute_metrics,
    #         tokenizer=tokenizer)
    # else:
    multi_trainer =  MultilabelTrainer(
        model,
        args = args,
        train_dataset=train_tokenized,
        eval_dataset=val_tokenized,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer)
    multi_trainer.train()
    # save the logs from each fold.
    log_results.append([multi_trainer.state.log_history])

In [None]:
multi_trainer.push_to_hub()

In [None]:
log_results

# Prediction
https://discuss.huggingface.co/t/i-have-trained-my-classifier-now-how-do-i-do-predictions/3625/2

## Predictions using Trainer

In [None]:
test_ds = load_dataset("json", data_files={"test": sector_dir + "test.json"})
test_ds  = test_ds['test']
test_ds  = test_ds.rename_column("sector_label", "labels")
test_ds  = test_ds.rename_column("context", "text")
cols = test_ds.column_names
cols.remove("labels")  
test_tokenized = test_ds.map(tokenize_and_encode, batched=True, remove_columns= cols)

In [None]:
test_tokenized.set_format("torch")
test_tokenized = (test_tokenized
          .map(lambda x : {"float_labels": x["labels"].to(torch.float)}, remove_columns=["labels"])
          .rename_column("float_labels", "labels"))
multi_trainer.evaluate(test_tokenized)

In [None]:
multi_trainer.state.log_history

In [None]:
predictions= multi_trainer.predict(val_tokenized)
pred,labels,_ = predictions
y_pred = torch.from_numpy(pred)
y_true = torch.from_numpy(labels)
y_prob = y_pred.sigmoid()
thresh = 0.5
y_pred = (y_prob>thresh).bool()

In [None]:
y_pred = y_pred.tolist()
df_val['pred'] = y_pred
df_val['prob'] = list(np.around(np.array(y_prob.tolist()),3))
df_val['pred_sectors'] = df_val.apply(lambda x: list(np.array(sectors)[x['pred']]), axis =1)

In [None]:
df_val

In [None]:
# multi_trainer.push_to_hub()

In [None]:
jsonfile = df_val.to_json(orient="records")
import json
parsed = json.loads(jsonfile)
with open('/content/drive/MyDrive/Colab Notebooks/giz/policyData/climatewatch_ndc/CW_sectorClassification_val_1.json', 'w') as file:
    json.dump(parsed, file, indent=4)

df_val.to_excel('/content/drive/MyDrive/Colab Notebooks/giz/policyData/climatewatch_ndc/CW_sectorClassification_val_1.xlsx')    

In [None]:
import sklearn.metrics as skm
import pandas as pd

cm = skm.multilabel_confusion_matrix(y_true, y_pred)
print(cm)
print( skm.classification_report(y_true,y_pred))
report = skm.classification_report(y_true, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()

In [None]:
df_report['Sector'] = sectors + [None, None, None, None]
df_report = df_report[['Sector','precision','recall','f1-score','support']]
df_report

Unnamed: 0,Sector,precision,recall,f1-score,support
0,Agriculture,0.738255,0.714286,0.726073,154.0
1,Buildings,0.0,0.0,0.0,7.0
2,Coastal Zone,0.651163,0.636364,0.643678,44.0
3,Cross-Cutting Area,0.617886,0.490323,0.546763,155.0
4,Disaster Risk Management (DRM),0.666667,0.588235,0.625,51.0
5,Economy-wide,0.333333,0.02381,0.044444,42.0
6,Energy,0.913265,0.856459,0.883951,209.0
7,Environment,0.568182,0.403226,0.471698,62.0
8,Health,0.833333,0.851064,0.842105,47.0
9,Industries,0.882353,0.652174,0.75,23.0


In [None]:
df_report.to_excel('/content/drive/MyDrive/Colab Notebooks/giz/policyData/climatewatch_ndc/CW_sectorClassification_report_27_03.xlsx')    

## Prediction on Test (using Pipeline)

In [None]:
test_ds = load_dataset("json", data_files={"test": sector_dir + "test.json"})

In [None]:
from transformers import pipeline
model_checkpoint = "/content/bert-multilabel-sector-classifier_8_cosine_restart"
pipe = pipeline("text-classification", model=model_checkpoint, return_all_scores=True)

In [None]:
predictions = pipe(list(test_ds['context']))

In [None]:
pred = []
for prediction in predictions:
    n_classes  = len(prediction)
    placeholder = []
    for i in range(n_classes):
        placeholder.append(prediction[i]['score'])
    pred.append(placeholder)
pred = np.array(pred)

In [None]:
labels = np.array(list(test_ds['sector_label']))

In [None]:
print(pred.shape)
print(labels.shape)

(1039, 16)
(1039, 16)


In [None]:
y_pred = torch.from_numpy(pred)
y_true = torch.from_numpy(labels)
y_prob = y_pred.sigmoid()
thresh = 0.51
y_pred = (y_prob>thresh).bool()

In [None]:
report = skm.classification_report(y_true, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()

In [None]:
df_report

Unnamed: 0,precision,recall,f1-score,support
0,0.734266,0.650155,0.689655,323.0
1,0.428571,0.333333,0.375,9.0
2,0.530303,0.402299,0.457516,87.0
3,0.423077,0.277978,0.335512,277.0
4,0.547368,0.412698,0.470588,126.0
5,0.336735,0.3,0.317308,110.0
6,0.674242,0.542683,0.601351,164.0
7,0.668831,0.42562,0.520202,242.0
8,0.797753,0.617391,0.696078,115.0
9,0.409091,0.75,0.529412,12.0


# Codecarbon
*  https://huggingface.co/docs/hub/model-cards-co2
*  https://mlco2.github.io/codecarbon/usage.html#

In [None]:
from codecarbon import EmissionsTracker

tracker = EmissionsTracker()
tracker.start()

tracker.stop()

In [None]:
# pred,labels,_ = predictions
# y_pred = torch.from_numpy(pred)
# y_true = torch.from_numpy(labels)
# y_pred = y_pred.sigmoid()
# thresh = 0.5
# y_pred = (y_pred>thresh).bool()

# y_pred = y_pred.tolist()
# df_val['pred'] = y_pred
# df_val['pred_sectors'] = df_val.apply(lambda x: list(np.array(sectors)[x['pred']]), axis =1)

In [None]:
# df = pd.read_json(sector_dir + "train_val_1.json")
# df = df.rename(column = {'sector_label','labels', 'context':'text'})
# # dataset = load_dataset("json", data_files={"train": sector_dir + "train_val_1.json"})
# dataset = dataset.shuffle(seed=7)
# dataset = dataset['train'].rename_column("sector_label", "labels")
# dataset = dataset.rename_column("context", "text")
# dataset = dataset.train_test_split(test_size =0.10)