<a href="https://colab.research.google.com/github/namiyousef/argument-mining/blob/develop/examples/OpenModelExperimentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open Model Experimentation

This notebook allows you to access testing data used for the project, as well as trained models. The original training data is not hosted here.

The document will first take you through steps to set-up the notebook so that you can access the data.

## Colab Set up

Firstly, please access the following [link](https://drive.google.com/drive/folders/1LzSEc25qZHSB5snig1Ro_pgZOnmZ9p2W?usp=sharing) to get access to the drive. This contains a folder named test/ that includes adversarial examples and a folder named tmpdir/ that includes the trained models. You can look into these in greater detail, however the names are not user friendly. These models are identical to those pushed on HuggingFace under https://huggingface.co/ucabqfe.

Once you have done that, please go on your drive under "Shared with me", right click the Desktop/ folder and select "Add shortcut to drive". This will allow you to access the data from within Colab. You can find more information on this here: https://github.com/namiyousef/argument-mining/issues/38. You can ignore the aspects on Authentication as the repository has now made public and published as a package.

Alternatively if that fails, you can download and extract the .zip files.

In [4]:
# %%capture
!pip install colab-dev-tools
!pip install gdown

import gdown
import os
from colabtools.utils import get_gpu_utilization

model_url = "https://drive.google.com/uc?id={}".format("1LWBsEBkskxCvW7OxqWsBuT-Lh93jl0S5")
output_models = "models.zip"


adversarial_url = "https://drive.google.com/uc?id={}".format("19WoXOsgE_eQJEW1NQT4cRQaCAL-tHzce")
output_adversarial = "test.zip"

gdown.download(model_url, output_models, quiet=False)
gdown.download(adversarial_url, output_adversarial, quiet=False)



Downloading...
From: https://drive.google.com/uc?id=1LWBsEBkskxCvW7OxqWsBuT-Lh93jl0S5
To: /content/models.zip
100%|██████████| 5.42G/5.42G [00:29<00:00, 183MB/s]
Downloading...
From: https://drive.google.com/uc?id=19WoXOsgE_eQJEW1NQT4cRQaCAL-tHzce
To: /content/test.zip
100%|██████████| 72.1M/72.1M [00:00<00:00, 220MB/s]


'test.zip'

This takes a while: unzipping models and data folders

In [5]:
!unzip models.zip
!unzip test.zip

Archive:  models.zip
   creating: content/drive/My Drive/Desktop/tmpdir/job/
  inflating: content/drive/My Drive/Desktop/tmpdir/job/.DS_Store  
   creating: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/
 extracting: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/.smpd  
  inflating: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/scores.json  
  inflating: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/training_scores.json  
   creating: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/models/
   creating: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/models/Z29vZ2xlL2JpZ2JpcmQtcm9iZXJ0YS1iYXNlX2ZpbmFs/
  inflating: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/models/Z29vZ2xlL2JpZ2JpcmQtcm9iZXJ0YS1iYXNlX2ZpbmFs/config.json  
  inflating: content/drive/My Drive/Desktop/tmpdir/job/820986.undefined/models/Z29vZ2xlL2JpZ2JpcmQtcm9iZXJ0YS1iYXNlX2ZpbmFs/pytorch_model.bin  
   creating: content/drive/My Drive

In [6]:
# set paths
PATH_TO_MODELS = "content/drive/MyDrive/Desktop/tmpdir/job"
PATH_TO_ADVERSARIAL = "test"

In [7]:
%%capture
!pip install argminer

# -- public imports

import gc
from transformers import AutoTokenizer, AutoModelForTokenClassification
import pandas as pd
from torch.utils.data import DataLoader
import torch
from pandas.testing import assert_frame_equal
import time

# -- private imports
from colabtools.utils import move_to_device
from colabtools.config import DEVICE

import argminer
from argminer.data import ArgumentMiningDataset, TUDarmstadtProcessor, PersuadeProcessor
from argminer.evaluation import inference, _get_scores_agg
from argminer.utils import encode_model_name
from argminer.config import LABELS_MAP_DICT, MODEL_MAP_DICT

# Testing on Adversarial Examples

In [8]:
BATCH_SIZE = 64
MODEL_NAME = 'ucabqfe/roberta_AAE_io' #   select huggingface model name (this is one of our models, see ucabqfe/ models on HuggingFace)
# alternatively, you should be able to specify to a path on the trained models. You can decode the names of the models using argminer.utils.decode_model_name()

model_metadata = MODEL_MAP_DICT[MODEL_NAME]
dataset = model_metadata['dataset']
max_length = model_metadata['max_length']
tokenizer_name = model_metadata['hugging_face_model_name']
strategy = MODEL_NAME.split('_')[-1]
print(f'Running model: {MODEL_NAME}:')
print('=======================================================================')
PATH_TO_ATTACK = os.path.join(PATH_TO_ADVERSARIAL, strategy)
for adversarial_attack in os.listdir(PATH_TO_ATTACK):
  model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
  tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, add_prefix_space=True)

  Processor = getattr(argminer.data, f'{dataset}Processor')
  processor = Processor(os.path.join(PATH_TO_ATTACK, adversarial_attack)).from_json(status='postprocessed')
  df_text = processor.dataframe[['text', 'labels']]
  df_label_map = LABELS_MAP_DICT[dataset][strategy]
  dataset = ArgumentMiningDataset(df_label_map, df_text, tokenizer, max_length, f'standard_{strategy}', is_train=False)
  dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)

  df_metrics, df_scores = inference(model, dataloader)
  metrics, df_scores_agg = _get_scores_agg(df_scores)
  print(f'Completed inference on adversarial attack: {adversarial_attack}. Macro Metrics: {metrics}\n')
  print(df_scores_agg.to_string(), '\n\n')


Running model: ucabqfe/roberta_AAE_io:


Downloading:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Prediction time: 0.223
Agg to word time: 8.56
Get predstring time: 0.47
Evaluate time: 1.74
Batch 1 complete.
Prediction time: 0.0153
Agg to word time: 2.17
Get predstring time: 0.105
Evaluate time: 0.359
Batch 2 complete.
Completed inference on adversarial attack: antonym. Macro Metrics: {'macro_f1': 0.7313651311178377, 'macro_recall': 0.7625363315175516, 'macro_precision': 0.7031963145053108}

        tp   fn   fp        f1    recall  precision
class                                              
0      949  388  415  0.702703  0.709798   0.695748
1      123   28   41  0.780952  0.814570   0.750000
2      196  108  154  0.599388  0.644737   0.560000
3      711   96  170  0.842417  0.881041   0.807037 




AttributeError: ignored

# Training

If you have access to the base data, you can train models with the script below.




In [None]:
# constants (these will be abstracted away by inputs that you give to run)
PATH_TO_DATA_DIR = '' # specify this path


# -- model specific configurations
model_name = 'google/bigbird-roberta-base'
max_length = 1024

# -- training configurations
epochs = 5
batch_size = 2
verbose = 2
save_freq = 2

# -- dataset configurations
dataset_name = 'Persuade'

# -- experiment configurations
strategy = 'standard_bieo'
strat_name, strat_label = strategy.split('_')

# -- inferred configurations
df_label_map = LABELS_MAP_DICT[dataset_name][strat_label]
num_labels = len(set(df_label_map.label))
Processor = eval(f'{dataset_name}Processor')


### Tokenizer, Model and Optimizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=num_labels) 
optimizer = torch.optim.Adam(params=model.parameters())

### Dataset 
Note this will change as the Processor develops. On the cluster you will need to use different options

In [None]:
processor = Processor(PATH_TO_DATA_DIR)
processor = processor.from_json()
df_total = processor.dataframe

df_dict = processor.get_tts(test_size=0.3, val_size=0.1)
df_train = df_dict.get('train')[['text', 'labels']]
df_test = df_dict.get('test')[['text', 'labels']]
df_val = df_dict.get('val')[['text', 'labels']]


In [None]:
train_set = ArgumentMiningDataset(df_label_map, df_train, tokenizer, max_length, strategy)
test_set = ArgumentMiningDataset(df_label_map, df_test, tokenizer, max_length, strategy, is_train=False)

train_loader = DataLoader(train_set, batch_size=batch_size)
test_loader = DataLoader(test_set, batch_size=batch_size)

In [None]:
if not os.path.exists('models'):
  os.makedirs('models')
  print('models directory created!')
model.to(DEVICE)
print(f'Model pushed to device: {DEVICE}')
for epoch in range(epochs):
    model.train()
    start_epoch_message = f'EPOCH {epoch + 1} STARTED'
    print(start_epoch_message)
    print(f'{"-" * len(start_epoch_message)}')
    start_epoch = time.time()

    start_load = time.time()
    training_loss = 0
    for i, (inputs, targets) in enumerate(train_loader):
        start_train = time.time()
        inputs = move_to_device(inputs, DEVICE)
        targets = move_to_device(targets, DEVICE)
        if DEVICE != 'cpu':
            print(f'GPU Utilisation at batch {i+1} after data loading: {get_gpu_utilization()}')

        optimizer.zero_grad()

        loss, outputs = model(
            labels=targets,
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            return_dict=False
        )
        if DEVICE != 'cpu':
            print(f'GPU Utilisation at batch {i+1} after training: {get_gpu_utilization()}')


        training_loss += loss.item()

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        del targets, inputs, loss, outputs
        gc.collect()
        torch.cuda.empty_cache()

        end_train = time.time()

        if verbose > 1:
            print(
                f'Batch {i + 1} complete. Time taken: load({start_train - start_load:.3g}), '
                f'train({end_train - start_train:.3g}), total({end_train - start_load:.3g}). '
            )
        start_load = time.time()

    print_message = f'Epoch {epoch + 1}/{epochs} complete. ' \
                    f'Time taken: {start_load - start_epoch:.3g}. ' \
                    f'Loss: {training_loss/(i+1): .3g}'

    if verbose:
        print(f'{"-" * len(print_message)}')
        print(print_message)
        print(f'{"-" * len(print_message)}')

    if epoch % save_freq == 0:
        encoded_model_name = encode_model_name(model_name, epoch+1)
        save_path = f'models/{encoded_model_name}'
        model.save_pretrained(save_path)
        print(f'Model saved at epoch {epoch+1} at: {save_path}')

encoded_model_name = encode_model_name(model_name, 'final')
save_path = f'models/{encoded_model_name}'
model.save_pretrained(save_path)
print(f'Model saved at epoch {epoch + 1} at: {save_path}')

In [None]:
# load trained model
path = ''
trained_model = AutoModelForTokenClassification.from_pretrained(path)

In [None]:
df_metrics, df_scores = inference(trained_model, test_loader)

In [None]:
_get_scores_agg(df_scores)