# Paratope Prediction using AntiBERTa

This notebook describes how one can fine-tune their own AntiBERTa model using the HuggingFace framework. As a demo, we've included the tokenizer we've used, and a minimal model.

## Setup of all the things we need

In [1]:
from transformers import (
    RobertaTokenizer,
    RobertaForTokenClassification,
    Trainer,
    TrainingArguments
)
from datasets import (
    Dataset,
    DatasetDict,
    Sequence,
    ClassLabel
)
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    matthews_corrcoef,
    roc_auc_score,
    average_precision_score
)
import pandas as pd
import torch
import numpy as np
import random
import os

In [2]:
TOKENIZER_DIR = "../antiberta/antibody-tokenizer"

# Initialise a tokenizer
tokenizer = RobertaTokenizer.from_pretrained(TOKENIZER_DIR, max_len=150)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.


## Data pre-processing

Data pre-processing for paratope prediction as a token classification task involves a few steps:
* Detecting the actual paratopes from PDB structures (this has already been done for convenience)
* Splitting non-redundant sequences (this has already been done for convenience)
* Loading them into HuggingFace-compatible `dataset` objects: shown below
* Tokenizing the sequences: shown below

### Wrangling data into the HF framework

In [3]:
# Read in parquet files
train_df = pd.read_parquet(
    '../antiberta/assets/sabdab_train.parquet'
)
val_df = pd.read_parquet(
    '../antiberta/assets/sabdab_val.parquet'
)
test_df = pd.read_parquet(
    '../antiberta/assets/sabdab_test.parquet'
)

In [4]:
# Get a preview
train_df.head(3)

Unnamed: 0,sequence,paratope_labels,paratope_sequence,v_gene,j_gene,pdb,antibody_chains,antigen_type_max,compound,paratope_count_bin,v_gene_cluster
1002,DIVMTQSPDSLAVSLGERATINCKSSQSVLYSSNNKNYLAWYQQKP...,"[N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, ...",------------------------------Y------Y--------...,IGKV4-1,IGKJ2,7k9z,BA,protein,Crystal structure of SARS-CoV-2 receptor bindi...,1,VK9
511,SALTQPPSVSGAPGQRVTISCTGSSSNIGAGYDVHWYQQLPGTAPK...,"[N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, ...",-----------------------------AGYD-------------...,IGLV1-40,IGLJ7,5ush,DE,protein,Structure of vaccinia virus D8 protein bound t...,0,VL8
911,VLTQPPSASGTPGQRVTISCSGSNSNIATNYVCWYQQYPGTAPKPL...,"[N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, ...",----------------------------TNY---------------...,IGLV1-47,IGLJ3,6xqw,HL,protein,Crystal Structure of MaliM03 Fab in complex wi...,1,VL8


In [5]:
# Create a new Dataset Dict with the sequence and paratope labels
ab_dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df[['sequence','paratope_labels']]),
    "validation": Dataset.from_pandas(val_df[['sequence','paratope_labels']]),
    "test": Dataset.from_pandas(test_df[['sequence','paratope_labels']])
})

In [6]:
# This is what a DatasetDict object looks like with its individual Dataset things
ab_dataset

DatasetDict({
    train: Dataset({
        features: ['sequence', 'paratope_labels', '__index_level_0__'],
        num_rows: 720
    })
    validation: Dataset({
        features: ['sequence', 'paratope_labels', '__index_level_0__'],
        num_rows: 90
    })
    test: Dataset({
        features: ['sequence', 'paratope_labels', '__index_level_0__'],
        num_rows: 90
    })
})

In [7]:
ab_dataset['train'].select(range(1))['sequence']

['DIVMTQSPDSLAVSLGERATINCKSSQSVLYSSNNKNYLAWYQQKPGQPPKLLIYWASTRESGVPDRFSGSGSGTDFTLTISSLQAEDVAVYYCQQYYSTPPTFGQGTKLEIK']

In [8]:
print(ab_dataset['train'].select(range(1))['paratope_labels'][0][20:35])

['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'P', 'N', 'N', 'N', 'N']


In [9]:
# Look at the Features of each column in the train dataset within the ab_dataset DatasetDict
ab_dataset['train'].features

{'sequence': Value(dtype='string', id=None),
 'paratope_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 '__index_level_0__': Value(dtype='int64', id=None)}

We need to convert the `paratope_labels` column into a set of `ClassLabel`s, which will be predicted via the Trainer

In [10]:
# Create a ClassLabel feature which will replace paratope_labels later.
paratope_class_label = ClassLabel(2, names=['N','P'])
new_feature = Sequence(
    paratope_class_label
)

In [11]:
# We iterate through the sequence and labels columns
# Keeping the sequence column as-is, but applying a str2int function, allowing us to cast later
ab_dataset_featurised = ab_dataset.map(
    lambda seq, labels: {
        "sequence": seq,
        "paratope_labels": [paratope_class_label.str2int(sample) for sample in labels]
    }, 
    input_columns=["sequence", "paratope_labels"], batched=True
)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [12]:
# Get the old Features instance from the previous ab_dataset
# Notice how labels is a Sequence of Value
feature_set_copy = ab_dataset['train'].features.copy()
feature_set_copy

{'sequence': Value(dtype='string', id=None),
 'paratope_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 '__index_level_0__': Value(dtype='int64', id=None)}

In [13]:
# Cast to the `new_feature` that we made earlier
feature_set_copy['paratope_labels'] = new_feature

In [14]:
ab_dataset_featurised = ab_dataset_featurised.cast(feature_set_copy)

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [15]:
ab_dataset_featurised['train'].features 

{'sequence': Value(dtype='string', id=None),
 'paratope_labels': Sequence(feature=ClassLabel(names=['N', 'P'], id=None), length=-1, id=None),
 '__index_level_0__': Value(dtype='int64', id=None)}

In [16]:
# now the labels are actually a series of integers, but is recognised by huggingface as a series of Classlabels
print(ab_dataset_featurised['train'].select(range(1))['paratope_labels'][0][20:35])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]


### Tokenizing inputs

In [17]:
# we need to redefine this to e.g. put -100 labels for the start/end tokens

def preprocess(batch):
    # :facepalm: The preprocess function takes tokenizer and needs a LIST not a PT tensor :eyeroll:
    # https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=vc0BSBLIIrJQ
    
    t_inputs = tokenizer(batch['sequence'], 
        padding="max_length")
    batch['input_ids'] = t_inputs.input_ids
    batch['attention_mask'] = t_inputs.attention_mask
    
    # enumerate 
    labels_container = []
    for index, labels in enumerate(batch['paratope_labels']):
        
        # This is typically length of the sequence + SOS + EOS + PAD (to longest example in batch)
        tokenized_input_length = len(batch['input_ids'][index])
        paratope_label_length  = len(batch['paratope_labels'][index])
        
        # we subtract 1 because we start with SOS
        # we should in theory have at least 1 "pad_with_eos" because an EOS wouldn't have been accounted for in the
        # paratope_labels column even for the longest possible sequence
        n_pads_with_eos = max(1, tokenized_input_length - paratope_label_length - 1)
        
        # We have a starting -100 for the SOS
        # and fill the rest of seq length with -100 to account for any extra pads and the final EOS token
        # The -100s are ignored in the CE loss function
        labels_padded = [-100] + labels + [-100] * n_pads_with_eos
        
        assert len(labels_padded) == len(batch['input_ids'][index]), \
        f"Lengths don't align, {len(labels_padded)}, {len(batch['input_ids'][index])}, {len(labels)}"
        
        labels_container.append(labels_padded)
    
    # We create a new column called `labels`, which is recognised by the HF trainer object
    batch['labels'] = labels_container
    
    for i,v in enumerate(batch['labels']):
        assert len(batch['input_ids'][i]) == len(v) == len(batch['attention_mask'][i])
    
    return batch

In [18]:
# Apply that function above on the dataset - labels now aligned!
ab_dataset_tokenized = ab_dataset_featurised.map(
    preprocess, 
    batched=True,
    batch_size=8,
    remove_columns=['sequence', 'paratope_labels']
)

  0%|          | 0/90 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

## Model set-up and training

Here we define:
* The callback function to compute some metrics during training (and can be used for evaluation!)
* The training configuration

In [19]:
# This has the actual names that maps 0->N and 1->P
label_list = paratope_class_label.names

def compute_metrics(p):
    """
    A callback added to the trainer so that we calculate various metrics via sklearn
    """
    predictions, labels = p
    
    # The predictions are logits, so we apply softmax to get the probabilities. We only need
    # the probabilities of the paratope label, which is index 1 (according to the ClassLabel we made earlier),
    # or the last column from the output tensor
    prediction_pr = torch.softmax(torch.from_numpy(predictions), dim=2).detach().numpy()[:,:,-1]
    
    # We run an argmax to get the label
    predictions = np.argmax(predictions, axis=2)

    # Only compute on positions that are not labelled -100
    preds = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    labs = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    probs = [ 
        [prediction_pr[i][pos] for (pr, (pos, l)) in zip(prediction, enumerate(label)) if l!=-100]
         for i, (prediction, label) in enumerate(zip(predictions, labels)) 
    ] 
            
    # flatten
    preds = sum(preds, [])
    labs = sum(labs, [])
    probs = sum(probs,[])
    
    return {
        "precision": precision_score(labs, preds, pos_label="P"),
        "recall": recall_score(labs, preds, pos_label="P"),
        "f1": f1_score(labs, preds, pos_label="P"),
        "auc": roc_auc_score(labs, probs),
        "aupr": average_precision_score(labs, probs, pos_label="P"),
        "mcc": matthews_corrcoef(labs, preds),
    }

In [20]:
# define batch size, metric you want etc. 
batch_size = 32
RUN_ID = "paratope-prediction-task"
SEED = 0
LR = 1e-6

args = TrainingArguments(
    f"{RUN_ID}_{SEED}", # this is the name of the checkpoint folder
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=LR, # 1e-6, 5e-6, 1e-5. .... 1e-3
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    warmup_ratio=0, # 0, 0.05, 0.1 .... 
    load_best_model_at_end=True,
    lr_scheduler_type='linear',
    metric_for_best_model='aupr', # name of the metric here should correspond to metrics defined in compute_metrics
    logging_strategy='epoch',
    seed=SEED
)

In [21]:
def set_seed(seed: int = 42):
    """
    Set all seeds to make results reproducible (deterministic mode).
    When seed is None, disables deterministic mode.
    """
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

In [24]:
set_seed(SEED)

# Name of the pre-trained model after you train your MLM
MODEL_DIR = "../antiberta/saved_model"

# We initialise a model using the weights from the pre-trained model
model = RobertaForTokenClassification.from_pretrained(MODEL_DIR, num_labels=2)

trainer = Trainer(
    model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=ab_dataset_tokenized['train'],
    eval_dataset=ab_dataset_tokenized['validation'],
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at ../antiberta/saved_model were not used when initializing RobertaForTokenClassification: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at ../antiberta/saved_model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able t

In [25]:
# watch stuff fly
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 720
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 230
  Number of trainable parameters = 85194242


  0%|          | 0/230 [00:00<?, ?it/s]

The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.4354, 'learning_rate': 9e-07, 'epoch': 1.0}


  0%|          | 0/3 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to paratope-prediction-task_0/checkpoint-23
Configuration saved in paratope-prediction-task_0/checkpoint-23/config.json


{'eval_loss': 0.30869412422180176, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_auc': 0.6220205051672912, 'eval_aupr': 0.17721246188876694, 'eval_mcc': 0.0, 'eval_runtime': 3.2295, 'eval_samples_per_second': 27.868, 'eval_steps_per_second': 0.929, 'epoch': 1.0}


Model weights saved in paratope-prediction-task_0/checkpoint-23/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-23/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-23/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.3053, 'learning_rate': 8e-07, 'epoch': 2.0}


  0%|          | 0/3 [00:00<?, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to paratope-prediction-task_0/checkpoint-46
Configuration saved in paratope-prediction-task_0/checkpoint-46/config.json


{'eval_loss': 0.2835576832294464, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_auc': 0.7457842502166135, 'eval_aupr': 0.27236866965280115, 'eval_mcc': 0.0, 'eval_runtime': 3.2514, 'eval_samples_per_second': 27.681, 'eval_steps_per_second': 0.923, 'epoch': 2.0}


Model weights saved in paratope-prediction-task_0/checkpoint-46/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-46/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-46/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.2868, 'learning_rate': 7e-07, 'epoch': 3.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-69
Configuration saved in paratope-prediction-task_0/checkpoint-69/config.json


{'eval_loss': 0.2671515643596649, 'eval_precision': 0.6129032258064516, 'eval_recall': 0.019487179487179488, 'eval_f1': 0.03777335984095428, 'eval_auc': 0.7881085882812777, 'eval_aupr': 0.3128190229803789, 'eval_mcc': 0.0973336925787922, 'eval_runtime': 3.2396, 'eval_samples_per_second': 27.781, 'eval_steps_per_second': 0.926, 'epoch': 3.0}


Model weights saved in paratope-prediction-task_0/checkpoint-69/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-69/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-69/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.2756, 'learning_rate': 6e-07, 'epoch': 4.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-92
Configuration saved in paratope-prediction-task_0/checkpoint-92/config.json


{'eval_loss': 0.2591077983379364, 'eval_precision': 0.6285714285714286, 'eval_recall': 0.022564102564102566, 'eval_f1': 0.04356435643564357, 'eval_auc': 0.8129165607515463, 'eval_aupr': 0.3456545575264204, 'eval_mcc': 0.10656719338387169, 'eval_runtime': 3.2156, 'eval_samples_per_second': 27.988, 'eval_steps_per_second': 0.933, 'epoch': 4.0}


Model weights saved in paratope-prediction-task_0/checkpoint-92/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-92/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-92/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.2673, 'learning_rate': 5e-07, 'epoch': 5.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-115
Configuration saved in paratope-prediction-task_0/checkpoint-115/config.json


{'eval_loss': 0.2533701956272125, 'eval_precision': 0.651685393258427, 'eval_recall': 0.059487179487179485, 'eval_f1': 0.10902255639097744, 'eval_auc': 0.8281455970742827, 'eval_aupr': 0.364978905808375, 'eval_mcc': 0.1777510354708629, 'eval_runtime': 3.2693, 'eval_samples_per_second': 27.529, 'eval_steps_per_second': 0.918, 'epoch': 5.0}


Model weights saved in paratope-prediction-task_0/checkpoint-115/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-115/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-115/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.2607, 'learning_rate': 4e-07, 'epoch': 6.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-138
Configuration saved in paratope-prediction-task_0/checkpoint-138/config.json


{'eval_loss': 0.24631254374980927, 'eval_precision': 0.6206896551724138, 'eval_recall': 0.07384615384615385, 'eval_f1': 0.13198900091659027, 'eval_auc': 0.8375128123129403, 'eval_aupr': 0.37705011384115933, 'eval_mcc': 0.1919007414543019, 'eval_runtime': 3.6481, 'eval_samples_per_second': 24.67, 'eval_steps_per_second': 0.822, 'epoch': 6.0}


Model weights saved in paratope-prediction-task_0/checkpoint-138/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-138/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-138/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.2576, 'learning_rate': 3e-07, 'epoch': 7.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-161
Configuration saved in paratope-prediction-task_0/checkpoint-161/config.json


{'eval_loss': 0.24151334166526794, 'eval_precision': 0.6394557823129252, 'eval_recall': 0.09641025641025641, 'eval_f1': 0.16755793226381463, 'eval_auc': 0.8431678870821818, 'eval_aupr': 0.38317859963716644, 'eval_mcc': 0.22406481912233273, 'eval_runtime': 3.4895, 'eval_samples_per_second': 25.792, 'eval_steps_per_second': 0.86, 'epoch': 7.0}


Model weights saved in paratope-prediction-task_0/checkpoint-161/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-161/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-161/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.2545, 'learning_rate': 2e-07, 'epoch': 8.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-184
Configuration saved in paratope-prediction-task_0/checkpoint-184/config.json


{'eval_loss': 0.23990122973918915, 'eval_precision': 0.6319444444444444, 'eval_recall': 0.09333333333333334, 'eval_f1': 0.16264521894548703, 'eval_auc': 0.8477366417116977, 'eval_aupr': 0.38823237175583547, 'eval_mcc': 0.2186798046597585, 'eval_runtime': 3.3852, 'eval_samples_per_second': 26.586, 'eval_steps_per_second': 0.886, 'epoch': 8.0}


Model weights saved in paratope-prediction-task_0/checkpoint-184/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-184/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-184/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.2532, 'learning_rate': 1e-07, 'epoch': 9.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-207
Configuration saved in paratope-prediction-task_0/checkpoint-207/config.json


{'eval_loss': 0.23839563131332397, 'eval_precision': 0.6363636363636364, 'eval_recall': 0.10051282051282051, 'eval_f1': 0.1736049601417183, 'eval_auc': 0.8498940848796941, 'eval_aupr': 0.3905218212455077, 'eval_mcc': 0.2281154224859304, 'eval_runtime': 3.2812, 'eval_samples_per_second': 27.429, 'eval_steps_per_second': 0.914, 'epoch': 9.0}


Model weights saved in paratope-prediction-task_0/checkpoint-207/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-207/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-207/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 90
  Batch size = 32


{'loss': 0.253, 'learning_rate': 0.0, 'epoch': 10.0}


  0%|          | 0/3 [00:00<?, ?it/s]

Saving model checkpoint to paratope-prediction-task_0/checkpoint-230
Configuration saved in paratope-prediction-task_0/checkpoint-230/config.json


{'eval_loss': 0.23847761750221252, 'eval_precision': 0.6423841059602649, 'eval_recall': 0.09948717948717949, 'eval_f1': 0.17229129662522202, 'eval_auc': 0.8506566737458967, 'eval_aupr': 0.3915655146767724, 'eval_mcc': 0.22835709827028988, 'eval_runtime': 3.3002, 'eval_samples_per_second': 27.271, 'eval_steps_per_second': 0.909, 'epoch': 10.0}


Model weights saved in paratope-prediction-task_0/checkpoint-230/pytorch_model.bin
tokenizer config file saved in paratope-prediction-task_0/checkpoint-230/tokenizer_config.json
Special tokens file saved in paratope-prediction-task_0/checkpoint-230/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from paratope-prediction-task_0/checkpoint-230 (score: 0.3915655146767724).


{'train_runtime': 1016.3631, 'train_samples_per_second': 7.084, 'train_steps_per_second': 0.226, 'train_loss': 0.2849534034729004, 'epoch': 10.0}


TrainOutput(global_step=230, training_loss=0.2849534034729004, metrics={'train_runtime': 1016.3631, 'train_samples_per_second': 7.084, 'train_steps_per_second': 0.226, 'train_loss': 0.2849534034729004, 'epoch': 10.0})

## Model evaluation

In [26]:
# run prediction on the test set
pred = trainer.predict(
    ab_dataset_tokenized['test']
)

The following columns in the test set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 90
  Batch size = 32


  0%|          | 0/3 [00:00<?, ?it/s]

In [27]:
# this stores a JSON with metric values
pred.metrics 

{'test_loss': 0.2323971837759018,
 'test_precision': 0.6686046511627907,
 'test_recall': 0.11734693877551021,
 'test_f1': 0.1996527777777778,
 'test_auc': 0.863730895333623,
 'test_aupr': 0.42898343940829897,
 'test_mcc': 0.2547548299990841,
 'test_runtime': 3.2968,
 'test_samples_per_second': 27.3,
 'test_steps_per_second': 0.91}

## Inference - how to go from sequence to predicted sequence

In [28]:
# input sequence of tralokinumab Light chain
input_seq = 'YVLTQPPSVSVAPGKTARITCGGNIIGSKLVHWYQQKPGQAPVLVIYDDGDRPSGIPERFSGSNSGNTATLTISRVEAGDEADYYCQVWDTGSDPVVFGGGTKLTVL'
model = RobertaForTokenClassification.from_pretrained(
    f"{RUN_ID}_{SEED}"
)

tokenized_input = tokenizer([input_seq], return_tensors='pt', padding=True)
predicted_logits = model(**tokenized_input)

# Simple argmax - no thresholding.
argmax = predicted_logits[0].argmax(2)[0][1:-1].cpu().numpy()
indices = np.argwhere(argmax).flatten()

predicted_sequence = ''

for i, s in enumerate(input_seq):
    if i in indices:
        predicted_sequence += s
    else:
        predicted_sequence += '-'


OSError: paratope-prediction-task_0 does not appear to have a file named config.json. Checkout 'https://huggingface.co/paratope-prediction-task_0/None' for available files.