<a href="https://colab.research.google.com/github/katarinagresova/AgoBind/blob/adaptor/experiments/Adaptor_playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install adaptor
%pip install git+https://github.com/katarinagresova/AgoBind
%pip install sklearn
%pip install comet-ml

Collecting transformers==4.10.2
  Using cached transformers-4.10.2-py3-none-any.whl (2.8 MB)
Collecting tokenizers<0.11,>=0.10.1
  Using cached tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.11.6
    Uninstalling tokenizers-0.11.6:
      Successfully uninstalled tokenizers-0.11.6
  Attempting uninstall: transformers
    Found existing installation: transformers 4.18.0
    Uninstalling transformers-4.18.0:
      Successfully uninstalled transformers-4.18.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
agobind 0.0.1 requires transformers>=4.17.0, but you have transformers 4.10.2 which is incompatible.[0m
Successfully installed tokenizers-0.10.3 

In [2]:
!git clone https://github.com/katarinagresova/AgoBind.git

fatal: destination path 'AgoBind' already exists and is not an empty directory.


In [3]:
%cd AgoBind/experiments

/content/AgoBind/experiments


In [4]:
import comet_ml

In [5]:
# 1. pick the model base
from adaptor.lang_module import LangModule

kmer_len = 6
stride = 1
lang_module = LangModule(f"armheb/DNA_bert_{kmer_len}")

In [6]:
# 2. Initialize training arguments
# We apply NUM_STEPS stopping strategy in cases where at least one of the objectives does not converge in max_steps
from adaptor.utils import AdaptationArguments, StoppingStrategy

training_arguments = AdaptationArguments(output_dir="dnabert_for_clash",
                                         learning_rate=2e-5,
                                         max_steps=100000,
                                         stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
                                         # stopping_strategy=StoppingStrategy.NUM_STEPS_ALL_OBJECTIVES,
                                         do_train=True,
                                         do_eval=True,
                                         warmup_steps=10000,
                                         gradient_accumulation_steps=10,
                                         logging_steps=100,
                                         eval_steps=100,
                                         save_steps=100,
                                         num_train_epochs=30,
                                         evaluation_strategy="steps",
                                         also_log_converged_objectives=True)

In [7]:
import pandas as pd
import numpy as np

def prepare_data(path_to_csv, path_to_txt, path_to_labels):
    dset = pd.read_csv(path_to_csv, sep='\t')
    dset['seq'] = dset.apply(lambda x: x['miRNA'] + 'NNNN' + x['gene'], axis=1)
    dset['seq'] = dset['seq'].apply(lambda x: ' '.join([x[i:i+kmer_len] for i in range(0, len(x)-kmer_len+1, stride)]))
    np.savetxt(path_to_txt, dset['seq'].values, fmt='%s')
    np.savetxt(path_to_labels, dset['label'].values, fmt='%s')

In [8]:
prepare_data('../data/train_set_1_1_CLASH2013_paper.tsv', '../data/train_set_1_1_CLASH2013_paper.txt', '../data/train_set_1_1_CLASH2013_paper_labels.txt')
prepare_data('../data/evaluation_set_1_1_CLASH2013_paper.tsv', '../data/evaluation_set_1_1_CLASH2013_paper.txt', '../data/evaluation_set_1_1_CLASH2013_paper_labels.txt')

In [9]:
# 3. pick objectives
# Objectives take either List[str] for in-memory iteration, or a source file path for streamed iterati
from adaptor.objectives.MLM import MaskedLanguageModeling
from adaptor.objectives.classification import SequenceClassification

mlm = MaskedLanguageModeling(lang_module,
                                 batch_size=16,
                                 texts_or_path='../data/train_set_1_1_CLASH2013_paper.txt',
                                 val_texts_or_path='../data/evaluation_set_1_1_CLASH2013_paper.txt',
                            )

cls = SequenceClassification(lang_module,
                                  batch_size=16,
                                  texts_or_path='../data/train_set_1_1_CLASH2013_paper.txt',
                                  labels_or_path='../data/train_set_1_1_CLASH2013_paper_labels.txt',
                                 val_texts_or_path='../data/evaluation_set_1_1_CLASH2013_paper.txt',
                                 val_labels_or_path='../data/evaluation_set_1_1_CLASH2013_paper_labels.txt',
)

Some weights of the model checkpoint at armheb/DNA_bert_6 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at armheb/DNA_bert_6 and are n

In [10]:
# 4. pick a schedule of the selected objectives
# This one will initially fit the first objective until convergence on its eval set, then fits the second one 
from adaptor.schedules import ParallelSchedule, SequentialSchedule

schedule = ParallelSchedule([mlm, cls], training_arguments)
#schedule = SequentialSchedule([mlm, cls], training_arguments)

In [11]:
comet_ml.init(project_name='dnabert_for_clash', api_key='3NQhHgMmmlfnoqTcvkG03nYo9')

COMET INFO: Comet API key is valid
COMET INFO: Comet API key saved in /root/.comet.config


In [12]:

import logging
import os
from typing import List, Dict, Tuple, Union, Optional

from transformers import WEIGHTS_NAME
import torch
from transformers import Trainer, BatchEncoding
from transformers.modeling_utils import unwrap_model

from adaptor.lang_module import LangModule
from adaptor.schedules import Schedule
from adaptor.utils import AdaptationArguments

logger = logging.getLogger()


class Adapter(Trainer):
    """
    Adapter instance is a lightweigt wrapper of HuggingFace Trainer.
    1. It performs mapping of IterableDatasets constructed in Schedule, to Trainer(*dataset)
    2. For user convenience, it re-evaluates arguments sanity for (multi-)objective adaptation.
    3. It propagates computation of loss to schedule, which distributes them to corresponding Objectives.
    4. It extends training logs (created in events `on_log` and `on_evaluate`) with objective-specific logs.
    5. It extends model persistence on checkpoints and after the training to a separate model for each Objective.
    """

    permitted_args = ["args", "tokenizer", "callbacks", "optimizers"]
    eval_metrics_prefix = "eval"

    def __init__(self, lang_module: LangModule, schedule: Schedule, args: AdaptationArguments, **kwargs):
        """
        Initialises Adapter, used in the same way as HuggingFace Trainer, refer to its documentation for more features.
        :param lang_module: Wrapper of multi-head model with registered heads for each objective of `schedule`.
        :param schedule: Adaptor's Schedule. Determines ordering of applying training Objectives and other.
        :param args: Positional arguments to be passed to HF Trainer.
        :param kwargs: Keyword arguments to be checked and passed to HF Trainer.
        """
        unexpected_args = [k for k in kwargs.keys() if k not in self.permitted_args]
        if unexpected_args:
            raise ValueError("Adapter(**kwargs) got these unexpected kwargs: %s" % unexpected_args)

        self.schedule = schedule

        orig_callbacks = [] if "callbacks" not in kwargs else kwargs.pop("callbacks")
        print(orig_callbacks)

        super().__init__(model=lang_module,
                         args=args,
                         train_dataset=self.schedule.iterable_dataset(split="train"),
                         eval_dataset=self.schedule.iterable_dataset(split="eval"),
                         data_collator=self.flattened_collator,
                         compute_metrics=None,  # would require a static prediction format among objectives
                         callbacks=orig_callbacks + [schedule.should_stop_check_callback()],
                         **kwargs)

    @staticmethod
    def flattened_collator(features: List[BatchEncoding]) -> BatchEncoding:
        """
        Objectives take care of their own data collation, so this collator just flattens the outputs of batch_size=1.
        :return: loss and a placeholder of unused outputs, for compatibility
        """
        assert len(features) == 1, "Sorry, for multi-GPU training, we only support DistributedDataParallel for now."

        return features[0]

    def compute_loss(self,
                     model: LangModule,
                     inputs: Dict[str, torch.Tensor],
                     return_outputs: bool = False) -> Union[torch.FloatTensor, Tuple[torch.FloatTensor, None]]:
        labels = inputs["labels"] if "labels" in inputs else inputs["label"]

        outputs = model(**inputs)
        if self.label_smoother is not None:
            raise NotImplementedError()  # objective-dependent label smoothing is custom
            # loss = self.label_smoother(outputs, labels)
        else:
            loss = self.schedule.compute_loss(outputs, labels)

        mock_outputs = torch.tensor([-1, -1])
        return (loss, mock_outputs) if return_outputs else loss

    def log(self, logs: List[Dict[str, float]]) -> None:
        is_eval_log = any(self.eval_metrics_prefix in log_key for log_key in logs)
        extended_logs = self.schedule.objectives_log(split="eval" if is_eval_log else "train")
        return super().log({**logs, **extended_logs})

    def evaluate(self, *args, **kwargs) -> Dict[str, float]:
        logger.warning("Evaluating...")
        out = super(Adapter, self).evaluate(*args, **kwargs)
        if "metric_key_prefix" in kwargs:
            self.eval_metrics_prefix = kwargs["metric_key_prefix"]

        # refresh exhausted evaluation iteration for possible next evaluation
        self.eval_dataset = self.schedule.iterable_dataset("eval")

        return out

    def save_model(self, output_dir: Optional[str] = None) -> None:
        # HF native reload compatibility
        objectives_counter = {str(obj): 0 for obj in self.schedule.objectives["train"].values()}

        for objective_id in self.schedule.objectives["train"].keys():
            module = self.model.trainable_models[str(objective_id)]
            objective = self.schedule.objectives["train"][int(objective_id)]
            output_module_path = os.path.join(output_dir, str(objective))

            # if the objective of this id was already persisted, we'll index the configs of the next ones
            if objectives_counter[str(objective)] != 0:
                output_module_path += "_{}".format(objectives_counter[str(objective)])
                objectives_counter[str(objective)] += 1

            # we persist a shared tokenizer and training args either way
            self.model.tokenizer.save_pretrained(output_module_path)
            torch.save(self.args, os.path.join(output_dir, "training_args.bin"))

            if hasattr(module, "save_pretrained") or hasattr(unwrap_model(module), "save_pretrained"):
                # if the head module has "save_pretrained" method, it will be called for persistence
                module.save_pretrained(output_module_path, use_diff=True)
            else:
                # otherwise, we persist only a raw pytorch module
                torch.save(module.state_dict(), os.path.join(output_module_path, WEIGHTS_NAME))

            logger.info(f"Model of objective {str(objective)} saved in {output_module_path}")

In [13]:
# 4. Run the training using Adapter, similarly to running HF.Trainer, only adding `schedule`
#from adaptor.adapter import Adapter
from transformers.integrations import CometCallback

adapter = Adapter(lang_module=lang_module, schedule=schedule, args=training_arguments, callbacks=[CometCallback()])
adapter.train()

[<transformers.integrations.CometCallback object at 0x7fec83268990>]


You are adding a <class 'transformers.integrations.CometCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
CometCallback
TensorBoardCallback
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
  Num examples = 115440
  Num Epochs = 9
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 10
  Gradient Accumulation steps = 10
  Total optimization steps = 100000
COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/katarinagresova/dnabert-for-clash/0767e30f0730478faa9fdb70d84e08ee

Automatic Comet.ml online logging enabled
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:

{'loss': 0.4433, 'learning_rate': 2.0000000000000002e-07, 'train_MaskedLanguageModeling_loss': 0.21739389516413213, 'train_MaskedLanguageModeling_num_batches': 500, 'train_SequenceClassification_loss': 0.6692616649866104, 'train_SequenceClassification_num_batches': 500, 'epoch': 0.01}


MaskedLanguageModeling: 1924batches [05:10,  6.31batches/s, epoch=1, loss=0.195, split=eval]





























































































































SequenceClassification: 100%|██████████| 125/125 [00:19<00:00,  6.19batches/s, epoch=1, loss=0.747, split=eval][AConverged objectives: []


{'eval_loss': 0.1942417472600937, 'eval_runtime': 330.8067, 'eval_samples_per_second': 0.756, 'eval_steps_per_second': 0.756, 'eval_MaskedLanguageModeling_loss': 0.1940250468039556, 'eval_MaskedLanguageModeling_num_batches': 1924, 'eval_SequenceClassification_loss': 0.6981173262596131, 'eval_SequenceClassification_num_batches': 125, 'epoch': 0.01}


TypeError: ignored