BioNExt

This repository contains the implementation for the work Towards Discovery: An End-to-End System for Uncovering Novel Biomedical Relations

Using Large Language Models (LLMs) for sequence variant detection can be complex, so we've made this component optional to accommodate all users. If you have access to an LLM, you can leverage it by extending the GenericAPICall class and implementing the required logic in the run method, which sends a prompt to the LLM. The OllamaAPICall.py is an example of how to use the LLMs provided by OLLAMA.

The integration of LLMs for sequence variant detection is entirely optional. Whether or not you configure the LLM component, sequence variant detection will always occur during linking by using direct matching techniques.

Setup and Configuration

To utilize LLMs in the pipeline:

Extend the GenericAPICall Class: Customize the GenericAPICall class by implementing the run method to handle the specifics of your LLM. For example, the OllamaAPICall class is designed to send the given prompts to an OLLAMA server, handling details like API endpoints, temperature settings, and GPU allocations.
Specify the LLM Details in Command Line: When running the main pipeline, specify the LLM API details using command-line arguments. For instance:
```
python main.py PMID:36516090 --linker.llm_api.address http://IP:PORT --linker.llm_api.module OllamaAPICall
```

Complete Arguments Description

Global Settings

These settings relate to the general configuration of the pipeline:

source_file: Path to the input file, either a BIOC JSON or a PubMed ID prefixed with PMID:.
-t, --tagger: Enable the tagger module.
-l, --linker: Enable the linker module.
-e, --extractor: Enable the extractor module.

Tagger Settings

Settings specific to the tagger module:

--tagger.checkpoint: Path or identifier for the tagger model checkpoint (default: "IEETA/BioNExt-Tagger").
--tagger.trained_model_path: Directory to where the tagger will be downloaded (default: "trained_models/tagger").
--tagger.batch_size: Batch size for processing (default: 8).
--tagger.output_folder: Directory for saving output from the tagger (default: "outputs/tagger").

Linker Settings

Settings specific to the linker module:

--linker.llm_api.module: Module for the LLM API (default: None).
--linker.llm_api.address: Address for the LLM API (default: None).
--linker.kb_folder: Directory to where the knowledge bases will be downloaded (default: "knowledge-bases/").
--linker.dataset_folder: Directory to where the datasets will be downloaded (default: "dataset/").
--linker.output_folder: Directory for saving output from the linker (default: "outputs/linker").

Extractor Settings

Settings specific to the extractor module:

--extractor.output_folder: Directory for saving output from the extractor (default: "outputs/extractor").
--extractor.checkpoint: Path or identifier for the extractor model checkpoint (default: "IEETA/BioNExt-Extractor").
--extractor.trained_model_path: Directory to where the tagger will be downloaded (default: "trained_models/extractor").
--extractor.batch_size: Batch size for processing (default: 128).

These options allow you to customize the execution of the pipeline to suit your specific needs, whether running the full suite of tools or individual components.

How to Train

Tagger Model

The training environment for the Tagger model is set up under the src/tagger directory. To begin training, navigate to this directory and use the hf_training.py script as the main entry point. This script is built upon the Hugging Face Trainer API, allowing for straightforward model training with BERT architectures.

Training Command

To start the training process, use the following command:

cd src/tagger
python hf_training.py michiyasunaga/BioLinkBERT-base --augmentation unk --context 64

Parameters Description

The hf_training.py script allows for several arguments to customize the training process:

Model Checkpoint: The first parameter specifies the pre-trained BERT model checkpoint to use as a starting point. In this example, michiyasunaga/BioLinkBERT-base is used.
--augmentation: The type of data augmentation to apply (default: None). (unk and random are the options)
--p_augmentation: Probability of applying augmentation on a per-example basis (default: 0.5).
--percentage_tags: The percentage of tags (>0) to be augmented per sample (default: 0.2).
--context: Length of the context window (default: 64 tokens).
--epochs: Number of training epochs (default: 30).
--batch: Batch size for training (default: 8).
--random_seed: Random seed for reproducibility (default: 42).

Dataset

By default we are using the datasets under the dataset folder. In case that its empty consider running our system in inference mode (see above), since it will automaticly download the BioRED dataset used for this work.

Extractor Model

The training setup for the Extractor model is similarly straightforward and utilizes the hf_training.py script under the src/extractor directory. This model is also trained using the Hugging Face Trainer API, tailored to handle the specific needs of relation extraction tasks in biomedical texts.

Training Command

To start training the Extractor model, navigate to the extractor directory and execute the following command:

cd src/extractor
python hf_training.py michiyasunaga/BioLinkBERT-base --novel

Parameters Description

The hf_training.py script for the Extractor model allows several arguments for customizing the training process:

Model Checkpoint: Specifies the pre-trained BERT model checkpoint to be used. In this case, michiyasunaga/BioLinkBERT-base is the chosen model.
--arch: Defines the architecture type for the model, with the default set to "mha" (multi-head attention). (bilstm is the other alternative)
--index_type: Specifies the type of indexing used in the model, defaulting to "both". ("s" and "e" are the other alternatives)
--name: Provides a unique name for identifying the training session or model version, default is None.
--epochs: Determines the number of training epochs, set to 30 by default.
--batch: Configures the batch size for training, default is 10.
--random_seed: Sets a random seed to ensure reproducibility, default is 42.
--novel: Enables joint training for relation and novelty detection. Activating this flag integrates novelty detection into the training process, enhancing the model's ability to distinguish between known and novel relations in the dataset.

Dataset

Ensure that the dataset folder contains the appropriate data before starting training. If the folder is empty, consider running the system in inference mode, which will automatically download the necessary BioRED dataset used for this work.

Models

Our tagger and extractor models are integrated with the Hugging Face library and can be accessed and used in isolation at:

tagger: https://huggingface.co/IEETA/BioNExt-Tagger
extractor: https://huggingface.co/IEETA/BioNExt-Extractor

Authors:

Tiago Almeida (ORCID: 0000-0002-4258-3350)
Richard A A Jonker (ORCID: 0000-0002-3806-6940)
Rui Antunes (ORCID: 0000-0003-3533-8872)
João R Almeida (ORCID: 0000-0003-0729-2264)
Sérgio Matos (ORCID: 0000-0003-1941-3983)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
dataset		dataset
knowledge-bases		knowledge-bases
outputs		outputs
src		src
testset		testset
trained_models		trained_models
.gitignore		.gitignore
LICENSE		LICENSE
OllamaAPICall.py		OllamaAPICall.py
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioNExt

Table of Contents

How to use

Install dependencies

Run the System (inference)

Select which modules to run

How to Use LLM for Sequence Variant Detection in the Linker

Setup and Configuration

Complete Arguments Description

Global Settings

Tagger Settings

Linker Settings

Extractor Settings

How to Train

Tagger Model

Training Command

Parameters Description

Dataset

Extractor Model

Training Command

Parameters Description

Dataset

Models

Authors:

About

Releases

Packages

Contributors 2

Languages

License

ieeta-pt/BioNExt

Folders and files

Latest commit

History

Repository files navigation

BioNExt

Table of Contents

How to use

Install dependencies

Run the System (inference)

Select which modules to run

How to Use LLM for Sequence Variant Detection in the Linker

Setup and Configuration

Complete Arguments Description

Global Settings

Tagger Settings

Linker Settings

Extractor Settings

How to Train

Tagger Model

Training Command

Parameters Description

Dataset

Extractor Model

Training Command

Parameters Description

Dataset

Models

Authors:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages