# 🦠 microbELP

[![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP)


The notebook showcases our corpus generation and main annotation modules designed for automatic recognition and normalisation of microbiome-related entities in biomedical literature.

# ⚙️ Installation

MicrobELP has a number of dependencies on other Python packages; it is recommended to install it in an isolated environment.

In [None]:
!git clone https://github.com/omicsNLP/microbELP.git

In [None]:
!pip install ./microbELP

We are doing a quick check of what is currently available in our directory to monitor the change as we run the code.

In [3]:
!ls

microbELP  sample_data


## 🧾 PMCID retrieval and conversion to BioC

This function automatically retrieve Open Access publications from PubMed Central and convert them into BioC JSON format. You can learn more here: [![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP/blob/main/README.md#-pmcid-retrieval-and-conversion-to-bioc)

In order to run the corpus generation, I need a txt file that anyone can obtain from the PMC search engine. It will list one PMCID per line. My Personal recomendation add 'AND open access [filter]' to your query as the function will only be able to retrieve those. Here from simplicity I will create the text file from a list of 10 PMCIDs.

In [4]:
pmcids = [
    'PMC11638674',
    'PMC10031919',
    'PMC7032713'
]

In [5]:
for i in range(len(pmcids)):
  with open("pmcid.txt", "a") as f:
    f.write(pmcids[i])
    if i+1 < len(pmcids):
      f.write("\n")

Doing 'ls' again we can see that a new file is listed.

In [6]:
!ls

microbELP  pmcid.txt  sample_data


Using 'cat' we can quickly check that the file correspond to our list.

In [7]:
!cat pmcid.txt

PMC11638674
PMC10031919
PMC7032713

In [9]:
from microbELP import pmcid_to_microbiome

pmcid_to_microbiome('./pmcid.txt', my_email) ## HERE YOU NEED TO REPLACE 'my_email' VARIABLE WITH A STR OF YOUR EMAIL.

Starting the retrieval process.
Retrieving file: 1 out of 3 from the NCBI API.
Retrieving file: 2 out of 3 from the NCBI API.
Retrieving file: 3 out of 3 from the NCBI API.
Starting the conversion process.
Converting file: 1 out of 3 to BioC.
Converting file: 2 out of 3 to BioC.
Converting file: 3 out of 3 to BioC.
Process complete!


After collecting the files, we are doing a third check to see the addition of 'microbELP_PMCID_microbiome'. Which contains the original XML files from PMC and their corresponding conversion to BioC.

In [10]:
!ls

microbELP  microbELP_PMCID_microbiome  pmcid.txt  sample_data


In [11]:
!ls ./microbELP_PMCID_microbiome

bioc  PMCID_XML


Here we can read one file to see its structure.

In [12]:
import json

with open('./microbELP_PMCID_microbiome/bioc/PMC11638674_bioc.json') as f:
  data = json.load(f)

data

{'source': 'Auto-CORPus (XML)',
 'date': '20251113',
 'key': 'autocorpus_fulltext.key',
 'infons': {'pmcid': 'PMC11638674',
  'pmid': '39678196',
  'doi': '10.3389/fendo.2024.1416611',
  'link': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11638674/',
  'journal': 'Frontiers in Endocrinology',
  'pub_type': 'Endocrinology',
  'year': '2024',
  'license': 'This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.'},
 'documents': [{'id': 'PMC11638674',
   'infons': {},
   'passages': [{'offset': 0,
     'infons': {'section_title_1': 'document title',
      'iao_name_1': 'document title',
      'iao_id_

## 🧰 Main pipeline - non–DL


This example, run the non-DL pipeline on a folder of BioC files with the name ending with _bioc.json. In this case I will be using the files we just collected. You can learn more about the parameters here: [![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP/blob/main/README.md#-main-pipeline---nondl)

In [13]:
from microbELP import microbELP

microbELP('./microbELP_PMCID_microbiome/bioc/')

Annotating file:  1 of:  3 PMC11638674_bioc.json
Input file found
|████████████████████████████████████████| 103/103 [100%] in 12:03.8 (0.14/s) 
['Prokaryote', 'Bacteria', 'Actinomyces', 'Catonella', 'Streptococcus anginosus', 'Lautropia mirabilis', 'Veillonella', 'Leptotrichia buccalis', 'Streptococcus', 'Desulfobulbus', 'Prevotella copri', 'Neisseria', 'Porphyromonadaceae', 'Streptococcaceae', 'Proteobacteria', 'Neisseriaceae', 'Bacteroidetes', 'Firmicutes', 'Tannerella forsythia', 'Eubacterium nodatum', 'Leptotrichiaceae', 'Actinobacteria', 'Staphylococcus', 'Fusobacterium nucleatum', 'Filifactor', 'Pasteurellaceae', 'Fusobacteriota', 'Fusobacteria', 'Synergistia', 'Lactobacillus', 'Leptotrichia', 'Aggregatibacter', 'Veillonellaceae', 'Alloprevotella rava', 'Streptococcus agalactiae', 'Tannerella', 'Capnocytophaga', 'Actinobacillus', 'Rothia dentocariosa', 'Haemophilus', 'Fusobacterium', 'Prevotella', 'Actinomyces naeslundii', 'Bulleidia', 'Rothia', 'Treponema', 'Pseudomonas', 'Porp

Let's check if the annotated BioC files have been saved in the newly created directory `'microbELP_result/'`.



In [14]:
!ls ./microbELP_result

PMC10031919_bioc.json  PMC11638674_bioc.json  PMC7032713_bioc.json


Let's look at the difference between the unannotated and annotated BioC.



In [15]:
with open('./microbELP_PMCID_microbiome/bioc/PMC11638674_bioc.json') as f:
  data = json.load(f)

data['documents'][0]['passages'][10]

{'offset': 6212,
 'infons': {'section_title_1': 'Methodology',
  'iao_name_1': 'methods section',
  'iao_id_1': 'IAO:0000317'},
 'text': 'The HMP complements this by employing advanced techniques like 16S rRNA gene sequencing and whole-genome shotgun sequencing to characterize the microbiome across different regions of the oral cavity (11). This project has highlighted the varied abundance and distribution of bacteria such as Streptococcus and Prevotella in specific oral sites, providing crucial context for understanding the microbial landscape in health and disease, including diabetic conditions.',
 'sentences': [],
 'annotations': [],
 'relations': []}

On the above, we can see that 'annotations': [] while the bellow has annotations in the list.



In [16]:
with open('./microbELP_result/PMC11638674_bioc.json') as f:
  data = json.load(f)

data['documents'][0]['passages'][10]

{'offset': 6212,
 'infons': {'section_title_1': 'Methodology',
  'iao_name_1': 'methods section',
  'iao_id_1': 'IAO:0000317'},
 'text': 'The HMP complements this by employing advanced techniques like 16S rRNA gene sequencing and whole-genome shotgun sequencing to characterize the microbiome across different regions of the oral cavity (11). This project has highlighted the varied abundance and distribution of bacteria such as Streptococcus and Prevotella in specific oral sites, providing crucial context for understanding the microbial landscape in health and disease, including diabetic conditions.',
 'sentences': [],
 'annotations': [{'text': 'bacteria',
   'infons': {'identifier': 'NCBI:txid2',
    'type': 'bacteria_superkingdom',
    'annotator': 'microbELP@omicsNLP.ic.ac.uk',
    'date': '2025-11-13 20:46:58',
    'parent_taxonomic_id': 'noParentIDinList'},
   'id': '09',
   'locations': {'length': 8, 'offset': 6487}},
  {'text': 'Streptococcus',
   'infons': {'identifier': 'NCBI:tx

## 🧰 Main pipeline - DL


This example, run the DL pipeline on a folder of BioC files with the name ending with _bioc.json. In this case I will be using the files we just collected. You can learn more about the parameters here: [![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP/blob/main/README.md#-main-pipeline---dl)

In [17]:
from microbELP import microbELP_DL

microbELP_DL('./microbELP_PMCID_microbiome/bioc/')

GPU detected, running the code using the GPU.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/359 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

sparse_encoder.pk:   0%|          | 0.00/47.8k [00:00<?, ?B/s]

sparse_weight.pt:   0%|          | 0.00/829 [00:00<?, ?B/s]

100%|██████████| 452/452 [00:09<00:00, 47.89it/s]
embedding dictionary: 100%|██████████| 452/452 [09:44<00:00,  1.29s/it]


tokenizer_config.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/809 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

Processing file 1 out of 3.
Processing file 2 out of 3.
Processing file 3 out of 3.


Let's check if the annotated BioC files have been saved in the newly created directory `'microbELP_DL_result/'`.

In [18]:
!ls ./microbELP_DL_result

PMC10031919_bioc.json  PMC11638674_bioc.json  PMC7032713_bioc.json


Let's look at the difference between the unannotated and annotated BioC.

In [19]:
with open('./microbELP_PMCID_microbiome/bioc/PMC11638674_bioc.json') as f:
  data = json.load(f)

data['documents'][0]['passages'][10]

{'offset': 6212,
 'infons': {'section_title_1': 'Methodology',
  'iao_name_1': 'methods section',
  'iao_id_1': 'IAO:0000317'},
 'text': 'The HMP complements this by employing advanced techniques like 16S rRNA gene sequencing and whole-genome shotgun sequencing to characterize the microbiome across different regions of the oral cavity (11). This project has highlighted the varied abundance and distribution of bacteria such as Streptococcus and Prevotella in specific oral sites, providing crucial context for understanding the microbial landscape in health and disease, including diabetic conditions.',
 'sentences': [],
 'annotations': [],
 'relations': []}

On the above, we can see that 'annotations': [] while the bellow has annotations in the list.


In [20]:
with open('./microbELP_DL_result/PMC11638674_bioc.json') as f:
  data = json.load(f)

data['documents'][0]['passages'][10]

{'offset': 6212,
 'infons': {'section_title_1': 'Methodology',
  'iao_name_1': 'methods section',
  'iao_id_1': 'IAO:0000317'},
 'text': 'The HMP complements this by employing advanced techniques like 16S rRNA gene sequencing and whole-genome shotgun sequencing to characterize the microbiome across different regions of the oral cavity (11). This project has highlighted the varied abundance and distribution of bacteria such as Streptococcus and Prevotella in specific oral sites, providing crucial context for understanding the microbial landscape in health and disease, including diabetic conditions.',
 'sentences': [],
 'annotations': [{'id': '10',
   'infons': {'type': 'microbiome',
    'identifier': 'NCBI:txid2',
    'annotator': 'microbELP@omicsNLP.github',
    'updated_at': '2025-11-13T21:29:06Z'},
   'text': 'bacteria',
   'locations': [{'offset': 6487, 'length': 8}]},
  {'id': '11',
   'infons': {'type': 'microbiome',
    'identifier': 'NCBI:txid1301',
    'annotator': 'microbELP@o

## 🐧 Linux / 🍎 macOS / 💠 Cygwin (Linux-like on Windows)

### 🧰 Main pipeline - non–DL (CPU only)

This example, run the non-DL pipeline in parallel on a folder of BioC files with the name ending with _bioc.json. In this case I will be using the files we just collected. You can learn more about the parameters here: [![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP/blob/main/README.md#-main-pipeline---nondl-cpu-only)

First we need to check the number of cores available.

In [21]:
import multiprocessing as mp

mp.cpu_count()

2

Now I will use the 2 cores to annotate all of our documents.

In [23]:
from microbELP import parallel_microbELP

parallel_microbELP(
  './microbELP_PMCID_microbiome/bioc/',
  2
)

The number of cores you want to use is equal or greater than the numbers of cores in your machine. We stop the script now


Due to the limit number of cores on colab, I will need to change to 1.

In [24]:
from microbELP import parallel_microbELP

parallel_microbELP(
  './microbELP_PMCID_microbiome/bioc/',
  1
)

No new document to annotate.


Still, since the library allows incremental update and all the documents are already annotated the code will not run. I will now delete the annotations to restart.

In [25]:
!rm -r ./microbELP_result/

In [26]:
from microbELP import parallel_microbELP

parallel_microbELP(
  './microbELP_PMCID_microbiome/bioc/',
  1
)

13/11/2025, 21:34:17 Process starting
Process 1 starts annotating file:  1 of:  3 ./microbELP_PMCID_microbiome/bioc/PMC11638674_bioc.json
Process 1 finished annotating file:  1 of:  3  and found the following list of entities:  ['Prokaryote', 'Bacteria', 'Actinomyces', 'Catonella', 'Streptococcus anginosus', 'Lautropia mirabilis', 'Veillonella', 'Leptotrichia buccalis', 'Streptococcus', 'Desulfobulbus', 'Prevotella copri', 'Neisseria', 'Porphyromonadaceae', 'Streptococcaceae', 'Proteobacteria', 'Neisseriaceae', 'Bacteroidetes', 'Firmicutes', 'Tannerella forsythia', 'Eubacterium nodatum', 'Leptotrichiaceae', 'Actinobacteria', 'Staphylococcus', 'Fusobacterium nucleatum', 'Filifactor', 'Pasteurellaceae', 'Fusobacteriota', 'Fusobacteria', 'Synergistia', 'Lactobacillus', 'Leptotrichia', 'Aggregatibacter', 'Veillonellaceae', 'Alloprevotella rava', 'Streptococcus agalactiae', 'Tannerella', 'Capnocytophaga', 'Actinobacillus', 'Rothia dentocariosa', 'Haemophilus', 'Fusobacterium', 'Prevotella',

In [27]:
with open('./microbELP_result/PMC11638674_bioc.json') as f:
  data = json.load(f)

data['documents'][0]['passages'][10]

{'offset': 6212,
 'infons': {'section_title_1': 'Methodology',
  'iao_name_1': 'methods section',
  'iao_id_1': 'IAO:0000317'},
 'text': 'The HMP complements this by employing advanced techniques like 16S rRNA gene sequencing and whole-genome shotgun sequencing to characterize the microbiome across different regions of the oral cavity (11). This project has highlighted the varied abundance and distribution of bacteria such as Streptococcus and Prevotella in specific oral sites, providing crucial context for understanding the microbial landscape in health and disease, including diabetic conditions.',
 'sentences': [],
 'annotations': [{'text': 'bacteria',
   'infons': {'identifier': 'NCBI:txid2',
    'type': 'bacteria_superkingdom',
    'annotator': 'microbELP@omicsNLP.ic.ac.uk',
    'date': '2025-11-13 21:35:30',
    'parent_taxonomic_id': 'noParentIDinList'},
   'id': '09',
   'locations': {'length': 8, 'offset': 6487}},
  {'text': 'Streptococcus',
   'infons': {'identifier': 'NCBI:tx