# ü¶† microbELP

[![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP)


The notebook showcases our microbiome entity normalisation module, designed to automatically link microbial mentions (bacteria, archaea, fungi) to their NCBI Taxonomy identifiers using both the DL and Non-DL methods.

# ‚öôÔ∏è Installation

MicrobELP has a number of dependencies on other Python packages; it is recommended to install it in an isolated environment.

In [None]:
!git clone https://github.com/omicsNLP/microbELP.git

In [None]:
!pip install ./microbELP

## üîó Normalisation Utility


The package includes two helper functions for standalone microbial name normalisation, available for both non‚ÄìDL and DL usage [![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP/tree/main?tab=readme-ov-file#-normalisation-utility).

### üìú Non‚ÄìDL Normalisation

Here, I am quickly introducing our Non-Deep Learning function for microbiome entity normalisation. It only uses CPU. In the first run I am providing a string as input and on the second a list of strings. You can learn all about the parameters and output structure here: [![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP/tree/main?tab=readme-ov-file#-nondl-normalisation).

In [3]:
from microbELP import microbiome_normalisation

Using a string:

In [4]:
microbiome_normalisation('Eubacterium rectale')

'NCBI:txid39491'

Using a list of strings:

In [5]:
microbiome_normalisation(['Eubacterium rectale', 'bacteria'])

[{'Eubacterium rectale': 'NCBI:txid39491'}, {'bacteria': 'NCBI:txid2'}]

### ‚ö° DL Normalisation

Here, I am quickly introducing our Deep Learning function for microbiome entity normalisation. It can be used on both CPU and GPU. I am here using a notebook with GPU as loading the vocabulary can be very time consuming while possible. In the first run I am providing a string as input and on the second a list of strings. You can learn all about the parameters and output structure here: [![microbELP](https://img.shields.io/badge/GitHub-microbELP-181717?logo=github)](https://github.com/omicsNLP/microbELP/tree/main?tab=readme-ov-file#-dl-normalisation).

In [6]:
from microbELP import microbiome_biosyn_normalisation

Using a string:

In [7]:
microbiome_biosyn_normalisation('Helicobacter pylori')

GPU detected, running the code using the GPU.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/359 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

sparse_encoder.pk:   0%|          | 0.00/47.8k [00:00<?, ?B/s]

sparse_weight.pt:   0%|          | 0.00/829 [00:00<?, ?B/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 452/452 [00:21<00:00, 21.13it/s]
embedding dictionary: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 452/452 [09:49<00:00,  1.31s/it]


[{'mention': 'Helicobacter pylori',
  'candidates': [{'NCBI:txid210': 'Helicobacter pylori'},
   {'NCBI:txid210': 'helicobacter pylori'},
   {'NCBI:txid210': 'Campylobacter pylori'},
   {'NCBI:txid210': 'campylobacter pylori'},
   {'NCBI:txid210': 'Campylobacter pyloridis'}]}]

Using a list of strings:

In [8]:
microbiome_biosyn_normalisation(['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'])

GPU detected, running the code using the GPU.


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 452/452 [00:08<00:00, 53.52it/s]
embedding dictionary: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 452/452 [09:59<00:00,  1.33s/it]


[{'mention': 'bacteria',
  'candidates': [{'NCBI:txid2': 'Bacteria'},
   {'NCBI:txid2': 'bacteria'},
   {'NCBI:txid1869227': 'bacteria bacterium'},
   {'NCBI:txid1869227': 'Bacteria bacterium'},
   {'NCBI:txid1573883': 'bacterium associated'}]},
 {'mention': 'Eubacterium rectale',
  'candidates': [{'NCBI:txid39491': 'Eubacterium rectale'},
   {'NCBI:txid39491': 'eubacterium rectale'},
   {'NCBI:txid39491': 'pseudobacterium rectale'},
   {'NCBI:txid39491': 'Pseudobacterium rectale'},
   {'NCBI:txid39491': 'e. rectale'}]},
 {'mention': 'Helicobacter pylori',
  'candidates': [{'NCBI:txid210': 'Helicobacter pylori'},
   {'NCBI:txid210': 'helicobacter pylori'},
   {'NCBI:txid210': 'Campylobacter pylori'},
   {'NCBI:txid210': 'campylobacter pylori'},
   {'NCBI:txid210': 'Campylobacter pyloridis'}]}]