This repository contains the codes and models of a metabolite named entity recognition (NER) project. The main features you can find in this repository are as follows:
- An automatically annotated metabolomics training corpus (
TrainingSet.txt
,TrainingSetAnnot.tsv
) in the Corpus directory; - A manually annotated metabolomics test corpus (
GoldStandard.txt
,GoldStandardAnnot.tsv
) in the Corpus directory; - An automatically annotated metabolomics corpus (
MetabolomicsCorpus.txt
,MetabolomicsCorpusAnnot.tsv
) in the Corpus directory (which comprises with all TrainingSet data and the automatically annotated results of the test set); - A rule-based annotation pipeline that automatically annotate (AutoCORPus-processed) publications (
generate_corpus.py
); - MetaboListem, a machine learning model that recognises metabolite names (
metabolistem_model.py
), adapted from a chemical NER model named ChemListem; - TABoLiSTM, a BERT-based machine learning model that recognises metabolite names (
tabolistem_model.py
), adapted from MetaboListem.
This project is written in Python 3.7.10 and has been tested on Windows 10. Compatibility with other Python versions and systems have not been verified.
To exploit our package, the following dependencies are required:
transformers == 4.7.0
scikit-learn == 0.24.1
numpy == 1.19.2
pandas == 1.2.3
tensorflow-gpu == 2.4.0
In addition, using generate_corpus
requires an extra package
spacy == 3.0.6
MetaboListem is a machine-learning based metabolite NER algorithm, adapted from a chemical NER model called ChemListem.
This repository includes a trained MetaboListem model in the in the Models folder. You can import and load our MetaboListem model as follows:
import metabolistem_model
json_path = 'PATH/TO/metabolistem.json'
model_path = 'PATH/TO/metabolistem.h5'
mm = metabolistem_model.MetaboListem()
mm.load(json_path,model_path)
Then you can process your text by, e.g.,
text='Glucose, glutamine and lactate are the most frequently mentioned metabolites in cancer studies. '
mm.process(text)
which would return a list:
[(0, 7, 'Glucose'), (9, 18, 'glutamine'), (23, 30, 'lactate')]
The items in the output list are tuples with format (start_idx, end_idx, metabolite)
, where
start_idx
is the position of the start character.end_idx
is the position of the end character.metabolite
is the recognised metabolite.
Similarly, you can process a batch of texts by calling
texts=['We found that 1-methyl-6,7-dihydroxy-1,2,3,4-tetrahydroisoquinoline,',
'Morphine and hippuric acid were higher in the disease group.']
mm.batchprocess(texts)
which results
[[(14, 68, '1-methyl-6,7-dihydroxy-1,2,3,4-tetrahydroisoquinoline')],
[(0, 8, 'Morphine'), (13, 26, 'hippuric acid')]]
that is, a list of lists where each list corresponds to the entity recognition output of each sentence.
Before training your own MetaboListem model with our architecture, it is recommended to obtain the 6B 300d pre-trained word embeddings from GloVe; you can get it by downloading glove.6B.zip
and unzipping glove.6B.300d.txt
. Albeit being recommended, the word embedding file is optional and you could proceed without it. Other files (words.txt) can be found in the ChemListem repository.
The MetaboListem model can be trained like the following example:
import metabolistem_model
mm = metabolistem_model.MetaboListem()
mm.train(textfile,annotfile,glovefile,runname)
where the arguments of mm.train
are:
textfile
: the filename of the training text file - e.g. "TrainingSet.txt"annotfile
: the filename of the training annotation file - e.g. "TrainingSetAnnot.tsv"glovefile
: None, or the filename of the GloVe file - e.g. "glove.6B.300d.txt"runname
: Part of the output filenames.
This would produce a trained model, which is constituted of two main files:
metabolistem_$RUNNAME.h5
: the keras modelmetabolistem_$RUNNAME.json
: auxilliary information
Each epoch in the training process also produces a model file on its own; the files are named:
epoch_$EPOCHNUM_$RUNAME.h5
Only one json
file would be produced as the auxilliary information is not dependent to epochs.
TABoLiSTM is a model adapted from MetaboListem above; in essence, the main difference is that the GloVe word embedding system is replaced with pre-trained BERT embeddings.
Applying TABoLiSTM models follows a very similar process as MetaboListem.
The first step is to import and to load the model (available on zenodo):
import tabolistem_model
json_path='PATH/TO/tabolistem.json'
weight_path='PATH/TO/tabolistem_weights' ## no suffix
tm = tabolistem_model.TaboListem()
tm.load(json_path,weight_path)
Then, like MetaboListem, you can process your text by calling
tm.process(text)
which returns a list of tuples (start_idx, end_idx, entity)
.
Alternatively, calling
tm.batchprocess(texts)
is recommended if multiple strings are to be processed at once for sake of shorter runtime.
import tabolistem_model
tm = tabolistem_model.TaboListem()
tm.train(textfile, annotfile, runname)
where the arguments of tm.train
are
textfile
: the filename of the training text file.annotfile
: the filename of the training annotation file.runname
: Part of the output filenames.
The data used in the study consists of Open Access metabolomics publications (n=1,218) from PubMed Central (PMC). Sentences in the corpus are excerpted from Abstract, Method, Result, and Discussion sections of these publications and processed in a rule-based fashion with generate_corpus.py
.
The metabolomics dataset provided here comprises two files, namely
TrainingSet.txt
containing sentences that mention at least one metaboliteTrainingSetAnnot.tsv
containing the positions of the metabolites and the metabolites themselves
Specifically, the two files are structured as the following examples respectively:
PMC2538910 R01008 Citrulline and ornithine, urea cycle intermediates.
PMC2538910 R01008 0 10 Citrulline
PMC2538910 R01008 15 24 ornithine
PMC2538910 R01008 26 30 urea
where PMC2538910
is the PMC identifier of the source article, and R01008
is the sentence identifier. The sentence id has three parts: for example in R01008
, R
means the Result section, 01
the second subsection and 008
the ninth identified sentence.
The training set files are generated automatically with the rule-based annotation pipeline generate_corpus.py
. To use this script, you can type in command lines:
python generate_corpus.py -b PATH/TO/PMC -t PATH/TO/SAVEDIR -m PATH/TO/METABOLITEDICT -n OUTPUTNAME -r REGEX -e EXCLUSION
which requires 6 arguments:
-b
: Directory of PMC json files processed by AutoCORPus-t
: Output directory-n
: Output filename-m
: Filepath to a file storing a list of metabolite names for exact dictionary matching (e.g.MetaboliteNames.txt
)-r
: Filepath to a file storing a list of regular expressions for entity recognition (e.g.RegexList_RuleBasedAnnotation.txt
)-e
: Filepath to a file storing a list of regular expressions to exclude unwanted entities (e.g.ExclusionList_RuleBasedAnnotation.txt
)
The metabolite names here (MetaboliteNames.txt
) were downloaded from the Human Metabolome Database (HMDB).