OpenFusion.py is a text mining tool written in Python for creating SQLite biomedical databases on specific topic by using PubMed (https://pubmed.ncbi.nlm.nih.gov/) as a source. This database can be used locally or on the web.
Install or upgrade Biopython
pip install biopython
pip install --upgrade biopython
All you need is OpenFusion.py. The rest is simple as 1, 2, 3:
-
Create files containing dictionary terms, e.g. genes.txt, diseases.txt, etc. Each line must contain a word or phrase, which can be followed by tab-separated synonyms, e.g.
Alpers Disease Progressive Cerebral Poliodystrophy
-
Create a project file, e.g.
myProject.yml. Simply use provided template for your porjects. -
Run:
./OpenFusion.py -p myProject.ymlThe program will create SQLite database as specified in the project file. In addition, it will create a file containing PubMed articles in MEDLINE format, so you don't have to download them again if you wish to recreate the database.
The SQLite database created by a program can be used for futher research, displayed on the web as tables or as graphs.
Tables:
dictionary: list of dictionariesdid: dictionary id (primary key)dictionary: dictionary name as specified in .yml filecolor: color in which dictionary will be displayed (auto assigned)shape: shape in which dictionary will be represented (auto assigned)
glossary: list of all dictionary termstid: term id (primary key)did: dictionary id the term belongs to (foregin key)term: tab separated term synonyms
corpus: articles downloaded from PubMedpmid: PubMed assigned id (primary key)author: list of authorsarticle: article text - title and the abstract are delimited by the pilcrow sign (¶ i.e.\u00B6)date: the article date in ISO formatannotation: article annotation in JSON format,eachFor example, the term "visual illusions" (dictionary 2, term 13) occupy positions from 8 to 23 and the term migraine" (did=2, tid=7) occupy positions from 54 to 61 in the article.[did, tid,start_term_postion,end_term_postion term_name]This information is useful when highligting terms which can be displayed in diferent colors.[ [2, 13, 8, 23, "visual illusions"] [2, 7, 54, 61, "migraine"], ]
annotation: contains pmid-tid pairsid: table primary keypmid: article id assigned by PubMeddid: dictionary id the term belongs totid: term ididx: position of term within the arcticleterm: term name
term: contains list of all terms foundtid: term iddid: dictionary id the term belongs toterm: term namepmid_count: number of articles the term has been found inpmid_list: comma separated list of articles containing the term. This table can be used to display various statistics or as a data source for machine learning programs.
termpair: contains list of term co-occurencesid: table primary keytid_1: term 1 iddid_1: dictionary id the term 1 belongs toterm_1: term 1 nametid_2: term 2 iddid_2: dictionary id the term 2 belongs toterm_2: term 2namepmid_count: number of articles terms has been found inpmid_list: comma separated list of articles containing the terms. This table can be used to generate complex graphs showing links between terms (e.g. by using the cytoscape (https://js.cytoscape.org). For example, terms are represented as colored blocks (as defined in thedictionarytable and the links between them denote number of articles linked terms co-occur)
Have a look at the example in the Alice_in_Wonderland directory.
The directory web_utils contains number of utilities written in Python and PHP that you can use to publish your database online.

