GWASkb is a machine-compiled knowledge base of associations between genetic mutations and human traits.
- In our forthcoming paper, we describe our methodology for creating the database and analyze the results.
- In this repository, we walk through the code used to create the database.
- In our online portal at http://gwaskb.stanford.edu/, you can search the resulting database.
GWASkb contains associations in the form of tuples of (genetic variant
, trait
, pvalue
). In our paper, we have selected and analyzed a set of associations that strike a good tradeoff between precision and recall.
These are found in:
notebooks/results/associations.tsv
This is a tab-separated file with 5 columns: pmid
, rsid
, high-level phenotype, low-level phenotype, log p-value. If the latter is -10000
, it means that we were not able to extract the p-value.
Our knowledge base also contains a large set of other data, which is documented in results.md
.
This repo is organized as follows:
.
├── README.md
├── annotations # Manually annotated data
└── not_in_gwasc.xlsx # Manually annotated set of 100 relations extracted by GwasKB that were not in GWAS Catalog
├── data # Datasets from which the knowledge base was compiled
├── associations # Human-curated associations against which we compare
├── db # Scripts to download and create the input database of publications
└── phenotypes # Scripts to generate phenotype ontology used by the system
├── notebooks # Jupyter notebooks that walk us through how the system was used to generate the results
├── bio-analysis # Notebooks that reproduce the biological analysis performed in the paper
├── lfs.py # A Python file containing all labeling functions used
└── results # The main set of results produced by the machine curation system
├── nb-output # Intermediary output generated by each module (each notebook)
└── metadata # Metadata associated with extracted p-values
├── snorkel-tables # Version of Snorkel used in the project
├── src # Source code of the components used on top of Snorkel
├── crawler # Scripts used to generate a database of papers as well as to crawl human-curated DBs
└── extractor # Modules that extend Snorkel to extracting GWAS-specific from the publications
└── results.md # File documenting the output of the system
In addition, the following files are important:
notebooks/results/nb-output
: folder containing the output of each system modulenotebooks/util/phenotype.mapping.annotated.tsv
: manually annotated mapping between GWAS Central and GWASkb phenotypesnotebooks/util/phenotype.mapping.gwascat.annotated.tsv
: manually annotated mapping between GWAS Catalog and GWASkb phenotypesnotebooks/util/rels.discovered.annotated.txt
: random subset of 100 previously unreported relations with explanations for why they are correct or not.
GWASkb is intended to run on macOS and Unix (no GPU required). It has been tested on macOS 10.14 (Mojave) and Ubuntu 16.04.
It requires python 2.7 and the following primary libraries:
lxml
,ElementTree
numpy
sklearn
sqlite
snorkel
These (and their dependencies) will be downloaded during Installation.
If you already have the source code, skip to Step 2. If you are retrieving the source code from this repository, run the following commands:
git clone https://github.com/kuleshov/gwaskb.git
cd gwaskb;
git submodule init;
git submodule update;
We recommend using Anaconda to set up a virtual environment and working within that, but this is not strictly necessary, so long as python 2.7 is on your path.
[1] Check Python version
python --version
You should see Python 2.7.X. If not, you may need to create a virtual environment where Python 2.7 is used.
[2] Install dependencies
cd ./snorkel-tables
pip install --requirement python-package-requirement.txt
./run.sh # Install treedlib and the Stanford CoreNLP tools
# This will also open a jupyter notebook which we will use for the demo in Step 4.
cd .. # Return to root directory
source set_env.sh # Add environment variables to your path
We extract mutation/phenotype relations from the open-access subset of PubMed.
In addition, we use hand-curated databases such as GWAS Catalog and GWAS Central for evaluation, and we use various ontologies (EFO, SNOMED, etc.) for phenotype extraction.
The first step is to download this data onto your machine and the easiest way to do that is to download a zipfile with the data that we used in the notebooks:
https://drive.google.com/file/d/1DX17UCztwXtB3PxKLQd2waUBJdSdNJDc/view?usp=sharing
Alternatively, you may use our code to manually recreate this dataset.
This can be done in one step:
cd data/db
make
cd ../..
Or step-by-step using the instructions below.
cd data/db
# we will store part of the dataset in a sqlite databset
make init # this will initialize an empty database
# next, we load a database of known phenotypes that might occur in the literature
# this will load phenotypes from the EFO ontology as well as
# various ontologies collected by the Hazy Research group
make phenotypes
# next, we download the contents of the hand-curated GWAS catalog database
make gwas-catalog # loads into sqlite db (/tmp/gwas.sql by default); this takes a while
# now, let's download from pubmed all the open-access papers mentioned in the GWAS catalog
make dl-papers # downloads ~600 papers + their supplementary material!
# finally, we will use the GWAS central database for validation of the results
make gwas-central # this will only download the parts of GWAS central relevant to our papers
We demo our system in a series of Jupyter notebooks in the notebooks
subfolder.
phenotype-extraction.ipynb
identifies the phenotypes studied in each papertable-pval-extraction.ipynb
extracts mutation ids and their associated p-valuestable-phenotype-extraction.ipynb
extracts relations between mutations and a specific phenotype (out of the many that can be described in the paper)acronym-extraction.ipynb
: often, phenotypes are mentioned via acronyms, and we need a module to resolve those acronymsevaluation.ipynb
: here, we merge all the results and evaluate our accuracy
The result is a list of TSV files containing facts (e.g. mutation/disease relations) that we have extracted from the literature.
Please send feedback to Volodymyr Kuleshov.