GWASkb is a machine-compiled knowledge base of associations between genetic mutations and human traits.
- In our forthcoming paper, we describe our methodology for creating the database and analyze the results.
- In this repository, we walk through the code used to create the database.
- In our online portal at http://gwaskb.stanford.edu/, you can search the resulting database.
GWASkb contains associations in the form of tuples of (
pvalue). In our paper, we have selected and analyzed a set of associations that strike a good tradeoff between precision and recall.
These are found in:
This is a tab-separated file with 5 columns:
rsid, high-level phenotype, low-level phenotype, log p-value. If the latter is
-10000, it means that we were not able to extract the p-value.
Our knowledge base also contains a large set of other data, which is documented in
This repo is organized as follows:
. ├── README.md ├── annotations # Manually annotated data └── not_in_gwasc.xlsx # Manually annotated set of 100 relations extracted by GwasKB that were not in GWAS Catalog ├── data # Datasets from which the knowledge base was compiled ├── associations # Human-curated associations against which we compare ├── db # Scripts to download and create the input database of publications └── phenotypes # Scripts to generate phenotype ontology used by the system ├── notebooks # Jupyter notebooks that walk us through how the system was used to generate the results ├── bio-analysis # Notebooks that reproduce the biological analysis performed in the paper ├── lfs.py # A Python file containing all labeling functions used └── results # The main set of results produced by the machine curation system ├── nb-output # Intermediary output generated by each module (each notebook) └── metadata # Metadata associated with extracted p-values ├── snorkel-tables # Version of Snorkel used in the project ├── src # Source code of the components used on top of Snorkel ├── crawler # Scripts used to generate a database of papers as well as to crawl human-curated DBs └── extractor # Modules that extend Snorkel to extracting GWAS-specific from the publications └── results.md # File documenting the output of the system
In addition, the following files are important:
notebooks/results/nb-output: folder containing the output of each system module
notebooks/util/phenotype.mapping.annotated.tsv: manually annotated mapping between GWAS Central and GWASkb phenotypes
notebooks/util/phenotype.mapping.gwascat.annotated.tsv: manually annotated mapping between GWAS Catalog and GWASkb phenotypes
notebooks/util/rels.discovered.annotated.txt: random subset of 100 previously unreported relations with explanations for why they are correct or not.
GWASkb is intended to run on macOS and Unix (no GPU required). It has been tested on macOS 10.14 (Mojave) and Ubuntu 16.04.
It requires python 2.7 and the following primary libraries:
These (and their dependencies) will be downloaded during Installation.
Step 1: Download source code
If you already have the source code, skip to Step 2. If you are retrieving the source code from this repository, run the following commands:
git clone https://github.com/kuleshov/gwaskb.git cd gwaskb; git submodule init; git submodule update;
Step 2: Setup environment
We recommend using Anaconda to set up a virtual environment and working within that, but this is not strictly necessary, so long as python 2.7 is on your path.
 Check Python version
You should see Python 2.7.X. If not, you may need to create a virtual environment where Python 2.7 is used.
 Install dependencies
cd ./snorkel-tables pip install --requirement python-package-requirement.txt ./run.sh # Install treedlib and the Stanford CoreNLP tools # This will also open a jupyter notebook which we will use for the demo in Step 4. cd .. # Return to root directory source set_env.sh # Add environment variables to your path
Step 3: Download data
We extract mutation/phenotype relations from the open-access subset of PubMed.
In addition, we use hand-curated databases such as GWAS Catalog and GWAS Central for evaluation, and we use various ontologies (EFO, SNOMED, etc.) for phenotype extraction.
The first step is to download this data onto your machine and the easiest way to do that is to download a zipfile with the data that we used in the notebooks:
Generating the data manually
Alternatively, you may use our code to manually recreate this dataset.
This can be done in one step:
cd data/db make cd ../..
Or step-by-step using the instructions below.
cd data/db # we will store part of the dataset in a sqlite databset make init # this will initialize an empty database # next, we load a database of known phenotypes that might occur in the literature # this will load phenotypes from the EFO ontology as well as # various ontologies collected by the Hazy Research group make phenotypes # next, we download the contents of the hand-curated GWAS catalog database make gwas-catalog # loads into sqlite db (/tmp/gwas.sql by default); this takes a while # now, let's download from pubmed all the open-access papers mentioned in the GWAS catalog make dl-papers # downloads ~600 papers + their supplementary material! # finally, we will use the GWAS central database for validation of the results make gwas-central # this will only download the parts of GWAS central relevant to our papers
Step 4: Information Extraction Demo
We demo our system in a series of Jupyter notebooks in the
phenotype-extraction.ipynbidentifies the phenotypes studied in each paper
table-pval-extraction.ipynbextracts mutation ids and their associated p-values
table-phenotype-extraction.ipynbextracts relations between mutations and a specific phenotype (out of the many that can be described in the paper)
acronym-extraction.ipynb: often, phenotypes are mentioned via acronyms, and we need a module to resolve those acronyms
evaluation.ipynb: here, we merge all the results and evaluate our accuracy
The result is a list of TSV files containing facts (e.g. mutation/disease relations) that we have extracted from the literature.
Please send feedback to Volodymyr Kuleshov.