Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


GWASkb is a machine-compiled knowledge base of associations between genetic mutations and human traits.

  • In our forthcoming paper, we describe our methodology for creating the database and analyze the results.
  • In this repository, we walk through the code used to create the database.
  • In our online portal at, you can search the resulting database.

Main Results

GWASkb contains associations in the form of tuples of (genetic variant, trait, pvalue). In our paper, we have selected and analyzed a set of associations that strike a good tradeoff between precision and recall. These are found in:


This is a tab-separated file with 5 columns: pmid, rsid, high-level phenotype, low-level phenotype, log p-value. If the latter is -10000, it means that we were not able to extract the p-value.

Our knowledge base also contains a large set of other data, which is documented in


This repo is organized as follows:

├── annotations           # Manually annotated data
  └── not_in_gwasc.xlsx   # Manually annotated set of 100 relations extracted by GwasKB that were not in GWAS Catalog
├── data                  # Datasets from which the knowledge base was compiled 
  ├── associations        # Human-curated associations against which we compare
  ├── db                  # Scripts to download and create the input database of publications
  └── phenotypes          # Scripts to generate phenotype ontology used by the system
├── notebooks             # Jupyter notebooks that walk us through how the system was used to generate the results
  ├── bio-analysis        # Notebooks that reproduce the biological analysis performed in the paper
  ├──              # A Python file containing all labeling functions used
  └── results             # The main set of results produced by the machine curation system
    ├── nb-output         # Intermediary output generated by each module (each notebook)
    └── metadata          # Metadata associated with extracted p-values
├── snorkel-tables        # Version of Snorkel used in the project
├── src                   # Source code of the components used on top of Snorkel
  ├── crawler             # Scripts used to generate a database of papers as well as to crawl human-curated DBs
  └── extractor           # Modules that extend Snorkel to extracting GWAS-specific from the publications
└──            # File documenting the output of the system

In addition, the following files are important:

  • notebooks/results/nb-output: folder containing the output of each system module
  • notebooks/util/phenotype.mapping.annotated.tsv: manually annotated mapping between GWAS Central and GWASkb phenotypes
  • notebooks/util/phenotype.mapping.gwascat.annotated.tsv: manually annotated mapping between GWAS Catalog and GWASkb phenotypes
  • notebooks/util/rels.discovered.annotated.txt: random subset of 100 previously unreported relations with explanations for why they are correct or not.


GWASkb is intended to run on macOS and Unix (no GPU required). It has been tested on macOS 10.14 (Mojave) and Ubuntu 16.04.

It requires python 2.7 and the following primary libraries:

  • lxml, ElementTree
  • numpy
  • sklearn
  • sqlite
  • snorkel

These (and their dependencies) will be downloaded during Installation.


Step 1: Download source code

If you already have the source code, skip to Step 2. If you are retrieving the source code from this repository, run the following commands:

git clone
cd gwaskb;
git submodule init;
git submodule update;

Step 2: Setup environment

We recommend using Anaconda to set up a virtual environment and working within that, but this is not strictly necessary, so long as python 2.7 is on your path.

[1] Check Python version

python --version

You should see Python 2.7.X. If not, you may need to create a virtual environment where Python 2.7 is used.

[2] Install dependencies

cd ./snorkel-tables
pip install --requirement python-package-requirement.txt
./            # Install treedlib and the Stanford CoreNLP tools
# This will also open a jupyter notebook which we will use for the demo in Step 4.

cd ..               # Return to root directory
source   # Add environment variables to your path

Step 3: Download data

We extract mutation/phenotype relations from the open-access subset of PubMed.

In addition, we use hand-curated databases such as GWAS Catalog and GWAS Central for evaluation, and we use various ontologies (EFO, SNOMED, etc.) for phenotype extraction.

The first step is to download this data onto your machine and the easiest way to do that is to download a zipfile with the data that we used in the notebooks:

Generating the data manually

Alternatively, you may use our code to manually recreate this dataset.

This can be done in one step:

cd data/db
cd ../..

Or step-by-step using the instructions below.

cd data/db

# we will store part of the dataset in a sqlite databset
make init # this will initialize an empty database

# next, we load a database of known phenotypes that might occur in the literature
# this will load phenotypes from the EFO ontology as well as 
# various ontologies collected by the Hazy Research group
make phenotypes

# next, we download the contents of the hand-curated GWAS catalog database 
make gwas-catalog # loads into sqlite db (/tmp/gwas.sql by default); this takes a while

# now, let's download from pubmed all the open-access papers mentioned in the GWAS catalog
make dl-papers # downloads ~600 papers + their supplementary material!

# finally, we will use the GWAS central database for validation of the results
make gwas-central # this will only download the parts of GWAS central relevant to our papers

Step 4: Information Extraction Demo

We demo our system in a series of Jupyter notebooks in the notebooks subfolder.

  1. phenotype-extraction.ipynb identifies the phenotypes studied in each paper
  2. table-pval-extraction.ipynb extracts mutation ids and their associated p-values
  3. table-phenotype-extraction.ipynb extracts relations between mutations and a specific phenotype (out of the many that can be described in the paper)
  4. acronym-extraction.ipynb: often, phenotypes are mentioned via acronyms, and we need a module to resolve those acronyms
  5. evaluation.ipynb: here, we merge all the results and evaluate our accuracy

The result is a list of TSV files containing facts (e.g. mutation/disease relations) that we have extracted from the literature.


Please send feedback to Volodymyr Kuleshov.


Machine-curated database of genetic disease and genome-wide association studies



No releases published


No packages published