Skip to content


Repository files navigation

Open Targets Genetics colocalisation pipeline

Colocalisation pipeline for Open Targets Genetics. In brief:

  1. Identify overlaps between credible sets output from finemapping pipeline
  2. Optionally, run conditional analysis
  3. Perform colocalisation analysis using coloc

This pipeline runs using Docker.


  • R
  • Spark v2.4.0
  • GCTA (>= v1.91.3) must be available in $PATH
  • conda
  • GNU parallel

Running the pipeline

Make sure to create a machine with a large enough disk to store the number of files/folders needed. GCP disks have some limits that we've run into: Based on the way files are saved, each conditionally independent sumstat signal requires about 50 folders. (Previously we had ~50 folders for each coloc test, but this has been fixed. Use df -T and df -i to check limits and usage.)

Step 1: Prepare docker

# If docker is not installed, then install (on Ubuntu) following steps here.
# (Simplest is to install using the convenience script):

# May need to update some permissions to get docker running, e.g.:

# May need to fix some config issues with Docker
sudo groupadd docker
sudo usermod -aG docker ${USER}
sudo systemctl restart docker
# Exit VM and re-login

# Build the docker environment (takes ~10 min)
docker build --tag otg-coloc .

# Pipeline accesses config files in an external location so that we can monitor
# it from outside the docker instance. Output is also written to ~/output.
mv configs $HOME/

Step 2: Prepare input data

Requires the same input data as the fine-mapping pipeline.

Additionally, it takes the toploci and credibleset outputs from the finemapping pipeline. To avoid re-running coloc tests that were computed previously, it takes the "raw" coloc output file from previous runs.

DATADIR=$HOME/data # for Docker pipeline
mkdir -p $DATADIR
#DATADIR=$HOME/genetics-colocalisation/data # for original pipeline

mkdir -p $DATADIR/ukb_v3_downsampled10k
mkdir -p $DATADIR/filtered/significant_window_2mb/gwas
mkdir -p $DATADIR/filtered/significant_window_2mb/molecular_trait

# If not on the same machine as fine-mapping pipeline, then set up LD panel
gsutil -m rsync gs://open-targets-ukbb/genotypes/ukb_v3_downsampled10k/ $DATADIR/ukb_v3_downsampled10k/

# Copy down significant windows
gsutil -m rsync -r -x '.*_SUCCESS$' gs://genetics-portal-dev-sumstats/filtered/significant_window_2mb/gwas/ $DATADIR/filtered/significant_window_2mb/gwas/
gsutil -m rsync -r -x '.*_SUCCESS$' gs://genetics-portal-dev-sumstats/filtered/significant_window_2mb/molecular_trait/ $DATADIR/filtered/significant_window_2mb/molecular_trait/

# Note, need to delete files named "_SUCCESS" from within all parquet folders,
# since the dask dataframe seems to choke on this when reading the parquet.
# Edit: In the latest run this doesn't seem to be necessary?!
find $DATADIR -name "*_SUCCESS" | wc -l
find $DATADIR -name "*_SUCCESS" -delete

mkdir -p $DATADIR/finemapping
# NOTE: Should update to the latest fine-mapping path before running
gsutil -m cp gs://genetics-portal-dev-staging/finemapping/220228_merged/top_loci.json.gz $DATADIR/finemapping/top_loci.json.gz
gsutil -m cp -r gs://genetics-portal-dev-staging/finemapping/220228_merged/credset $DATADIR/finemapping/

# Update to the path to the previous coloc release before running
# The previous coloc file is used to avoid repeating coloc tests that were already done.
# The "raw" file is best for this, since the "processed" one could have had some tests removed already.
# But either would work.
gsutil -m cp -r gs://genetics-portal-dev-staging/coloc/220127/coloc_raw.parquet $DATADIR/
gsutil -m cp -r gs://genetics-portal-dev-staging/coloc/220127/coloc_processed.parquet $DATADIR/

# If you are filtering out coloc tests that were previously done (not re-running them)
# then make sure that the path to the coloc_raw.parquet file is specified in config.yaml.

After downloading the data, we need to split the top loci by chromosome, and subset the reference panels for optimising performance. These are done within the docker instance.


# Update docker container if any code has changed
docker build --tag otg-coloc .

# Run docker container. Can then run steps in individually.
docker run -it --rm \
    --ulimit nofile=1024000:1024000 \
    -v $HOME/data:/data \
    -v $HOME/configs:/configs \
    -v $HOME/output:/output \
    otg-coloc /bin/bash

python # Script from fine-mapping pipeline

# (Only needed if not on same machine as fine-mapping pipeline and already split.)
# To optimise the GCTA conditioning step, it helps to split the UKB reference panel
# into smaller, overlapping "sub-panels". This is because (I believe) GCTA loads
# in the index for the whole chromosome (*.bim file) to determine which SNPs match.
time python --path /data/ukb_v3_downsampled10k/ukb_v3_chr{chrom}.downsampled10k

Step 3: Run pipeline

Start docker as above, and then either manually run individual commands in, or run the complete script.

# Set number of cores available to use, and pyspark args
# May need to scale memory relative to number of cores
# Last run on full dataset (~4 M colocs) took 28 hrs on a 224-core machine.
export PYSPARK_SUBMIT_ARGS="--driver-memory 100g pyspark-shell --executor-memory 2g pyspark-shell"
#export PYSPARK_SUBMIT_ARGS="--driver-memory 20g --executor-memory 2g pyspark-shell"

# Run the full pipeline (or alternatively, run individual commands from this script)
dt=`date '+%Y_%m_%d.%H_%M'`
time bash $NCORES "$PYSPARK_SUBMIT_ARGS" | tee /output/pipeline_run.$dt.txt

# Exit tmux with Ctrl+b then d

Step 4: Monitor running pipeline

# Check how many jobs have completed
wc -l /output/
wc -l /output/

# Estimate GCTA conditioning time remaining
cat $HOME/output/ | wc -l
JOBS_DONE=`cat $HOME/output/ | wc -l`
JOBS_TOTAL=`zcat $HOME/configs/commands_todo.cond.txt.gz | wc -l`
TIME_START=`head -n 2 $HOME/output/ | cut -f 3 | tail -n 1`
TIME_END=`tail -n 1 $HOME/output/ | cut -f 3`
PCT_DONE=`echo "scale=3; 100 * $JOBS_DONE / $JOBS_TOTAL" | bc`
echo "$PCT_DONE% done"
MIN_LEFT=`echo "scale=3; ($TIME_END - $TIME_START) * ((100.0 / $PCT_DONE) - 1) / 60" | bc`
echo "$MIN_LEFT min left"

# Check how many coloc jobs have completed
cat $HOME/output/ | wc -l
JOBS_DONE=`cat $HOME/output/ | wc -l`
JOBS_TOTAL=`cat $HOME/configs/commands_todo_coloc_opt.txt | wc -l`
TIME_START=`head -n 2 $HOME/output/ | cut -f 3 | tail -n 1`
TIME_END=`tail -n 1 $HOME/output/ | cut -f 3`
PCT_DONE=`echo "scale=3; 100 * $JOBS_DONE / $JOBS_TOTAL" | bc`
echo "$PCT_DONE% done"
MIN_LEFT=`echo "scale=3; ($TIME_END - $TIME_START) * ((100.0 / $PCT_DONE) - 1) / 60" | bc`
echo "$MIN_LEFT min left"

Step 5: Merge previous results


Step 6: Copy to GCS

# Update README text in the following script first

Step 7: Join the summary stats onto the coloc table

This step was added at a later date to join the summary stats onto the coloc results table. This is done in the pipeline rather than in the database as the sumstat database lives on a different machine from the coloc table.

This step requires the summary stats to be concatenated into a single parquet dataset and partitioned by chrom, pos. Script to do this is available here.

To run on google dataproc: (last run took XX hrs)

# Open and specify file arguments

# Start a dataproc cluster
# Note that I had this fail multiple times, and had to try adjusting the number
# of executors, memory, cores, etc. to get it to work. More memory seems to be key.
# Took nearly 5 hrs on last run, n2-highmem-64
# This would probably be dramatically faster on BigQuery.
gcloud beta dataproc clusters create \
    js-coloc-beta-join \
    --image-version=preview \
    --properties=spark:spark.debug.maxToStringFields=100,spark:spark.driver.memory=25g,spark:spark.executor.memory=76g,spark:spark.executor.cores=8,spark:spark.executor.instances=6 \
    --master-machine-type=n2-highmem-64 \
    --master-boot-disk-size=2TB \
    --zone=europe-west1-d \
    --initialization-action-timeout=20m \
    --single-node \
    --project=open-targets-genetics-dev \
    --region=europe-west1 \
# Cluster will automatically shutdown after 10 minutes idle

# Submit to dataproc
gcloud dataproc jobs submit pyspark \
    --cluster=js-coloc-beta-join \
    --async \
    --properties spark.submit.deployMode=cluster \
    --project=open-targets-genetics-dev \
    --region=europe-west1 \

To monitor job

gcloud compute ssh js-coloc-beta-join-m \
  --project=open-targets-genetics-dev \
  --zone=europe-west1-d -- -D 1080 -N

"/Applications/Google Chrome" \
  --proxy-server="socks5://localhost:1080" \
  --user-data-dir="/tmp/js-coloc-beta-join-m" http://js-coloc-beta-join-m:8088


Useful commands
# Count finished
find output/data -name "*.coloc_res.csv" | wc -l

# Check how many conditional sumstats had no variants to condition on
find $HOME/output/logs/extract_sumstats -name "log_file.txt" -exec grep "No variants to condition on" {} \;

# Parse time taken for each run
find $HOME/output/logs -name "log_file.txt" -exec grep "Time taken" {} \;

# Parse time taken to load right sumstats
find $HOME/output/logs -name "log_file.txt" -exec "Loading right" -A 1 {} \;

# Check how many commands still "to do"
time python --quiet
zcat $HOME/configs/commands_todo.txt.gz | wc -l

# Check how many intermediate (conditioned) sumstat files generated
find /output/cache/sqtl -name 'sumstat.tsv.gz' | wc -l

# Grep all log files
find output/logs/extract_sumstats -name 'log_file.txt' -print0 | xargs -r0 grep -iH 'ERROR' | tee cond_errors.txt | wc -l
find /output/logs/extract_sumstats -name 'log_file.txt' -print0 | xargs -r0 grep -iH 'ERROR' | less
find output/logs/coloc -name 'log_file.txt' -print0 | xargs -r0 grep -iH 'ERROR' | tee coloc_errors.txt | wc -l
find $HOME/output/logs/coloc -name 'log_file.txt' -print0 | xargs -r0 grep -iH 'ERROR' | grep -v 'no intersection'

# Cat all log files
find /output/logs/extract_sumstats -name "log_file.txt" -exec cat {} \; | less
find output/logs/coloc -name "log_file.txt" -exec cat {} \; | less

find /output/cache -name "sumstat.tsv.gz" -exec ls -l {} \; | less

find /output/logs/extract_sumstats -name "log_file.txt" -exec cat {} \; | grep -i "error" | less
find /output/logs/coloc -name "log_file.txt" -exec cat {} \; | grep -i "error" | less

find output/logs/coloc -name "coloc_log*.txt" -exec cat {} \; | grep -i "error" | less

gcloud beta dataproc clusters create \
    js-coloc-beta-join \
    --image-version=preview \
    --properties=spark:spark.debug.maxToStringFields=100 \
    --master-machine-type=n2-highmem-8 \
    --master-boot-disk-size=1TB \
    --zone=europe-west1-d \
    --initialization-action-timeout=20m \
    --single-node \
    --project=open-targets-genetics-dev \
    --region=europe-west1 \

gcloud dataproc jobs submit pyspark \
    --cluster=js-coloc-beta-join \
    --async \
    --properties spark.submit.deployMode=cluster \
    --project=open-targets-genetics-dev \
    --region=europe-west1 \

OLD stuff

Setup environment


git clone
cd genetics-colocalisation
# The last time I ran this it would hang at "solving environment..."
# I got around this by creating the env and then manually installing
# each package with conda install <name>
# Offending packages is probably r-coloc. Workaround is to open R
# and manually install coloc.
conda env create -n coloc --file environment.yaml

Run a single study (OLD)

# Activate environment
source activate coloc

# View args
$ python scripts/ --help
usage: [-h] --left_sumstat <file> --left_type <str>
                        --left_study <str> [--left_phenotype <str>]
                        [--left_bio_feature <str>] --left_chrom <str>
                        --left_pos <int> [--left_ref <str>] [--left_alt <str>]
                        [--left_ld <str>] --right_sumstat <file> --right_type
                        <str> --right_study <str> [--right_phenotype <str>]
                        [--right_bio_feature <str>] --right_chrom <str>
                        --right_pos <int> [--right_ref <str>]
                        [--right_alt <str>] [--right_ld <str>] --method
                        {conditional,distance} --window_coloc <int>
                        --window_cond <int> [--min_maf <float>]
                        --r_coloc_script <str> [--delete_tmpdir]
                        [--top_loci <str>] --out <file> [--plot <file>] --log
                        <file> --tmpdir <file>

optional arguments:
  -h, --help            show this help message and exit
  --left_sumstat <file>
                        Input: left summary stats parquet file
  --left_type <str>     Left type
  --left_study <str>    Left study_id
  --left_phenotype <str>
                        Left phenotype_id
  --left_bio_feature <str>
                        Left bio_feature
  --left_chrom <str>    Left chromomsome
  --left_pos <int>      Left position
  --left_ref <str>      Left ref allele
  --left_alt <str>      Left alt allele
  --left_ld <str>       Left LD plink reference
  --right_sumstat <file>
                        Input: Right summary stats parquet file
  --right_type <str>    Right type
  --right_study <str>   Right study_id
  --right_phenotype <str>
                        Right phenotype_id
  --right_bio_feature <str>
                        Right bio_feature
  --right_chrom <str>   Right chromomsome
  --right_pos <int>     Right position
  --right_ref <str>     Right ref allele
  --right_alt <str>     Right alt allele
  --right_ld <str>      Right LD plink reference
  --method {conditional,distance}
                        Which method to run (i) conditional analysis or (ii)
                        distance based without conditional
  --window_coloc <int>  Plus/minus window (kb) to perform coloc on
  --window_cond <int>   Plus/minus window (kb) to perform conditional analysis
  --min_maf <float>     Minimum minor allele frequency to be included
  --r_coloc_script <str>
                        R script that implements coloc
  --delete_tmpdir       Remove temp dir when complete
  --top_loci <str>      Input: Top loci table (required for conditional
  --out <file>          Output: Coloc results
  --plot <file>         Output: Plot of colocalisation
  --log <file>          Output: log file
  --tmpdir <file>       Output: temp dir

Step 2 (OLD): Prepare environment

# Activate environment
source activate coloc

# Set spark paths
export PYSPARK_SUBMIT_ARGS="--driver-memory 80g pyspark-shell"
export SPARK_HOME=/home/ubuntu/software/spark-2.4.0-bin-hadoop2.7

Manifest file details

The manifest file specifies all analyses to be run. The manifest is a JSON lines file with each line containing the following fields:

  "left_sumstats": "/home/ubuntu/data/sumstats/filtered/significant_window_2mb/gwas/NEALE2_30530_raw.parquet",
  "left_ld": "/home/ubuntu/data/genotypes/ukb_v3_downsampled10k_plink/ukb_v3_chr7.downsampled10k",
  "left_study_id": "NEALE2_30530_raw",
  "left_type": "gwas",
  "left_phenotype_id": null,
  "left_bio_feature": null,
  "left_lead_chrom": "7",
  "left_lead_pos": 73627972,
  "left_lead_ref": "GCTTT",
  "left_lead_alt": "G",
  "right_sumstats": "/home/ubuntu/data/sumstats/filtered/significant_window_2mb/molecular_trait/ALASOO_2018.parquet",
  "right_ld": "/home/ubuntu/data/genotypes/ukb_v3_downsampled10k_plink/ukb_v3_chr7.downsampled10k",
  "right_study_id": "ALASOO_2018",
  "right_type": "eqtl",
  "right_phenotype_id": "ENSG00000071462",
  "right_bio_feature": "MACROPHAGE_SALMONELLA",
  "right_lead_chrom": "7",
  "right_lead_pos": 73676792,
  "right_lead_ref": "C",
  "right_lead_alt": "T",
  "method": "conditional",
  "out": "/home/ubuntu/results/coloc/output/left_study=NEALE2_30530_raw/left_phenotype=None/left_bio_feature=None/left_variant=7_73627972_GCTTT_G/right_study=ALASOO_2018/right_phenotype=ENSG00000071462/right_bio_feature=MACROPHAGE_SALMONELLA/right_variant=7_73676792_C_T/coloc_res.json.gz",
  "log": "/home/ubuntu/results/coloc/logs/left_study=NEALE2_30530_raw/left_phenotype=None/left_bio_feature=None/left_variant=7_73627972_GCTTT_G/right_study=ALASOO_2018/right_phenotype=ENSG00000071462/right_bio_feature=MACROPHAGE_SALMONELLA/right_variant=7_73676792_C_T/log_file.txt",
  "tmpdir": "/home/ubuntu/results/coloc/tmp/left_study=NEALE2_30530_raw/left_phenotype=None/left_bio_feature=None/left_variant=7_73627972_GCTTT_G/right_study=ALASOO_2018/right_phenotype=ENSG00000071462/right_bio_feature=MACROPHAGE_SALMONELLA/right_variant=7_73676792_C_T",
  "plot": "/home/ubuntu/results/coloc/plot/NEALE2_30530_raw_None_None_7_73627972_GCTTT_G_ALASOO_2018_ENSG00000071462_MACROPHAGE_SALMONELLA_7_73676792_C_T.png"