CELLECT LDSC Tutorial

This tutorial will take you through running CELLECT-LDSC on two GWAS summary stats and two example expression specificity inputs.

Tl;dr

cd ~/CELLECT
# ---------------------- STEP 1: prep GWAS data ----------------- #
wget https://portals.broadinstitute.org/collaboration/giant/images/c/c8/Meta-analysis_Locke_et_al%2BUKBiobank_2018_UPDATED.txt.gz -P example/
wget https://www.dropbox.com/s/ho58e9jmytmpaf8/GWAS_EA_excl23andMe.txt -P example/

conda env create -f ldsc/environment_munge_ldsc.yml
conda activate munge_ldsc

python ldsc/mtag_munge.py \
--sumstats example/GWAS_EA_excl23andMe.txt \
--merge-alleles data/ldsc/w_hm3.snplist \
--n-value 766345 \
--keep-pval \
--p PVAL \
--out example/EA3_Lee2018


python ldsc/mtag_munge.py \
--sumstats example/Meta-analysis_Locke_et_al+UKBiobank_2018_UPDATED.txt.gz \
--a1 Tested_Allele \
--a2 Other_Allele \
--merge-alleles data/ldsc/w_hm3.snplist \
--keep-pval \
--p PVAL \
--out example/BMI_Yengo2018

# --- STEP 2: Generate cell-type specificity input using CELLEX --- #
(CELLEX specificity files have been pre-generated for this tutorial)

# ---------------------- STEP 3: run CELLECT-LDSC ----------------- #
conda activate <env_with_snakemake>

snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile config.yml

Step 0: Create conda environment for munging

Change to the CELLECT git repository root, e.g.:

cd ~/CELLECT

Create a new conda environment that has the correct package versions for GWAS munging:

conda env create -f ldsc/environment_munge_ldsc.yml # creates 'munge_ldsc' environment 
conda activate munge_ldsc

Step 1: Download and munge GWAS

Download GWAS summary stats. Here we download the BMI GWAS from Yengo (HMG, 2018) and Educational Attainment GWAS from Lee (Nat. Gen., 2018).

wget https://portals.broadinstitute.org/collaboration/giant/images/c/c8/Meta-analysis_Locke_et_al%2BUKBiobank_2018_UPDATED.txt.gz -P example/
wget https://www.dropbox.com/s/ho58e9jmytmpaf8/GWAS_EA_excl23andMe.txt -P example/

Munge the GWAS data. This step ensures that 1) the GWAS summary stats are correctly formatted; 2) we only analyze HapMap3 SNPs. We suggest using the script ldsc/mtag_munge.py for munging as it has more options:

python ldsc/mtag_munge.py \
--sumstats example/GWAS_EA_excl23andMe.txt \
--merge-alleles data/ldsc/w_hm3.snplist \
--n-value 766345 \
--keep-pval \
--p PVAL \
--out example/EA3_Lee2018

Munge BMI GWAS:

python ldsc/mtag_munge.py \
--sumstats example/Meta-analysis_Locke_et_al+UKBiobank_2018_UPDATED.txt.gz \
--a1 Tested_Allele \
--a2 Other_Allele \
--merge-alleles data/ldsc/w_hm3.snplist \
--keep-pval \
--p PVAL \
--out example/BMI_Yengo2018

Bonus info 1: the --keep-pval --p PVAL arguments are only needed to make the munged summary stats compatible with CELLECT-MAGMA. Since you might want to run both CELLECT-LDSC and CELLECT-MAGMA, we recommend you make it a habit to add the arguments.

Bonus info 2: you will essentially get the same results if using ldsc/munge_sumstats.py (just using --N instead of --n-value argument, and you have no option --keep-pval that makes the output compatible with CELLECT-MAGMA):

python ldsc/munge_sumstats.py \
--sumstats example/GWAS_EA_excl23andMe.txt \
--merge-alleles data/ldsc/w_hm3.snplist \
--N 766345 \
--p PVAL \
--out example/EA3_Lee2018

Run python ldsc/munge_sumstats.py -h and python ldsc/mtag_munge.py -h to see how the two programs differ in the options.

Bonus info 2: On the wiki page GWAS-datasets we list several repositories containing publically available GWAS summary statistics. Check it out if you want to 'bulk download' GWAS summary stats.

Step 2: Generate cell-type specificity input using CELLEX

CELLECT uses cell-type specificity matrix files as input. We recommend using CELLEX to generate these files, which can be done in python in a few lines, e.g.:

import cellex
eso = cellex.ESObject(df=mousebrain_sc_rnaseq_data, annotation=celltype_labels)
eso.compute()
eso.results["esmu"].to_csv("mousebrain-test.csv.gz")

Here we have pre-generated two example CELLEX specificity files for you: example/mousebrain-test.csv and example/tabula_muris-test.csv containing two cell-types from the Mousebrain dataset and five cell-types from the Tabula Muris dataset.

Step 3: Run CELLECT-LDSC

Now we will run the workflow. Remember, running CELLECT-LDSC requires having the snakemake library available (e.g. activate an environment with snakemake installed:)

conda activate <env_with_snakemake>

Then run the workflow:

snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile config.yml

The first time you run the workflow, snakemake will download and install local conda environments in ./.snakemake. These environments ensure that all dependencies are correctly installed. CELLECT-LDSC is unlikely to work without the --use-conda flag.

The above command is configured to output results in ./CELLECT-EXAMPLE. To change this open the config.yml file and edit the BASE_OUTPUT_DIR to specify the output directory.

The config file is preconfigured to prioritize the two CELLEX specificity inputs for each of the two GWAS datasets we just downloaded.

Running the workflow should take 5-15 minutes depending on the available number of cores on your system. Here we run the workflow using all available cores on the computer (-j). If you wish to use only 4 cores, just pass the -j 4 flag.

Output files

In ./CELLECT-EXAMPLE/CELLECT-LDSC/results/prioritization.csv you will see the following prioritization output:

gwas,specificity_id,annotation,beta,beta_se,pvalue
BMI_Yengo2018,tabula_muris-test,Brain_Non-Myeloid.Bergmann_glial_cell,9.231153572792368e-09,4.019151323331188e-09,0.010815326414746898
BMI_Yengo2018,tabula_muris-test,Bladder.bladder_cell,-5.004712418748802e-10,3.2673818983229307e-09,0.5608686595651585
BMI_Yengo2018,tabula_muris-test,Brain_Myeloid.microglial_cell,-2.6351066098370078e-09,3.5627583125482456e-09,0.7702363432351039
BMI_Yengo2018,tabula_muris-test,Brain_Myeloid.macrophage,-4.20188757005791e-09,5.126292004971455e-09,0.7937989730524113
BMI_Yengo2018,tabula_muris-test,Bladder.bladder_urothelial_cell,-9.444074983296222e-09,3.099358846619276e-09,0.9988447189868584
EA3_Lee2018,tabula_muris-test,Brain_Non-Myeloid.Bergmann_glial_cell,8.49739420095679e-09,1.911986590915025e-09,4.409437303659968e-06
EA3_Lee2018,tabula_muris-test,Brain_Myeloid.macrophage,2.214934890235212e-09,2.716854415111798e-09,0.20746257567581028
EA3_Lee2018,tabula_muris-test,Brain_Myeloid.microglial_cell,-6.441126119804287e-10,1.753163408008973e-09,0.6433397441548794
EA3_Lee2018,tabula_muris-test,Bladder.bladder_urothelial_cell,-4.589906843053639e-09,1.6159150917314255e-09,0.9977474194521412
EA3_Lee2018,tabula_muris-test,Bladder.bladder_cell,-4.421016110716137e-09,1.3525251453635627e-09,0.9994598102836466
BMI_Yengo2018,mousebrain-test,ACBG,-7.391698263466479e-09,3.5090024712828697e-09,0.9824193375098678
BMI_Yengo2018,mousebrain-test,ABC,-9.899458888243979e-09,3.763677148602214e-09,0.9957340520312591
EA3_Lee2018,mousebrain-test,ACBG,1.3369407818499012e-09,1.9809168750306665e-09,0.2498664455262536
EA3_Lee2018,mousebrain-test,ABC,-7.091136899549197e-09,1.749536904646236e-09,0.9999747337967729

See Input & Output for a full description of the output files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly