This repository contains the commands and important data files that I used for the analysis presented in this blog post.
The following files are provided:
- the actual reference genome used by Nebula Genomics (download and
gunzipthis file and the.tbifile the subfolderreference-genome) - The "alleles VCF" file containing the locations and alleles for all DNA variants used in the
PGSKB CLI tool (as of 2024-05) in the subfolder
PRSKB
The commands to calculate PRS are provided directly in this README.
Disclaimer: This is a proof-of-concept analysis that disregards several cases (InDels) and contains hacky one-liners without any error checking.
You will need to download the following raw data files from Nebula Genomics.
- The DNA variants file. For me the file was named like
NG123456M.mm2.sortdup.bqsr.hc.vcf.gz. Let's rename this file tome.vcf.gz. - The DNA variants index file. For me the file was named like
NG123456M.mm2.sortdup.bqsr.hc.vcf.gz.tbi. Let's rename this file tome.vcf.gz.tbi. - The alignment file. For me this file was names like
NG123456M.mm2.sortdup.bqsr.cram. Let's rename this file tome.cram
It will be very useful if we can annotate our DNA variants with DBsnp identifiers, for wich we need a file with all human SNPs. However, the dbSNP database is growing very fast and the files get very big. I therefore use the following slightly old and preprocessed file, with a size of 3.2GB.
wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz
wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi
All the commands work in Linux only (for example, GNU Awk is used) and may need 32 GB memory or more.
I suggest to use the conda package manager for installing required tools in the specific versions. Please follow the instructions here.
We need access to bioconda:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forgeOnce you have the conda environment activated, install the variant calling tool GATK4. A specific version is needed to work with CRAM files of Nebula Genomics (see this issue).
conda install gatk4=4.2.6.1The htslib package is needed for tools like bgzip and tabix.
conda install htslibFurthermore, we install some python tools with pip. This is best done in a dedicated VirtualEnv
pip install virtualenv
python -m virtualenv venv
source ./venv/bin/activate
At this point, your command prompt should begin with (venv) (base) that shows you that both environments are active.
Download the PRSKB CLI tool from here and install the required dependencies
pip install "setuptools<58.0"
pip install filelock requests pyvcf myvariant Bio
Then comment out this line from the file connect_to_server.py.
We also need the jq tool to work with the JSON files. On Ubuntu/Debian it can be installed with sudo apt-get install jq.
Note: If you just want to run the analysis on your variants, you can skip this step and just use the files PRSKB_all_variants.dbsnp.vcf.gz and PRSKB_all_variants.dbsnp.vcf.gz.tbi provided in the subfolder PRSKB of this repository.
The easiest way to obtain the variants used by PRSKB CLI is to run it, as it will download the required data and keep it in the folder .workingFiles. We can simply use the VCF from Nebula Genomics for this "dummy run"
bash runPrsCLI.sh -f me.vcf.gz -o PRS.from_NG_vcf.tsv -r hg38 -p EUR -c 0.05
We will look at the parameters later.
The file .workingFiles/allAssociations_hg38.txt now contains all associations of SNPs with
study-trait-combinations within one very big JSON file. The jq tool allows us to extract the
required information from this JSON file and store it in the tab-separated text file
allAllelesFromAssociations.tsv.
Next, we must turn these into valid VCF records suitable for variant calling. We must obtain the
correct REF alleles, which we do by joining above tsv-file with the big DBsnp table downloaded
earlier on rsIDs using awk. Effect alleles are stored in the INFO column of the VCF record. In the last step, we iterate over the VCF records, checking if those effect alleles are among the alleles for that rsID. If not, we add them.
cat .workingFiles/allAssociations_hg38.txt | jq -r '.associations | to_entries | .[] | .key as $rsid | .value | objects | .pos as $pos | .traits | .[] | .[] | .[] | [$pos, $rsid, (keys[])] | @tsv' | uniq | sort | uniq >allAllelesFromAssociations.tsv
awk -F'\t' -v OFS='\t' 'FNR==NR {if ($2 in d) d[$2] = d[$2]","$3; else d[$2]=$3; next;} ($3 in d) { print $1,$2,$3,$4,$5,1,".","A="d[$3]," "," "}' allAllelesFromAssociations.tsv <(zcat dbsnp_146.hg38.vcf.gz) >dbSnpWithAllelesFromAssoc.pre.vcf
# There are 1728 cases where the effect allele is not one of the alleles in dbSNP - we need to add them
cat dbSnpWithAllelesFromAssoc.pre.vcf | awk -F '\t' -v OFS='\t' '{i=$8; gsub("A=", "", i); split(i,a,","); for (e in a) if (a[e] != $4 && index($5, a[e]) == 0) $5=$5","a[e]; print $0;}' >dbSnpWithAllelesFromAssoc.pre.fixed.vcf
Lastly, we turn this collection of VCF records into a valid VCF file. A simple option is to just use the VCF header from Nebula Genomics' VCF file, and concatenate it with the records. We also compress and index the resulting VCF file.
bcftools view -h me.vcf.gz >PRSKB_all_variants.dbsnp.vcf
cat dbSnpWithAllelesFromAssoc.pre.vcf >>PRSKB_all_variants.dbsnp.vcf
bgzip PRSKB_all_variants.dbsnp.vcf
tabix PRSKB_all_variants.dbsnp.vcf.gz
The GATK 4 Haplotype caller can be used to call your variants based on the locations and alleles
from PRSKB. We use a special mode of GATK, activated with the --alleles, -L and --output-mode
options. Important additional input files for this tool are the aligned reads (me.cram, downloaded
from NG), the reference genome used by NG (in this repository in the ReferenceFromNebula
folder). For convenience, the rsIDs shall be added to output VCF using the --dbsnp option.
gatk --java-options '-Xms2g -Xmx4g' HaplotypeCaller -R ReferenceFromNebula/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna -L PRSKB_all_variants.dbsnp.vcf.gz --alleles PRSKB_all_variants.dbsnp.vcf.gz -I me.cram --output-mode EMIT_VARIANTS_ONLY --dbsnp PRSKB_all_variants.dbsnp.vcf.gz -O my_wgs.PRSKB_all_variants.vcf.gz
It is time to calculate the scores which can be done with just one command. We have to provide the
DNA variants calculated in the previous step. The verbose option -v is required to get the
percentiles. We need to specify the correct reference genome with -r hg38, the super-population
with -p EUR (adapt, if required) and the p-value treshold with -c 0.05. The final option -h 0.1 defines the Imputation Threshold, an upper limit for how many SNPs are allowed to be imputed:
"We divide the number of imputed SNPs by the total number of SNPs in the calculation and if that
number exceeds the threshold we do not report that study." Setting it to 10% is a safety measure to
ensure that we considered enough SNPs per study.
bash runPrsCLI.sh -f my_wgs.PRSKB_all_variants.vcf.gz -v -o PRS.PRSKB_all_variants.tsv -r hg38 -p EUR -c 0.05 -h 0.1
The output file PRS.PRSKB_all_variants.tsv is a big table that has the percentile in column 15.
We could compute the PGS in a similar way as we did with PRSKB CLI. That is
- Collect all locations and alleles of all relevant PGS variants in one big VCF file
- Use GATK4 to call DNA variants on this locations
- Transform the results variant VCF in a form suitable for pgs_calc (Hint: use
bcftools norm -m -anyto get bi-allelic VCF records)
I tried this, but it turned out that there are several small complications like a more difficult joining with the DBsnp database and more difficult transformations of the variant VCF. So I choose an alternative approach that focuses on the locations of PGS variants only.
Install Docker has discribed here.
Install Nextflow:
conda install nextflow
Download the reference database required to compute percentiles
wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_HGDP+1kGP_v1.tar.zst
We will do this analysis for the six PGS PGS000662, PGS000192, PGS000043, PGS000210, PGS000882, PGS000904. I tried to do the analysis for all scores in one go, but it did not work out. Please replace these PGS with whatever list of scores you are interested in.
mkdir scorefiles
cd scorefiles
pip install pgscatalog-utils
# The following unfortunatly does not seem to work - the resulting file has not all locations
# wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/pgs_scores_list.txt
# for pgsid in $(cat pgs_scores_list.txt); do echo $pgsid; pgscatalog-download -i $pgsid -o . -b GRCh38; done;
for pgsid in PGS000662 PGS000192 PGS000043 PGS000210 PGS000882 PGS000904; do echo $pgsid; pgscatalog-download -i $pgsid -o . -b GRCh38; done;
pgscatalog-combine -s PGS*.txt.gz -t GRCh38 -o all_scores_combined.txt.gz
cd ..
We can now collect a list of variant locations, where we will exclude some special cases (chromosomes not included in Nebula Genomics' reference genome, undefined loci).
zcat scorefiles/all_scores_combined.txt.gz | grep -v -e "chr" -e "##" | awk -F'\t' '{hash = $1":"$2; dict[hash];} END { for (h in dict) print h;}' | grep -v -e "::" -e "MT" -e "alt" | sort -t':' -k1,1 -k2,2n | awk '{print "chr"$0;}' >scorefiles_locations.list
The GATK HaplotypeCaller will be used to determine DNA variants on all locations from our list. In this case, two steps are required.
gatk --java-options '-Xms2g -Xmx4g' HaplotypeCaller -R ReferenceFromNebula/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna -L scorefiles_locations.list -I me.cram -O my_wgs.scorefile_locations.vcf.gz -ERC BP_RESOLUTION
gatk --java-options '-Xms2g -Xmx4g' GenotypeGVCFs -R ReferenceFromNebula/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna -V my_wgs.scorefile_locations.vcf.gz -O my_wgs.scorefile_locations.gt.vcf.gz --dbsnp dbsnp_146.hg38.vcf.gz --include-non-variant-sites true
We now have to transfom this VCF into a suitable form for pgsc_calc (which is geared towards input
data from arrays/imputation servers). In particular, this means adding the other effect alleles from
the combined scorefile for homyzgous reference variants because otherwise this very large fraction
of variants would not be matched. At this point we will cut some corners, like ignoring entries
where other_allele is not set (empty ALT is not accepted!).
The general idea is to first read in the combined scorefile and store all locations together with
the effect allele and other allele in memory. Then go through our variant file, and if we find a
homref variant (ALT is not set) then add entries for all possible effect alleles and their other
alleles. A new record is created for each other allele, as pgsc_calc only supports bi-allelic
records. We also deal with the special case where GATK sets the QUAL to "Infinity", which is also
not supported.
awk -F'\t' -v OFS='\t' 'NR == FNR {h="chr"$1":"$2; if ($4) d[h][$3]=$4; next;} $6 == "Infinity" { $6 = 10000} $5 == "." {h=$1":"$2; if (h in d) for (ea in d[h]) { if ($4 != ea) { $5 = ea } else { $5 = d[h][ea] }; if (!seen[h":"$4":"$5]++) print; }; next;} {print;}' <(zcat scorefiles/all_scores_combined.txt.gz) <(zcat my_wgs.scorefile_locations.gt.vcf.gz) | bgzip -c >my_wgs.scorefile_locations.homrefAllelesAdded.vcf.gz
tabix my_wgs.scorefile_locations.homrefAllelesAdded.vcf.gz
Now we prepare the folder structure and files required for pgsc_calc (as detailed in their guide
mkdir my-wgs
ln -s my_wgs.scorefile_locations.homrefAllelesAdded.vcf.gz my-wgs/me.vcf.gz
ln -s my_wgs.scorefile_locations.homrefAllelesAdded.vcf.gz.tbi my-wgs/me.vcf.gz.tbi
echo -e "sampleset,path_prefix,chrom,format\nmy-wgs,$PWD/my-wgs/me,,vcf">samplesheet.csv
nextflow run pgscatalog/pgsc_calc -profile docker --input samplesheet.csv --target_build GRCh38 --pgs_id PGS000662,PGS000192,PGS000043,PGS000210,PGS000882,PGS000904 --run_ancestry pgsc_HGDP+1kGP_v1.tar.zst --keep_ambiguous
After the analysis, the score will be in the folder results/my-wgs/score/. If you run the pipeline again with changed input, remember to delete the work/ directory.