Genome-wide Polygenic Risk Score (GPRS)

This package aims to generate the PRS model from GWAS summary statistics. It is designed to deal with GWAS summary statistics from the GWAS catalog and GeneATLAS database or similar format datasets.

Understanding the workflow of this package:

Filter GWAS summary statistics files (unify the data format. optional: remove duplicate SNPID and select significant SNPs by P-value)
Generate bfiles by Plink1.9
Do clumping and thresholding by Plink1.9
Generate PRS model by Plink2.0

Environment setup

Setup virtualenv

$ python3 -m venv venv

Activate virtualenv

$ source ./venv/bin/activate

Install this package

$ pip install -r requirements.txt
$ pip install -e .

If you are USC users

$ pip install . --user

Additional requirements

Install both version 1.9 and 2.0 plink. https://zzz.bwh.harvard.edu/plink/download.shtml

If you are USC users: Please load the plink modules by using:

$ module load plink2

$ module load gcc/11.2.0
$ module load plink/1.9-beta6.24

Prepare dataset

Download GWAS summary statistics from GWAS catalog and GeneATLAS. (The GWAS summary statistics should have contains the information: SNPID, ALLELE, BETA, P-value, and StdErr)

Usage guidance

Before starting:

If your GWAS summary statistics data are zipped, please unzip your .gz file first.

Two ways to run the package: Python or Commandline Interface

1. Use Python

For example, if you want to run GeneAtlas, create a python file, e.g: gene_atlas_model.py. Paste the following code in the file and change the configuration to your settings.

from gprs.gene_atlas_model import GeneAtlasModel

if __name__ == '__main__':
    geneatlas = GeneAtlasModel( ref='/home/1000genomes/GRCh38',
                    data_dir='/home/Projects/GPRS/data/2014_GWAS_Height' )

    geneatlas.filter_data( snp_id_header='MarkerName',
                            allele_header='Allele1',
                            beta_header='b',
                            se_header ='SE',
                            pvalue_header='p',
                            output_name='2014height')

    geneatlas.generate_plink_bfiles(snplist_name='2014height', output_name='2014height',extra_commands="--vcf-half-call r" ,symbol='.genotypes')

    geneatlas.clump(output_name='2014height',plink_bfile_name='2014height',
                    qc_file_name='2014height',clump_kb='250',
                    clump_p1='0.02',clump_p2='0.02', clump_r2='0.02')

    geneatlas.select_clump_snps(output_name='2014height',
                                clump_file_name='2014height',
                                qc_file_name='2014height',
                                clump_kb='250',
                                clump_p1='0.02',
                                clump_r2='0.5',clumpfolder_name='2014height')

    geneatlas.build_prs( vcf_input= 'home/1000genomes/hg19',
                         output_name ='2014height', memory='1000',clump_kb='250',
                         clump_p1='0.02', clump_r2='0.02', qc_clump_snplist_foldername='2014height')

    geneatlas.combine_prs(filename="2014height",
                          clump_r2="0.5",clump_kb="250",clump_p1="0.02")

    geneatlas.prs_statistics(output_name='2014height', score_file = "home/GPRS/tmp/2014height_250_0.02_0.1.sscore",
                             pheno_file = "prs/2014height_pheno.csv",
                             r_command='/bin/Rscript',
                             prs_stats_R="/GPRS/gprs/prs_stats_quantitative_phenotype.R", 
                             data_set_name="2014height",
                             clump_kb='250',
                             clump_p1='0.02',
                             clump_r2='0.1')
    
    geneatlas.subset_pop(input_data="home/1000genomes/hg19/integrated_call_samples_v3.20130502.ALL.panel",
                         column_name="super_pop",pop_info = "EUR", output_name = "2014height")

    geneatlas.generate_plink_bfiles_w_individual_info(popfile_name="2014height", output_name="2014height",bfile_name="2014height")
    
    geneatlas.subset_vcf_w_random_sample(fam_dir="/home/result/plink/bfiles", fam_filename="2014height", samplesize=50, vcf_input="1000genomes/hg19", symbol=".")
    
    geneatlas.random_draw_samples_from_fam(fam_dir="/home/result/plink/bfiles",fam_filename="2014height", samplesize= "500", tag="LD_reference")

2. Use Commandline Interface

$ gprs geneatlas-filter-data --ref [str] --data_dir [str] --result_dir [str] --snp_id_header [str] --allele_header [str] --beta_header [str] --se_header [str] --pvalue_header [str] --pvalue [float/scientific notation] --output_name [str]  
$ gprs gwas-filter-data --ref [str] --data_dir [str] --result_dir [str] --snp_id_header [str] --allele_header  [str] --beta_header [str] --se_header [str] --pvalue_header [str] --pvalue [float/scientific notation] --output_name [str]  
$ gprs generate-plink-bfiles --ref [str] --snplist_name [str] --output_name [str] --symbol [str] --extra_commands [str] 
$ gprs clump --plink_bfile_name [str] --output_name [str] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_p2 [float/scientific notation] --clump_r2 [float] --clump_field [str] --qc_file_name [str] --clump_snp_field [str]   
$ gprs select-clump-snps --qc_file_name [str] --clump_file_name [str] --output_name [output name] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float] --clumpfolder_name [str]
$ gprs build-prs --vcf_input [str] --output_name [str] --qc_clump_snplist_foldername [str] --memory [int] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float] --symbol [str/int] --columns [int] --plink_modifier [str]  
$ gprs combine-prs --filename [str] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float]
$ gprs prs-statistics --score_file [str] --pheno_file [str] --output_name [str] --data_set_name [str] --prs_stats_R [str] --r_command [str] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float]
$ gprs combine-prs-stat --data_set_name [str] --clump_kb [int] --clump_p1 [float/scientific notation] --clump_r2 [float]

optional function

If alleles are a, t, c, g instead of capital A, T, C, G it might affect the further analysis.

$ gprs transfer_atcg --qc_file_name [str]
$ gprs subset_pop --input_data [str] --column_name [str] --pop_info [str] --output_name [str]
$ gprs generate-plink-bfiles-w-individual-info --popfile_name [str] --bfile_name [str] --output_name [str]
$ gprs subset-vcf-w-random-sample --fam_dir [str] --fam_filename [str] --samplesize [int] --vcf_input [str] --symbol [str/int]
$ gprs random_draw_samples_from_fam --fam_dir [str] --fam_filename [str] --samplesize [int] --tag [str]

Commands in gprs package:

Thirteen commands in gprs:

geneatlas-filter-data
gwas-filter-data
generate-plink-bfiles
clump
select-clump-snps
build-prs
combine-prs
prs-statistics
combine-prs-stat
subset_pop (optional)
generate_plink_bfiles_w_individual_info (optional)
transfer_atcg (optional)
subset_vcf_w_random_sample (optional)

Result folder

In the first step, you need to indicate the path to creating the result folder. Five folders will automatically generate under the result folder by script.

qc folder: ./result/qc/
snplists folder : ./result/snplists/
bfile folder: ./result/plink/bfiles
clump folder: ./result/plink/clump
qc_and_clump_snpslist folder: ./result/plink/qc_and_clump_snpslist
prs folder: ./result/plink/prs
pop folder: ./result/pop/
random_draw_sample folder: ./result/random_draw_sample/

❗ Users have to indicate reference and result directories every time when using the command interface.

❗ Users have to provide output_name every time when they execute the commands. The output_name should be the same in every execution.

Output file format

This package will generate output files below:

*.QC.csv
*.csv
*.bim
*.bed
*.fam
*.clump
*.qc_clump_snpslist.csv
*.sscore
*_stat.txt
*_combined_stat.txt

All output files will be named as: [chrnb]_[name].[extension]. The chrnb will given automatically, users only have to give [name] while using the package. Thus, it is better use the same output name to generate all files.

--output_name
--snplist_name
--qc_file_name
--clump_file_name
--plink_bfile_name

`gprs geneatlas-filter-data`

Filter GeneAtlas csv file by P-value and unify the data format as following order: SNPID, ALLELE, BETA, StdErr, Pvalue

Options:

  --ref                     path to population reference panel  [required]
  --data_dir                The directory of GeneAtlas csv files (all 1-24 chr) [required]
  --result_dir              path to output folder; default:[./result]
  --snp_id_header           SNP ID column name in GeneAtlas original file  [required]
  --allele_header           ALLELE column name in GeneAtlas original file  [required]
  --beta_header             BETA column name in GeneAtlas original file [required]
  --se_header               StdErr column name in GeneAtlas original file  [required]
  --pvalue_header           P-value column name in GeneAtlas original file  [required]
  --output_name             output name; default: "geneatlas"; the output file name is [chrnb]_[output_name].csv and [chrnb]_[output_name].QC.csv
  --pvalue                  P-value threshold
  --help                    Show this message and exit.

Result:

This option generates two types of output in qc and snplists folders:

*.QC.csv (QC files )
*.csv (snplist)

`gprs gwas-filter-data`

Filter GeneAtlas csv file by P-value and unify the data format as following order: SNPID, ALLELE, BETA, StdErr, Pvalue

Options:

  --ref                      path to population reference panel  [required]
  --data_dir                 path to GWAS catalog summary statistic csv file (all 1-24 chr)  [required]
  --result_dir               path to output folder; default:[./result]
  --snp_id_header            SNP ID column name in GWAS catalog original file  [required]
  --allele_header            ALLELE column name in GWAS catalog original file  [required]
  --beta_header              BETA column name in GWAS catalog original file  [required]
  --se_header                StdErr column name in GWAS catalog original file  [required]
  --pvalue_header            P-value column name in GWAS catalog original file  [required]
  --output_name              output name; default: "gwas"; the output file named is [chrnb]_[output_name].csv and [chrnb]_[output_name].QC.csv
  --pvalue                   P-value threshold for filtering SNPs
  --help                     Show this message and exit.

Result:

This option generates two types of output in qc and snplists folders:

*.QC.csv (QC files )
*.csv (snplist)

`gprs generate-plink-bfiles`

This option encodes plink1.9 make-bed function

plink --vcf [ref] --extract [snplists after qc] --make-bed --out [bfile folder/output_name]

snplists and bfiles folders will automatically be filled in the script. Users have to indicate ref and output_name only.

Options:

  --ref                path to population reference panel  [required]
  --output_name        output name
  --symbol             indicate the symbol or text after chrnb in vcf file, default = "." ; i.e. ALL.chr8.vcf.gz, you can put "." or ".vcf.gz" 
  --snplist_name       snplist_name is [output_name] from [chrnb]_[output_name].csv [required]
  --extra_commands     put your extract function if needed, otherwise no need to specify this argument,
  --help               Show this message and exit.

Result:

This option will generate three files in bfiles folder:

*.bim
*.bed
*.fam

`gprs clump`

This option encodes plink1.9 clump function

plink --bfile [bfiles] --clump [qc snpslists] --clump-p1  --clump-p2  --clump-r2  --clump-kb  --clump-field  --clump-snp-field  --out

The plink_bfiles_dir, qc snpslists and clump_output_dir will automatically be filled in the script. Users have to indicate the options below.

Options:

 --clump_kb                 distance(kb) parameter for clumping [required]
 --clump_p1                 first set of P-value for clumping [required]
 --clump_p2                 should equals to p1 reduce the snps [required]
 --qc_file_name             qc_file_name is [output_name] from [chrnb]_[output_name].QC.csv [required]
 --plink_bfile_name         plink_bfile_name is [output_name] from [chrnb]_[output_name].bim/bed/fam [required]
 --output_name              it is better if the output_name remain the same. The clump output: [chrnb]_[output_name]_clumped_snplist.csv [required]
 --clump_r2                 r2 value for clumping, default = 0.1
 --clump_field              P-value column name, default = Pvalue
 --clump_snp_field          SNP ID column name, default = SNPID
 --help                     Show this message and exit.

Result:

This option will generate one file in clump folder:

*.clump

`gprs select-clump-snps`

Options:

  --qc_dir                       path to qc folder, default: "./result/qc", the qc files were generated from gwas_filter_data or geneatlas_filter_data options
  --clump_output_dir             path to clump output folder, default: "./result/plink/clump"
  --qc_clump_snplists_dir        path to snpslist (after qc and clumping), default:"./result/plink/qc_and_clump_snpslist"
  --clump_file_name              clump_file_name is [output_name] from [chrnb]_[output_name].clump [required]
  --output_name                  it is better if the output_name remain the same. output: [chrnb]_[output_name]_clumped_snplist.csv [required]
  --clump_kb                     distance(kb) parameter for clumping [required]
  --clump_p1                     first set of P-value for clumping [required]
  --clump_r2                     r2 value for clumping, default = 0.1
  --help                         Show this message and exit.

Result:

This option will generate one file in qc_and_clump_snpslist folder:

*.qc_clump_snpslist.csv

`gprs build-prs`

This option encodes plink2.0 function

plink2 --vcf [vcf input] dosage=DS --score [snplists afte clumped and qc]  --out

The clumped qc snpslists and prs_output_dir will automatically be filled in the script. Users have to indicate the options below.

Options:

  --symbol                       indicate the symbol or text after chrnb in vcf file, default = "." ; i.e. ALL.chr8.vcf.gz, you can put "." or ".vcf.gz"
  --vcf_input                    path to vcf files  [required]
  --qc_clump_snplist_foldername  the folder name of .qc_clump_snpslist.csv file. The folder name should be 
                                 the same as the output name in select_clump_snps step
  --columns                      a column index indicate the [SNPID] [ALLELE] [BETA] position; column nb starts from 1
  --plink_modifier               no-mean-imputation as default in here, get more info by searching plink2.0 modifier
  --output_name                  it is better if the output_name remain the same. output: [chrnb]_[output_name].sscore
  --clump_kb                     distance(kb) parameter for clumping [required]
  --clump_p1                     first set of P-value for clumping [required]
  --clump_r2                     r2 value for clumping, default = 0.1
  --help                         Show this message and exit.

Result:

This option will generate .sscore files in prs folder:

*.sscore

`gprs combine-prs`

Combine-prs option will combine all .sscore files as one .sscore file.

Options:

  --filename          name of .sscore, i.e.  chr10_geneatlas_500_1e-7_0.05.sscore, The file name here is "geneatlas"
  --clump_kb          distance(kb) parameter for clumping [required]
  --clump_p1          first set of P-value for clumping [required]
  --clump_r2          r2 value for clumping, default = 0.1
  --help              Show this message and exit.

Result:

This option will generate one file in prs folder:

*.sscore

`gprs prs-statistics`

After obtained combined sscore file, prs-statistics calculate BETA, AIC, AUC, PseudoR2 and OR ratio

Options:

  --score_file             the absolute path to combined .sscore file [required]
  --pheno_file             the absolute path to pheno file  [required]
  --output_name            the output name  [required]
  --data_set_name          the name of the data-set i.e. gout_2019_GCST008970
                           [required]
  --clump_kb               distance(kb) parameter for clumping [required]
  --clump_p1               first set of P-value for clumping [required]
  --clump_r2               r2 value for clumping, default = 0.1
  --prs_stats_R            the absolute path to "prs_stats.R"  [required]
  --r_command              use "which R" in linux, and copy the path after
                           --r_command  [required]
  --help                   Show this message and exit.

Result:

This option will generate .txt file in stat folder:

*_stat.txt

`gprs combine-prs-stat`

If you have more than one trained PRS model, combine-prs-stat function is designed to combine statistics results. For instance: the first PRS model was filtered with P < 0.05, the second PRS model was filtered with P < 0.0005. You will have DATA_0.05_stat.txt/DATA_0.0005_stat.txt Combining two statistic tables allows users easy to compare between PRS models

Options:

  --data_set_name        the name of the data-set i.e. gout_2019_GCST008970 [required]
  --clump_kb             distance(kb) parameter for clumping [required]
  --clump_p1             first set of P-value for clumping [required]
  --clump_r2             r2 value for clumping, default = 0.1
  --help                 Show this message and exit.

Result:

This option will generate .txt file in stat folder:

*_combined_stat.txt

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
bin		bin
docs		docs
gprs		gprs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.readthedocs.txt		requirements.readthedocs.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

imyingchulo/gprs

Folders and files

Latest commit

History

Repository files navigation

Genome-wide Polygenic Risk Score (GPRS)

Environment setup

Additional requirements

Prepare dataset

Usage guidance

Before starting:

Two ways to run the package: Python or Commandline Interface

1. Use Python

2. Use Commandline Interface

optional function

Commands in gprs package:

Result folder

Output file format

gprs geneatlas-filter-data

Options:

Result:

gprs gwas-filter-data

Options:

Result:

gprs generate-plink-bfiles

Options:

Result:

gprs clump

Options:

Result:

gprs select-clump-snps

Options:

Result:

gprs build-prs

Options:

Result:

gprs combine-prs

Options:

Result:

gprs prs-statistics

Options:

Result:

gprs combine-prs-stat

Options:

Result:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`gprs geneatlas-filter-data`

`gprs gwas-filter-data`

`gprs generate-plink-bfiles`

`gprs clump`

`gprs select-clump-snps`

`gprs build-prs`

`gprs combine-prs`

`gprs prs-statistics`

`gprs combine-prs-stat`