---
title: "Polygenic Score analysis"
author:
    - name: Conor O'Hare
    - name: Samuele Soraggi
      orcid: 0000-0002-1159-5535
      email: samuele@birc.au.dk
    - name: Alba Refoyo Martinez
      orcid: 0000-0002-3674-4007
      email: alba.martinez@sund.ku.dk   
---


---
format:
  html:
   theme: default
   number-sections: true
   code-fold: false
   self-contained: false
   toc: true
   highlight-style: pygments
  ipynb:
    toc: true
    number-sections: false
bibliography: references/references_6.bib
---    

:::{.callout-note title="Important notes for this notebook" icon=false}

SOMETHING SOMETHING

## Learning outcomes

- **Discuss and choose** the PRS equation 
- **Discuss** PRS scores and biases


## How to make this notebook work

* In this notebook, we will only use `Bash`, please choose the bash kernel. A kernel contains a programming language and the necessary packages to run the course material. To choose a kernel, go to the menu at the top of the page, select `Kernel --> Change Kernel`, and then select `Bash`.
* You can run the code in each cell by clicking the run cell sign in the toolbar, or simply by pressing <kbd>Shift</kbd>+<kbd>Enter</kbd>. When the code is done running, a small green check mark will appear on the left side.
* You need to run the cells in sequential order to execute the analysis. Please do not run a cell until the one above is done running, and do not skip any cells. 
* The code is accompanied by textual descriptions to help you understand what is happening. Please try not to focus on understanding the code itself in too much detail, but rather focus on the explanations and the output of the commands.  
* You can create new code cells by pressing `+` in the Menu bar above or by pressing <kbd>B</kbd> after selecting a cell. 

While most individual associations found in GWAS studies are of small effect, information about them can be combined across the genome, to create a  **polygenic score (PGS)**. These scores can be used to make **genome-based predictions about the overall risk of having a particular trait or disease or about the genetic value for continuous traits**. If the prediction is on a discrete phenotype such as a disease, these scores are known as polygenic risk scores (PRS).

## 1. Computing a PRS

Single variant association analysis has been the primary method in GWAS but requires very large sample sizes to detect more than a handful of SNPs for many complex traits. In contrast, PRS analysis does not aim to identify individual SNPs but instead aggregates genetic risk across the genome in a single individual polygenic score for a trait of interest. One straightforward way to obtain a PGS, for a given population, is summing the allele frequencies of statistically significant trait-associated variants, weighted by their effect size after ensuring these variants are approximately independent (e.g. via LD pruning).

In this approach, a large discovery sample is required to reliably determine how much each SNP is expected to contribute to the polygenic score (“weights”) of a specific trait. Subsequently, in an independent target sample, which can be more modest in size [@dudbridge2013power], polygenic scores can be calculated based on genetic DNA profiles and these weights (see below for details on the calculations). As a rule of thumb, a target sample of around **2,000 subjects** provides sufficient power to detect a significant proportion of variance explained. Furthermore, the discovery and target samples should have the same number of subjects until the target sample includes 2,000 subjects. If more samples are available, additional subjects should be included in the discovery sample to maximize the accuracy of the estimation of the effect sizes [@dudbridge2013power].

Although PRS is not powerful enough to predict disease risk on the individual level [@wray2013pitfalls], it has been successfully used to show significant associations both within and across traits. For example, a PRS analysis of schizophrenia showed for the first time that an aggregate measure of the genetic risk of developing schizophrenia, estimated based on the effects of common SNPs (from the discovery sample) that showed nominally significant associations with disease risk, was significantly associated with schizophrenia risk in an independent (target) sample. A significant association was found even though the available sample sizes were too small to detect genome‐wide significant SNPs [@international2009common]. In addition, GWAS for schizophrenia (the discovery sample) has been used to significantly predict the risk in target samples with various phenotypes, such as bipolar disorder, level of creativity, and even risk of immune disorders [@power2015polygenic; @stringer2014genetic].

## 2. Conducting polygenic risk prediction analyses

To conduct PRS analysis, trait‐specific weights (beta's for continuous traits and the log of the odds ratios for binary traits) are obtained from a discovery GWAS. In the target sample, a PRS is calculated for each individual based on the weighted sum of the number of risk alleles that they carry multiplied by the trait‐specific weights. For many complex traits, SNP effect sizes are publicly available (e.g., see https://www.med.unc.edu/pgc/downloads or https://www.ebi.ac.uk/gwas/).

Although in principle all common SNPs could be used in a PRS analysis, it is customary to first clump the GWAS results before computing risk scores. P-value thresholds are typically used to remove SNPs that show little or no statistical evidence for association (e.g., only keep SNPs with p-values <0.05 or <0.01). Usually, multiple PRS analyses will be performed, with varying thresholds for the p-values of the association test.

Once PRS have been calculated for all subjects in the target sample, the scores can be used in a (logistic) regression analysis to predict any trait that is expected to show genetic overlap with the trait of interest. The prediction accuracy can be expressed with the (pseudo‐) $R^2$ measure of the regression analysis. It is important to include at least a few MDS components as covariates in the regression analysis to control for population stratification. To estimate how much variation is explained by the PRS, the $R^2$ of a model that includes only the covariates (e.g., MDS components) and the $R^2$ of a model that includes covariates + PRS will be compared. The increase in $R^2$ due to the PRS indicates the increase in prediction accuracy explained by genetic risk factors.

The prediction accuracy of PRS depends mostly on the (co‐)heritability of the analyzed traits, the number of SNPs, and the size of the discovery sample. The size of the target sample only affects the reliability of $R^2$ and typically a few thousand subjects in the target sample are sufficient to achieve a significant $R^2$ (if the (co‐)heritability of the trait(s) of interest and the sample size of the discovery sample used are sufficiently large).

![Figure 6.1: Prediction of schizophrenia (SCZ) and bipolar disorder (BD) in Iceland using polygenic risk scores derived from independent GWASs of these disorders [@power2015polygenic]](Images/schizofrenia.png)

## 3. Polygenic risk score analysis with PRSice-2
To perform polygenic risk score analysis, one possible tool is [PRSice](https://choishingwan.github.io/PRSice/). In this tutorial, we provide a step-by-step guide to perform a simple polygenic risk score analysis using PRSice and explain how to interpret the results.

We have supplied PRSice-2 with this course material ready to use.

The installed package will include an R script that is straightforward to run. It requires the following information:

- `--prsice`: the binary executable file
- `--base`: the `.assoc` file that contains statistical information
- `--target`: the PLINK-formatted dataset

:::{.callout-note}

It would be ideal at this point to apply this method to our HapMap dataset. However, as mentioned above, PRS requires a sample size of around 2000 for it to show meaningful results. Our dataset, meanwhile, contains only about 150 individuals. Hence, we will use a *toy dataset* for didactic purposes. 

:::

### 3.1 PRSice analysis

<img src="Images/bash.png" alt="Bash" width="40"> Let's create a folder for the output files. Then, perform the PRS analysis on the toy dataset in the following way:

In [1]:
mkdir -p Results/GWAS6
cd Results/GWAS6

Create two links to  data and softwares

In [10]:
ln -sf ../Data
ln -sf ../Software

We apply PRSice running the `R` script `PRSice.R`, which does a lot of things to elaborate the input and ouput of the program `./Software/PRSice`. We then provide the results of association testing and give the column names corresponding to SNPs, chromosomes, ... We then tell `PRSice` that the data is `TOY_TARGET_DATA` and that the format is binary (0s and 1s).

:::{.callout-warning}

If you get an error in the following command, try to restart the kernel in the `Kernel` menu. Sometimes  links to folders are not recognized immediately.

:::

In [1]:
Rscript ./Data/PRSice.R --out Results/GWAS6/PRSice \
--prsice ./Software/PRSice \
--base ./Data/TOY_BASE_GWAS.assoc \
--snp SNP --chr CHR --bp BP --A1 A1 --A2 A2 --stat OR --pvalue P \
--target ./Data/TOY_TARGET_DATA \
--binary-target T 

[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25hPRSice 2.3.5 (2021-09-20) 
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2024-08-07 12:08:59
./Software/PRSice \
    --a1 A1 \
    --a2 A2 \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base ./Data/TOY_BASE_GWAS.assoc \
    --binary-target T \
    --bp BP \
    --chr CHR \
    --clump-kb 250kb \
    --clump-p 1.000000 \
    --clump-r2 0.100000 \
    --interval 5e-05 \
    --lower 5e-08 \
    --num-auto 22 \
    --or  \
    --out Results/GWAS6/PRSice \
    --pvalue P \
    --seed 1252005307 \
    --snp SNP \
    --stat OR \
    --target ./Data/TOY_TARGET_DATA \
    --thread 1 \
    --upper 0.5

Initializing Genotype f

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Processing 100.00%
There are 1 region(s) with p-value less than 1e-5. Please 
note that these results are inflated due to the overfitting 
inherent in finding the best-fit PRS (but it's still best 
to find the best-fit PRS!). 
You can use the --perm option (see manual) to calculate an 
empirical P-value. 

[?25hBegin plotting
[?25hCurrent Rscript version = 4.3.2
[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25hPlotting Bar Plot
Plotting the high resolution plot
[?25h[?25h


The `--base` parameter refers to the file with summary statistics from the base sample (also known as discovery or training samples). These summary statistics contain for each genetic variant at least an effect size and p-value. The `--target` parameter refers to the prefix of the files (without file extension) that contain the genotype data in binary plink format (i.e., .bed,.bim,.fam file extensions). The base and target samples are also known as validation or test samples. This target sample should be completely independent of the base sample that was used to compute the summary statistics. Sample overlap across the discovery and target sample will greatly inflate the association between the polygenic risk score and the disease trait. 

If the type effect (`--stat`) or data type (`--binary-target`) were not specified, PRSice will try to determine this information based on the header of the base file. 

:::{.callout-note}

Instead of performing a polygenic risk score analysis on all genetic variants, it is customary to clump first. In clumping, within each block of correlated SNPs, the SNP with the lowest p-value in the discovery set is selected and all other SNPs are ignored in downstream analyses. This clumping procedure is performed by PRSice automatically but can be adjusted with several clumping parameters. Although many other options exist, we refer to the [PRSice user manual](https://choishingwan.github.io/PRSice/step_by_step/#clumping) for more detailed information about the program. 

:::

For simplicity's sake, we did not include principal components or covariates in this analysis, however, when conducting your analyses we **strongly recommend** including these.

### 3.2 Interpreting the results

By default, PRSice saves two plots and several text files. The first plot is `PRSice_BARPLOT_<date>.png `(**which you need to open from the folder Result/GWAS6 using the file browser**, since the name depends on the current date. Below, you can see a screenshot of the figure). This plot shows the predictive value (Nagelkerke's) in the target sample of models based on SNPs with p-values below specific thresholds in the base sample. In addition, for each model, a p-value is provided. 

![](Images/PRSice_BARPLOT.png){width=600px fig-align="center"}

As shown in the plot, a model using SNPs with a p-value up to 0.4463 achieves the highest predictive value in the target sample with a p-value of 4.7e-18. However, as is often the case in polygenic risk scores analysis with relatively small samples, the predictive value is relatively low (Nagelkerke’s around 5%). The text files include the exact values for each p-value threshold (check them!).

The second plot is `PRSice_HIGH-RES_PLOT_<date>.png` (**which you again need to manually open**, but we show also a screenshot below), and shows many different p-value thresholds. The p-value of the predictive effect is in black together with an aggregated trend line in green. 

![](Images/PRSice_HIGH-RES_PLOT.png){width=600px}

Both figures show that many SNPs that affect the trait in the base sample can be used to predict the trait in the target sample. Note that the two traits can be either the same or different. If the same trait is used the predictive value is related to the heritability of the trait (as well as the sample size of the base sample). If different traits are analyzed, the predictive value is also related to the genetic overlap between the two traits. Either way, polygenic risk score analysis typically shows that models with lenient p-value thresholds often predict better than models with more stringent thresholds, suggesting that many statistically insignificant SNPs still have predictive value in polygenic traits.

### Conclusion
In this tutorial, we have discussed how to perform a simple polygenic risk score analysis using the PRSice script and how to interpret its results. When PLINK genotype target files are available, PRSice provides a relatively easy way of performing polygenic risk score analysis. As mentioned before, PRSice offers many additional options to adjust the risk score analysis, including adding covariates, and principal components and adjusting clumping parameters. It is therefore recommended to read the user manual of PRSice to perform a polygenic risk score analysis optimal to the research question at hand.

## Further Reading

There is only so much one can discuss in a beginner's practical guide to GWAS. As such, for those who want to expand their knowledge of GWAS, we have provided a comprehensive list of resources for you to read/try out below.