-
Notifications
You must be signed in to change notification settings - Fork 43
/
LDpred2.Rmd
557 lines (423 loc) · 23.7 KB
/
LDpred2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
---
title: "Polygenic scores and inference using LDpred2"
author: "Florian Privé"
date: "June 8, 2023"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE, width = 75, max.print = 30)
knitr::opts_knit$set(global.par = TRUE, root.dir = "..")
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', dev = 'png', out.width = "90%")
```
Here I show how to compute polygenic scores using [LDpred2](https://doi.org/10.1093/bioinformatics/btaa1029), as well as inferring genetic architecture parameters with LDpred2-auto.
If you have to write a standalone LDpred2 command line script, have a look at [this example written by two LDpred2 users](https://github.com/comorment/containers/tree/main/scripts/pgs/LDpred2).
**Please be careful about the spelling; you should write LDpred2, not LDPred2!**
Here is a (slightly outdated) video of me going through the tutorial and explaining the different steps:
<center><iframe width="560" height="315" src="https://www.youtube.com/embed/aya8WsNAu6U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>
## Installation
In R, run
```r
# install.packages("remotes")
remotes::install_github("privefl/bigsnpr")
```
or for the CRAN version
```r
install.packages("bigsnpr")
```
- If you install {bigsnpr} >= v1.10.4, LDpred2-grid and LDpred2-auto should be much faster for large data.
- if you install {bigsnpr} >= v1.11.4, there is a new version LDpred2-auto that was validated for inferring parameters of genetic architectures (cf. at the end of this tutorial).
## Which set of variants to use?
To run LDpred2, you need
- GWAS summary statistics with marginal effect sizes, their standard errors, and the corresponding sample size(s),
- an LD (correlation) matrix computed from individuals of the same genetic ancestry as individuals used in the GWAS,
- individual-level data for tuning hyper-parameters (when using LDpred2-grid or lassosum2) and for testing the final models.
You need to restrict to genetic variants in common between all these datasets.
We previously recommended to use a set of 1,054,330 HapMap3 variants for LDpred2, because they provide a good coverage of the genome and are generally well imputed and available in most studies.
In [this paper](https://doi.org/10.1016/j.ajhg.2023.10.010), we recently proposed an extension of this set, to provide an even better coverage of the genome by adding 37% more variants, and called it HapMap3+. This is the preferred set of variants to use with LDpred2 when power is sufficient.
You should use these sets of variants only when your data is imputed so that the overlap is good. If you only have access to a chip of genotyped variants, you should use these. In that case, you should compute the LD information yourself (as done for the tutorial data below).
If you use HM3/HM3+ variants with European summary statistics and do not have enough data to use as LD reference (e.g. at least 2000 individuals), we provide LD matrices to be used directly:
- for [HapMap3+ variants with independent LD blocks](https://doi.org/10.6084/m9.figshare.21305061) (from [this paper](https://doi.org/10.1016/j.ajhg.2023.10.010))
- for [HapMap3 variants with independent LD blocks](https://doi.org/10.6084/m9.figshare.19213299) (previously recommended in [this paper](https://doi.org/10.1016/j.xhgg.2022.100136))
- along with an [example R script](https://github.com/privefl/paper-infer/blob/main/code/example-with-provided-LD.R) on how to use them
Note that forming independent LD blocks in LD matrices can be useful for robustness and extra speed gains (see [this paper](https://doi.org/10.1016/j.xhgg.2022.100136)).
Information about the HapMap3+ variants can be retrieved with
```{r}
# $pos is in build GRCh37 / hg19, but we provide positions in 2 other builds
info <- readRDS(runonce::download_file(
"https://figshare.com/ndownloader/files/37802721",
dir = "tmp-data", fname = "map_hm3_plus.rds"))
str(info)
```
Information about the HapMap3 variants can be retrieved with
```{r}
# $pos is in build GRCh37 / hg19, but we provide positions in 3 other builds
info <- readRDS(runonce::download_file(
"https://figshare.com/ndownloader/files/36360900",
dir = "tmp-data", fname = "map_hm3.rds"))
str(info)
```
## Downloading genotype data and summary statistics for the tutorial
This tutorial uses fake data for educational purposes only.
Another tutorial using another dataset can be found at https://privefl.github.io/bigsnpr-extdoc/polygenic-scores-pgs.html.
You can download [the tutorial data](https://github.com/privefl/bigsnpr/raw/master/data-raw/public-data3.zip) and unzip files in R. We store those files in a directory called `"tmp-data"` here.
```{r}
# install.packages("runonce")
zip <- runonce::download_file(
"https://github.com/privefl/bigsnpr/raw/master/data-raw/public-data3.zip",
dir = "tmp-data")
unzip(zip)
```
First, you need to read genotype data from the PLINK files (or BGEN files) as well as the text file containing summary statistics.
```{r, echo=FALSE}
unlink(paste0("tmp-data/public-data3", c(".bk", ".rds")))
```
```{r}
# Load packages bigsnpr and bigstatsr
library(bigsnpr)
# Read from bed/bim/fam, it generates .bk and .rds files.
snp_readBed("tmp-data/public-data3.bed")
# Attach the "bigSNP" object in R session
obj.bigSNP <- snp_attach("tmp-data/public-data3.rds")
# See how the file looks like
str(obj.bigSNP, max.level = 2, strict.width = "cut")
# Get aliases for useful slots
G <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y <- obj.bigSNP$fam$affection
(NCORES <- nb_cores())
```
```{r}
# Read external summary statistics
sumstats <- bigreadr::fread2("tmp-data/public-data3-sumstats.txt")
str(sumstats)
```
We split the individuals from the genotype data into "validation" (to choose best-performing hyper-parameters) and "test" (to evaluate final polygenic scores).
Here we consider that there are 350 individuals to be used as validation set to tune hyper-parameters for LDpred2-grid and lassosum2. The other 153 individuals are used as test set to evaluate the final models.
```{r}
set.seed(1)
ind.val <- sample(nrow(G), 350)
ind.test <- setdiff(rows_along(G), ind.val)
```
## Matching variants between genotype data and summary statistics
To match variants contained in genotype data and summary statistics, the variables `"chr"` (chromosome number), `"pos"` (physical genetic position in bp), `"a0"` (reference allele) and `"a1"` (alternative allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.
From the summary statistics, you need to get `"beta"`, `"beta_se"` (standard errors), and `"n_eff"` (the effective sample sizes per variant for a GWAS using logistic regression, and simply the sample size for continuous traits).
```{r, error=TRUE}
# sumstats$n_eff <- 4 / (1 / sumstats$n_case + 1 / sumstats$n_control)
# sumstats$n_case <- sumstats$n_control <- NULL
sumstats$n_eff <- sumstats$N
map <- setNames(obj.bigSNP$map[-3], c("chr", "rsid", "pos", "a1", "a0"))
df_beta <- snp_match(sumstats, map)
```
Here, there is problem with the matching; this is due to having different genome builds. You can either convert between builds with `snp_modifyBuild()` (or directly use the converted positions in `info`), or match by rsIDs instead.
```{r}
df_beta <- snp_match(sumstats, map, join_by_pos = FALSE) # use rsid instead of pos
```
If no or few variants are actually flipped, you might want to disable the strand flipping option (`strand_flip = FALSE`) and maybe remove the few that were flipped (errors?).
## Quality control of GWAS summary statistics
**Some quality control on summary statistics is highly recommended.**
A refined QC is described in [this new paper](https://doi.org/10.1016/j.xhgg.2022.100136).
See e.g. [the code](https://github.com/privefl/paper-misspec/tree/main/code) that was used to prepare the sumstats there.
Most of the QC recommended for LDpred2 consists in computing
\begin{equation}
\text{sd}(G_j) \approx \dfrac{\text{sd}(y)}{\sqrt{n_j ~ \text{se}(\hat{\gamma}_j)^2 + \hat{\gamma}_j^2}} ~,
\end{equation}
where $\hat{\gamma}_j$ is the marginal GWAS effect size of variant $j$, $n_j$ is the GWAS sample size associated with variant $j$, $y$ is the vector of phenotypes and ${G}_j$ is the vector of genotypes for variant $j$.
Note that, for a binary trait for which logistic regression is used, we have instead
\begin{equation}\label{eq:approx-sd-log}
\text{sd}(G_j) \approx \dfrac{2}{\sqrt{n_j^\text{eff} ~ \text{se}(\hat{\gamma}_j)^2 + \hat{\gamma}_j^2}} ~,
\end{equation}
where $n_j^\text{eff} = \frac{4}{1 / n_j^\text{cases} + 1 / n_j^\text{controls}}$.
Then, these standard deviations of genotypes inferred from GWAS summary statistics can be compared to $\sqrt{2 \cdot f_j \cdot (1 - f_j) \cdot \text{INFO}_j}$, where $f_j$ is the allele frequency of variant $j$ and $\text{INFO}_j$ is used to correct for the reduced variance of imputed dosages.
In the equation for continuous traits, we can estimate $\text{sd}(y)$ by the first percentile of $\sqrt{0.5 \left(n_j ~ \text{se}(\hat{\gamma}_j)^2 + \hat{\gamma}_j^2\right)}$, where the first percentile approximates the minimum and is robust to outliers.
In practice, a consistent deviation between these two estimates of standard deviations (from summary statistics and from allele frequencies) can also be explained by using wrong estimates for $n_j$ or $n_j^\text{eff}$.
This is for example the case when computing $N^\text{eff}$ from the total number of cases and controls from a meta-analysis (cf. [this paper](https://doi.org/10.1016/j.biopsych.2022.05.029)).
In that case, you can estimate the overall $N^\text{eff}$ using e.g. `quantile(8 / df_beta$beta_se^2, 0.999)`.
## Computing LDpred2 scores genome-wide
Note that you should **run LDpred2 genome-wide**.
Just build the SFBM (the sparse LD matrix on disk) so that it contains selected variants for all chromosomes at once (see the for-loop below).
### Correlation
First, you need to compute correlations between variants.
We recommend to use a window size of 3 cM (see [the LDpred2 paper](https://doi.org/10.1093/bioinformatics/btaa1029)).
```{r}
# To convert physical positions (in bp) to genetic positions (in cM), use
# POS2 <- snp_asGeneticPos(CHR, POS, dir = "tmp-data", ncores = NCORES)
# To avoid downloading "large" files, `POS2` has been precomputed here
POS2 <- obj.bigSNP$map$genetic.dist
# /!\ this is usually just a vector of 0s /!\
# Use `snp_asGeneticPos()` in your analyses
```
Before computing the LD matrices, let us filter out variants with small minor allele frequencies (MAFs):
```{r}
ind.row <- rows_along(G)
maf <- snp_MAF(G, ind.row = ind.row, ind.col = df_beta$`_NUM_ID_`, ncores = NCORES)
maf_thr <- 1 / sqrt(length(ind.row)) # threshold I like to use
df_beta <- df_beta[maf > maf_thr, ]
```
Let us create the on-disk sparse genome-wide correlation matrix on-the-fly:
```{r}
tmp <- tempfile(tmpdir = "tmp-data")
for (chr in 1:22) {
# print(chr)
## indices in 'df_beta'
ind.chr <- which(df_beta$chr == chr)
## indices in 'G'
ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]
# here we compute LD matrices ourselves, BUT
# we recall that we provide pre-computed LD matrices that are
# usually much better (larger N, LD blocks for robustness, etc)
corr0 <- snp_cor(G, ind.col = ind.chr2, size = 3 / 1000,
infos.pos = POS2[ind.chr2], ncores = NCORES)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp, compact = TRUE)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
}
```
To use the "compact" format for SFBMs, you need `packageVersion("bigsparser") >= package_version("0.5")`.
Make sure to reinstall {bigsnpr} after updating {bigsparser} to this new version (to avoid crashes).
```{r}
file.size(corr$sbk) / 1024^3 # file size in GB
```
Note that you will need at least the same memory as this file size (to keep it cached for faster processing) + some other memory for all the results returned. If you do not have enough memory, processing will be very slow (because you would read the data from disk all the time). If using the one million HapMap3 variants, having 60 GB of memory should be enough.
### LDpred2-inf: infinitesimal model
```{r}
# Estimate of h2 from LD Score regression
(ldsc <- with(df_beta, snp_ldsc(ld, length(ld), chi2 = (beta / beta_se)^2,
sample_size = n_eff, blocks = NULL)))
ldsc_h2_est <- ldsc[["h2"]]
```
Note that parameter `ld_size` from `snp_ldsc()` is the initial number of variants used to compute the LD scores, and should be equal to `nrow(info)` instead of `length(ld)` when using the pre-computed LD scores `info$ld`.
```{r}
beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = ldsc_h2_est)
```
```{r}
pred_inf <- big_prodVec(G, beta_inf, ind.row = ind.test, ind.col = df_beta[["_NUM_ID_"]])
pcor(pred_inf, y[ind.test], NULL)
```
LDpred2-inf would very likely perform worse than the other models presented hereinafter.
We actually recommend not to use it anymore.
### LDpred2(-grid): grid of models
In practice, we recommend to test multiple values for h2 and p.
```{r}
(h2_seq <- round(ldsc_h2_est * c(0.3, 0.7, 1, 1.4), 4))
(p_seq <- signif(seq_log(1e-5, 1, length.out = 21), 2))
(params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE)))
```
```{r}
set.seed(1) # to get the same result every time
# takes less than 2 min with 4 cores
beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)
```
```{r}
pred_grid <- big_prodMat(G, beta_grid, ind.col = df_beta[["_NUM_ID_"]])
params$score <- apply(pred_grid[ind.val, ], 2, function(x) {
if (all(is.na(x))) return(NA)
summary(lm(y[ind.val] ~ x))$coef["x", 3]
# summary(glm(y[ind.val] ~ x, family = "binomial"))$coef["x", 3]
})
```
Note that missing values represent models that diverged substantially.
```{r, fig.asp=0.5}
library(ggplot2)
ggplot(params, aes(x = p, y = score, color = as.factor(h2))) +
theme_bigstatsr() +
geom_point() +
geom_line() +
scale_x_log10(breaks = 10^(-5:0), minor_breaks = params$p) +
facet_wrap(~ sparse, labeller = label_both) +
labs(y = "GLM Z-Score", color = "h2") +
theme(legend.position = "top", panel.spacing = unit(1, "lines"))
```
```{r, message=FALSE, warning=FALSE}
library(dplyr)
params %>%
mutate(sparsity = colMeans(beta_grid == 0), id = row_number()) %>%
arrange(desc(score)) %>%
mutate_at(c("score", "sparsity"), round, digits = 3) %>%
slice(1:10)
```
You can then choose the best model according to your preferred criterion (e.g. max AUC or $r^2$). Here, we use the Z-Score from the (linear or logistic) regression of the phenotype by the PRS since we have found it more robust than using the correlation or the AUC. It also enables adjusting for covariates in this step.
Also note that we separate both sparse and non-sparse models to show that their predictive performance are similar (in the original LDpred2 paper). In practice, if you do not really care about sparsity, you could choose the best LDpred2-grid model among all sparse and non-sparse models. If you do, choose the best sparse one (if it is close enough to the best one).
```{r}
best_beta_grid <- params %>%
mutate(id = row_number()) %>%
# filter(sparse) %>%
arrange(desc(score)) %>%
slice(1) %>%
print() %>%
pull(id) %>%
beta_grid[, .]
```
```{r}
pred <- big_prodVec(G, best_beta_grid, ind.row = ind.test,
ind.col = df_beta[["_NUM_ID_"]])
pcor(pred, y[ind.test], NULL)
```
### LDpred2-auto: automatic model
Contrary to LDpred2-grid, LDpred2-auto does not need a validation set because it can directly infer values for hyper-parameters $h^2$ and $p$.
We recommend to run many chains in parallel with different initial values for `p`, which will be used for QC afterwards.
In [this paper](https://doi.org/10.1016/j.xhgg.2022.100136), we have also introduced two new parameters in LDpred2-auto for improving its robustness, `allow_jump_sign` and `shrink_corr`, and recommend to use them.
```{r}
coef_shrink <- 0.95 # reduce this up to 0.4 if you have some (large) mismatch with the LD ref
set.seed(1) # to get the same result every time
# takes less than 2 min with 4 cores
multi_auto <- snp_ldpred2_auto(
corr, df_beta, h2_init = ldsc_h2_est,
vec_p_init = seq_log(1e-4, 0.2, length.out = 30), ncores = NCORES,
# use_MLE = FALSE, # uncomment if you have convergence issues or when power is low (need v1.11.9)
allow_jump_sign = FALSE, shrink_corr = coef_shrink)
str(multi_auto, max.level = 1)
str(multi_auto[[1]], max.level = 1)
```
You can verify whether the chains "converged" by looking at the path of the chains:
```{r}
library(ggplot2)
auto <- multi_auto[[1]] # first chain
plot_grid(
qplot(y = auto$path_p_est) +
theme_bigstatsr() +
geom_hline(yintercept = auto$p_est, col = "blue") +
scale_y_log10() +
labs(y = "p"),
qplot(y = auto$path_h2_est) +
theme_bigstatsr() +
geom_hline(yintercept = auto$h2_est, col = "blue") +
labs(y = "h2"),
ncol = 1, align = "hv"
)
```
In the LDpred2 paper, we proposed an automatic way of filtering bad chains by comparing the scale of the resulting predictions. We have tested a somewhat equivalent and simpler alternative since, which we recommend here:
```{r}
# `range` should be between 0 and 2
(range <- sapply(multi_auto, function(auto) diff(range(auto$corr_est))))
(keep <- which(range > (0.95 * quantile(range, 0.95, na.rm = TRUE))))
```
To get the final effects / predictions, **you should only use chains that pass this filtering**:
```{r}
beta_auto <- rowMeans(sapply(multi_auto[keep], function(auto) auto$beta_est))
pred_auto <- big_prodVec(G, beta_auto, ind.row = ind.test, ind.col = df_beta[["_NUM_ID_"]])
```
```{r}
pcor(pred_auto, y[ind.test], NULL)
```
### lassosum2: grid of models
lassosum2 is a re-implementation of [the lassosum model](https://doi.org/10.1002/gepi.22050) that now uses the exact same input parameters as LDpred2 (`corr` and `df_beta`). It should be fast to run. It can be run next to LDpred2 and the best model can be chosen using the validation set.
Note that parameter 's' from lassosum has been replaced by a new parameter 'delta' in lassosum2, in order to better reflect that the lassosum model also uses L2-regularization (therefore, elastic-net regularization).
```{r}
beta_lassosum2 <- snp_lassosum2(corr, df_beta, ncores = NCORES)
```
```{r}
(params2 <- attr(beta_lassosum2, "grid_param"))
pred_grid2 <- big_prodMat(G, beta_lassosum2, ind.col = df_beta[["_NUM_ID_"]])
params2$score <- apply(pred_grid2[ind.val, ], 2, function(x) {
if (all(is.na(x))) return(NA)
summary(lm(y[ind.val] ~ x))$coef["x", 3]
# summary(glm(y[ind.val] ~ x, family = "binomial"))$coef["x", 3]
})
```
```{r}
library(ggplot2)
ggplot(params2, aes(x = lambda, y = score, color = as.factor(delta))) +
theme_bigstatsr() +
geom_point() +
geom_line() +
scale_x_log10(breaks = 10^(-5:0)) +
labs(y = "GLM Z-Score", color = "delta") +
theme(legend.position = "top") +
guides(colour = guide_legend(nrow = 1))
```
```{r}
library(dplyr)
best_grid_lassosum2 <- params2 %>%
mutate(id = row_number()) %>%
arrange(desc(score)) %>%
print() %>%
slice(1) %>%
pull(id) %>%
beta_lassosum2[, .]
```
```{r}
# Choose the best among all LDpred2-grid and lassosum2 models
best_grid_overall <-
`if`(max(params2$score, na.rm = TRUE) > max(params$score, na.rm = TRUE),
best_grid_lassosum2, best_beta_grid)
```
## Inference with LDpred2-auto
Here, you need `packageVersion("bigsnpr") >= package_version("1.11.4")`. Here is [the accompanying paper](https://doi.org/10.1016/j.ajhg.2023.10.010).
```{r, echo=FALSE, out.width="80%", fig.cap="Overview of what LDpred2-auto can now provide"}
knitr::include_graphics("https://raw.githubusercontent.com/privefl/paper-infer/main/overview.svg")
```
LDpred2-auto has been very recently extended and improved to allow for estimating $h^2$, $p$, and $\alpha$, a third parameter that controls how expected effect sizes relate to minor allele frequencies. The new model assumed by LDpred2-auto is
\begin{equation}
\beta_j = S_j \gamma_j \sim \left\{
\begin{array}{ll}
\mathcal N \big( 0,~\sigma_\beta^2 \cdot (S_j^2)^{(\alpha + 1)} \big) & \mbox{with probability $p$,} \\
0 & \mbox{otherwise,}
\end{array}
\right.
\end{equation}
where $p$ is the proportion of causal variants, $h^2$ the (SNP) heritability, $\boldsymbol{\gamma}$ the effect sizes on the allele scale, $\boldsymbol{S}$ the standard deviations of the genotypes, and $\boldsymbol{\beta}$ the effects of the scaled genotypes.
Parameters $h^2$, $p$, and $\alpha$ (and 95% CIs) can be estimated using:
```{r, cache=TRUE}
# reduce this up to 0.4 if you have some (large) mismatch with the LD ref
# /!\ but the inference of h2 and p might be biased if you do this (see paper)
coef_shrink <- 0.95
multi_auto <- snp_ldpred2_auto(
corr, df_beta, h2_init = ldsc_h2_est,
vec_p_init = seq_log(1e-4, 0.2, length.out = 50), ncores = NCORES,
burn_in = 500, num_iter = 500, report_step = 20,
# use_MLE = FALSE, # uncomment if you have convergence issues or when power is low (need v1.11.9)
allow_jump_sign = FALSE, shrink_corr = coef_shrink)
```
```{r}
(range <- sapply(multi_auto, function(auto) diff(range(auto$corr_est))))
(keep <- which(range > (0.95 * quantile(range, 0.95, na.rm = TRUE))))
```
```{r}
all_h2 <- sapply(multi_auto[keep], function(auto) tail(auto$path_h2_est, 500))
quantile(all_h2, c(0.5, 0.025, 0.975))
all_p <- sapply(multi_auto[keep], function(auto) tail(auto$path_p_est, 500))
quantile(all_p, c(0.5, 0.025, 0.975))
all_alpha <- sapply(multi_auto[keep], function(auto) tail(auto$path_alpha_est, 500))
quantile(all_alpha, c(0.5, 0.025, 0.975))
```
Predictive performance $r^2$ can also be inferred from the Gibbs sampler:
```{r}
bsamp <- lapply(multi_auto[keep], function(auto) auto$sample_beta)
all_r2 <- do.call("cbind", lapply(seq_along(bsamp), function(ic) {
b1 <- bsamp[[ic]]
Rb1 <- apply(b1, 2, function(x)
coef_shrink * bigsparser::sp_prodVec(corr, x) + (1 - coef_shrink) * x)
b2 <- do.call("cbind", bsamp[-ic])
b2Rb1 <- as.matrix(Matrix::crossprod(b2, Rb1))
}))
quantile(all_r2, c(0.5, 0.025, 0.975))
```
and compared to
```{r}
beta_auto <- rowMeans(sapply(multi_auto[keep], function(auto) auto$beta_est))
pred_auto <- big_prodVec(G, beta_auto, ind.col = df_beta[["_NUM_ID_"]])
pcor(pred_auto, y, NULL)^2
```
These are not exactly the same, which we attribute to the small number of variants used in this tutorial data.
You can also get per-variant probabilities of being causal (for fine-mapping purposes)
```{r}
postp <- rowMeans(sapply(multi_auto[keep], function(auto) auto$postp_est))
qplot(y = postp, alpha = I(0.2)) + theme_bigstatsr()
```
You can also infer local per-block or per-annotation $h^2$ following the code [here](https://github.com/privefl/paper-infer/blob/3dcd2949f54d7f1fd5c346d1bbb35db4b84a719e/code/run-ldpred2-ukbb.R#L158-L165) and [here](https://github.com/privefl/paper-infer/blob/3dcd2949f54d7f1fd5c346d1bbb35db4b84a719e/code/height-meta-1600k.R#L170-L216).
***
```{r}
# Some cleaning
rm(corr); gc(); file.remove(paste0(tmp, ".sbk"))
```
## References
- Privé, F., Arbel, J., & Vilhjálmsson, B. J. (2020). [LDpred2: better, faster, stronger](https://doi.org/10.1093/bioinformatics/btaa1029). *Bioinformatics*, 36(22-23), 5424-5431.
- Privé, F., Arbel, J., Aschard, H., & Vilhjálmsson, B. J. (2022). [Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores](https://doi.org/10.1016/j.xhgg.2022.100136). *Human Genetics and Genomics Advances*, 3(4), 100136.
- Privé, F., Albiñana, C., Arbel, J., Pasaniuc, B., & Vilhjálmsson, B. J. (2022). [Inferring disease architecture and predictive ability with LDpred2-auto](https://doi.org/10.1016/j.ajhg.2023.10.010). *The American Journal of Human Genetics*, 110(12), 2042-2055.