-
Notifications
You must be signed in to change notification settings - Fork 0
/
asthma_prelim_results.Rmd
433 lines (363 loc) · 22.2 KB
/
asthma_prelim_results.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
---
title: "Asthama and Allergy diseases"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(knitr)
library(dplyr)
library(RColorBrewer)
source("code/make_plots.R")
```
## Overview
The main goal is to identify causal variants, genes and cell types relevant to AAD by integrating omics data of lung samples. We hypothesize that open chromatin regions of lung-resident immune cells can explain broader heritability for AAD than those of blood immune cells. The disease associated variants annotated by these regulatory functions are more likely to contribute to disease risk, so this prior knowledge can be leveraged to prioritize risk variants in GWAS loci.
## Dataset
### ATAC-seq data for Blood immune cells (Caldero2019)
**Dataset**: ATAC-Seq profiles for FACS-sorted cells from the peripheral blood of up to 4 healthy donors
**Output files**:
1. An ATAC-seq count table with a union set of peaks as rows and individual cell as columns
2. A Sample QC table that includes number of peaks and cell type identity for each cell
3. Significant differentially accessible regions when compared to progenitor cells
4. Significant differentially accessible regions under stimulation
**Procedure**: I used the number of peaks after QC for each cell to extract its corresponding ATAC-Seq peaks from the count table.
<details>
<summary>Details of the procedure</summary>
For the cells of interest, I sorted their corresponding columns of the count table and store the top N number of peaks according to the number of peaks from the QC table as cell type resolved peaks.
</details>
### ATAC-seq data for hematopoietic cells (Ulirsch2019)
**Dataset**: ATAC-Seq profiles for FACS-sorted cells from human peripheral blood or bone marrow.
**Output files**:
1.peak files downloaded from: https://github.com/caleblareau/singlecell_bloodtraits/tree/master/data/bulk/ATAC/narrowpeaks
### scATAC-seq data for lung tissues (Wang2020)
**Dataset**: scATAC-Seq and scRNA-seq profiles for small airway region of right middle lobe (RML) lung tissue from 3 donors at different ages
**Output files**:
1. peak files downloaded from web portal: https://www.lungepigenome.org/
### scATAC-seq data for fetal hematopoietic cells (Ronzoni2021)
**Dataset**: scATAC-Seq and scRNA-seq profiles of human immunophenotypic blood cells from fetal liver and bone marrow
**Output files**:
Downloaded from gitlab page: to be added
1. A merged normalized peak table with a union set of peaks as rows and individual cell as columns
2. A meta table that includes number of peaks and predicted cell type identity for each cell
3. A raw count table with peaks as rows and individual cell as columns
**Procedure**: I used the number of peaks for each cell from the meta table to extract its corresponding ATAC-Seq peaks from the merged peak table. The number of peaks is equivalent to the amount of non-zero peaks.
### Summary of peaks called across open chromatin regions
**Peak size** - not shown
```{r echo=F, eval=FALSE}
data.dir<-"~/cluster/data/features/formatted"
get_median_peaks<-function(study){
#get median size for called peaks
medPeaks<-c()
for(i in list.files(paste(data.dir, study, sep='/'), pattern='bed')){
annot<-read.table(paste0(data.dir, sprintf("/%s/%s", study, i)), header=F)
colnames(annot)<-c("chr","start", "end")
medPeaks<-c(medPeaks, median(annot$end-annot$start))
}
print(c(study, mean(medPeaks)))
return(medPeaks)
}
par(mfrow = c(2,2))
hist(get_median_peaks("Wang2020"), main="")
hist(get_median_peaks("Ulirsch2019"), main="")
hist(get_median_peaks("Caldero2019/regroup"), main="")
hist(get_median_peaks("Caldero2019/lineage"), main="")
```
## TORUS run for individual Blood annotation
**Motivation**:
1. perform QC check for the Caldero2019 dataset
2. have a senes of which cell types are potentially relevant to AAD
**Results**:
Overall, the magnitude of enrichment estimates are much smaller than that in figure 5b. The discrepancy can be due to different GWAS datasets and enrichment method. They used LDSC to estimate enrichment coefficients, which have a better control on the overlapping peaks between many annotations.
```{r echo=F, fig.height=8, fig.width=12}
immune_cells<-c("Bulk B", "Naive B", "Mem B","Plasmablasts",
"CD8pos T", "Naive CD8 T", "Central memory CD8pos T",
"Effector memory CD8pos T", "Gamma delta T", "Regulatory T","Naive Tregs",
"Memory Tregs","Effector CD4pos T","Naive Teffs", "Memory Teffs",
"Th1 precursors", "Th2 precursors", "Th17 precursors",
"Follicular T Helper","Immature NK", "Mature NK", "Memory NK",
"Monocytes", "pDCs", "Myeloid DCs")
traits<-c("allergy", "asthma_adult", "asthma_child", "ra_2014", "bmi", "height", "LDL", "HDL")
test<-traits[1:4]
control<-traits[5:8]
tbl<-data.frame()
for(trait in test[c(1,2,4)]){
rest<-read.table(sprintf("output/AAD/%s/torus_enrichment_all_rest.est", trait), header=F)
stimu<-read.table(sprintf("output/AAD/%s/torus_enrichment_all_stimulated.est", trait), header=F)
# name the terms in a simpler form (todo)
rest["condition"]<-"resting"
stimu["condition"]<-"stimulated"
df<-data.frame(rbind(rest, stimu))
colnames(df)<-c("raw_term", "estimate", "low", "high", "condition")
df["term"]<-unlist(lapply(df$raw_term, function(i){paste(strsplit(strsplit(i, "[.]")[[1]][1], "_")[[1]], collapse = " ")}))
tbl<-data.frame(rbind(tbl, cbind(df, trait)))
}
tbl$condition<-factor(df$condition, levels=c("stimulated", "resting"))
snp_enrichment_plot(tbl, split.data = "condition", y.order = rev(immune_cells), trait="", bar.width = 0) + facet_grid(. ~ trait)
```
## TORUS run for individual lung annotation
**Motivation**:
1. perform QC check for this dataset
2. have a senes of which cell types in lungs are potentially relevant to AAD
**Results**:
Data look good as the open chromatin regions of most cell types in lungs are enriched with disease risk variants.
```{r echo=F, fig.width=12, fig.height=8}
tbl<-data.frame()
for(trait in test){
df<-read.table(sprintf("output/AAD/%s/Wang2020_indiv.est", trait), header=F)
colnames(df)<-c("raw_term", "estimate", "low", "high")
df["term"]<-unlist(lapply(df$raw_term, function(i){strsplit(i, "[.]")[[1]][1]}))
tbl<-data.frame(rbind(tbl, cbind(df, trait)))
}
snp_enrichment_plot(tbl, trait="") + facet_grid(. ~ trait)
```
##Joint TORUS run for annotation sets
### major clusters in lung dataset
* Mesenchymal cells
* Immune cells
* epithelial cells
* endothelial cells
* others(neuroendocrine cells, type I pneumocyte, ciliated cells)
**Motivation**: determine how each cluster with similar chromatin accessibility patterns contribute to AAD heritability.
**Results**:
We observed some non-immune cell types such as epithelial cells, endothelia cells do contribute to genetic risks of AAD, though not as much as immune cells.
```{r echo=F, fig.width=12}
tbl<-data.frame()
for(trait in traits){
df<-read.table(sprintf("output/AAD/%s/Wang2020_regroup.est", trait), header=T, skip=1)
colnames(df)<-c("raw_term", "estimate", "low", "high")
df["term"] = unlist(lapply(df$raw_term, function(i){strsplit(i[1], "[.]")[[1]][1]}))
tbl<-data.frame(rbind(tbl, cbind(df, trait)))
}
#tbl["trait_label"]<-factor(tbl$trait, levels=traits,
#labels=c("Allergy", "Asthma(adult)", "Asthma(child)", "BMI", "HDL", "Height", "LDL", "RA"))
snp_enrichment_plot(tbl[tbl$trait %in% test,], trait="",
y.order=c("neuroendocrine_cell", "type_I_pneumocyte",
"ciliated_cell", "epithelial_cells",
"mesenchymal_cells", "endothelial_cells",
"immune_cells")
) + facet_grid(. ~ trait) + theme(strip.text.x = element_text(size=18))
snp_enrichment_plot(tbl[tbl$trait %in% control,], trait="",
y.order=c("neuroendocrine_cell", "type_I_pneumocyte",
"ciliated_cell", "epithelial_cells",
"mesenchymal_cells", "endothelial_cells",
"immune_cells")
) + facet_grid(. ~ trait) + theme(strip.text.x = element_text(size=18))
```
### annotation sets for Caldero2019
#### grouped by lineages
**Motivation**:
Running annotation sets in a joint model enables us to idenity the relevant contribution in open chromatin regions of each immune cell type to traits. LSDC is a better tool to use as we expected many overlaps of acccessible peaks across the sub-clusters of immune cells, which can make the estimation of TORUS's joint model unstable. Here we grouped the cell types and used TORUS to get a quick run of which group of cell types have significant enrichment.
**Procedure**:
For Caldero2019 dataset: Immune cells were grouped into six main categories and merged across two conditions. The peaks in these groups can be overlapped.
For Ronzoni2021 dataset, I took a union set of peaks from all cells that were predicted to be granulocytes progenitors.
**Results**:
Consistent with prior knowledge, we see enrichment of granulocyte progenitors with genetic risk of allergy.
```{r echo=F, fig.width=12}
tbl<-data.frame()
for(trait in traits){
df<-read.table(sprintf("output/AAD/%s/Caldero2019_addGPs.est", trait), header=T)
colnames(df)<-c("raw_term", "estimate", "low", "high")
### remove lines of FDR values for each loci
df["label"] = unlist(lapply(df$raw_term, function(i){length(strsplit(i[1], "[.]")[[1]])>1}))
df<-df[df$label==TRUE,]
df["term"]<-unlist(lapply(df$raw_term, function(i){strsplit(i, "_merge")[[1]][1]}))
tbl<-data.frame(rbind(tbl, cbind(df, trait)))
}
snp_enrichment_plot(tbl[tbl$trait %in% test,], trait="", y.label = c("Myeloid cells","B cells", "NK cells", "CD8_T", "CD4_T","Gamma_delta_T", "Granulocyte-progenitors")) + xlim(-3,5) + facet_grid(. ~ trait) + theme(strip.text.x = element_text(size=18))
#snp_enrichment_plot(tbl[tbl$trait %in% control,], trait="", y.label = c("Myeloid cells","B cells", "NK cells", "CD8_T", "CD4_T","Gamma_delta_T", "Granulocyte-progenitors")) + facet_grid(. ~ trait)
```
**Summary**:
1. Overall, we see the differences in immune cell components that contribute to these three autoimmune diseases.
2. GPs and CD4+ T cels are consistently significant in enrichment of risk variants for all three diseases.
#### grouped by disjoint peaks
*Enrichment of each disjoint set of peaks*
**Motivation**:
Using disjoint groups of peaks from immune cell types to estimate separate contributions of immune components to disease heritability. These annotations are ideal predictors for the linear model used in either LDSC or TORUS to obtain unbiased enrichment estimators.
**Procedure**: The disjoint peaks were directly downloaded from the paper.
```{r echo=F, fig.width=12, fig.height=8}
tbl<-data.frame()
for(trait in test){
df<-read.table(sprintf("output/AAD/%s/Caldero2019_lineage.est", trait), header=T)
colnames(df)<-c("raw_term", "estimate", "low", "high")
### remove lines of FDR values for each loci
df["label"] = unlist(lapply(df$raw_term, function(i){length(strsplit(i[1], "[.]")[[1]])>1}))
df<-df[df$label==TRUE,]
df["term"]<-unlist(lapply(df$raw_term, function(i){strsplit(i, "[.]")[[1]][1]}))
tbl<-data.frame(rbind(tbl, cbind(df, trait)))
}
snpsIn<-read.table("output/AAD/Caldero2019_lineage_snps.sumstats", header=T)
snpsIn["term"]<-unlist(lapply(snpsIn$term, function(i){strsplit(i, "[.]")[[1]][1]}))
cols<-rev(c("B_stimulated", "T_stimulated", "BandT_stimulated",
"B_resting", "T_resting","nk_resting","myeloid_resting", "EPI_resting","progenitor_resting","open"))
cols<-unlist(lapply(cols, function(i){
label<-snpsIn[snpsIn$term==i, "percent"]
return(sprintf("%s(%s)", i, label))
}))
tbl<-inner_join(tbl, snpsIn, by="term")
tbl["term"]<-sprintf("%s(%s)", tbl$term, tbl$percent)
tbl["h2g"]<-round((tbl$percent_val * exp(tbl$estimate))*100, 1)
snp_enrichment_plot(tbl[tbl$raw_term!="thymo_resting.1" & tbl$trait %in% test,], y.order = cols , trait="") + facet_grid(. ~ trait, scales="free_x")
#snp_enrichment_plot(tbl[tbl$term!="thymo_resting" & tbl$trait %in% control,], y.order = cols , trait="") + facet_grid(. ~ trait, scales="free_x")
```
*Percent of causal variants in each peak set*
**Motivation**: To not only test for enrichment, but also estimate the percent of genetic signals explained by each set of peaks
**Procedure**:
For each disjoint set of peaks, I multiplied the percent of genome-wide SNPs that occur in each peak set with the exponentials of Torus enrichment estimates. The number in parenthesis is the percent of SNPs in peaks.
**Results**:
```{r echo=F, fig.width=10}
total_h2g<-unlist(lapply(split(tbl, tbl$trait), function(i){
sprintf("%s(%s)", unique(i$trait),paste0(sum(i$h2g),"%"))
}))
ggplot(tbl[!(tbl$raw_term %in% c("EPI_resting.1", "thymo_resting.1")),], aes(x=term, y=h2g, fill= trait)) +
scale_fill_discrete(name="Trait", type= palette.colors(palette="Set2")) +
geom_bar(stat="identity", position="dodge") + ylab("Percent of h2g") + xlab("") +
theme(axis.text.x=element_text(angle=45, vjust = 0.9, hjust=0.8, size=14),
axis.text.y=element_text(size=14), axis.title.y=element_text(size=14),
legend.title = element_text(size=14), legend.text = element_text(size=14))
```
### Disjoint immune peaks in blood vs. immune groups in lung
**Motivation**:
As the OCRs are largely shared across immune cell types, the use of multiple regression to estimate the effects of the openness of immune cells on disease risks can be hard to interpret. To address this issue, we generated disjoint sets of annotations as independent predictors, which can directly tell us how each immune component explains the genetic associations.
**Procedure**:
I partitioned peaks from lung and blood into three disjoint peak sets - only found in blood, only in lung and the shared. Specifically, I used `bedtools intersect` to call the overlapped peaks and peaks that are unique to either side for lung and blood peaks. Then, the overlapped peaks will be merged.
**Summary of the number of overlapped peaks**
*Ulirsch2019 dataset*
```{r}
dat<-read.table("output/AAD/Ulirsch2019_peaks_overlapped.txt", header=F)
colnames(dat)<-c("numPeaks","file")
dat$file<-gsub("GMP_", "GMP-",dat$file)
dat$file<-gsub("CD4_", "CD4+T_",dat$file)
dat$file<-gsub("CD8_", "CD8+T_",dat$file)
dat[c("celltype", "annot")]<-t(sapply(dat$file, function(i){strsplit(strsplit(i, "[.]")[[1]][1], "_")[[1]][1:2]}))
dat$celltype<-factor(dat$celltype, levels=c("B","CD4+T","CD8+T","T","NK", "Ery","GMP-A","GMP-B","GMP-C","GMP-merge","mDC","Mega","Mono","pDC"),
labels=c("B","CD4+T","CD8+T","T","NK", "Ery","GMP-A","GMP-B","GMP-C","GMP","mDC","Mega","Mono","pDC"))
ggplot(dat[!(dat$celltype %in% c("T", "GMP-A", "GMP-B", "GMP-C", "pDC")),], aes(x=celltype, y=numPeaks, fill=annot)) + geom_bar(stat="identity") + ylab("Number of Peaks") +
scale_fill_discrete(name="Annotation sets", labels=c("Blood Only", "Common", "Lung Only"), type= palette.colors(palette="Set3")[c(1,6,5)]) + xlab("") +
scale_y_continuous(breaks=seq(0, 200000, 50000), labels=paste0(seq(0, 200, 50), "K")) +
theme(axis.text.x=element_text(angle=45, vjust = 0.9, hjust=0.8)) +
theme(strip.text.x = element_text(size=16))
```
*Test the enrichment of disjoint peak sets*
```{r echo=F, fig.width=12}
snpsIn<-read.table("output/AAD/Ulirsch2019_disjoint_snps.sumstats", header=T, row.names = 1)
rownames(snpsIn)<-sub("GMP_", "GMP-", rownames(snpsIn))
snpsIn["term"]<-unlist(lapply(rownames(snpsIn), function(i){paste(strsplit(i, "_")[[1]][1:2], collapse="_")}))
tbl<-data.frame()
for(trait in traits){
df<-read.table(sprintf("output/AAD/%s/Ulirsch2019/Ulirsch2019_compare.txt", trait), header=F)
colnames(df)<-c("raw_term", "estimate", "low", "high")
df["tmp"] = unlist(lapply(df$raw_term, function(i){strsplit(i[1], "[.]")[[1]][1]}))
df["tmp"]<-sub("GMP_", "GMP-", df$tmp)
tbl<-data.frame(rbind(tbl, cbind(df, trait)))
}
tbl["term"]<-unlist(lapply(tbl$tmp, function(i){
return(sprintf("%s(%s)", i, snpsIn[snpsIn$term == i, "percent"]))
}))
tbl["percent"]<-unlist(lapply(tbl$tmp, function(i){
return(snpsIn[snpsIn$term == i, "percent_val"])
}))
tbl["celltype"]<-unlist(lapply(tbl$tmp, function(i){strsplit(i, "[_]")[[1]][1]}))
tbl["annot"]<-unlist(lapply(tbl$tmp, function(i){rev(strsplit(i, "[_]")[[1]])[1]}))
snp_enrichment_plot(tbl[tbl$trait%in%test & tbl$celltype %in% c("B", "CD4", "CD8", "mDC", "GMP-merge", "Mono"),], trait="") + facet_grid(. ~ trait) + theme(strip.text.x = element_text(size=14)) + xlim(-10, 15)
```
*Percent of causal variants in each peak set*
```{r echo=F,fig.width=10}
tbl["h2g"]=round(exp(tbl$estimate)*tbl$percent*100, 1)
ggplot(tbl[tbl$trait %in% test & !(tbl$celltype %in% c("GMP-A", "GMP-B", "GMP-C", "pDC")),], aes(x=celltype, y=h2g, fill= c(annot))) +
geom_bar(stat="identity") + ylab("Percent of h2g") +
scale_fill_discrete(name="Annotation sets", labels=c("Blood Only", "Common", "Lung Only"), type= palette.colors(palette="Set3")[c(1,6,5)]) +
theme(axis.text.x=element_text(angle=45, vjust = 0.9, hjust=0.8)) +
facet_grid(.~trait) + theme(strip.text.x = element_text(size=14))
```
## stratified-LDSC for partitioning h2g
### Joint run of all annotations in Wang2020
```{r}
labels<-read.table("output/AAD/Wang2020_annot.txt", header=F)
immune<-c("macrophage","erythrocyte","B_cell", "T_cell", "natural_killer_cell")
tbl<-c()
for(trait in test[1:3]){
f<-read.table(sprintf("output/AAD/%s/Wang2020_joint.results", trait), header=T)
new.f<-cbind(trait, f[f$Category %in% rev(f$Category)[1:17],])
new.f$Category<-factor(new.f$Category, levels=new.f$Category, labels=labels$V1)
new.f$Category<-factor(new.f$Category, levels=c("B_cell","T_cell","natural_killer_cell",
"erythrocyte","macrophage", "type_II_pneumocyte",
"club_cell","basal_cell","ciliated_cell",
"neuroendocrine_cell","type_I_pneumocyte",
"capillary_endothelial_cell", "endothelial_cell_of_artery",
"endothelial_cell_of_lymphatic_vessel",
"extracellular_matrix_secreting_cell","pericyte_cell"
))
tbl<-data.frame(rbind(tbl, new.f))
}
colnames(tbl)[c(2,6,7)]<-c("term", "estimate", "sd")
tbl["high"]<-tbl$Prop._h2 + tbl$Prop._h2_std_error
tbl["low"]<-tbl$Prop._h2 - tbl$Prop._h2_std_error
ggplot(tbl, aes(x=term, y=Prop._h2*100)) +
geom_bar(stat="identity") + geom_errorbar(aes(ymin=low*100, ymax=high*100), width=0.2) +
ylab("Percent of h2g") + xlab("") + ylab("Percent of h2") +
theme(axis.text.x=element_text(angle=45, vjust = 0.9, hjust=0.8)) + coord_flip() +
facet_wrap(.~trait, nrow=1) + theme(strip.text.x = element_text(size=18)) + theme_bw()
```
### Ulirsch2019
**Enrichment estimates**
```{r echo=F}
feature<-c("B", "CD4","CD8","Ery","GMP_A", "GMP_B", "GMP_C", "mDC", "pDC","Mega","Mono")
h2g<-c()
for(i in feature){
for(trait in test[1:3]){
tbl<-read.table(sprintf("output/AAD/%s/Ulirsch2019/%s_%s.results", trait, trait, i), header=T)
output<-cbind(trait, tbl[tbl$Category %in% rev(tbl$Category)[1:3],])
output$Category<-factor(output$Category, levels = c("L2_1", "L2_2", "L2_3"),
labels=paste(i, c("blood", "common", "lung"), sep="_"))
h2g<-data.frame(rbind(h2g, output))
}
}
h2g$Category<-gsub("GMP_", "GMP-", h2g$Category)
h2g[c("celltype", "annot")]<-t(sapply(h2g$Category, function(i){strsplit(as.character(i), "_")[[1]][1:2]}))
```
```{r}
feature<-c("B", "CD4","CD8","Ery","GMP-A", "GMP-B", "GMP-C", "mDC", "pDC","Mega","Mono")
ylabel<-c("B", "CD4+T","CD8+T","Ery","GMP-A", "GMP-B", "GMP-C", "mDC", "pDC","Mega","Mono")
colnames(h2g)[c(2,6,7)]<-c("term", "estimate", "sd")
h2g["high"]<-h2g$estimate + h2g$sd
h2g["low"]<-h2g$estimate - h2g$sd
h2g$celltype<-factor(h2g$celltype, levels = rev(feature), labels=rev(ylabel))
ggplot(h2g, aes(x=celltype, y=estimate, fill=annot)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=low, ymax=high), width=0.3, position=position_dodge(0.9)) +
xlab("") + ylab("Variant Enrichment") +
scale_fill_discrete(name="Annotation sets", labels=c("Blood Only", "Common", "Lung Only"), type= palette.colors(palette="Set3")[c(1,6,5)]) +
theme(axis.text.x=element_text(angle=45, vjust = 0.9, hjust=0.8)) + coord_flip() +
facet_wrap(.~trait, nrow=1) + theme(strip.text.x = element_text(size=18)) + theme_bw()
```
**h2 explained**
```{r echo=F, fig.width=10}
ggplot(h2g, aes(x=celltype, y=Prop._h2*100, fill= annot)) +
geom_bar(stat="identity") + ylab("Percent of h2") +
scale_fill_discrete(name="Annotation sets", labels=c("Blood Only", "Common", "Lung Only"), type= palette.colors(palette="Set3")[c(1,6,5)]) +
theme(axis.text.x=element_text(angle=45, vjust = 0.9, hjust=0.8, size=12),
axis.text.y=element_text(angle=45, vjust = 0.9, hjust=0.8, size=12),
axis.title.y=element_text(size=14)) + xlab("") +
facet_grid(.~trait) + theme(strip.text.x = element_text(size=14))
```
```{r echo=F, eval=F}
#### Compute for the percent of SNPs in annotations
dataPath<-"/home/jinggu/cluster/data/features/formatted/Ulirsch2019/compare"
tbl<-c()
#build IRange object for all QC SNPs in 1kg
snpRanges<-make_ranges(seqname=bigSNP$map$chromosome,
start = bigSNP$map$physical.pos,
end = bigSNP$map$physical.pos)
snpRanges<-plyranges::mutate(snpRanges, snp=bigSNP$map$marker.ID)
for (f in list.files(dataPath, pattern = "bed")){
curr <- rtracklayer::import(paste(dataPath, f, sep='/'), format='bed')
subdf <- IRanges::subsetByOverlaps(snpRanges, curr)
tbl<-data.frame(rbind(tbl, c(f, length(unique(subdf$snp)),
length(unique(subdf$snp))/dim(bigSNP$map)[1])
))
}
colnames(tbl)<-c("term", "count", "percent_val")
tbl["percent"]<-paste0(round(as.numeric(tbl$percent_val)*100, 1), "%")
write.table(tbl, "~/projects/funcFinemapping/output/AAD/Ulirsch2019_disjoint_snps.sumstats",
sep='\t', row.names = F, quote=F)
```