-
Notifications
You must be signed in to change notification settings - Fork 0
/
homolog_EC50.Rmd
198 lines (147 loc) · 8.43 KB
/
homolog_EC50.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "Compute per-homolog EC50"
author: "Tyler Starr"
date: "08/03/2021"
output:
github_document:
toc: true
html_preview: false
editor_options:
chunk_output_type: inline
---
For some antibodies, we have 'binding curves' where we measured binding/escape across a concentration series of antibody. This script complements homolog_escape to compute an "EC50" of each homolog for the antibody.
```{r setup, message=FALSE, warning=FALSE, error=FALSE}
require("knitr")
knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(dev.args = list(png = list(type = "cairo")))
#list of packages to install/load
packages = c("yaml","data.table","tidyverse","gridExtra")
#install any packages not already installed
installed_packages <- packages %in% rownames(installed.packages())
if(any(installed_packages == F)){
install.packages(packages[!installed_packages])
}
#load packages
invisible(lapply(packages, library, character.only=T))
#read in config file
config <- read_yaml("config.yaml")
#make output directory
if(!file.exists(config$EC50_scores_dir)){
dir.create(file.path(config$EC50_scores_dir))
}
```
Session info for reproducing environment:
```{r print_sessionInfo}
sessionInfo()
```
## Data input
Read in table of escape scores for barcodes, also annotated with variant counts and expression that we'll use in filtering. Remove barcodes with pre_count of 0 or NA expression effects -- these are likely barcodes that either do not express well, are generally low-frequency, or were actually "mutant" variants (but no mutation in the SSM) which were not part of the sub-pool of the homologs experiments that we used for these antibody selections. Last, subset on the set of antibodies that are actually in the "EC50" style experiment rather than single-concentration escape, and the "extant" sarb RBDs in th elibrary.
```{r input_data}
dt <- data.table(read.csv(file=config$escape_fracs_barcodes,stringsAsFactors = F))
dt <- dt[pre_count>0 & !is.na(expression),]
dt <- dt[antibody %in% c(config$S2K146_EC50,config$S2K146UCA_EC50,config$S2E12_EC50) & target %in% config$targets_ordered,]
```
## Filtering
We will use the per-barcode pre_count and expression scores to filter out escape scores used in computing per-homolog escape.
First, let's look at the distribution of pre-counts across barcodes. The median pre-count is `r median(dt$pre_count)`. Vertical lines on the two plots below indicate a threshold for pre_count of 1/2 that of the median pre-count, which is what we'll apply below. This is a stringent cutoff, but we have such a high degree of over-barcoding that we can do this.
```{r hist_pre_count, fig.width=4, fig.height=4, fig.align="center", dpi=300,dev="png"}
hist(log10(dt$pre_count),col="gray50",main="",xlab="pre-selection sequencing count (log10 scale)");abline(v=log10(0.5*median(dt$pre_count)),col="red",lty=2)
```
Let's also see how escape fraction correlates with pre-count. Theoretically, escape_fraction should not exceed 1, though we see it does, particularly when pre_count is lower.
```{r scatter_pre_count, fig.width=4, fig.height=4, fig.align="center", dpi=300,dev="png"}
plot(dt$pre_count,dt$escape_frac,pch=16,col="#00000060",log="x",xlab="pre-count (log scale)",ylab="escape fraction (shouldn't exceed 1)")
abline(h=1,lty=2,col="red")
abline(v=0.5*median(dt$pre_count),lty=2,col="red")
```
Remove barcode measurements for those where pre-count is less than half the median pre-count. This corresponds to removing variants with less than `r 0.5*median(dt$pre_count)` pre-sort counts.
```{r filter_precount}
dt <- dt[pre_count > 0.5*median(dt$pre_count),]
```
Censor NA measurements for AncSARS1a_alt and HKU3-8 which were generally non-expressing from our prior measurements (and so any barcodes that eke through are probably poorly expressed/artefactual)
```{r filter_nonexpressing_homologs}
dt[target%in%c("AncSARS1a_alt","HKU3-8"),escape_frac:=NA]
dt[target%in%c("AncSARS1a_alt","HKU3-8"),expression:=NA]
```
## Escape fraction per homolog
Next, let's visualize the escape fraction scores per barcode grouped by variant, with violin plots.
```{r escape_frac_vioplot, fig.width=13, fig.height=20, fig.align="center", dpi=300,dev="png"}
#set factor order for homologs to display
dt$target <- factor(dt$target, levels=config$targets_ordered)
ggplot(dt,aes(x=target,y=escape_frac))+
geom_violin(scale="width")+stat_summary(fun=median,geom="point",size=1)+
ggtitle("escape frac by homolog")+xlab("homolog")+theme(axis.text.x=element_text(angle=-90,hjust=0))+
facet_wrap(~antibody,ncol=1)
```
Can see some escape fracs above the theoretical maximum 1 escape frac. Fix these to max = 1.
```{r fix_max_escape_1}
dt[escape_frac>1,escape_frac:=1]
```
Fit per-barcode EC50s
```{r fit_per_bc_EC50}
#separate antibody out into antibody column and concentration column
dt[,c("mAb"):=strsplit(antibody,split="_")[[1]][1],by="antibody"]
dt[,c("concentration"):=strsplit(antibody,split="_")[[1]][2],by="antibody"]
#stupid funciton ot convert concentration string to numeric
numeric_conc <- function(val){
if(!is.na(as.numeric(val))){
return(as.numeric(val))
}else{
return(as.numeric(sub("-",".",val)))
}
}
dt[,conc:=numeric_conc(concentration),by="concentration"]
#define bind_frac as 1-escape_frac
dt[,bind_frac := 1-escape_frac]
#function to fit EC50 to bind_frac versus concentration
fit_EC50 <- function(bind_vec,conc_vec){
fit <- nls(bind_vec ~ (a/(1+(EC50/conc_vec)^n))+b,
start=list(a=1,b=0,EC50=10,n=2),
lower=list(a=0.75,b=0,EC50=0.01,n=1/3),
upper=list(a=1,b=0.25,EC50=10000,n=3),
algorithm="port")
return(summary(fit)$coefficients[c("EC50","a","b","n"),"Estimate"])
}
dt[,c("EC50") := tryCatch(fit_EC50(bind_frac,conc)[1],error=function(e){list(as.numeric(NA))}),by=c("library","barcode","mAb")]
#collapse to one row per barcode
dt <- unique(dt[,.(library,target,barcode,expression,mAb,EC50)])
```
Visualize distribution of EC50 fits for each homolog as violin plots
```{r EC50_vioplot, fig.width=13, fig.height=5, fig.align="center", dpi=300,dev="png"}
#set factor order for homologs to display
ggplot(dt,aes(x=target,y=log10(EC50)))+
geom_violin(scale="width")+stat_summary(fun=median,geom="point",size=1)+
ggtitle("log10(EC50)")+xlab("homolog")+theme(axis.text.x=element_text(angle=-90,hjust=0))+
facet_wrap(~mAb,ncol=1)
```
Collapse each homolog escape fraction to its median EC50 across barcodes.
```{r collapse_median}
dt[,median_EC50:=median(EC50,na.rm=T),by=c("library","target","mAb")] #median EC50 across all barcodes for a homolog
dt[,n_barcodes:=sum(!is.na(EC50)),by=c("library","target","mAb")] #the number of barcodes on which a homolog median EC50 was calculated
dt[,median_expression:=median(expression,na.rm=T),by=c("library","target")] #the average expression of that homolog
dt_collapse <- unique(dt[,.(library,target,median_expression,mAb,median_EC50,n_barcodes)]) #collapse down to homolog-level data instead of barcode-level
dt_collapse[is.na(median_EC50),n_barcodes:=NA]
```
Make histograms showing the typical number of barcodes on which a homolog escape fraction was averaged across. The median number of barcodes across all homolog escape fracs is `r median(dt$n_barcodes,na.rm=T)`.
```{r hist_n_barcode, fig.width=5, fig.height=4, fig.align="center", dpi=300,dev="png"}
hist(dt_collapse$n_barcodes,main="",col="gray50",xlab="number of barcodes",breaks=20)
```
## Heatmaps
Last, make heatmaps illustrating the EC50 of each homolog versus each antibody. These will be what we'll eventually align to the phylogeny as our data display?
```{r heatmap_homolog-EC50, fig.width=12,fig.height=4,fig.align="center", dpi=500,dev="png",echo=T}
dt_collapse$target <- factor(dt_collapse$target,levels=c(config$EurAf_extant,config$SARS2_extant,config$SARS1_extant,config$Clade2_extant))
ggplot(dt_collapse,aes(target,mAb))+geom_tile(aes(fill=log10(median_EC50)),color="black",lwd=0.1)+
scale_fill_gradient(high="white",low="#353D41",limits=c(-2,4),na.value="cyan")+
labs(x="RBD homolog",y="")+theme_classic(base_size=9)+
coord_equal()+theme(axis.text.x=element_text(angle=90,hjust=1,face="bold"))
invisible(dev.print(pdf, paste(config$EC50_scores_dir,"/heatmap_homologs.pdf",sep="")))
```
## Output
Save our final per-homolog and per-bc EC50s
```{r output_data}
dt_collapse %>%
mutate_if(is.numeric, round, digits=5) %>%
write.csv(file=config$EC50_homologs, row.names=F)
dt[,.(library,target,barcode,expression,mAb,EC50)] %>%
mutate_if(is.numeric, round, digits=5) %>%
write.csv(file=config$EC50_barcodes, row.names=F)
```