# Analysis of EGRET networks derived from 119 individuals in three cell types - LCLs, iPSCs and CMs.
## *Des Weighill*

This analysis can be ran on the server or locally by setting the following parameter:

In [None]:
runserver=0

We need to set the paths for data in the server.

In [None]:
if (runserver==1){
    ppath='/opt/data/egretnet/'
}else if(runserver==0){
    ppath=''
}

## Introduction
This netbook steps through the analysis performed in the EGRET publication<sup>1</sup>. Banovich and colleagues<sup>2</sup> had previously analyzed RNA-seq data derived from three cell types: lymphoblastoid cell lines (LCLs), induced pluripotent stem cells (iPSCs), and cardiomyocytes (CMs; differentiated from the iPSCs). We constructed 357 individual-specific EGRET networks using expression, genotype, and eQTL data from 119 Yoruba individuals for all three cell types used in the Banovich et al study. We also constructed a baseline GRN for each cell type. The process is outlined bfriefly below:

Gene expression and eQTL data for a population of lymphoblastoid cell lines (LCL), induced pluripotent stem cells (iPSCs) and cardiomyocytes (CMs) that were differentiated from the induced pluripotent stem cells  derived the study by Banovich et al<sup>2</sup> and Li et al<sup>3</sup>, as well as the corresponding genotypes of 119 Yoruba individuals were downloaded on 06/17/2020. Expression data and eQTLs for LCLs, as well as eQTLs for iPSCs and iPSC-CMs were downloaded from http://eqtl.uchicago.edu/ whereas gene expression data for iPSCs and CMs were obtained through the Gene Expression Omnibus (GEO) from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107654. For each cell type, significant eQTLs p < 1e-5 for genes where the SNP resided within a TF motif within the promoter region of a gene ([-750,+250] around a TSS) were selected. For each cell type, SNPs in the population of 119 Yoruba individuals that also were selected as eQTLs in the respective cell type were then isolated, and QBiC was run on this set of SNPs, per cell type.

LCL and iPSC expression data were already preprocessed through WASP and normalized by standardizing by gene and quantile normalizing by individual. The CM expression was not yet normalized, and we followed the process detailed in the series matrix files from GEO https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107654 in order to process the CM expression data in the same manner. This involved scaling each gene by mean centering and dividing by the standard deviation, followed by quantile-normalizing the individuals using the normalize.quantiles function in the preprocessCore R package.  QBiC<sup>4</sup> was then run on the eQTLs to predict the effect these SNPs had on the binding of TFs using the full set of TF binding models in QBiC, and using hg19 as a reference genome.

EGRET was then run for each genotype in each cell type (a total of $119 \times 3 = 357$ EGRET runs). In addition, message passing was performed using the co-expression network, PPI network, and the reference motif prior (which involves no genotype information) to construct a genotype agnostic baseline GRN for each cell type. Message passing was performed using the pandaR <sup>5</sup>.

Since some of these analyses take some time, we have saved preliminary files throughout the analysis which can be loaded to pick up the analysis at a certain point.


## Preliminaries
First we load all of the R libraries we will need in this analysis. We also set up our color pallet to use in figures.

In [None]:
library(dplyr)
library(xtable)
library(preprocessCore)
library(ggplot2)
library(tidyr)
library(ggpubr)
library(reshape2)
library(jcolors)
library(ALPACA)
library(gridExtra)
library(extrafont)
library(grid)
library(gridExtra)
library(ggthemes)
library(forcats)

# This is our color pallet
#amber, magenta, teal
pallet <- c("#FFBF00",  "#870E75"  ,  "#0BB19F" )


## Data loading and parsing: Make PANDA/EGRET edge table

We now load each of the 357 (1 per individual per cell type) into one large dataframe, as well as the baseline network per cell type. We do this per cell type, loading the 119 EGRET networks + 1 PANDA network per cell type. For the sake of saving run time, we have saved the combined tables to load.

### Individual-specific networks in LCLs
Load the LCL networks:

In [None]:

load(paste0(ppath,"data/networks/finalEgret_v1_banovich_LCL_allModels_smart1_07032020_1_panda.RData"))
net <- melt(regnet)
colnames(net) = c("tf","gene","score")
net$id <- paste0(net$tf,net$gene)
regnet_edge_table_LCL <- data.frame(net$id)
colnames(regnet_edge_table_LCL) <- c("id")
regnet_edge_table_LCL$tf <- as.vector(net$tf)
regnet_edge_table_LCL$gene <- as.vector(net$gene)
regnet_edge_table_LCL$panda <- as.vector(net$score)
for (g in c(1:119)) {
  filename <- paste0(ppath,"data/networks/finalEgret_v1_banovich_LCL_allModels_smart1_07032020_",g,"_egret.RData")
  load(filename)
  net <- melt(regnet)
  colnames(net) = c("tf","gene","score")
  net$id <- paste0(net$tf,net$gene)
  if(all.equal(as.vector(regnet_edge_table_LCL$id),as.vector(net$id))){
    print(g)
    regnet_edge_table_LCL$tempID <- as.vector(net$score)
    #print(colnames(regnet_edge_table))
    colnames(regnet_edge_table_LCL)[colnames(regnet_edge_table_LCL) == "tempID"] <- paste0("LCL_indiv_",g)
  }
}
#save(regnet_edge_table_LCL, file = paste0(ppath,"data/networks/LCL_table_EGRET_banovich_allModels.RData"))

### Individual-specific networks in iPSCs
Load the iPSC networks:

In [None]:
load(paste0(ppath,"data/networks/finalEgret_v1_banovich_iPSC_allModels_smart1_07032020_1_panda.RData"))
net <- melt(regnet)
colnames(net) = c("tf","gene","score")
net$id <- paste0(net$tf,net$gene)
regnet_edge_table_iPSC <- data.frame(net$id)
colnames(regnet_edge_table_iPSC) <- c("id")
regnet_edge_table_iPSC$tf <- as.vector(net$tf)
regnet_edge_table_iPSC$gene <- as.vector(net$gene)
regnet_edge_table_iPSC$panda <- as.vector(net$score)
for (g in c(1:119)) {
  filename <- paste0(ppath,"data/networks/finalEgret_v1_banovich_iPSC_allModels_smart1_07032020_",g,"_egret.RData")
  load(filename)
  net <- melt(regnet)
  colnames(net) = c("tf","gene","score")
  net$id <- paste0(net$tf,net$gene)
  if(all.equal(as.vector(regnet_edge_table_iPSC$id),as.vector(net$id))){
    print(g)
    regnet_edge_table_iPSC$tempID <- as.vector(net$score)
    #print(colnames(regnet_edge_table))
    colnames(regnet_edge_table_iPSC)[colnames(regnet_edge_table_iPSC) == "tempID"] <- paste0("iPSC_indiv_",g)
  }
}
#save(regnet_edge_table_iPSC, file = "paste0(ppath,data/networks/iPSC_table_EGRET_banovich_allModels.RData"))

### Individual-specific networks in iPSC-CMs
Load the CM networks:

In [None]:
load(paste0(ppath,"data/networks/finalEgret_v1_banovich_iPSC-CM_allModels_smart1_07032020_1_panda.RData"))
net <- melt(regnet)
colnames(net) = c("tf","gene","score")
net$id <- paste0(net$tf,net$gene)
regnet_edge_table_iPSC_CM <- data.frame(net$id)
colnames(regnet_edge_table_iPSC_CM) <- c("id")
regnet_edge_table_iPSC_CM$tf <- as.vector(net$tf)
regnet_edge_table_iPSC_CM$gene <- as.vector(net$gene)
regnet_edge_table_iPSC_CM$panda <- as.vector(net$score)
for (g in c(1:119)) {
  filename <- paste0(ppath,"data/networks/finalEgret_v1_banovich_iPSC-CM_allModels_smart1_07032020_",g,"_egret.RData")
  load(filename)
  net <- melt(regnet)
  colnames(net) = c("tf","gene","score")
  net$id <- paste0(net$tf,net$gene)
  if(all.equal(as.vector(regnet_edge_table_iPSC_CM$id),as.vector(net$id))){
    print(g)
    regnet_edge_table_iPSC_CM$tempID <- as.vector(net$score)
    #print(colnames(regnet_edge_table))
    colnames(regnet_edge_table_iPSC_CM)[colnames(regnet_edge_table_iPSC_CM) == "tempID"] <- paste0("iPSC-CM_indiv_",g)
  }
}
#save(regnet_edge_table_iPSC_CM, file = paste0(ppath,"data/networks/CM_table_EGRET_banovich_allModels.RData"))

### Merge into a single table with an inner join
This ensures that we are considering the same set of edges for all cell types.

In [None]:
pre_merged_regnet_table <- merge(regnet_edge_table_LCL,regnet_edge_table_iPSC, by.x = c(1,2,3), by.y = c(1,2,3), all = FALSE)
merged_regnet_table <- merge(pre_merged_regnet_table,regnet_edge_table_iPSC_CM,by.x = c(1,2,3), by.y = c(1,2,3), all = FALSE)
colnames(merged_regnet_table)[4] <- "LCL_panda"
colnames(merged_regnet_table)[124] <- "iPSC_panda"
colnames(merged_regnet_table)[244] <- "iPSC-CM_panda"
#save(merged_regnet_table, file = paste0(ppath,"data/networks/merged_regnet_table_EGRET_banovich.RData"))

### Construct edge differences from PANDA
The first question we asked was *"Which Tf-gene edges are impacted by variants for each individual and each tissue?"* We first construct the difference table by substracting the PANDA edge weight from the EGRET edge weight and taking the absolute value of the difference. 


In [None]:
diff_table <- merged_regnet_table
for (i in c(1:119)){
  diff_table[,4+i] <- abs(merged_regnet_table[,4+i]-merged_regnet_table[,4])
  diff_table[,124+i] <- abs(merged_regnet_table[,124+i]-merged_regnet_table[,124])
  diff_table[,244+i] <- abs(merged_regnet_table[,244+i]-merged_regnet_table[,244])
}
#save(diff_table, file = paste0(ppath,"data/networks/diffTable_merged_banovich_edge_table.RData"))

### GWAS associations
To investigate enrichments within disease-related genes, we obtained GWAS hits for Crohn's Disease and Coronary Artery disease from the GWAS catalog.

Load the GWAS hits:


In [None]:
# load gene annotation to assign gene id to gene name
nameGeneMap <- read.table(paste0(ppath,"data/annotation/geneID_name_map.txt"), header = FALSE)
colnames(nameGeneMap) <- c("gene","name")

CD <- read.table(paste0(ppath,"data/annotation/CD_gwas_parsed.txt"), header = FALSE,  sep = "\t")
colnames(CD) <- c("CDtrait","name")
CD$gene <- nameGeneMap$gene[match(CD$name, nameGeneMap$name)]
CAD <- read.table(paste0(ppath,"data/annotation/CAD_gwas_parsed.txt"), header = FALSE,  sep = "\t")
colnames(CAD) <- c("CADtrait","name")
CAD$gene <- nameGeneMap$gene[match(CAD$name, nameGeneMap$name)]

Mark genes/tfs in our EGRET networks which are GWAS hits

In [None]:
diff_table$GWAS_CD_tf <- ifelse(diff_table$tf %in% CD$name, 1, 0)
diff_table$GWAS_CD_gene <- ifelse(diff_table$gene %in% CD$gene, 1, 0)
diff_table$GWAS_CAD_tf <- ifelse(diff_table$tf %in% CAD$name, 1, 0)
diff_table$GWAS_CAD_gene <- ifelse(diff_table$gene %in% CAD$gene, 1, 0)
#save(diff_table, file = paste0(ppath,"data/networks/diffTable_withGWAS_merged_banovich_edge_table.RData"))

## Investigation of TF disruption Scores
EGRET inferred edge weights can be used to quantitatively estimate the predicted regulatory effects produced by SNPs on a given gene, TF, or TF-gene relationship. A higher edge weight between a TF *i* and a gene *j* is interpreted as a higher confidence that the TF binds the promoter of and regulates the expression of gene *j*. To assess the effects of SNPs on gene regulation, we define and calculate three different regulatory disruption scores for nodes and edges in a given genotype. The *edge disruption score* quantifies the extent to which a TF-gene regulatory relationship is disrupted by genetic variants. The *gene disruption score* assesses the extent to which a gene has disrupted regulation due to genetic variants in its promoter region. The *TF disruption score* represents the cumulative impact of cis-acting variants that disrupt a TF's regulation of its target genes. A TF with a high disruption score would suggest that many of its TF-gene edges have been disrupted by genetic variants and thus the overall regulatory influence of the TF is diminished. These scores are defined per edge/node in each genotype-specific EGRET network by comparing it to a baseline network.

![image](./fig1.pdf)


### Hard threshold edge differences, aggregate by TF and gene
Edge disruption scores were calculated for each edge in each individual network for each cell type as the absolute difference between each edge weight and the corresponding edge in the baseline network for that cell type. A thresholded of 0.35 was then applied, and edge disrution scores were aggregated by TF to calculate a TF disruption score for each TF. 

In [None]:
diff_table_hard_thresh <- diff_table[,c(c(5:123),c(125:243),c(245:363))]
diff_table_hard_thresh[diff_table_hard_thresh < 0.35] <- 0
diff_table_hard_thresh <- cbind(diff_table[,c(2,3)],diff_table_hard_thresh)
tf_degrees_hard_thresh <- aggregate(diff_table_hard_thresh[,c(3:359)], by = list(diff_table_hard_thresh$tf), FUN = sum) 
gene_degrees_hard_thresh <- aggregate(diff_table_hard_thresh[,c(3:359)], by = list(diff_table_hard_thresh$gene), FUN = sum) 
colnames(tf_degrees_hard_thresh)[1] <- "tf"
colnames(gene_degrees_hard_thresh)[1] <- "gene"
head(tf_degrees_hard_thresh)
head(gene_degrees_hard_thresh)
tf_degrees_hard_thresh$GWAS_CD_tf <- ifelse(tf_degrees_hard_thresh$tf %in% CD$name, 1, 0)
gene_degrees_hard_thresh$GWAS_CD_gene <- ifelse(gene_degrees_hard_thresh$gene %in% CD$gene, 1, 0)
tf_degrees_hard_thresh$GWAS_CAD_tf <- ifelse(tf_degrees_hard_thresh$tf %in% CAD$name, 1, 0)
gene_degrees_hard_thresh$GWAS_CAD_gene <- ifelse(gene_degrees_hard_thresh$gene %in% CAD$gene, 1, 0)
save(tf_degrees_hard_thresh,file = paste0(ppath,"data/tf_degrees_hard_thresh_035.RData"))
save(gene_degrees_hard_thresh,file = paste0(ppath,"data/gene_degrees_hard_thresh_035.RData"))

### Scale and plot distributions of discription scores

A scaled TF disruption score for a TF within in an individual and cell type was then calculated by subtracting the mean TF disruption score for that individual/cell type and dividing by the standard deviation. 

In [None]:
tf_degrees_hard_thresh_scaled <- scale(tf_degrees_hard_thresh[,c(2:358)], center = TRUE, scale = TRUE)
tf_degrees_hard_thresh_scaled <- cbind(tf_degrees_hard_thresh[,c(1)], tf_degrees_hard_thresh_scaled, tf_degrees_hard_thresh[,c(359:360)])
colnames(tf_degrees_hard_thresh_scaled)[1] <-"tf"

tf_degrees <- tf_degrees_hard_thresh
tf_degree_lcl <- melt(tf_degrees[,c(1,c(2:120))], id.vars = c("tf"))
tf_degree_lcl$tissue <- "LCL"
tf_degree_iPSC <- melt(tf_degrees[,c(1,c(121:239))], id.vars = c("tf"))
tf_degree_iPSC$tissue <- "iPSC"
tf_degree_iPSC_CM <- melt(tf_degrees[,c(1,c(240:358))], id.vars = c("tf"))
tf_degree_iPSC_CM$tissue <- "iPSC-CM"
tf_degrees_melted <- rbind(tf_degree_lcl,tf_degree_iPSC,tf_degree_iPSC_CM)
colnames(tf_degrees_melted) <- c("tf","individual","diff_out_degree","Tissue")

ggplot(tf_degrees_melted, aes(x=Tissue, y=diff_out_degree, fill=Tissue, col = Tissue)) + geom_violin() + theme_bw() + scale_fill_manual(values=pallet) + scale_color_manual(values=pallet) + labs(title="Distribution of TF Disruption Scores", x ="Cell type", y = "TF Disruption Score") + theme(legend.position="none")

tf_degrees <- tf_degrees_hard_thresh_scaled
tf_degree_lcl <- melt(tf_degrees[,c(1,c(2:120))], id.vars = c("tf"))
tf_degree_lcl$tissue <- "LCL"
tf_degree_iPSC <- melt(tf_degrees[,c(1,c(121:239))], id.vars = c("tf"))
tf_degree_iPSC$tissue <- "iPSC"
tf_degree_iPSC_CM <- melt(tf_degrees[,c(1,c(240:358))], id.vars = c("tf"))
tf_degree_iPSC_CM$tissue <- "iPSC-CM"
tf_degrees_melted <- rbind(tf_degree_lcl,tf_degree_iPSC,tf_degree_iPSC_CM)
colnames(tf_degrees_melted) <- c("tf","individual","diff_out_degree","Tissue")

ggplot(tf_degrees_melted, aes(x=Tissue, y=diff_out_degree, fill=Tissue, col = Tissue)) + geom_violin() + theme_bw() + scale_fill_manual(values=pallet) + scale_color_manual(values=pallet) + labs(title="Distribution of TF Disruption Scores", x ="Cell type", y = "TF Disruption Score") + theme(legend.position="none")

### T-tests for disease genes
We then labelled TFs as associated with Crohn's disease (CD) and coronary artery disease (CAD) Figure \ref{fig:tf_venn}) based on annotation from the NHGRI-EBI GWAS catalog<sup>6</sup>. We tested to see if disease-associated TFs were more likely to have significant disruption scores in relevant cell types. Using a t-test, we found that TF disruption scores were significantly higher in cardiomyocytes (CMs) for TFs associated with CAD than were disruption scores for non-CAD related TFs (p = 4.5256e-6); this CAD enrichment was not observed in LCLs (p = 0.99831). Similarly, we found TF disruption scores in LCLs, but not CMs, were substantially higher for TFs linked to CD than for non CD-linked TFs (p = 5.3374e-16 in LCL networks, p = 1 in CM networks). This analysis leads to an important observation: genotype-mediated, disease-related TF disruptions are cell-type specific and can be identified using networks inferred by EGRET. 


In [None]:
colnames(tf_degree_iPSC_CM) <-  c("tf","individual","diff_out_degree","tissue")
colnames(tf_degree_lcl) <-  c("tf","individual","diff_out_degree","tissue")
colnames(tf_degree_iPSC) <-  c("tf","individual","diff_out_degree","tissue")

tf_degree_iPSC_CM$CAD <- tf_degrees$GWAS_CAD_tf[match(tf_degree_iPSC_CM$tf, tf_degrees$tf)]
tf_degree_iPSC_CM$CD <- tf_degrees_hard_thresh$GWAS_CD_tf[match(tf_degree_iPSC_CM$tf, tf_degrees_hard_thresh$tf)]
tf_degree_lcl$CD <- tf_degrees_hard_thresh$GWAS_CD_tf[match(tf_degree_lcl$tf, tf_degrees_hard_thresh$tf)]
tf_degree_lcl$CAD <- tf_degrees_hard_thresh$GWAS_CAD_tf[match(tf_degree_lcl$tf, tf_degrees_hard_thresh$tf)]

test_cad_in_cm <- t.test(tf_degree_iPSC_CM[which(tf_degree_iPSC_CM$CAD == 1),]$diff_out_degree,tf_degree_iPSC_CM[which(tf_degree_iPSC_CM$CAD == 0),]$diff_out_degree, alternative="greater")

test_cd_in_cm <- t.test(tf_degree_iPSC_CM[which(tf_degree_iPSC_CM$CD == 1),]$diff_out_degree,tf_degree_iPSC_CM[which(tf_degree_iPSC_CM$CD == 0),]$diff_out_degree, alternative="greater")

test_cd_in_lcl <- t.test(tf_degree_lcl[which(tf_degree_lcl$CD == 1),]$diff_out_degree,tf_degree_lcl[which(tf_degree_lcl$CD == 0),]$diff_out_degree, alternative="greater")

test_cad_in_lcl <- t.test(tf_degree_lcl[which(tf_degree_lcl$CAD == 1),]$diff_out_degree,tf_degree_lcl[which(tf_degree_lcl$CAD == 0),]$diff_out_degree, alternative="greater")

test_cad_in_lcl$p.value
test_cad_in_cm$p.value
test_cd_in_lcl$p.value
test_cd_in_cm$p.value
ttest_df <- NULL
ttest_df$Disease <- c("CAD","CAD","CD","CD")
ttest_df$CellType <- c("LCL","CM","LCL","CM")


### Individual-level TF disruption scores for CAD and CD genes

Indeed, we find that the highest TF disruption scores for CAD TFs occur in CMs and that the highest TF disruption scores for CD TFs occur in LCLs. Further supporting this observation, the TF disruption signal in CAD is dominated in a subset of the study population by single a TF, ERG, which is a member of the erythroblast transformation-specific (ETS) gene family and known to be involved in angiogenesis. In these individuals, the high TF disruption scores for CAD TFs in CMs are driven by the presence or absence of a mutation on Chromosome 1 (Chr1:201476815, an eQTL for CSRP1) that lies in the binding motif for the TF ERG in the promoter region of the gene CSRP1 ENSG00000159176).

#### TF disruption scores for CAD genes

In [None]:
data <- tf_degrees_hard_thresh_scaled
cad_tfs_degree_hard_thresh <- data[which(data$GWAS_CAD_tf == 1),c(1:358)]
cad_tfs_degree_hard_thresh_melted <- melt(cad_tfs_degree_hard_thresh)
colnames(cad_tfs_degree_hard_thresh_melted) <- c("tf","sample","tf_diff_degree")
cad_tfs_degree_hard_thresh_melted <- separate(cad_tfs_degree_hard_thresh_melted, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)
cad_tfs_degree_hard_thresh_melted$cellType[which(cad_tfs_degree_hard_thresh_melted$cellType == "iPSC-CM")] <- "CM"

cad_plot <- ggplot(cad_tfs_degree_hard_thresh_melted) + geom_point(shape = 19, size = 0.5, aes(x = reorder(indiv, -tf_diff_degree), y = tf_diff_degree, col = cellType),) + theme_classic()+ scale_color_manual(values=pallet[c(2,1,3)])+ xlab("Individual") + theme( axis.text.x=element_blank(), axis.ticks.x=element_blank())  + labs(title = "Disruption scores for CAD TFs",col = "Cell type",shape = "TF")  + facet_grid(. ~ cellType)+ theme(legend.position="none") + ylab(bquote(d^(TF))) +theme(strip.text.x = element_text(size = 10, color = "black", face = "bold"),strip.text.y = element_text(size = 10, color = "black", face = "bold"))

disease_test_tfs_degree_hard_thresh <- data[which(data$GWAS_CD == 1),c(1:358)]
disease_test_tfs_degree_hard_thresh_melted <- melt(disease_test_tfs_degree_hard_thresh)
colnames(disease_test_tfs_degree_hard_thresh_melted) <- c("tf","sample","tf_diff_degree")
disease_test_tfs_degree_hard_thresh_melted <- separate(disease_test_tfs_degree_hard_thresh_melted, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)
disease_test_tfs_degree_hard_thresh_melted$cellType[which(disease_test_tfs_degree_hard_thresh_melted$cellType == "iPSC-CM")] <- "CM"

cd_plot <- ggplot(disease_test_tfs_degree_hard_thresh_melted) + geom_point(shape = 19, size = 0.5, aes(x = reorder(indiv, -tf_diff_degree), y = tf_diff_degree, col = cellType),) + theme_classic()+ scale_color_manual(values=pallet[c(2,1,3)])+ xlab("Individual") + theme( axis.text.x=element_blank(), axis.ticks.x=element_blank())  + labs(title = "Disruption scores for CD TFs",col = "Cell type",shape = "TF") + facet_grid(. ~ cellType)+ theme(legend.position="none") + ylab(bquote(d^(TF)))+theme(strip.text.x = element_text(size = 10, color = "black", face = "bold"),strip.text.y = element_text(size = 10, color = "black", face = "bold"))

cad_annotated <- annotate_figure(cad_plot,fig.lab = "A", fig.lab.face = "plain", fig.lab.pos = "top.left", fig.lab.size = 14)
cd_annotated <- annotate_figure(cd_plot,fig.lab = "B", fig.lab.face = "plain", fig.lab.pos = "top.left", fig.lab.size = 14)

combined_plot <- grid.arrange(cad_annotated, cd_annotated, nrow=2, heights = c(3,3))
plot(combined_plot)


Question: Which individuals have the disruptions for ERG in CMs?

In [None]:

indivs_high_erg_disrup <- cad_tfs_degree_hard_thresh_melted[which((cad_tfs_degree_hard_thresh_melted$tf_diff_degree>3) & (cad_tfs_degree_hard_thresh_melted$cellType == "CM")),]
indivs_high_erg_disrup



Which individual has the highest cm disruptions for ERG?

In [1]:
top_indiv <- cad_tfs_degree_hard_thresh_melted[which((cad_tfs_degree_hard_thresh_melted$tf_diff_degree== max(cad_tfs_degree_hard_thresh_melted$tf_diff_degree)) & (cad_tfs_degree_hard_thresh_melted$cellType == "CM")),]


# get tf disruption scores for ERG for all individuals in iPSC-CMs
tf_disrup_erg_cm <- cad_tfs_degree_hard_thresh_melted[which((cad_tfs_degree_hard_thresh_melted$tf == "ERG") & (cad_tfs_degree_hard_thresh_melted$cellType == "CM")),c('indiv','tf_diff_degree')]

ERROR: Error in eval(expr, envir, enclos): object 'cad_tfs_degree_hard_thresh_melted' not found


## Affect of variants on network communities.
We also tested the hypothesis that the network effects of genetic variants have the potential to subtly change the modular structure of genotype-specific networks, altering the functional network modules active in an individual. ALPACA<sup>7</sup> is a method that compares the modular structure of two networks and identifies modules that differ between the networks. The resulting gene differential modularity (DM) scores indicate which genes have undergone the greatest change in their modular environment. We used ALPACA to compare the modular structure of the cell-type and individual-specific EGRET GRNs with the baseline GRN for the corresponding cell type, and calculated the DM score for each gene in each network.

##### Load ALPACA scores for LCLs:

In [None]:
net <- read.table(paste0(ppath,"alpaca/LCL_100_ALPACA_scores.txt"), header = FALSE, sep = "\t")
colnames(net) = c("node","score")
net <- separate(net, node, c("node","set"), "_", remove = TRUE)

alpaca_score_table_LCL <- data.frame(net$node[which(net$set == "B")])
colnames(alpaca_score_table_LCL) = c("node")
for (n in c(1:119)) {
  file <- paste0(ppath,"alpaca/LCL_",n,"_ALPACA_scores.txt")
  net <- read.table(file, header = FALSE, sep = "\t")
  colnames(net) = c("node","score")
  net <- separate(net, node, c("node","set"), "_", remove = TRUE)
  print(n)
  alpaca_score_table_LCL$tempID <- net$score[match(alpaca_score_table_LCL$node, net$node)]
  colnames(alpaca_score_table_LCL)[colnames(alpaca_score_table_LCL) == "tempID"] <- paste0("LCL_indiv_",n)
  }
save(alpaca_score_table_LCL, file = paste0(ppath,"alpaca/LCL_alpaca_score_table_genes.RData"))

##### Load ALPACA scores for iPSCs:

In [None]:
net <- read.table(paste0(ppath,"alpaca/iPSC_100_ALPACA_scores.txt"), header = FALSE, sep = "\t")
colnames(net) = c("node","score")
net <- separate(net, node, c("node","set"), "_", remove = TRUE)

alpaca_score_table_iPSC <- data.frame(net$node[which(net$set == "B")])
colnames(alpaca_score_table_iPSC) = c("node")
for (n in c(1:119)) {
  file <- paste0(ppath,"alpaca/iPSC_",n,"_ALPACA_scores.txt")
  net <- read.table(file, header = FALSE, sep = "\t")
  colnames(net) = c("node","score")
  net <- separate(net, node, c("node","set"), "_", remove = TRUE)
  print(n)
  alpaca_score_table_iPSC$tempID <- net$score[match(alpaca_score_table_iPSC$node, net$node)]
  colnames(alpaca_score_table_iPSC)[colnames(alpaca_score_table_iPSC) == "tempID"] <- paste0("iPSC_indiv_",n)
  }
save(alpaca_score_table_iPSC, file = paste0(ppath,"alpaca/iPSC_alpaca_score_table_genes.RData"))

##### Load ALPACA scores for iPSC-CMs:

In [None]:
net <- read.table(paste0(ppath,"alpaca/iPSC-CM_100_ALPACA_scores.txt"), header = FALSE, sep = "\t")
colnames(net) = c("node","score")
net <- separate(net, node, c("node","set"), "_", remove = TRUE)

alpaca_score_table_iPSCCM <- data.frame(net$node[which(net$set == "B")])
colnames(alpaca_score_table_iPSCCM) = c("node")
for (n in c(1:119)) {
  file <- paste0(ppath,"alpaca/iPSC-CM_",n,"_ALPACA_scores.txt")
  net <- read.table(file, header = FALSE, sep = "\t")
  colnames(net) = c("node","score")
  net <- separate(net, node, c("node","set"), "_", remove = TRUE)
  print(n)
  alpaca_score_table_iPSCCM$tempID <- net$score[match(alpaca_score_table_iPSCCM$node, net$node)]
  colnames(alpaca_score_table_iPSCCM)[colnames(alpaca_score_table_iPSCCM) == "tempID"] <- paste0("iPSCCM_indiv_",n)
  }
save(alpaca_score_table_iPSCCM, file = paste0(ppath,"alpaca/iPSC-CM_alpaca_score_table_genes.RData"))

##### Merge and scale alpaca scores:

In [None]:
pre_merged_alpaca_table <- merge(alpaca_score_table_LCL,alpaca_score_table_iPSC, by.x = c(1), by.y = c(1), all = FALSE)
merged_alpaca_table <- merge(pre_merged_alpaca_table,alpaca_score_table_iPSCCM,by.x = c(1), by.y = c(1), all = FALSE)
merged_alpaca_table_qq <- cbind(merged_alpaca_table[,c(1)],as.data.frame(normalize.quantiles(as.matrix(merged_alpaca_table[,c(2:358)]))))
colnames(merged_alpaca_table_qq) <- colnames(merged_alpaca_table)
merged_alpaca_table_qq_sc <- cbind(merged_alpaca_table_qq[,c(1)],as.data.frame(t(scale(t(merged_alpaca_table_qq[,c(2:358)]), center = TRUE, scale = TRUE))))
colnames(merged_alpaca_table_qq_sc)[1] <- "node"



##### Plot distribution of alpaca scores:

In [None]:
melted_alpaca <- melt(merged_alpaca_table_qq_sc)
colnames(melted_alpaca) <- c("node","sample","alpaca_score")
melted_alpaca <- separate(melted_alpaca, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)
melted_alpaca$cellType[which(melted_alpaca$cellType == "iPSCCM")] <- "CM"
alpaca_dist <- ggplot(melted_alpaca) + geom_violin(aes(x = cellType, y = (alpaca_score), fill = cellType, col = cellType)) + theme_classic()+ scale_fill_manual(values=pallet[c(2,1,3)]) + scale_color_manual(values=pallet[c(2,1,3)])+ labs(y = "DM score",x = "Cell type", title = "Gene DM scores") + theme(legend.position = "none") 
alpaca_dist
plot(alpaca_dist)

### ALPACA scores for top CD and CAD genes in relevant cell types
Further evidence of cell type-specific alteration of functional modules can be seen by examining the DM scores of disease-associated target genes (as annotated by the NHGRI-EBI GWAS catalog. Coronary artery disease genes with high DM scores in CMs had low DM scores in iPSCs and LCLs. In contrast, genes associated with Crohn's disease, which has a strong immune component, that had high DM scores in LCLs had low DM scores in iPSCs and CMs.. 


In [None]:
data <- merged_alpaca_table_qq_sc
only_CAD_genes <- as.vector(CAD$gene)[which(!((as.vector(CAD$gene) %in% as.vector(CD$gene))))]
cads <- data[which(data$node %in% CAD$gene),c(1:358)]
cads_melted <- melt(cads)
colnames(cads_melted) <- c("node","sample","alpaca_score")
cads_melted <- separate(cads_melted, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)
cads_melted$cellType[which(cads_melted$cellType == "iPSCCM")] <- "CM"
ggplot(cads_melted) + geom_violin(aes(x = cellType, y = (alpaca_score), fill = cellType, col = cellType)) + theme_classic()+ scale_fill_manual(values=pallet[c(2,1,3)]) + scale_color_manual(values=pallet[c(2,1,3)])+ labs(y = "DM score",x = "Cell type", title = "Top CAD genes in CMs") + theme(legend.position = "none") 

cads_melted_cm <- cads_melted[which(cads_melted$cellType == "CM"),]
top_10_perc_cm_cad <- top_frac(cads_melted_cm,0.1,alpaca_score)
top_10_perc_cm_genes_cad <- unique(top_10_perc_cm_cad$node)

cads <- data[which(data$node %in% top_10_perc_cm_genes_cad),c(1:358)]
cads_melted <- melt(cads)
colnames(cads_melted) <- c("node","sample","alpaca_score")
cads_melted <- separate(cads_melted, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)
cads_melted$cellType[which(cads_melted$cellType == "iPSCCM")] <- "CM"
cad_alpaca_plot <- ggplot(cads_melted) + geom_violin(aes(x = cellType, y = (alpaca_score), fill = cellType, col = cellType)) + theme_classic()+ scale_fill_manual(values=pallet[c(2,1,3)]) + scale_color_manual(values=pallet[c(2,1,3)])+ labs(y = "DM score",x = "Cell type", title = "High-DM-score CAD genes in CMs") + theme(legend.position = "none") 
cad_alpaca_plot <- annotate_figure(cad_alpaca_plot,fig.lab = "A", fig.lab.face = "plain", fig.lab.pos = "top.left", fig.lab.size = 14)

only_CD_genes <- as.vector(CD$gene)[which(!((as.vector(CD$gene) %in% as.vector(CAD$gene))))]
cds <- data[which(data$node %in% CD$gene),c(1:358)]
cds_melted <- melt(cds)
colnames(cds_melted) <- c("node","sample","alpaca_score")
cds_melted <- separate(cds_melted, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)

cds_melted_lcl <- cds_melted[which(cds_melted$cellType == "LCL"),]
top_10_perc_lcl_cd <- top_frac(cds_melted_lcl,0.1,alpaca_score)
top_10_perc_lcl_genes_cd <- unique(top_10_perc_lcl_cd$node)

cds <- data[which(data$node %in% top_10_perc_lcl_genes_cd),c(1:358)]
cds_melted <- melt(cds)
colnames(cds_melted) <- c("node","sample","alpaca_score")
cds_melted <- separate(cds_melted, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)
cds_melted$cellType[which(cds_melted$cellType == "iPSCCM")] <- "CM"
cd_alpaca_plot <- ggplot(cds_melted) + geom_violin(aes(x = cellType, y = (alpaca_score), fill = cellType, col = cellType)) + theme_classic()+ scale_fill_manual(values=pallet[c(2,1,3)]) + scale_color_manual(values=pallet[c(2,1,3)])+ labs(y = "DM score",x = "Cell type", title  = "High-DM-score CD genes in LCLs") + theme(legend.position = "none") 
cd_alpaca_plot <- annotate_figure(cd_alpaca_plot,fig.lab = "B", fig.lab.face = "plain", fig.lab.pos = "top.left", fig.lab.size = 14)

combined_plot <- grid.arrange(cad_alpaca_plot, cd_alpaca_plot, nrow=1, widths = c(3,3))
combined_plot_annotated <- annotate_figure(combined_plot,top = text_grob("Distribution of differential modularity scores", color = "black", face = "bold", size = 14))

plot(combined_plot)

##### Example - individuals with mutation in ERG binding site affecting expression of CSRP1

In [None]:
geno <- read.table(paste0(ppath,"data/annotation/erg_variant_genotypes"), header = FALSE)
data <- merged_alpaca_table_qq_sc
gene <- "ENSG00000159176"
genes <- data[which(data$node == gene),c(1:358)]
genes_melted <- melt(genes)
colnames(genes_melted) <- c("node","sample","alpaca_score")
genes_melted <- separate(genes_melted, sample, c("cellType","tag", "indiv"), "_", remove = TRUE)
genes_melted$tds <- tf_disrup_erg_cm$tf_diff_degree[match(genes_melted$indiv,tf_disrup_erg_cm$indiv)]
genes_melted$dosage <- as.numeric(as.vector(c(as.vector(geno[1,c(10:128)]),as.vector(geno[1,c(10:128)]),as.vector(geno[1,c(10:128)]))))

p1 <- ggplot(genes_melted[which(genes_melted$cellType == "iPSCCM"),]) + geom_point(shape = 21,aes(x = reorder(indiv, -alpaca_score), y = (alpaca_score), fill = dosage, size = tds)) + theme_classic()+ scale_fill_gradient(low = "white", high = "#870E75") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())+ labs(y = "DM score",fill = "A",size = expression(paste("ERG ", d^{(TF)})), x= "Individual") 

p2 <- annotate_figure(p1,top = text_grob("DM scores for CSRP1 (ENSG00000159176)", color = "black", face = "bold", size = 14))
plot(p2)

### Gene ontology enrichment
GO enrichment was performed on the above ranked lists using GOrilla (http://cbl-gorilla.cs.technion.ac.il/). The code below creates the GO term plots. Because there are many terms enriched only in CMs making plotting diffifult for a main text figure, for terms enriched only in CMs, we plot only those terms we wish to discuss in the main text figure, and include a full plot of all terms as a suplemental figure. Both of these figures (reduced figure and full figure) are produced below.

Several GO terms relevant to CMs and cardiovascular functioning and development, including "regulation of actomyosin structure organization", "prepulse inhibition", "ephrin receptor signaling pathway", "maintenance of postsynaptic specialization structure", and "actin cytoskeleton reorganization" were enriched in CMs from this individual but not in their LCLs or iPSCs.


In [None]:
go_cm <- read.table(paste0(ppath,"data/annotation/formatted_GO_indiv18_cm.txt"), header = FALSE, sep = "\t")
colnames(go_cm) <- c("GOTerm",	"Description",	"Pvalue",	"FDR",	"Enrichment",	"N"	,"B",	"n"	,"b",	"Genes")
go_cm$celltype <- "CM"

go_lcl <- read.table(paste0(ppath,"data/annotation/formatted_GO_indiv18_lcl.txt"), header = FALSE, sep = "\t")
colnames(go_lcl) <- c("GOTerm",	"Description",	"Pvalue",	"FDR",	"Enrichment",	"N"	,"B",	"n"	,"b",	"Genes")
go_lcl$celltype <- "LCL"

go_ipsc <- read.table(paste0(ppath,"data/annotation/formatted_GO_indiv18_ipsc.txt"), header = FALSE, sep = "\t")
colnames(go_ipsc) <- c("GOTerm",	"Description",	"Pvalue",	"FDR",	"Enrichment",	"N"	,"B",	"n"	,"b",	"Genes")
go_ipsc$celltype <- "iPSC"

go_enrich <- rbind(go_ipsc,go_lcl,go_cm)
terms_to_plot <- read.table(paste0(ppath,"data/annotation/plot_terms"), header = FALSE, sep = "\t")

only_cm <- go_cm$Description[which((!(go_cm$Description %in% go_lcl$Description)) & (!(go_cm$Description %in% go_ipsc$Description)))]
only_lcl <- go_lcl$Description[which((!(go_lcl$Description %in% go_cm$Description)) & (!(go_lcl$Description %in% go_ipsc$Description)))]
only_ipsc <- go_ipsc$Description[which((!(go_ipsc$Description %in% go_lcl$Description)) & (!(go_ipsc$Description %in% go_cm$Description)))]

go_enrich_plot <- go_enrich[which(go_enrich$Description %in% terms_to_plot$V1),]
go_enrich_plot$color <- 'black'
go_enrich_plot$color[which(go_enrich_plot$Description %in% only_cm)] <- '#870E75'
go_enrich_plot$color[which(go_enrich_plot$Description %in% only_lcl)] <- '#0BB19F'
go_enrich_plot$color[which(go_enrich_plot$Description %in% only_ipsc)] <- '#FFBF00'

color <- factor(unique(go_enrich_plot[,c("Description", "color")])$color, levels = c("black","#FFBF00","#0BB19F", "#870E75"))
go_enrich_plot$color2 <- color[match(go_enrich_plot$color, color)]
order <- with(unique(go_enrich_plot[,c("Description", "color2")]), order(color2))

Description <- factor(unique(go_enrich_plot[,c("Description", "color2")])$Description, levels = unique(go_enrich_plot[,c("Description", "color2")])[order,"Description"])
go_enrich_plot$Description2 <- Description[match(go_enrich_plot$Description, Description)]
 #
go_enrich_plot$celltype <- factor(go_enrich_plot$celltype, levels = c("CM","LCL","iPSC"))

p <- ggplot(go_enrich_plot, aes(x=-log10(Pvalue), y=Description2, colour=celltype, group = celltype, size=b)) + geom_point() + expand_limits(x=0) + labs(x="-log10(P-value)", y="GO term", colour="P value", size="Count") + theme_few() + scale_color_manual(values=pallet[c(2,3,1)]) +  facet_grid(~celltype) + guides(color = FALSE)
y_labs <- ggplot_build(p)$layout$panel_params[[1]]$y$get_labels()
col_vec <- as.vector(go_enrich_plot$color2)[match(y_labs,go_enrich_plot$Description2)]

p <- p + theme(axis.text.y = element_text(colour = col_vec), legend.position = "right") + scale_x_continuous(name="-log10(p-value)",limits=c(1, 8)) +theme(strip.text.x = element_text(size = 16, color = "black", face = "bold"),strip.text.y = element_text(size = 16, color = "black", face = "bold"))
g <- ggplot_gtable(ggplot_build(p))
strip_t <- which(grepl('strip-t', g$layout$name))
fills <- c("#870E75","#0BB19F","#FFBF00")
k <- 1
for (i in strip_t) {
j <- which(grepl('rect', g$grobs[[i]]$grobs[[1]]$childrenOrder))
g$grobs[[i]]$grobs[[1]]$children[[j]]$gp$fill <- fills[k]
k <- k+1
}
grid.draw(g)

In [None]:
go_enrich$color <- 'black'
go_enrich$color[which(go_enrich$Description %in% only_cm)] <- '#870E75'
go_enrich$color[which(go_enrich$Description %in% only_lcl)] <- '#0BB19F'
go_enrich$color[which(go_enrich$Description %in% only_ipsc)] <- '#FFBF00'

color <- factor(unique(go_enrich[,c("Description", "color")])$color, levels = c("black","#FFBF00","#0BB19F", "#870E75"))
go_enrich$color2 <- color[match(go_enrich$color, color)]
order <- with(unique(go_enrich[,c("Description", "color2")]), order(color2))

Description <- factor(unique(go_enrich[,c("Description", "color2")])$Description, levels = unique(go_enrich[,c("Description", "color2")])[order,"Description"])
go_enrich$Description2 <- Description[match(go_enrich$Description, Description)]
 #
go_enrich$celltype <- factor(go_enrich$celltype, levels = c("CM","LCL","iPSC"))

p <- ggplot(go_enrich, aes(x=-log10(Pvalue), y=Description2, colour=celltype, group = celltype, size=b)) + geom_point() + expand_limits(x=0) + labs(x="-log10(P-value)", y="GO term", colour="P value", size="Count") + theme_few() + scale_color_manual(values=pallet[c(2,3,1)]) +  facet_grid(~celltype) + guides(color = FALSE)
y_labs <- ggplot_build(p)$layout$panel_params[[1]]$y$get_labels()
col_vec <- as.vector(go_enrich$color2)[match(y_labs,go_enrich$Description2)]

p <- p + theme(axis.text.y = element_text(colour = col_vec)) + scale_x_continuous(name="-log10(p-value)") +theme(strip.text.x = element_text(size = 12, color = "black", face = "bold"),strip.text.y = element_text(size = 12, color = "black", face = "bold"))
g <- ggplot_gtable(ggplot_build(p))
strip_t <- which(grepl('strip-t', g$layout$name))
fills <- c("#870E75","#0BB19F","#FFBF00")
k <- 1
for (i in strip_t) {
j <- which(grepl('rect', g$grobs[[i]]$grobs[[1]]$childrenOrder))
g$grobs[[i]]$grobs[[1]]$children[[j]]$gp$fill <- fills[k]
k <- k+1
}

grid.draw(g)


# References

[1] Weighill, Deborah, et al. "Predicting genotype-specific gene regulatory networks." Genome research 32.3 (2022): 524-533.

[2] Banovich, Nicholas E., et al. "Impact of regulatory variation across human iPSCs and differentiated cells." Genome research 28.1 (2018): 122-131.

[3] Li, Yang I., et al. "RNA splicing is a primary link between genetic variation and disease." Science 352.6285 (2016): 600-604.

[4] Martin, Vincentius, et al. "QBiC-Pred: quantitative predictions of transcription factor binding changes due to sequence variants." Nucleic acids research 47.W1 (2019): W127-W135.

[5] Glass, Kimberly, et al. "Passing messages between biological networks to refine predicted interactions." PloS one 8.5 (2013): e64832.

[6] Buniello, Annalisa, et al. "The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019." Nucleic acids research 47.D1 (2019): D1005-D1012.

[7] Padi, Megha, and John Quackenbush. "Detecting phenotype-driven transitions in regulatory network structure." NPJ systems biology and applications 4.1 (2018): 1-12.