# Network analysis of eQTLs with CONDOR
Authors: Deborah Weighill<sup>1</sup>, Maud Fagny<sup>2</sup>

<sup>1</sup> Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill

<sup>2</sup> Universit√© Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190, Gif-sur-Yvette, France

## 1. Introduction
This netbook will demonstrate the analysis from the study by Fagny et al.<sup>1</sup>, which explored tissue regulation by eQTLs in several human tissues.

This study explored the structure of expression quantitative loci (eQTL) networks in 13 different tissues. Each edge in a given network represented an eQTL relationship between a genetic variant and the expression of a given gene, represented as a link connecting the variant to the gene. The collection of these eQTL relatipnsips forms a bipartite network for each tissue. These networks were found to be highly modular, often grouping genes based on functional processes. Both tissue specific communities as well as tissue-conserved communities were identified.

This notebook will take the reader through large sections of the analysis performed in the above study<sup>1</sup>. In addition, a shiny app that was developed with the paper allows to browse the networks and download them at http://networkmedicine.org:3838/eqtl/.

## 2. Load packages
We start first by loading the required packages.

In [None]:
.libPaths(c("~/R/x86_64-redhat-linux-gnu-library/","/nas/longleaf/apps/r/3.6.0/lib64/R/library"))
library(netZooR)
library(tidyr)
library(data.table)
library(gprofiler2)

## 3. The Data
This study made use of eQTLs derived from genome variant and gene expression data of 13 different tissues from GTEx version 6. Associations between SNPs and gene expression levels were tested while correcting for potentially confounding variables, including sex, age, ethnic background and the first three principle components of the genotype data.

## 4. cis- vs. trans-eQTLs

Here we investigate and plot the number of cis- and trans-eQTLs in each tissue, as a function of sample size. First we load data objects containing tissue names, number of samples for each tissue, and eQTLs for each tissue. 

In [None]:
# Load data objects
## names of different tissues
load("/opt/data/netZooR/tissueeqtl/data/tissue_names.RData")
head(Tissues)

In [None]:
## number of samples for each tissue
load("/opt/data/netZooR/tissueeqtl/data/nb_samples.RData")
head(nb.samples)
## eQTLs
load("/opt/data/netZooR/tissueeqtl/data/all_tissues_eqtls_fdr0.20.2_1MB.Rdata")
eqtl$adipose_subcutaneous[c(1:5),]

Next, we count the number of cis- and trans-eQTLs in each tissue

In [None]:
# Count number of cis and trans eQTLs for each tissue
size.com <- lapply(eqtl, function(x){
    c("cis"=sum(x$cis.or.trans=="cis" & x$FDR<=0.05),
    "trans"=sum(x$cis.or.trans=="trans" & x$FDR<=0.05))
})
nb.cis <- unlist(lapply(size.com, function(x){x['cis']}))
nb.trans <- unlist(lapply(size.com, function(x){x['trans']}))
names(nb.cis) <- gsub("\\.cis", "", names(nb.cis))
names(nb.trans) <- gsub("\\.trans", "", names(nb.trans))

In [None]:
head(nb.cis)
head(nb.trans)

Next we plot the number of cis/trans eQTLs as a function of number of samples in a bar chart. From this plot we can clearly see that there were a lot more cis-eQTLs identified than trans-eQTLs, and that the number of eQTLs identified was higher in tissues with higher sample sizes which would affect eQTL significance and discovery.

In [None]:
# Plot number of cis- and trans-eQTLs as a function of tissue sample size
par(mar=c(4,5,1,1)+0.1)
plot(nb.samples[names(nb.cis)], (nb.cis), pch=16, col="red", log='y',
     ylim=c(1000,700000), xlab="Nb samples", ylab="Nb eQTLs")
points(nb.samples[names(nb.trans)], (nb.trans), pch=17, col="blue")
legend("topleft", bty='n', legend=c("cis-eQTLs", "trans-eQTLs"), pch=16:17, col=c("red", "blue"))

## 5. Cluster eQTL networks using CONDOR

We next cluster the eQTL network of each tissue into communities using CONDOR<sup>1</sup>, a bipartite network clustering method designed specifically for eQTL networks. For each tissue separately, we load the object of eQTLs and convert it to a condor object.

In [None]:
# For each tissue:
for (x in rownames(Tissues)){
    ## load the eQTLs for that tissue
    eqtl.file <- paste0("data/",x,"_eqtls.RData")
    load(eqtl.file)
    if(("RS_ID_dbSNP142_CHG37p13" %in% colnames(eqtls)) & ("RS_ID_dbSNP135_original_VCF" %in% colnames(eqtls))){
        eqtls$RSID <- eqtls$RS_ID_dbSNP142_CHG37p13
        eqtls$RSID[eqtls$RS_ID_dbSNP142_CHG37p13=='.'] <- eqtls$RS_ID_dbSNP135_original_VCF[eqtls$RS_ID_dbSNP142_CHG37p13=='.']
    }
    
    ## assign eQTL SNPs to the "red" group and eQTL genes to the "blue" group
    elist<- data.frame("red" = eqtls$RSID, "blue" = eqtls$genes)
    
    ## create a CONDOR object of nodes, edges, and properties
    condor.object <- create.condor.object(elist)
    
    ## compute clusters and node modularity, and save the results
    condor.result <- condor.cluster(condor.object,project=F)
    condor.modularity <- condor.qscore(condor.result)
    clust.file <- paste0("../data/",x,"_clusters.RData")
    mod.file <- paste0("../data/",x,"_mods.RData")
    
    # save the clustering results to file
    save(condor.result, file=clust.file)
    save(condor.modularity, file=mod.file)   
}

## 6. Investigate community modularities

Different clusterings of networks have different modularities, or "tightness" of the clusters. We will plot the modularities of different tissue's networks. First, we combine our individual CONDOR modularity objects into a list and save that as an object for later use. 

In [None]:
# the list where we will store the community/modularity objects per tissue
communities <- list()
# for each tissue
for (x in tissues){
    # load the CONDOR modularity object for that tissue
    mods.file <- paste0("/opt/data/netZooR/tissueeqtl/data/",x,"_mods.RData")
    load(mods.file)
    # insert the modularity object into the list at the position named after the tissue.
    communities[[`x`]] <- condor.modularity
}
# save the communities list to a RData file
save(communities, file="../data/all_tissues_communities.RData")

Create objects containing lists of the SNPs, genes, and edges in each tissue-specific network.

In [None]:
# Extract snps
snps <- lapply(communities, function(x){tapply(as.character(x$red.memb$red.names), x$red.memb$com, function(y){y})})

# Extract genes
genes <- lapply(communities, function(x){tapply(as.character(x$blue.memb$blue.names), x$blue.memb$com, function(y){y})})

# Extract edges
edges <- lapply(communities, function(x){
    g <- x$blue.memb$com
    names(g) <- as.character(x$blue.memb$blue.names)
    s <- x$red.memb$com
    names(s) <- as.character(x$red.memb$red.names)
    e <- x$edges
    e$snp.com <- s[e$red]
    e$gen.com <- g[e$blue]
    f <- e[e$snp.com == e$gen.com, ]
    tapply(paste(f$red, f$blue, sep="_"), f$snp.com, function(x){x})
})

save(snps, file="../data/all_tissues_snps.RData")
save(genes, file="../data/all_tissues_genes.RData")
save(edges, file="../data/all_tissues_edges.RData")

To plot the modularity of, we load the communities data object (if it is not already loaded) 

In [None]:
# Load data
communities.file <- "data/all_tissues_communities.RData"
load("/opt/data/netZooR/tissueeqtl/data/tissue_names.RData")
load(communities.file)

# extract network modularities from the communities data list
modularity <- unlist(lapply(communities, function(x){max(x$modularity)}))

# Plot modularity
mod <- sort(modularity, decreasing=T)
n <- as.character(Tissues[names(modularity)[order(modularity, decreasing=T)],2])
barplot(mod, horiz=T, main = 'Modularity',  xlab = 'Modularity',cex.names=0.4, names=n, xlim = c(0,1), col='darkorchid')

We can visually represent a network's community structure as a heatmap. Below we plot the communities for the breast tissue network. Each column represents a SNP, each row represents a gene, and each point within the heatmap represents an eQTL edge between a SNP and a gene. Intracommunity edges are plotted in blue, whereas intercommunity edges are plotted in black. 

In [None]:
# breast tissue
load("/opt/data/netZooR/tissueeqtl/data/breast_mammary_tissue_clusters.RData")
cols = rep("dodgerblue",length(unique(condor.result$red.memb$com)))
condor.plot.communities(condor.result,color_list = cols)

## 7. Distribution of communities across chromosomes
A question that might arise is wherether or not a communities within a network tend to contain SNPs/genes co-localized on the same chromosome. To investigate this, we first count the number of chromosomes represented by SNPs/genes in each community within each tissue.

In [None]:
load("/opt/data/netZooR/tissueeqtl/data/all_tissues_eqtls_fdr0.20.2_1MB.Rdata")
load("/opt/data/netZooR/tissueeqtl/data/all_tissues_edges.RData")
load("/opt/data/netZooR/tissueeqtl/data/nb_samples.RData")

# Output files
chr.data.file <- paste0('../data/summary_cluster_chr.Rdata')

# Function to count number of chromosomes 
count.chr <- function(edg, qtl){
    a <- unique(data.frame(qtl[edges %in% edg,], stringsAsFactors=F)$Chr)
    b <- unique(data.frame(qtl[edges %in% edg], stringsAsFactors=F)$chr)
    c("SNP"=length(a), "Genes"=length(b), "All"=length(unique(c(a, b))))
}

### For each tissue, for each community, calculate number of chromosomes to which SNPs/Genes map
nb.chr <- list()
for(tissue in names(eqtl)){
    print(tissue)
    qtl <- eqtl[[`tissue`]]
    qtl$edges <- paste(qtl$RSID, qtl$genes, sep='_')
    qtl <- data.table(qtl)
    setkey(qtl, edges)
    nb.chr[[`tissue`]] <- matrix(unlist(lapply(edges[[`tissue`]], count.chr, qtl)), ncol=3, byrow=T)
}
save(nb.chr, file=chr.data.file)

Now create a table summarizing the fraction of communities within a tissue that represent 1, 2, ..., uo to 22 chromosomes.

In [None]:
## Create table summarizing proportion of communities with element in 1, 2, ..., 22 chromosomes for one tissue.
tab.chr.snps <- data.frame( matrix(0, ncol=22, nrow=length(nb.chr)) )
tab.chr.genes <- data.frame( matrix(0, ncol=22, nrow=length(nb.chr)) )
tab.chr.all <- data.frame( matrix(0, ncol=22, nrow=length(nb.chr)) )
colnames(tab.chr.snps ) <- colnames(tab.chr.genes ) <- colnames(tab.chr.all ) <- as.character(1:22)
rownames(tab.chr.snps ) <- rownames(tab.chr.genes ) <- rownames(tab.chr.all ) <- Tissues[names(nb.chr),2]

for(i in 1:length(nb.chr)){
    tmp.snps <- table(nb.chr[[i]][,1])
    tmp.genes <- table(nb.chr[[i]][,2])
    tmp.all <- table(nb.chr[[i]][,3])
    tab.chr.snps[i,names(tmp.snps)] <- tmp.snps/sum(tmp.snps)
    tab.chr.genes[i,names(tmp.genes)] <- tmp.genes/sum(tmp.genes)
    tab.chr.all[i,names(tmp.all)] <- tmp.all/sum(tmp.all)
}
head(tab.chr.snps)

Vizualize the table as a bar plot.

In [None]:
## Plot proportion of communities with SNPs and genes from more than 2 chromosomes
b<-cbind(1-tab.chr.snps[,1], 1-tab.chr.genes[,1], 1-tab.chr.all[,1])
rownames(b) <- rownames(tab.chr.snps)
colnames(b) <- c("SNPs", "Genes", "Both")
barplot(t(b[,3:1]), horiz=T, beside=T, cex.names=0.4,
        col=c( "green3", "blue","red"), xlab="Proportion of community with SNPs and genes in >2chr",
        xlim=0:1)
legend("right", legend=colnames(b), fill=c("red", "blue", "green3"), bty='n')

## 8. Functional enrichment in communities

We now investigate if different communities in aorta tissue are enriched for any biological functions. First we load the communities data object, select our tissue of interest, and then remove the ".1" transcript numbers from the gene IDs so that we can perform enrichment using the gprofiler package.

In [None]:
# load data
load("/opt/data/netZooR/tissueeqtl/data/all_tissues_communities.RData")
names(communities)

# select tissue
tissue="artery_aorta"
condor.modularity <- communities[[`tissue`]]

In [None]:
# parse gene ids
condor.modularity$blue.memb <- separate(data = condor.modularity$blue.memb, blue.names, c("gene", "transcript_num"), sep="\\.")
head(condor.modularity$blue.memb)

Now we order communities by size.

In [None]:
coms <- unique(condor.modularity$blue.memb$com)
com_size <- aggregate(gene ~ com, data=condor.modularity$blue.memb, FUN = length)
head(com_size[order(-com_size$gene),])
com_member_list <- lapply(com_size$com, FUN = function(x){
  return(as.vector(unique(condor.modularity$blue.memb$gene[which(condor.modularity$blue.memb$com==x)])))
})
com_member_list_ordered <- as.list(com_member_list[order(-com_size$gene)])


In [None]:
# determine enrichment of GO biological processes, KEGG pathways and reactome terms in each community
go_enrich <- gost(query = com_member_list_ordered,
                organism = "hsapiens",
                significant = TRUE, sources = c("GO:BP","KEGG","REAC"),
                domain_scope = "annotated", custom_bg = as.vector(unique(condor.modularity$blue.memb$gene)))
gostplot(go_enrich, capped = FALSE, interactive = TRUE, )


Let's make a separate plot for communities "query 15" and "query 9" so that we can see the plots better:

In [None]:
go_enrich <- gost(query = com_member_list_ordered[15],
                organism = "hsapiens",
                significant = TRUE, sources = c("GO:BP","KEGG","REAC"),
                domain_scope = "annotated", custom_bg = as.vector(unique(condor.modularity$blue.memb$gene)))
gostplot(go_enrich, capped = FALSE, interactive = TRUE, )


In [None]:
go_enrich <- gost(query = com_member_list_ordered[9],
                organism = "hsapiens",
                significant = TRUE, sources = c("GO:BP","KEGG","REAC"),
                domain_scope = "annotated", custom_bg = as.vector(unique(condor.modularity$blue.memb$gene)))
gostplot(go_enrich, capped = FALSE, interactive = TRUE, )

These are interactive plots, and the enriched terms can be seen by hovering over the individual points. The first cluster is enriched for functions related to synaptic signalling and nervous system development, whereas the second cluster is enriched for metal ion response functions.

# References

[1] Fagny M, Paulson JN, Kuijjer ML, Sonawane AR, Chen C.-Y., Lopes-Ramos CM, Glass K, Quackenbush J, Platig J. (2017) Exploring regulation in tissues with eQTL networks. _PNAS_ __114(37)__:E7841-E7850. [https://doi.org/10.1073/pnas.1707375114](https://doi.org/10.1073/pnas.1707375114)

[2] Platig J, Castaldi PJ, DeMeo D, Quackenbush J. Bipartite community structure of eQTLs. PLoS computational biology. 2016 Sep 12;12(9):e1005033. [
https://doi.org/10.1371/journal.pcbi.1005033](https://doi.org/10.1371/journal.pcbi.1005033)