# Investigating potential regulatory relationships between TFs in breast cancer using DRAGON

Author: Deborah Weighill<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.


# Overview
In this tutorial, we investigate potential regulatory relationships between transcription factors (TFs) based on their methylation as well as gene expression. For this analysis we use the [DRAGON](https://netzoo.github.io/zooanimals/dragon/) algorithm **(Determining Regulatory Associations using Graphical models on multi-Omic Networks)** to calculate partial correlations between expression and methylation profiles of TFs. DRAGON calibrates its parameters to achieve an optimal trade-off between the network's complexity and estimation accuracy, while  explicitly accounting for the characteristics of each of the assessed omics layers. DRAGON is distributed through the Network Zoo package (netZooPy v0.9; [netzoo.github.io](netzoo.github.io)) in Python. However, for this tutorial, we will call the Python function from R because we need to use TCGA2BED software to process the data.

# Load libraries
First, we start first by loading the libraries.

In [None]:
library('dplyr')
library('tidyr')
library('RcppCNPy')
library('ggplot2') # for plotting
library('ggthemes')
library('visNetwork') # to visualize the network
library('RColorBrewer')
source("./dragonScripts/call_dragon.R") #script to load Dragon from netZooPy and bind to Python

# Make input matrices
We construct our input matrices for DRAGON. We obtained TCGA breast cancer expression and methylation data which had already been preprocessed as `.bed` files from TCGA2BED<sup>1</sup> (see http://bioinf.iasi.cnr.it/tcga2bed/) and used the TCGA2BED software to create a combined methylation and gene expression matrix. From this combined matrix, we perform the following preprocessing steps:

1. Create a gene expression matrix removing genes with consistently low expression
2. Scale the expression matrix
3. Create a methylation matrix which removes unknown methylation values.
4. Scale the methylation matrix.
5. Down-select matrices to TFs.

In [None]:
# Load combined matrix
data <- read.table("/opt/data/netZooPy/dragon/combined_RNASeq_Meth_matrix_BRCA.txt", header = TRUE, row.names = 1)
meth <- t(data[,c(1:20049)])
expr <- t(data[,c(20050:40534)])

# Expr: Remove low-expressed genes - note - these are rsem normalized, and then scale
exprLow <- apply(expr , MARGIN = 1, FUN = function(x) length(x[(is.na(x)| x<=1) ]))
cleanExpr <- expr[exprLow <= (0.2*ncol(expr)),]
scaled_expr <- scale(t(cleanExpr))

# Methelation: remove "?" genes and then scale
methNA <- apply(meth , MARGIN = 1, FUN = function(x) length(which(x == "?")))
cleanMeth <- meth[methNA == 0,]
write.table(cleanMeth, file = "../data/cleanMeth.txt", sep = "\t", quote = FALSE, row.names = TRUE, col.names = TRUE)
cleanMeth2 <- read.table("../data/cleanMeth.txt", header = TRUE, row.names = 1)
scaled_meth <- scale(t(cleanMeth2))

# select TFs from methylation
tfs_meth <- read.table("/opt/data/netZooPy/dragon/meth_tf_ids.txt", header = FALSE)
methylation_subset <- scaled_meth[,which(colnames(scaled_meth) %in% tfs_meth$V1)]

# select TFs from expression
tfs_expr <- read.table("/opt/data/netZooPy/dragon/rnaseq_tf_ids.txt", header = FALSE)
expression_subset <- scaled_expr[,which(colnames(scaled_expr) %in% tfs_expr$V1)]

# Run DRAGON
Now we run DRAGON, which will calculate partial correlations, p-values and adjusted p-values for pairs of TFs based on their expression/methylation profiles.

In [None]:
XA0 <- expression_subset
XB0 <- methylation_subset

# DRAGON reports the partial correlations, p-values and adjusted p-values betweene each pair of TFs (i,j) in the following structure: the AA result contains partial correlations between the expression profile of TF i and the expression profile of TF j; the BB result contains partial correlations between the methylation profile of TF i and the methylation profile of TF j; the AB result contains partial correlations between the expression profile of TF i and the methylation profile of TF j
res2 <- call.dragon(XA0, XB0)
head(res2$resAA)
head(res2$resAB)
head(res2$resBB)

# parse feature columns to get clean TF ids
exp_meth_2layer <- res2$resAB
exp_meth_2layer <- separate(exp_meth_2layer, feature1, c("tf_rnaseq",NA), "_", remove = TRUE)
exp_meth_2layer <- separate(exp_meth_2layer, feature2, c("tf_dnameth",NA), "_", remove = TRUE)
head(exp_meth_2layer)

meth_meth_2layer <- res2$resBB
meth_meth_2layer <- separate(meth_meth_2layer, feature1, c("tf_dnameth1",NA), "_", remove = TRUE)
meth_meth_2layer <- separate(meth_meth_2layer, feature2, c("tf_dnameth2",NA), "_", remove = TRUE)
meth_meth_2layer$id <- paste0(meth_meth_2layer$tf_dnameth1,meth_meth_2layer$tf_dnameth2)
head(meth_meth_2layer)

exp_exp_2layer <- res2$resAA
exp_exp_2layer <- separate(exp_exp_2layer, feature1, c("tf_rnaseq1",NA), "_", remove = TRUE)
exp_exp_2layer <- separate(exp_exp_2layer, feature2, c("tf_rnaseq2",NA), "_", remove = TRUE)
exp_exp_2layer$id <- paste0(exp_exp_2layer$tf_rnaseq1,exp_exp_2layer$tf_rnaseq2)
head(exp_exp_2layer)

# Run standard GGM
Here we run standard Gaussian Graphical Model to calculate partial correlations between a single layer of expression data (i.e. no methylation information included). This will allow us to investigate if there exist TF-TF relationships that are better explained by including a methylation layer.

In [None]:
XA0 <- expression_subset

res1 <- call.GGM(XA0)
res1$res
saveRDS(res1,"../data/dragon_TFs_results_1layer_03102021.Rds")
exp_exp_1layer <- res1$res
exp_exp_1layer <- separate(exp_exp_1layer, feature1, c("tf_rnaseq1",NA), "_", remove = TRUE)
exp_exp_1layer <- separate(exp_exp_1layer, feature2, c("tf_rnaseq2",NA), "_", remove = TRUE)
exp_exp_1layer$id <- paste0(exp_exp_1layer$tf_rnaseq1,exp_exp_1layer$tf_rnaseq2)

Now we select TF-TF relationships which are significant at an adjusted p-value of 0.05, investigate if there are significant relationships in the GGM which are altered when looking at the 2-layer DRAGON relationships.

In [None]:
# apply adjusted p-value threshold
exp_exp_1layer_0.05 <- exp_exp_1layer[which(exp_exp_1layer$adj_p_vals <= 0.05),]
exp_exp_2layer_0.05 <- exp_exp_2layer[which(exp_exp_2layer$adj_p_vals <= 0.05),]
meth_meth_2layer_0.05 <- meth_meth_2layer[which(meth_meth_2layer$adj_p_vals <= 0.05),]

# Identify relationships (network edges) significant in the expression GGM, non-significant in the DRAGON expression network and significant in the DRAGON methylation network.
exp_exp_1layer_0.05[which(!(exp_exp_1layer_0.05$id %in% exp_exp_2layer_0.05$id) & (exp_exp_1layer_0.05$id %in% meth_meth_2layer_0.05$id)),]

# print the p-values
exp_exp_1layer[which((exp_exp_1layer$tf_rnaseq1 == "ELF4") &(exp_exp_1layer$tf_rnaseq2 == "ZBTB33")),]
exp_exp_2layer[which((exp_exp_2layer$tf_rnaseq1 == "ELF4") &(exp_exp_2layer$tf_rnaseq2 == "ZBTB33")),]
meth_meth_2layer_0.05[which((meth_meth_2layer_0.05$tf_dnameth1 == "ELF4") &(meth_meth_2layer_0.05$tf_dnameth2 == "ZBTB33")),]

These results show the relationship between TFs ELF4 and ZBTB33, which appear to be correlated in terms of expression in the GGM, are in fact not significantly correlated in the DRAGON expression network, but are significantly correlated in the DRAGON methylation network. This suggets that this relationship is driven by co-methylation and not co-expression.

# Investigate proximity of co-methylated genes
Now, we investigate the physical proximity of TF gene regions in the genome and how this relates to co-methylation.

In [None]:
# read in gene annotation file
genes <- read.table("/opt/data/netZooPy/dragon/geneID_name_map.txt", header = FALSE)
colnames(genes) <- c("ensID","name","chrom","start","stop")
# calcualte gene midpoint
genes$mid <- (genes$stop - genes$start)/2

# identify the chromosome on which the TF's gene resides
meth_meth_2layer$chrom1 <- genes$chrom[match(meth_meth_2layer$tf_dnameth1, genes$name)]
meth_meth_2layer$chrom2 <- genes$chrom[match(meth_meth_2layer$tf_dnameth2, genes$name)]
# if the TFs' respective genes reside on the same chromosome, calcualte the distance between the midpoints of the genes.
meth_meth_2layer$distance <- ifelse(meth_meth_2layer$chrom1 == meth_meth_2layer$chrom2, abs((genes$mid[match(meth_meth_2layer$tf_dnameth1, genes$name)]) - (genes$mid[match(meth_meth_2layer$tf_dnameth2, genes$name)])), NA)
# mark each pair of TFs as significantly co-methylated or not, based on DRAGON adjusted p-values.
meth_meth_2layer$signif <- ifelse(meth_meth_2layer$adj_p_vals <= 0.05, "P <= 0.05","P > 0.05")

meth_meth_2layer_0.05$chrom1 <- genes$chrom[match(meth_meth_2layer_0.05$tf_dnameth1, genes$name)]
meth_meth_2layer_0.05$chrom2 <- genes$chrom[match(meth_meth_2layer_0.05$tf_dnameth2, genes$name)]
meth_meth_2layer_0.05$distance <- ifelse(meth_meth_2layer_0.05$chrom1 == meth_meth_2layer_0.05$chrom2, abs((genes$mid[match(meth_meth_2layer_0.05$tf_dnameth1, genes$name)]) - (genes$mid[match(meth_meth_2layer_0.05$tf_dnameth2, genes$name)])), NA)


# plot distance vs significance as a violin plot (for pairs of TFs on the same chromosome)
p <- ggplot(meth_meth_2layer, aes(x=signif, y=distance, fill=signif)) + geom_violin(trim = FALSE)  + theme_bw() + scale_fill_manual(values=c("red", "blue","orange","cyan", "darkgreen", "violet")) + labs(x="P-value", y = "Distance between genes (bp)", title = "Gene proximity distributions") + theme(legend.position = "none")
p

# count the number of significant associations
num_signif <- length(meth_meth_2layer$distance[which(meth_meth_2layer$signif == "P <= 0.05")])
# count the number of non-significant associations
num_non_signif <- length(meth_meth_2layer$distance[which(meth_meth_2layer$signif == "P > 0.05")])
# count the number of significant associations on different chromosomes
num_signif_interChrom <- length(which(is.na(meth_meth_2layer$distance[which(meth_meth_2layer$signif == "P <= 0.05")])))
# count the number of non-significant associations on different chromosomes
num_Nonsignif_interChrom <- length(which(is.na(meth_meth_2layer$distance[which(meth_meth_2layer$signif == "P > 0.05")])))

num_signif
num_non_signif
num_signif_interChrom
num_Nonsignif_interChrom

Note, almost 95 percent of non-significant edges are across chromosomes.

# Visualize network
Now we plot the significant DRAGON relationships with expression-expression relationships colored orange, methylation-methylation edges colored green, and expression-methylation relationships colored purple.

In [None]:
mm_edges <- meth_meth_2layer_0.05[,c(4,5)]
colnames(mm_edges) <- c("from","to")
mm_edges$color <- "green"

ee_edges <- exp_exp_2layer_0.05[,c(4,5)]
colnames(ee_edges) <- c("from","to")
ee_edges$color <- "orange"

meth_exp_2layer_0.05 <- exp_meth_2layer[which(exp_meth_2layer$adj_p_vals <= 0.05),]
me_edges <- meth_exp_2layer_0.05[,c(4,5)]
colnames(me_edges) <- c("from","to")
me_edges$color <- "purple"

edges <- rbind(mm_edges, ee_edges, me_edges)
nodes <- data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$label <- nodes$id
#nodes$color <- "darkolivegreen2"

net <- visNetwork(nodes, edges, width = "100%")
net              

# References
1- Cumbo, Fabio, et al. "TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas." BMC bioinformatics 18.1 (2017): 1-9.