# Finding drug candidates to reverse Lung Adenocarcinoma (LUAD)-induced gene regulation disruption

Author: Marouen Ben Guebila<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

## 1. Introduction

Lung adenocarcinoma (LUAD) is an agressive form of cancer that represents 12.4% of the total new cases of cancer and every year, as many Americans die of LUAD as of prostate, colon, and breast cancer combined <sup>1</sup>.
![](https://www.drugs.com/health-guide/images/355434.jpg)

Outline of the study:

- Download gene expression data from TCGA corresponing to LUAD patients
- Process gene expression data
- Build a coexpression network for LUAD and estimate Transcirption Factor (TF) drivers of transition from health to disease
- Build a gene regulatory network for LUAD
- Download a normal lung gene regulatory network from GRAND
- Build LUAD differential network
- Analysis of the disrupted gene and TF edges
- Drug repurposing through reversal
- Estimating the drivers of transition from healthy to LUAD network

### 1.1. Loading the librarires

In [None]:
library('recount') # to process gene expression
library('limma') # to average replicate
library('netZooR') # for network analysis
library('visNetwork') # for network visualization
library('EnsDb.Hsapiens.v79') # to convert gene symbol
library('gplots') # to visualize heatmaps

## 2. Finding gene expression data from LUAD patients

TCGA<sup>2</sup> is an NIH-funded project that collected and measured gene expression samples from hundreds of patients with variuous types of cancers. A quick check on their [website](https://portal.gdc.cancer.gov/projects/TCGA-LUAD) tells us that there are indeed LUAD patients enrolled in the project.

However, since we are doing a compartive analysis, we want to minimize all sources of variability inherent to batch effects, so instead of downloading the data directly from TCGA protal, we will use data from recount2<sup>3</sup>, a project that uniformly processed all of the gene expression studies available.

We can click on the [TCGA tab](https://jhubiostatistics.shinyapps.io/recount/) and select lung or we can download the samples programmatically.

In [None]:
#system('curl -O http://idies.jhu.edu/recount/data/v2/TCGA/rse_gene_lung.Rdata')
load("/opt/data/rse_gene_lung.Rdata")

## 3. Processing gene expression data
First, we need to clean the data to account only for genes, because sometimes pseudogenes and other non-relevant regions are sequenced as well.

In [None]:
rowDataRse=rowData(rse_gene)
geneIds       = c()
geneSymbol    = c()
geneEntrezIds = c()
# First remove pseudogenes
usingEntrezGeneIDs=1
if(usingEntrezGeneIDs==1){
  for(i in 1:dim(rowDataRse)[1]){
    if(!is.na(rowDataRse$symbol[i][[1]][1])){
      geneIds       = c(geneIds,i)
      geneSymbol    = c(geneSymbol,rowDataRse$symbol[i][[1]][1])
      geneEntrezIds = c(geneEntrezIds, substr(rowDataRse$gene_id[i],1,15))
    }
  }
}

Then let's have a look at the samples in our data

In [None]:
colDataRse=colData(rse_gene)
print(unique(colDataRse@listData$gdc_cases.project.name))

We see that there are two lung cancer types: LUAD and LUSC that stands for Lung Squamous Cell Carcinoma, so we need to keep only LUAD samples.

In [None]:
print(unique(colDataRse@listData$gdc_cases.samples.sample_type))

In addition, we see that not all samples were taken from the tumor, but there are also normal samples and tumor adjacent normal samples, that are used as controls for gene expression analysis. Since we have already a "normal" lung network in our database, we can discard those.

In [None]:
tumorIds=c()
patientSymbol=c()
for(i in 1:length(colDataRse@listData$gdc_cases.samples.sample_type)){
  if(colDataRse@listData$gdc_cases.samples.sample_type[i] %in% c("Primary Tumor","Recurrent Tumor") && colDataRse@listData$gdc_cases.project.name[i] == "Lung Adenocarcinoma"){
    tumorIds = c(tumorIds,i)     
    patientSymbol = c(patientSymbol, colDataRse@listData$gdc_cases.samples.portions.analytes.aliquots.submitter_id[i])
  }
}
#check that samples are LUAD and not normal
colDataRse@listData$gdc_cases.project.name[tumorIds]
colDataRse@listData$gdc_cases.samples.sample_type[tumorIds]

Finally, we need to make sure that there no duplicates in our data. Gene names can be duplicated if the isoform transcripts of the same gene were measured separtly, in that case, we simply average their values. In reality, isoforms are not duplicates and can contain essential information about gene activity.
![](https://genestack.com/blog/wp-content/uploads/2015/04/isoforms.png)

In [None]:
#average isoforms
countMat=assays(rse_gene)$counts[geneIds,tumorIds]
colnames(countMat)=patientSymbol
rownames(countMat)=geneEntrezIds
countMat=avearrays(t(countMat))
countMat=as.data.frame(t(countMat))

Then, we need to standardize the count of the gene expression reads. Here, we will simply scale the counts by the total coverage for each sample.

In [None]:
# 1.2 scale counts
rse="SummarizedExperiment"(countMat, rowData=rownames(countMat), colData=colnames(countMat), metadata=list())
rse=as(rse, "RangedSummarizedExperiment")
colData(rse)$auc = colData(rse_gene)$auc[tumorIds]
rse <- scale_counts(rse)
countMat=SummarizedExperiment::assay(rse, 1)

Finally, we remove the genes that have no counts in all the samples.

In [None]:
countMat <- countMat[ rowSums(countMat) > 1, ]

# 4. Building an LUAD coexpression regulatory network
Now, using the processed gene expression data, we will estimate a bipartite gene regulatory network that link Transcription Factors (TFs) to their target genes. There are several tools in the [Network Zoo](netzoo.github.io) packages that allow to infer gene regulatory network. For this study, we will start by using [MONSTER](https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-017-0517-y). 

The most simple way of building TF-gene regulatory network is to compute the correlation between the expression of TFs and their target genes. The edges in the regulatory network will be the Pearson's correlation coefficient between the TF and the target gene. The `R` function that allows to compute correlation between two vectors `x1` and `x2` is [corr](https://www.rdocumentation.org/packages/emulator/versions/1.2-20/topics/corr). Therefore, we can simply call call `corr` on our `countMat` to obtain a gene to gene regulatory network that we call `coexReg`. Then, we can filter the rows to keep only the genes that encode for TFs, which results in a TF by gene regulatory network. However, the expression of TFs does not necessarily reflect their acitivity, in other words correlation does not imply causation,  therefore, we need to add more information into our network.

TFs regulate their target genes by binding to their promoter region. Therefore, we need to restrict the TF-gene coexpression network to the TFs that are known to bind to their target genes. Evidence of binding can be obtained by [ChIP-seq](https://en.wikipedia.org/wiki/ChIP_sequencing#:~:text=ChIP%2Dsequencing%2C%20also%20known%20as,sites%20of%20DNA%2Dassociated%20proteins.), or through scanning each TFs motif (DNA binding domain) through the promoter region of all genes to find a complementary sequence. This way we obtain another TF-by-gene binary regulatory network that we call $bindReg$ where $1$ means that TF binds to its target gene and $0$ otherwise.

Intersecting $coexReg$ and $bindReg$ gives us a coexpression regulatory network between TFs and their target genes, only if there has been an evidence of TFs binding to their target genes. We call this network $coexBindReg$.

## 4.1. Using MONSTER to build a bipartite coexpression network

To build such a network, we can use a netZooR tool called MONSTER. MONSTER is a versatile function that allows to 1) build a gene regulatory network between two conditions such as a case and a control, 2) estimate the regulaotry transition between case and control, 3) then identify the TFs that drive the transition.

The first step of MONSTER is the reconstruction of gene regualtory network. For this step, the tool merges two types of regulatory networks. The first one is called a direct evidence network which is exactly $coexBindReg$ that was described earlier. The second one is an indirect evidence network that is built by scoring the TF-gene edges using the TF-gene coexpression as features in a classifier of TF-gene binding events. We can refer to this network as $indirEvReg$.Then both direct and indirect evidence network are merged using a weight $alphaw*indirEvReg+(1-alphaw)*coexBindReg$. Therefore, simply invoking MONSTER with a parameter `alphaw=0`, will gives us $coexBindReg$.

First, we need to supply MONSTER with $bindReg$ that we computed for a [previous study](https://www.sciencedirect.com/science/article/pii/S2211124720307762). Motif data set was based on assessing TF to gene binding using TF DNA binding motif scan against the sequence of the promoter region of human genes. Before that, we need to set the seed to be able to get reproducible analyses. MONSTER has a strong statistical significane analysis component by resampling, therefore, we need to set the seed particularly if we choose the number of permutations to be low.

In [None]:
set.seed(1619)
#system('curl -O https://granddb.s3.amazonaws.com/tissues/motif/tissues_motif.txt')
motif    <- read.delim("/opt/data/tissues_motif.txt", stringsAsFactors=F, header=F)

Finally, let's convert the TF names from gene symbols to ENSG gene IDs, just like their target genes, otherwise we won't be able to intersect $coexReg$ and $bindReg$ in the first step of the reconstruction process.

In [None]:
# Harmonizing all genes in expression and motif to Ensmbl
geneIDs2 <- ensembldb::select(EnsDb.Hsapiens.v79, keys= motif[,1], keytype = "SYMBOL", columns = c("SYMBOL","GENEID"))
geneIDsConv = match(motif[,1],geneIDs2[,1])
motif[,1]   = geneIDs2[geneIDsConv,2]

Then, we can simply call MONSTER. The monsterNI function does the reconstruction step.

In [None]:
LUADcoexBindReg <- monster.monsterNI(motif, countMat, 
                                     method="BERE", regularization="none",
                                     score="none", ni.coefficient.cutoff=NA,
                                     verbose=TRUE, randomize = "none", cpp=FALSE,
                                     alphaw=0)

## 4.2. Using MONSTER to estimate the healthy to disease transition in LUAD 
We are interested in LUAD coexpression network, however, it would be valuable to compare it to the normal lung coexpression network. In particular, we would like to identify the TFs that drive the transition from health to disease state as modeled by the coexpression networks.

MONSTER allows to estimate the transition drivers by building a case and a control network using `monster.monsterNI` function, then estimate the transition matrix between case and control networks. A wrapper function caleld `monster` allows to do all the steps in one call. 

First, we need to download the expression of the normal lung. The [GTEx project](https://gtexportal.org/home/) collected gene expression data across normal, undiseased human tissues. We processed the data and stored in the [GRAND database](https://grand.networkmedicine.org). 

In [None]:
#system('curl -O https://granddb.s3.amazonaws.com/tissues/expression/Lung.csv')
lungExpr=read.table('/opt/data/Lung.csv', header=T,sep=',',row.names = 1)

Then, we need to harmonize the gene expression from the normal lung and the LUAD lung by taking the intersection of the genes covered and merging them into one matrix called `exp`. We also need to supply MONSTER with a `design` vector that tells it which columns of `exp` are normal lung and which ones are LUAD lung.

In [None]:
interGenesExp= intersect(rownames(countMat),rownames(lungExpr))
indExpNormal = match(interGenesExp,rownames(lungExpr))
indExpLUAD   = match(interGenesExp,rownames(countMat))
exp          = cbind(lungExpr[indExpNormal,],countMat[indExpLUAD,])
design <- c(rep(1,dim(lungExpr)[2]),rep(0,dim(countMat)[2]))

Finally,  let's call MONSTER with `alphaw=0` to estimate the transition between normal and LUAD lung coexpression network. We wil do 1000 random permutations of our experiments to be able to compare to a null model and assess the significance of our results. We can load processed data directly because running MONSTER with 1000 permutations can take some time. 

In [None]:
#monsterRes <- monster(exp, design, motif, nullPerms=1000, numMaxCores=12, alphaw=0)
load('/opt/data/monsterCoexpLUAD.RData')

Since the analysis is TF centric, I find the gene symbols more readable than ENSG gene IDs so let's convert TF genes to gene symbols.

In [None]:
geneIDsConvRev = match(rownames(monsterRes@nullTM[[1]]), geneIDs2[,2])
rownames(monsterRes@nullTM[[1]]) = geneIDs2[geneIDsConvRev,1]
rownames(monsterRes@tm) = geneIDs2[geneIDsConvRev,1]
colnames(monsterRes@tm) = geneIDs2[geneIDsConvRev,1]

First, let's find the TFs have the most significant involvment in the health to disease transition. If `rescale='magnitude'` then the TFs will be ordered by the observed values of their involvment in the transition matrix and will produce deterministic results between two MONSTER runs. Setting `rescale='significance'` will order the TF involvment by their statsitical significance and will rescale the data of the observed values to be standardized by the null distribution. If the permutation size is not big enough to sample the null, the scaled results might change, that is why it is important to set the seed in the beginning of the analysis.

In [None]:
monster.plot.monsterAnalysis(monsterRes, rescale='significance', nTFs=20)

We see that TFs like `TBX19`, `PGR`, `FOXA2`, `MEF2C`, and `TFCP2L1` were estimated to be the top 5 drivers of LUAD transition.

`TBX19` was suggested to play in role in cell development and in [cancer](https://www.sciencedirect.com/science/article/abs/pii/S0070215316301739) in particular.

`FOXA2` regulates respiratory epithelial cell and is [involved in hyperplesia](https://pubmed.ncbi.nlm.nih.gov/14757645/).

`MEF2C` was suggested to be a missing link between [inflammation and lung cancer](https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-017-1168-x), with a potential involvment in COPD. 

`TFCP2L1` is a [transcriptional supressor](https://elifesciences.org/articles/24265) that controls cell renewal, embyronic development, and blood pressure. Epithelial cells are known to be rapidly dividing, therefore `TFCP2L1` can support transition to cancer state by acting on the rate of division. 

Finally, Progesterone Receptor (`PGR`) disruption can suggest a sex-dimorphic regulation in LUAD, which reflects the epidemiology of LUAD. Although, LUAD epidemiology follows the tobacco use demographics, genetics factors have been suggested to play a role in [the sex-dimorphic distribution of LUAD](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6037963/).

Now, let's have a look at the TF to TF transition matrix from health to disease.

In [None]:
heatmap.2(slot(monsterRes, 'tm'), col = "bluered",
          density.info="none",  
          trace="none", 
          dendrogram='none',     
          Rowv=FALSE,
          Colv=FALSE)

A Principal Componenet Analysis (PCA) on the transition matrix can show us the groups of TFs that exhibit similar patterns of transition to LUAD

In [None]:
monster.transitionPCAPlot(monsterRes)

In this case, it seems that most TFs are inactive in the transition from health to disease. A few TFs stand out and it seems that they act on different aspects of the transition, rather than a cohesive group. We also note the presence of the TFs we identified earlier in the differential involvment analysis.

The transition matrix excerpt tells us which TFs gained or lost their regulatory potential in LUAD as compared to normal. The differential involvment plot is a summary of the transition matrix that computes a differential involvment score for each TF. However, the differential involvment score will only show us which TFs consistently lost or gained activity with a large number of other TFs. In other words, there could be some TFs that have a low differential involvment score overall but exihibt a few transfers with other key regulators. 

We can plot the 20 most significant transitions from normal to cancer using `monster.transitionNetworkPlot`. Here as well, when `rescale` is set to significance, the TFs will be ordered by their p-values and the `numTopTFs` will be filtered which is 10 by default. Then among those TF, the top `numEdges` will be selected by their values in the transition matrix. Here again, when the null distribution is not sampled enough, there could be discrepencies in the significance between two runs and influence the selection of the top TFs, so it is important to set the seed. Setting `rescale`  to `'none'` will simply take the top `numEdges` in the transition matrix and will produce determinstic results.

In [None]:
monster.transitionNetworkPlot(monsterRes, numEdges=20, rescale='significance')

The edges in the directed transition network should be read as TF A positively or negatively contributes to the targeting of TF B in LUAD. We see that the transition is mainly driven by`TFAP2B` that contributes to the loss of activity of other TFs such as `TBX19`. We saw previously that the increase of `TBX19` acitivity has been suggested in cancer. Also we can see that `SPIC` contributes positively to `KMT2A` activity in LUAD, whose activity was demonstrated in oncogenic processes in general and in [Acute Myeloid Leukemia](https://www.sciencedirect.com/science/article/pii/B978012809843100019X) (AML) in particular. Therefore, this `SPIC`-`KMT2A` edge could be a target to reverse LUAD.
This plot is essential to refine our analysis because it tells us about the directionality of the effects and their magnitude. Also we see that not all the TFs that we previously determined differentially involved overall, drove the strongest transitions.

## 5. Building an LUAD gene regulatory network

Coexpression does not capture the full spectrum of gene regulation because many TFs are post-transcriptionally regulated, in other words, their transcript level do not necessarily reflect their activity. [PANDA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0064832) can help uncover TF-gene edges beyond coexpression by integrating several sources of data.


<img src="https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/PANDA-01.png" alt="drawing" width="700"/>

Briefly, PANDA applies guilt-by-association principles to gene regulation. If TF A regulates gene 1, and if gene 2 is coexpressed with gene 1, then TF A is likely to regulate gene 2. The quantification of the association is made using a continuous Tanimoto similarity.
In a nutshell, PANDA infers a TF-gene regulatory network by averaging the distance between three sources of prior information: TF Protein-Protein Interaction (PPI) network as a TF-by-TF matrix, gene coexpression as gene-by-gene matrix, and TF motif binding site estimation as a TF-by-gene matrix. 

TF PPI network can be downloaded from [STRINGdb](https://string-db.org/)<sup>4</sup> that defines interactions between a large set of proteins, not only TFs, either through binding, coexpression, or a combined score that aggregates several measures.

Gene coexpression is simply the Pearson correlation matrix of our gene expression matrix

Motif matrix is an initial estimate of our regulatory network and can be obtained by scannning TF binding domains sequences called motifs in the promoter regions of genes to initiate transcription.

![](https://media.addgene.org/data/easy-thumbnails/filer_public/cms/filer_public/05/a2/05a22a94-5efe-4cd1-8d66-98d769ceec8b/eukaryotic_promoters.png__500x300_q85_crop_subsampling-2_upscale.png)

Once again, since we are doing a comparative analysis of a control and a case network, we can use processed inputs that were used to generate the normal lung regulatory network, that we are comparing to. check [GRAND database](https://www.grand.networkmedicine.org/tissues/Lung_tissue/) to download the motif data we used to reconstruct the normal lung network, or we can download it programmatically.

In [None]:
#system('curl -O https://granddb.s3.amazonaws.com/optPANDA/ppi/ppi_complete.txt')
ppi      <- read.delim("/opt/data/ppi_complete.txt", stringsAsFactors=F, header=F)

Computing the network can take some time, so we will use a precomputed version.

In [None]:
#LUADLung <- panda(motif, countMat, ppi, mode="intersection")
luadLung = read.table('/opt/data/LUAD_PANDA.csv', header=T,sep=',',row.names = 1)

## 5. Downloading the normal lung regulatory network

[GRAND database](https://www.grand.networkmedicine.org/tissues/Lung_tissue/) hosts a large array of gene regulatory networks across human conditions including networks for normal human tissues. These networks were generated in previous studies<sup>5,6</sup> to investigate the tissue-specific regulation of gene activity. Gene expression data was downloaded from the [GTEx project](https://gtexportal.org/home/)<sup>7</sup> where a large number of sample of non-diseased tissues were collected from humans, sometimes post-mortem for invasive samples such as brain.

In [None]:
# 2.2. Download normal lung network from GRAND
#system('curl -O https://granddb.s3.amazonaws.com/tissues/networks/Lung.csv')
normalLung = read.table('/opt/data/Lung.csv', header=T,sep=',',row.names = 1)

## 6. Building a differential LUAD network
Now, to find the differences in regulation in LUAD as compared to normal, we first need to align the network becuase they do not have the same sets of genes and TFs, then we will compare them simply by taking the difference of the intersecting sets. The difference of edge weight is the most straightforward method of comapring network, however, there are other method of building differential networks.

In [None]:
# Align TFs and genes
interTFs   = intersect(rownames(normalLung), rownames(luadLung))
interGenes = intersect(colnames(normalLung), colnames(luadLung))
indLUADTF  = match(interTFs, rownames(luadLung))
indLUADGene= match(interGenes,colnames(luadLung))
indLungTF  = match(interTFs, rownames(normalLung))
indLungGene= match(interGenes,colnames(normalLung))
# Compute differential network
diffNet = luadLung[indLUADTF,indLUADGene] - normalLung[indLungTF,indLungGene]

Now, we will compute a summary measure called targeting, which simply the sum of edge weights. Gene targeting refers to the weighted in-degree of each gene, and TF targeting regers to the weighted out-degree for each TF.

In [None]:
# compute targeting
diffTF   = rowSums(diffNet)
diffGene = colSums(diffNet)

Finally, to summarize what we did so far, we can visualize the network interactively to get a sense of the connections. We are plotting the top 100 differential edges, the TFs are orange triangles, the genes are blue squares, green edges mean increased targeting in LUAD, red edges mean decreased targeting in LUAD. The size of the TFs and genes is proportional to their targeting score. You can click on the nodes to interact with the graph.

In [None]:
nDiffs= 100 # top edges to plot (top edges with largest absolute value)
nTFs  = length(diffTF)
# Edges data frame
edges           = matrix(0L, nDiffs, 3)
colnames(edges) = c("from","to","value")
edges = as.data.frame(edges)
aa    = order(as.matrix(abs(diffNet)), decreasing = TRUE)
bb    = sort(as.matrix(abs(diffNet)), decreasing = TRUE)
edges$value  = as.matrix(diffNet)[aa[1:nDiffs]]
geneIdsTop   = (aa[1:nDiffs] %/% dim(diffNet)[1]) + 1
tfIdsTop     = aa[1:nDiffs] %% dim(diffNet)[1]
tfIdsTop[tfIdsTop == 0] = nTFs
edges$to     = geneSymbol[geneIdsTop]
edges$from   = rownames(diffNet)[tfIdsTop]                                  
edges$arrows = "to"   
edges$color  = ifelse(edges$value > 0, "green", "red")
edges$value  = abs(edges$value)

# Nodes data frame
nodes       = data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))), 
                    label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group = ifelse(nodes$id %in% edges$from, "TF", "gene")
nodes$value = c(unique(diffTF[tfIdsTop]), unique(diffGene[geneIdsTop]))

# Plot network
net <- visNetwork(nodes, edges, width = "100%")
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "orange", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "darkblue", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

We can clearly see the bipartite structure of the network when we switch the layout to hierarchical.

In [None]:
net <- visNetwork(nodes, edges, width = "100%")%>% 
  visHierarchicalLayout()
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "orange", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "darkblue", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

## 7. Analysis of disrupted gene and TF regulation
Now making sense of a differential gene regulatory network can be challenging, so we are going to analyze the top 50 differential TFs in LUAD using a tool in [GRAND database](https://www.grand.networkmedicine.org/). For example, we can pick the 25 most positively disrupted TFs and the 25 most negatively disrupted TFs.

In [None]:
# Top 50 differential TF list
diffTF = sort(diffTF)
cat(names(diffTF[c(1:25,(length(diffTF)-24):length(diffTF))]), sep="\n")

Then, copy this list and paste in https://www.grand.networkmedicine.org/disease/ and click `submit`. This will tell us about the biological processess that involve those TFs.

For example, we see that another type of cancer is present `prostate cancer` which highlights common pathological processess. Also, heart physiology-related terms are enriched such as `atrial fibrillation` and `stroke ischemic` which could suggest a co-morbidity of LUAD and heart disease. Immunity-related terms such as  `mean platelet volume` and `systemic sclerosis` are also present and expcted for cancer. A small note that can put us on the path for a follow up study is the presence of male-related terms such `prostate cancer` and `male-pattern baldness`. Although, this can seem anodine, it may also sugeest a sex-dimrophic prevalence of LUAD.

## 8. Finding drugs that reverse LUAD differential network

Now, using the differential network and the TF and Gene differential targeting scores, we will find drugs that allow to reverse the disruption of regulation in LUAD back to normal. To do that, we will use a powerful and simple technique called reverse connectivity. The idea<sup>8</sup>, first formulated for gene expression using the [Connectivity Map](https://clue.io/) project<sup>9</sup>, simply assumes that a given drug or chemical compound is a potential candidate to treat a disease if it induces the opposite gene expression signature to that of the disease.

In other words, if the expression of gene A is increased in cancer, then the drug that reduces the expression of gene A is potentially a treatment. Although the idea might seem simple, the classical ways of designing new drugs does not integrate gene expression changes as a readout but rather the binding and inhibition of a drug to a target protein that is supossedly the driver of the disease.

In our case, since we are dealing with networks, we wil look for drugs that reverse the differential targeting of TFs and genes in LUAD. To do that, we will use a tool called [CLUEreg](https://grand.networkmedicine.org/analysis/), that computes the regulatory network induced by more tahn 20,000 drugs and we will look for the ones that reverse LUAD first by TF targeting then by gene targeting.

<img src="https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/cluereg.png" alt="drawing" width="600"/>

We can start by taking the previous list of 50 differential TF, and take the 25 top positively differential TFs and 25 top negatively diffrential TFs. Perfect drug matches are the ones that induce the top 25 positively regulated TFs in LUAD to be negatively regulated and the top 25 negatively regualted TFs to in LUAD to be postively regulated.

In [None]:
# Up 25
cat( names(diffTF[(length(diffTF)-24):length(diffTF)]), sep="\n" )

In [None]:
# Down 25
cat(names(diffTF[1:25]), sep="\n")

Then copy the first list and paste it in the top panel in [CLUEreg](https://grand.networkmedicine.org/analysis/) and paste the second list in the bottom panel. Check the `TF targeting` box and the `remove investigational drugs` box.
We see that among the drugs that reverse regulation in LUAD are [Doxorubicin](https://en.wikipedia.org/wiki/Doxorubicin) and [Vatalanib](https://en.wikipedia.org/wiki/Vatalanib) that are already prescribed in cancer, so that sort of confirm the validity of our approach.

In the similar section, we found the compounds that have the same patterns as the disease, among those [Aflatoxin-b1](https://en.wikipedia.org/wiki/Aflatoxin_B1), a potent carcinogen.

We can do the same analysis in the gene space:

In [None]:
diffGene = sort(diffGene)
# Up 150
cat( names(diffGene[(length(diffGene)-149):length(diffGene)]), sep="\n" )

In [None]:
# Down 150
cat(names(diffGene[1:150]), sep="\n")

Paste the first list in the top panel and the second list in the bottom panel. Check the `Gene targeting` box and the `remove investigational drugs` box.

We find an intersting hit, LBH-589 ([Pabinostat](https://en.wikipedia.org/wiki/Panobinostat)) which showed intersting activity in in Acute Myeloid Leukemia (AML) and chronic myeloid leukemia (CML-BC) cell lines. 

# 9. Estimating the drivers of regulatory transition from health to disease in LUAD
We estimated the differences between LUAD and healthy lung through computing a differential regulatory network. Thi approach allowed to find drug candidates to reverse the disease network back to normal. Another approach to compute a differential network would be to use MONSTER between the healthy regulatory network and the LUAD regulatory network.

This time we will call MONSTER and supply it with the LUAD and healthy PANDA regualtory networks, this can be achieved by passing the argument `mode='regNet'`. We will generate 1000 null permutations to estimate the significance of the transitions, therefore we need to set the seed to be able to obtain reproducible results with MONSTER. The pre-computed results can be loaded directly.

In [None]:
set.seed(1619)
combinedRegNetworks=as.data.frame(cbind(normalLung[indLungTF,indLungGene],luadLung[indLUADTF,indLUADGene]))
nGenes=length(indLungGene)
design=c(rep(0,nGenes),rep(1,nGenes))
#monsterResRegNet <- monster(combinedRegNetworks, design ,motif=NA, nullPerms=1000, numMaxCores=12, mode='regNet')
#system('curl -O https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/monster/monsterRegNetLUAD.RData')
load('/opt/data/monsterRegNetLUAD.RData')

We can plot the top 20 TF that drive the transition to LUAD by shifting their tagrteing patterns.

In [None]:
monster.plot.monsterAnalysis(monsterResRegNet, rescale='significance', nTFs=20)

We see that `FOXA2` and `MEF2c` are shared with the TFs that we found previously in the analysis of gene expression networks. The FOX family of transcription factors are transcriptional repressors that are particularly responsible for cell growth and proliferation and some of its members are involved in [some forms of cancer](https://pubmed.ncbi.nlm.nih.gov/25205602/).

`SP1` and `SP3` have anitdeath properties following DNA damage or oxidiative stress in [neurons](https://www.jneurosci.org/content/23/9/3597).

`ZBTB7B` is involved in immunological process and it has been shown that its activity could be lung specific in proteciting against tuberculosis through [modulating immunological processes](https://iai.asm.org/content/88/2/e00845-19).

We find that the drivers of expression transition share common patterns with the drivers of expression transition with additional elements that can further our understanding of the disease.

We can project the transition matrix to get an overall idea of the main drivers.

In [None]:
heatmap.2(slot(monsterResRegNet, 'tm'), col = "bluered",
          density.info="none",  
          trace="none", 
          dendrogram='none',     
          Rowv=FALSE,
          Colv=FALSE)

We can confirm for example that the `FOX` family of transcription factors is collectivly disrupted in LUAD, which could suggest common pathways. Another way of visualizing the main transition drivers is to project the transition matrix on its two principal components

In [None]:
monster.transitionPCAPlot(monsterResRegNet)

The size of the TF is proportional to its scaled transitions with all other TFs. We can see that `ZB7BTB`, `FOXP1`, and `ONECUT2` are consistenly disrupted with a large number of their transition counterparts. However, no structure nor groups of TF seems to have cluster under similar transition patterns. A few TFs stand out such as `MYOD1` and `NR2C1` however their effect size seems small.

Finally, transitions can happen with TFs that do not have consistent disrupted targeting with all other TFs in LUAD. In other words, some TFs promote the transition by engaging a few TF partners. TO identify such interactions, we can simply plot the network of the top 20 edges in the transition matrix.

In [None]:
monster.transitionNetworkPlot(monsterResRegNet, numEdges=20)

We can see that our top differentially involved TFs indeed contribute to some of the largest transitions. The members of the FOX family seem to gain targeting in LUAD from `DLX6`, while `SP1` and `SP3` lose targeting to `CEBPG` and `ALX1`. We can also see that some of the transitions are not driven by the TFs that have the largest involvment. For example, `ELK1` that has been implicated in [prostate cancer](https://pubmed.ncbi.nlm.nih.gov/31794091/#:~:text=Conclusions%3A%20ELK1%20is%20a%20strong,decision%20making%20in%20prostate%20cancer.) induces a loss of targeting of `PAX4` that plays an essential role in [cell survival](https://www.nature.com/articles/1210205) among other roles. However, `CEBPG` positively contributes to targeting of `ZNF148` and `ELK1` negatively contributes to targeting of `ZNF148`. It was shown that incrased activity of `ZNF148` had a bad prognostic for [colon cancer patients](https://pubmed.ncbi.nlm.nih.gov/23576061/), therefore, a potential apporach to reverse LUAD could address the `CEBPG`-`ZNF148` edge in the transition network to favor the supression of `ZNF148` activity by `ELK1`.

MONSTER analysis allowed to identify the transition matrix from healthy lung to LUAD and identify the TF drivers of this transition by computing their involvment. Therefore, these TFs can be targets of possible interventions. In addition, we identified the specific TF-to-TF edges that had the highest impact on developing LUAD. Manipulating these high weight transition edges is particularly interesting because they allow a more targeted and precise approach to network perturbation than knocking-down a TF, which could be less effective due to the essentiality of TFs in key cellular processes.

# 10. Conclusion

Using a network approach, we found drug candidates that are prescribed in other cancers that could be used for LUAD. However, this approach is useful to generate hypotheses and the path to bring a candidate to the bench and the bedisde needs extensive validation and careful expert curation.

Finally, if you find this resource useful, please support us with a GitHub star in [our repository](https://github.com/netZoo). Thank you !

## References
1- Cruz, Charles S. Dela, Lynn T. Tanoue, and Richard A. Matthay. "Lung cancer: epidemiology, etiology, and prevention." Clinics in chest medicine 32.4 (2011): 605-644. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3864624/

2- Gao, Galen F., et al. "Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data." Cell systems 9.1 (2019): 24-34. https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30201-7

3- Collado-Torres, Leonardo, et al. "Reproducible RNA-seq analysis using recount2." Nature biotechnology 35.4 (2017): 319-321. https://www.nature.com/articles/nbt.3838

4- Szklarczyk, Damian, et al. "STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets." Nucleic acids research 47.D1 (2019): D607-D613. https://academic.oup.com/nar/article/47/D1/D607/5198476

5- Sonawane, Abhijeet Rajendra, et al. "Understanding tissue-specific gene regulation." Cell reports 21.4 (2017): 1077-1088. https://www.sciencedirect.com/science/article/pii/S2211124717314183

6- Lopes-Ramos, Camila M., et al. "Sex Differences in Gene Expression and Regulatory Networks across 29 Human Tissues." Cell reports 31.12 (2020): 107795. https://www.sciencedirect.com/science/article/pii/S2211124720307762

7- GTEx Consortium. "Genetic effects on gene expression across human tissues." Nature 550.7675 (2017): 204-213. https://www.nature.com/articles/nature24277

8- Keenan, Alexandra B., et al. "Connectivity mapping: methods and applications." Annual Review of Biomedical Data Science 2 (2019): 69-92. https://www.annualreviews.org/doi/abs/10.1146/annurev-biodatasci-072018-021211

9- Lamb, Justin, et al. "The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease." science 313.5795 (2006): 1929-1935. https://pubmed.ncbi.nlm.nih.gov/17008526/

## Image credit

<a href='https://www.freepik.com/vectors/heart'>Heart vector created by rawpixel.com - www.freepik.com</a>