# Finding drug candidates to reverse Lung Adenocarcinoma (LUAD)-induced gene regulation disruption

Author: Marouen Ben Guebila<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

## 1. Introduction

Lung adenocarcinoma (LUAD) is an agressive form of cancer that represents 12.4% of the total new cases of cancer and every year, as many Americans die of LUAD as of prostate, colon, and breast cancer combined <sup>1</sup>.
![](https://www.drugs.com/health-guide/images/355434.jpg)

Outline of the study:

- Download gene expression data from TCGA corresponing to LUAD patients
- Process gene expression data
- Build a gene regulatory network for LUAD
- Download a normal lung gene regulatory network from GRAND
- Build LUAD differential network
- Analysis of the disrupted gene and Transcription Factor (TF) edges
- Drug repurposing through reversal

### 1.1. Loading the librarires

In [None]:
library('recount') # to process gene expression
library('limma') # to average replicate
library('netZooR') # for network analysis
library('visNetwork') # for network visualization

## 2. Finding gene expression data from LUAD patients

TCGA<sup>2</sup> is an NIH-funded project that collected and measured gene expression samples from hundreds of patients with variuous types of cancers. A quick check on their [website](https://portal.gdc.cancer.gov/projects/TCGA-LUAD) tells us that there are indeed LUAD patients enrolled in the project.

However, since we are doing a compartive analysis, we want to minimize all sources of variability inherent to batch effects, so instead of downloading the data directly from TCGA protal, we will use data from recount2<sup>3</sup>, a project that uniformly processed all of the gene expression studies available.

We can click on the [TCGA tab](https://jhubiostatistics.shinyapps.io/recount/) and select lung or we can download the samples programmatically.

In [None]:
#system('curl -O http://idies.jhu.edu/recount/data/v2/TCGA/rse_gene_lung.Rdata')
load("/opt/data/rse_gene_lung.Rdata")

## 3. Processing gene expression data
First, we need to clean the data to account only for genes, because sometimes pseudogenes and other non-relevant regions are sequenced as well.

In [None]:
rowDataRse=rowData(rse_gene)
geneIds       = c()
geneSymbol    = c()
geneEntrezIds = c()
# First remove pseudogenes
usingEntrezGeneIDs=1
if(usingEntrezGeneIDs==1){
  for(i in 1:dim(rowDataRse)[1]){
    if(!is.na(rowDataRse$symbol[i][[1]][1])){
      geneIds       = c(geneIds,i)
      geneSymbol    = c(geneSymbol,rowDataRse$symbol[i][[1]][1])
      geneEntrezIds = c(geneEntrezIds, substr(rowDataRse$gene_id[i],1,15))
    }
  }
}

Then let's have a look at the samples in our data

In [None]:
colDataRse=colData(rse_gene)
print(unique(colDataRse@listData$gdc_cases.project.name))

We see that there are two lung cancer types: LUAD and LUSC that stands for Lung Squamous Cell Carcinoma, so we need to keep only LUAD samples.

In [None]:
print(unique(colDataRse@listData$gdc_cases.samples.sample_type))

In addition, we see that not all samples were taken from the tumor, but there are also normal samples and tumor adjacent normal samples, that are used as controls for gene expression analysis. Since we have already a "normal" lung network in our database, we can discard those.

In [None]:
tumorIds=c()
patientSymbol=c()
for(i in 1:length(colDataRse@listData$gdc_cases.samples.sample_type)){
  if(colDataRse@listData$gdc_cases.samples.sample_type[i] %in% c("Primary Tumor","Recurrent Tumor") && colDataRse@listData$gdc_cases.project.name[i] == "Lung Adenocarcinoma"){
    tumorIds = c(tumorIds,i)     
    patientSymbol = c(patientSymbol, colDataRse@listData$gdc_cases.samples.portions.analytes.aliquots.submitter_id[i])
  }
}
#check that samples are LUAD and not normal
colDataRse@listData$gdc_cases.project.name[tumorIds]
colDataRse@listData$gdc_cases.samples.sample_type[tumorIds]

Finally, we need to make sure that there no duplicates in our data. Gene names can be duplicated if the isoform transcripts of the same gene were measured separtly, in that case, we simply average their values. In reality, isoforms are not duplicates and can contain essential information about gene activity.
![](https://genestack.com/blog/wp-content/uploads/2015/04/isoforms.png)

In [None]:
#average isoforms
countMat=assays(rse_gene)$counts[geneIds,tumorIds]
colnames(countMat)=patientSymbol
rownames(countMat)=geneEntrezIds
countMat=avearrays(t(countMat))
countMat=as.data.frame(t(countMat))

Finally, we need to standardize the count of the gene expression reads. Here, we will simply scale the counts by the toal coverage for each sample.

In [None]:
# 1.2 scale counts
rse="SummarizedExperiment"(countMat, rowData=rownames(countMat), colData=colnames(countMat), metadata=list())
rse=as(rse, "RangedSummarizedExperiment")
colData(rse)$auc = colData(rse_gene)$auc[tumorIds]
rse <- scale_counts(rse)
countMat=SummarizedExperiment::assay(rse, 1)

## 4. Building an LUAD gene regulatory network

Now, using the processed gene expression data, we will estimate a bipartite gene regulatory network that link Transcription Factors (TFs) to their target genes. There are several tools in the [Network Zoo](netzoo.github.io) packages that allow to infer gene regulatory network. For this study, we will start by using [PANDA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0064832).


<img src="https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/PANDA-01.png" alt="drawing" width="700"/>

Briefly, PANDA applies guilt-by-association principles to gene regulation. If TF A regulates gene 1, and if gene 2 is coexpressed with gene 1, then TF A is likely to regulate gene 2. The quantification of the association is made using a continuous Tanimoto similarity.
In a nutshell, PANDA infers a TF-gene regulatory network by averaging the distance between three sources of prior information: TF Protein-Protein Interaction (PPI) network as a TF-by-TF matrix, gene coexpression as gene-by-gene matrix, and TF motif binding site estimation as a TF-by-gene matrix. 

TF PPI network can be downloaded from [STRINGdb](https://string-db.org/)<sup>4</sup> that defines interactions between a large set of proteins, not only TFs, either through binding, coexpression, or a combined score that aggregates several measures.

Gene coexpression is simply the Pearson correlation matrix of our gene expression matrix

Motif matrix is an initial estimate of our regulatory network and can be obtained by scannning TF binding domains sequences called motifs in the promoter regions of genes to initiate transcription.

![](https://media.addgene.org/data/easy-thumbnails/filer_public/cms/filer_public/05/a2/05a22a94-5efe-4cd1-8d66-98d769ceec8b/eukaryotic_promoters.png__500x300_q85_crop_subsampling-2_upscale.png)

Once again, since we are doing a comparative analysis of a control and a case network, we can use processed inputs that were used to generate the normal lung regulatory network, that we are comparing to. check [GRAND database](https://www.grand.networkmedicine.org/tissues/Lung_tissue/) to download the motif data we used to reconstruct the normal lung network, or we can download it programmatically.

In [None]:
#system('curl -O https://granddb.s3.amazonaws.com/optPANDA/ppi/ppi_complete.txt')
#system('curl -O https://granddb.s3.amazonaws.com/tissues/motif/tissues_motif.txt')
ppi      <- read.delim("/opt/data/ppi_complete.txt", stringsAsFactors=F, header=F)
motif    <- read.delim("/opt/data/tissues_motif.txt", stringsAsFactors=F, header=F)

Computing the network can take some time, so we will use a precomputed version.

In [None]:
#LUADLung <- panda(motif, countMat, ppi, mode="intersection")
luadLung = read.table('/opt/data/LUAD_PANDA.csv', header=T,sep=',',row.names = 1)

## 5. Downloading the normal lung regulatory network

[GRAND database](https://www.grand.networkmedicine.org/tissues/Lung_tissue/) hosts a large array of gene regulatory networks across human conditions including networks for normal human tissues. These networks were generated in previous studies<sup>5,6</sup> to investigate the tissue-specific regulation of gene activity. Gene expression data was downloaded from the [GTEx project](https://gtexportal.org/home/)<sup>7</sup> where a large number of sample of non-diseased tissues were collected from humans, sometimes post-mortem for invasive samples such as brain.

In [None]:
# 2.2. Download normal lung network from GRAND
#system('curl -O https://granddb.s3.amazonaws.com/tissues/networks/Lung.csv')
normalLung = read.table('/opt/data/Lung.csv', header=T,sep=',',row.names = 1)

## 6. Building a differential LUAD network
Now, to find the differences in regulation in LUAD as compared to normal, we first need to align the network becuase they do not have the same sets of genes and TFs, then we will compare them simply by taking the difference of the intersecting sets. The difference of edge weight is the most straightforward method of comapring network, however, there are other method of building differential networks.

In [None]:
# Align TFs and genes
interTFs   = intersect(rownames(normalLung), rownames(luadLung))
interGenes = intersect(colnames(normalLung), colnames(luadLung))
indLUADTF  = match(interTFs, rownames(luadLung))
indLUADGene= match(interGenes,colnames(luadLung))
indLungTF  = match(interTFs, rownames(normalLung))
indLungGene= match(interGenes,colnames(normalLung))
# Compute differential network
diffNet = luadLung[indLUADTF,indLUADGene] - normalLung[indLungTF,indLungGene]

Now, we will compute a summary measure called targeting, which simply the sum of edge weights. Gene targeting refers to the weighted in-degree of each gene, and TF targeting regers to the weighted out-degree for each TF.

In [None]:
# compute targeting
diffTF   = rowSums(diffNet)
diffGene = colSums(diffNet)

Finally, to summarize what we did so far, we can visualize the network interactively to get a sense of the connections. We are plotting the top 100 differential edges, the TFs are orange triangles, the genes are blue squares, green edges mean increased targeting in LUAD, red edges mean decreased targeting in LUAD. The size of the TFs and genes is proportional to their targeting score. You can click on the nodes to interact with the graph.

In [None]:
nDiffs= 100 # top edges to plot (top edges with largest absolute value)

# Edges data frame
edges           = matrix(0L, nDiffs, 3)
colnames(edges) = c("from","to","value")
edges = as.data.frame(edges)
aa    = order(as.matrix(abs(diffNet)), decreasing = TRUE)
bb    = sort(as.matrix(abs(diffNet)), decreasing = TRUE)
edges$value  = as.matrix(diffNet)[aa[1:nDiffs]]
geneIdsTop   = (aa[1:nDiffs] %/% dim(diffNet)[1]) + 1
tfIdsTop     = aa[1:nDiffs] %% dim(diffNet)[1]
edges$to     = geneSymbol[geneIdsTop]
edges$from   = rownames(diffNet)[tfIdsTop]                                  
edges$arrows = "to"   
edges$color  = ifelse(edges$value > 0, "green", "red")
edges$value  = abs(edges$value)

# Nodes data frame
nodes       = data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))), 
                    label=unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group = ifelse(nodes$id %in% edges$from, "TF", "gene")
nodes$value = c(unique(diffTF[tfIdsTop]), unique(diffGene[geneIdsTop]))

# Plot network
net <- visNetwork(nodes, edges, width = "100%")
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "orange", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "darkblue", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

We can clearly see the bipartite structure of the network when we switch the layout to hierarchical.

In [None]:
net <- visNetwork(nodes, edges, width = "100%")%>% 
  visHierarchicalLayout()
net <- visGroups(net, groupname = "TF", shape = "triangle",
                 color = list(background = "orange", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                 color = list(background = "darkblue", border="black"))
visLegend(net, main="Legend", position="right", ncol=1) 

## 7. Analysis of disrupted gene and TF regulation
Now making sense of a differential gene regulatory network can be challenging, so we are going to analyze the top 50 differential TFs in LUAD using a tool in [GRAND database](https://www.grand.networkmedicine.org/). For example, we can pick the 25 most positively disrupted TFs and the 25 most negatively disrupted TFs.

In [None]:
# Top 50 differential TF list
diffTF = sort(diffTF)
cat(names(diffTF[c(1:25,(length(diffTF)-24):length(diffTF))]), sep="\n")

Then, copy this list and paste in https://www.grand.networkmedicine.org/disease/ and click `submit`. This will tell us about the biological processess that involve those TFs.

For example, we see that another type of cancer is present `prostate cancer` which highlights common pathological processess. Also, heart physiology-related terms are enriched such as `atrial fibrillation` and `stroke ischemic` which could suggest a co-morbidity of LUAD and heart disease. Immunity-related terms such as  `mean platelet volume` and `systemic sclerosis` are also present and expcted for cancer. A small note that can put us on the path for a follow up study is the presence of male-related terms such `prostate cancer` and `male-pattern baldness`. Although, this can seem anodine, it may also sugeest a sex-dimrophic prevalence of LUAD.

## 8. Finding drugs that reverse LUAD differential network

Now, using the differential network and the TF and Gene differential targeting scores, we will find drugs that allow to reverse the disruption of regulation in LUAD back to normal. To do that, we will use a powerful and simple technique called reverse connectivity. The idea<sup>8</sup>, first formulated for gene expression using the [Connectivity Map](https://clue.io/) project<sup>9</sup>, simply assumes that a given drug or chemical compound is a potential candidate to treat a disease if it induces the opposite gene expression signature to that of the disease.

In other words, if the expression of gene A is increased in cancer, then the drug that reduces the expression of gene A is potentially a treatment. Although the idea might seem simple, the classical ways of designing new drugs does not integrate gene expression changes as a readout but rather the binding and inhibition of a drug to a target protein that is supossedly the driver of the disease.

In our case, since we are dealing with networks, we wil look for drugs that reverse the differential targeting of TFs and genes in LUAD. To do that, we will use a tool called [CLUEreg](https://grand.networkmedicine.org/analysis/), that computes the regulatory network induced by more tahn 20,000 drugs and we will look for the ones that reverse LUAD first by TF targeting then by gene targeting.

<img src="https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/cluereg.png" alt="drawing" width="600"/>

We can start by taking the previous list of 50 differential TF, and take the 25 top positively differential TFs and 25 top negatively diffrential TFs. Perfect drug matches are the ones that induce the top 25 positively regulated TFs in LUAD to be negatively regulated and the top 25 negatively regualted TFs to in LUAD to be postively regulated.

In [None]:
# Up 25
cat( names(diffTF[(length(diffTF)-24):length(diffTF)]), sep="\n" )

In [None]:
# Down 25
cat(names(diffTF[1:25]), sep="\n")

Then copy the first list and paste it in the top panel in [CLUEreg](https://grand.networkmedicine.org/analysis/) and paste the second list in the bottom panel. Check the `TF targeting` box and the `remove investigational drugs` box.
We see that among the drugs that reverse regulation in LUAD are [Doxorubicin](https://en.wikipedia.org/wiki/Doxorubicin) and [Vatalanib](https://en.wikipedia.org/wiki/Vatalanib) that are already prescribed in cancer, so that sort of confirm the validity of our approach.

In the similar section, we found the compounds that have the same patterns as the disease, among those [Aflatoxin-b1](https://en.wikipedia.org/wiki/Aflatoxin_B1), a potent carcinogen.

We can do the same analysis in the gene space:

In [None]:
diffGene = sort(diffGene)
# Up 150
cat( names(diffGene[(length(diffGene)-149):length(diffGene)]), sep="\n" )

In [None]:
# Down 150
cat(names(diffGene[1:150]), sep="\n")

Paste the first list in the top panel and the second list in the bottom panel. Check the `Gene targeting` box and the `remove investigational drugs` box.

We find an intersting hit, LBH-589 ([Pabinostat](https://en.wikipedia.org/wiki/Panobinostat)) which showed intersting activity in in Acute Myeloid Leukemia (AML) and chronic myeloid leukemia (CML-BC) cell lines. 

# 9. Conclusion

Using a network approach, we found drug candidates that are prescribed in other cancers that coule be used for LUAD. However, this approach is useful to generate hypotheses and the path to bring a candidate to the bench and the bedisde needs extensive validation and careful expert curation.

Finally, if you find this resource useful, please support us with a GitHub star in [our repository](https://github.com/netZoo). Thank you !

## References
1- Cruz, Charles S. Dela, Lynn T. Tanoue, and Richard A. Matthay. "Lung cancer: epidemiology, etiology, and prevention." Clinics in chest medicine 32.4 (2011): 605-644. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3864624/

2- Gao, Galen F., et al. "Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data." Cell systems 9.1 (2019): 24-34. https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30201-7

3- Collado-Torres, Leonardo, et al. "Reproducible RNA-seq analysis using recount2." Nature biotechnology 35.4 (2017): 319-321. https://www.nature.com/articles/nbt.3838

4- Szklarczyk, Damian, et al. "STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets." Nucleic acids research 47.D1 (2019): D607-D613. https://academic.oup.com/nar/article/47/D1/D607/5198476

5- Sonawane, Abhijeet Rajendra, et al. "Understanding tissue-specific gene regulation." Cell reports 21.4 (2017): 1077-1088. https://www.sciencedirect.com/science/article/pii/S2211124717314183

6- Lopes-Ramos, Camila M., et al. "Sex Differences in Gene Expression and Regulatory Networks across 29 Human Tissues." Cell reports 31.12 (2020): 107795. https://www.sciencedirect.com/science/article/pii/S2211124720307762

7- GTEx Consortium. "Genetic effects on gene expression across human tissues." Nature 550.7675 (2017): 204-213. https://www.nature.com/articles/nature24277

8- Keenan, Alexandra B., et al. "Connectivity mapping: methods and applications." Annual Review of Biomedical Data Science 2 (2019): 69-92. https://www.annualreviews.org/doi/abs/10.1146/annurev-biodatasci-072018-021211

9- Lamb, Justin, et al. "The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease." science 313.5795 (2006): 1929-1935. https://pubmed.ncbi.nlm.nih.gov/17008526/

## Image credit

<a href='https://www.freepik.com/vectors/heart'>Heart vector created by rawpixel.com - www.freepik.com</a>