# Drug repurposing analysis in colon cancer
Marouen Ben Guebila <sup>1</sup>

<sup>1</sup> Harvard School of Public Health, Harvard University, Boston, MA, USA.

# Introduction
In this case study, we will provide an example on using [GRAND database](https://grand.networkmedicine.org/) to compute differential gene regulatory networks in colon cancer and find small molecule drugs that can reverse the cancer network to a 'normal' network.

This study corresponds to figure 5 of the GRAND database description<sup>1</sup>.

First, we start by loading the library for the analysis.

In [None]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import networkx as nx # for network plotting
from matplotlib import cm # for color palette
from matplotlib.colors import ListedColormap, LinearSegmentedColormap # for color palette
from scipy.stats import zscore # to compute zscore

Here, we define the path to load the data from the server.

In [None]:
ppath = '/opt/data/netZooPy/coloncancer/'

Then, let's define a set of functions to perfrom the analyses.
`ismember` simply finds the membership of the elements of array a in array b, but as opposed to `np.in1d`, it follows the style of MATLAB ismember function.

In [None]:
def ismember(a, b):
    # tf = np.in1d(a,b) # for newer versions of numpy
    tf = np.array([i in b for i in a])
    u = np.unique(a[tf])
    index = np.array([(np.where(b == i))[0][-1] if t else 0 for i,t in zip(a,tf)])
    return tf, index

`plothist` plots a histogram and overlays lines on the tails of the distribution.

In [None]:
def plothist(tar,diffTar,a):
    if tar=='gene':
        plt.hist(diffTar)
        plt.plot([diffTar.iloc[a[-301]],diffTar.iloc[a[-301]]] , [0,3500], 'r')
        plt.plot([diffTar.iloc[a[301]], diffTar.iloc[a[301]]], [0,3500], 'r')
    elif tar=='tf':
        plt.hist(diffTar)
        plt.plot([diffTar.iloc[a[-101]],diffTar.iloc[a[-101]]], [0,250] ,'r')
        plt.plot([diffTar.iloc[a[101]],diffTar.iloc[a[101]]], [0,250] , 'r')

`printTopTar` prints a list of TFs and genes by targeting scores

In [None]:
def printTopTar(tar, a, diffTar):
    # Down-targeted genes/tfs
    print('Largest targeting scores \n')
    if tar=='tf':
        print("\n".join(diffTar.index[a[:100]]))
        diffTar.iloc[a[100]]
    elif tar=='gene':
        print("\n".join(diffTar.index[a[:300]]))
        diffTar.iloc[a[300]]

    print('Smallest targeting scores \n')
    # Up-targeted genes/tfs
    if tar=='tf':
        print("\n".join(diffTar.index[a[-101:-1]]))
        diffTar.iloc[a[-101]]
    elif tar=='gene':
        print("\n".join(diffTar.index[a[-301:-1]]))
        diffTar.iloc[a[-301]]

`computeTargeting` computes targeting scores in a network.

In [None]:
def computeTargeting(colonCancer,colonHealthy,tar):
    if tar=='tf':
        cancerTar = colonCancer.sum(axis=1)
        healthTar = colonHealthy.sum(axis=1)
    else:
        cancerTar = colonCancer.sum(axis=0)
        healthTar = colonHealthy.sum(axis=0)
    return cancerTar,healthTar

After defining these functions, the first step is to compute a differential colon cancer network. This can be done by comparing a colon cancer network and a normal colon the network. The former is a PANDA network built on TCGA gene expressiond data and the latter is a PANDA network using GTEx gene expression collected across normal human tissues.

In [None]:
colonCancer  = pd.read_csv(ppath+'Colon_cancer_TCGA.csv',index_col=0)
colonHealthy = pd.read_csv(ppath+'Colon_Sigmoid.csv',index_col=0)

# 1. Gene targeting analysis
The first analysis consists of computing differences in gene targeting between colon cancer and normal colon networks. Since gene names are in ENSEMBL gene ids, we need to set the following parameter to convert them to HGO gene symbols.

In [None]:
conv='2hugo'

Then, we compute gene targeting.

In [None]:
tar='gene'
[cancerTar,healthTar] = computeTargeting(colonCancer,colonHealthy,tar)

And, we convert gene names.

In [None]:
# Read conversion file
convMat = pd.read_csv(ppath+'geneNames.txt',sep='\t')
if tar=='gene':
    if conv=='2ens':
        inter,ix=ismember(np.array(cancerTar.index), np.array(convMat.geneNames) ) #inter,ix is lia,locb
        cancerTar.index=convMat.Ensemble[ix]
    elif conv=='2hugo':
        inter,ix=ismember(np.array(healthTar.index), np.array(convMat.Ensemble) ) #inter,ix is lia,locb
        healthTar.index=convMat.geneNames[ix]

Next, these networks do not have the same size, therefore, we take the intersection of their nodes

In [None]:
# Intersect both healthy and cancer
inter2,ix2 = ismember(np.array(healthTar.index), np.array(cancerTar.index))

Finally, we computed a differential network by simply taking the difference between the edges, and computing the zscores of this differential network. There are more sophisticated methods to estimate differences between networks such as [ALPACA and MONSTER](netzoo.github.io).

In [None]:
# Compute differential targeting
diffTar = cancerTar[ix2[inter2]] - healthTar[inter2]
diffTar=(diffTar-np.mean(diffTar)) / (np.std(diffTar))

For this particular analysis, we chose to take the 300 largest and smallest genes by targeting scores to define a differential targeting profile that can be used later for drug repruposing analysis using [cluereg](https://grand.networkmedicine.org/analysis/).

In [None]:
a=np.argsort(list(diffTar))
printTopTar(tar, a, diffTar)

These 300 genes can be seen in the histogram of targeting scores in the differential network.

In [None]:
# Plot histogram of targeting scores
plothist(tar,diffTar,a)
plt.xlabel('Differential targeting scores')
plt.xlabel('Frequency')
plt.title('Differential gene targeting scores')

Finally, we can list the top 10 Genes by largest differential targeting scores

In [None]:
print("\n".join(diffTar.index[a[:10]]))

and the top 10 Genes by smallest targeting scores among the 300 we selected.

In [None]:
print("\n".join(diffTar.index[a[-11:-1]]))

# 2. TF targeting scores
PANDA networks are bipartite graphs that link Transcription Factors (TFs) to their target genes. We did the previous targeting analysis on only one part of the networks. In this section, we can extend the association to TFs to compute their differential targeting profiles.

In [None]:
tar='tf'
# Compute gene targeting
[cancerTar,healthTar] = computeTargeting(colonCancer,colonHealthy,tar)

In [None]:
# Intersect both healthy and cancer
inter2,ix2 = ismember(np.array(healthTar.index), np.array(cancerTar.index))

In [None]:
# Compute differential targeting
diffTar = cancerTar[ix2[inter2]] - healthTar[inter2]
diffTar=(diffTar-np.mean(diffTar)) / (np.std(diffTar))

In [None]:
# Plot histogram of targeting scores
a=np.argsort(list(diffTar))
plothist(tar,diffTar,a)
plt.xlabel('Differential targeting scores')
plt.xlabel('Frequency')
plt.title('Differential TF targeting scores')

List top 10 TFs by largest differential targeting scores

In [None]:
print("\n".join(diffTar.index[a[:10]]))

List top 10 TFs by smallest differential targeting scores

In [None]:
print("\n".join(diffTar.index[a[-11:-1]]))

In particular, we see that `SP1` and `ZBTB7B` are the TFs that have the smallest differential targeting scores.

In this case, we can consider differntially targeted TFs as those that have the largest and smallest targeting scores, which can be seen on the previous hisotgram of targeting scores.

In [None]:
printTopTar(tar, a, diffTar)

In particular, we see that `BHLHA15` and `ARNTL` are the TFs that have the largest differential targeting scores.

# 3. Plot differential network
In this section, we will plot a subset of the differntial network, taking the 2 largest targeted TFs and the 2 smallest targeted TFs. First, we need to convert gene names to HUGO symbols as we did previously.

In [None]:
# Convert gene names and label networks
inter,ix=ismember(np.array(colonHealthy.columns), np.array(convMat.Ensemble) )
colonHealthy.columns=convMat.geneNames[ix]
diffCols,diffColsInd=ismember(colonHealthy.columns,colonCancer.columns)
colonHealthy=colonHealthy.loc[:,diffCols]
colonCancer =colonCancer.iloc[:,diffColsInd[diffCols]]
diffRows,diffRowsInd=ismember(colonHealthy.index,colonCancer.index)
colonHealthy=colonHealthy.loc[diffRows,:]
colonCancer=colonCancer.iloc[diffRowsInd[diffRows],:]

Then, we need to compute the differential network by taking the difference between adjacency matrices and z-scoring the result.

In [None]:
# build differential network
diffNet=colonCancer-colonHealthy
diffNet=diffNet.transpose().apply(zscore, axis=0).transpose() # z-score by TF
diffNetBip=pd.DataFrame(data=diffNet.values.flatten())
diffNetBip['target']=list(diffNet.columns)*len(diffNet.index)
b=[[tf]*len(diffNet.columns) for tf in diffNet.index]
diffNetBip['source']=sum(b, [])

For visualization purposes, we will plot the top differntially targeted TFs in both sides of the distribution. In this case, `BHLHA15`, `ARNTL` have the largest scores, and `ZBTB7B`, `SP1` have the smallest scores.

In [None]:
# select edges
diffNetBip=diffNetBip.sort_values(by=0, ascending=False)
Bip1 = diffNetBip[diffNetBip.source=='BHLHA15'].iloc[pd.np.r_[0:5, -6:-1]]
Bip2 = diffNetBip[diffNetBip.source=='ARNTL'].iloc[pd.np.r_[0:5, -6:-1]]
Bip3 = diffNetBip[diffNetBip.source=='ZBTB7B'].iloc[pd.np.r_[0:5, -6:-1]]
Bip4 = diffNetBip[diffNetBip.source=='SP1'].iloc[pd.np.r_[0:5, -6:-1]]
Bip  = pd.concat([Bip1,Bip2,Bip3,Bip4])

Finally, we can draw the network for these 4 TFs. Genes are colored in blue and TFs in red.

In [None]:
# Draw
g_data=nx.from_pandas_edgelist(Bip, source='source', target='target', edge_attr=True)
color_map = []
color_map = []
for node in g_data:
    if node in ['BHLHA15','ARNTL','ZBTB7B','SP1']:
        color_map.append('red')
    else:
        color_map.append('blue')

pos = nx.spring_layout(g_data)
nx.draw(g_data, with_labels=True, node_color=color_map ,pos=pos,node_size=100)

# 4. Drug repruposing
To find drugs that can potentially reverse the regualtory 'signature' of colon cancer, that is, the differentially-targeted TFs and genes, we can use the targeting scores that we computed as input to [cluereg](https://grand.networkmedicine.org/analysis/). This tool uses the connectivity idea<sup>2</sup> to find drug repurposing candidates. The Connectivity Map<sup>3</sup> was the original tool that explored this idea on gene expression ([clue.io](https://clue.io)).

We can do this analysis on [the webserver](https://grand.networkmedicine.org/analysis/), or we can run cluereg locally as we will do next. We start first by loading small molecule 'sigantures' from cluereg.

In [None]:
# load gene/tf diff matrix
sparse_matrix   = scipy.sparse.load_npz(ppath+'sparse_cmapreg.npz')
sparse_matrixtf = scipy.sparse.load_npz(ppath+'sparse_cmapregtf.npz')
genNames   = pd.read_csv(ppath+'geneNames.csv',header=None)
drugNames  = pd.read_csv(ppath+'drugNames.csv',header=None)
tfNames    = pd.read_csv(ppath+'tfNames.csv',header=None)
drugGeneDf=pd.DataFrame(data=sparse_matrix.toarray(),columns=drugNames.iloc[:,0],index=pd.concat([genNames,genNames]).iloc[:,0])
drugTfDf=pd.DataFrame(data=sparse_matrixtf.toarray(),columns=drugNames.iloc[:,0],index=pd.concat([tfNames,tfNames]).iloc[:,0])

In figure 5 of our analysis<sup>1</sup>, we found that MK-5108, was a drug repurposing candidate for colon cancer, so let's dig a bit deeper on how does this compound affect gene regulation. We first load the signature of this drug.

In [None]:
# Drug signature
print("\n".join(drugTfDf.index[drugTfDf['MK-5108']!=0]))

Then, we plot the number of TFs that are up-regualted and down-regulated following MK-5108 exposure.

In [None]:
# plot drug signature
plt.bar(['up','down'],[len(drugTfDf.index[drugTfDf['MK-5108']==1]),len(drugTfDf.index[drugTfDf['MK-5108']==-1])])

We can also list these TFs.

In [None]:
# find the reversed profiles
print(diffTar.index[a[:100]]) # down targeted TFs
print(diffTar.index[a[-101:-1]]) # up targeted TFs

As an illsutration, we can selected the 2 largest and smallest TFs by targeting scores and mapping them back in colon cancer. We see indeed that their profiles are reversed, TFs that are up-regulated by MK-5108, are down-regulated in colon cancer, and vice-versa.

In [None]:
# drug signature
ia,locb=ismember(np.array(diffTar.index[a[:100]]),np.array(tfNames.iloc[:,0])) # down
ia2,locb2=ismember(np.array(diffTar.index[a[-101:-1]]),np.array(tfNames.iloc[:,0])) #up

down2up=pd.concat((diffTar.iloc[a[:100]],drugTfDf['MK-5108'].iloc[locb[ia]]),axis=1) # down-targeted by colon cancer becoming up-targeted by drug
up2down=pd.concat((diffTar.iloc[a[-101:-1]],drugTfDf['MK-5108'].iloc[locb2[ia2]+len(tfNames)]),axis=1) # up-targeted by colon cancer becoming down-targeted by drug

#pick two down and two up
down2up.sort_values(0,ascending=True)
up2down.sort_values(0,ascending=False)
down2up[down2up['MK-5108']==1]
up2down[up2down['MK-5108']==-1]

plt.plot(down2up[down2up['MK-5108']==1].iloc[0:2,0],'o')
plt.plot(up2down[up2down['MK-5108']==-1].iloc[-3:-1, 0], 'o')

# References

1 - Guebila, Marouen Ben, et al. "GRAND: A database of gene regulatory network models across human conditions." bioRxiv (2021).

2 - Keenan, Alexandra B., et al. "Connectivity mapping: methods and applications." Annual Review of Biomedical Data Science 2 (2019): 69-92.

3- Subramanian, Aravind, et al. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell 171.6 (2017): 1437-1452.