# Multiomic CCLE analysis using the Network Zoo
Marouen Ben Guebila <sup>1</sup>

<sup>1</sup> Harvard T.H. Chan School of Public Health, Boston, MA, USA.

# Introduction
The Cancer Cell Line Encyclopedia (CCLE) has collected various omic data for more than a thousand cancer cell lines, representative of many lineages and tissue type. In this analysis, we will first use DRAGON<sup>2</sup> to find associations between multiomic data types, and second, we will use PANDA-LIONESS-MONSTER to model a transition from primary to metastatic melanoma and identify drivers of this transition.<sup>1</sup>
# Importing packages
First, we start by loading the packages required for the analysis.

In [None]:
import numpy as np
from scipy.stats import skew
import matplotlib.pyplot as plt # For plottinh
import os
import pandas as pd         # To load data
import seaborn as sns       # To plot results
from netZooPy import dragon # To import dragon

We set plotting parameters

In [None]:
imputationMissing='zero'
plt.rcParams["font.family"] = "arial"

Next, we define data path on netbooks server.

In [None]:
ppath = '/opt/data/netZooPy/ccle/'

Then, we define a set of functions to import and process CCLE data.

In [None]:
def estimateDragonValues(ppiMat, expressionMat, pval=False):
    print('computing lambdas')
    lambdas_exp_ppi, lambdas_landscape_exp_ppi = dragon.estimate_penalty_parameters_dragon(ppiMat, expressionMat)
    print('lambdas are ', lambdas_exp_ppi)
    # 8. compute partial correlation
    print('computing corrs')
    r_exp_ppi = dragon.get_partial_correlation_dragon(ppiMat, expressionMat, lambdas_exp_ppi)
    if pval==True:
        print('computing pvals')
        # 9. Compute pvalues
        n_exp_ppi =ppiMat.shape[0]
        p1_exp_ppi=ppiMat.shape[1]
        p2_exp_ppi=expressionMat.shape[1]
        adj_p_vals_exp_ppi, p_vals_exp_ppi = dragon.estimate_p_values_dragon(r_exp_ppi, n_exp_ppi, p1_exp_ppi, p2_exp_ppi, lambdas_exp_ppi)
    else:
        adj_p_vals_exp_ppi,p_vals_exp_ppi=[],[]
    return r_exp_ppi, adj_p_vals_exp_ppi, p_vals_exp_ppi

This is a simple scaling function

In [None]:
def Scale(X): 
    X_temp = X
    X_std = np.std(X_temp, axis=0)
    X_mean = np.mean(X_temp, axis=0)
    return (X_temp - X_mean) / X_std

Because we will use DRAGON to find associations in pairs of multiomic data, we need to align any 2 omic data types to have the same sample size by matching their cell line names using this function.

In [None]:
def alignDF(expression, methyl):
    interListMerge = np.intersect1d(methyl.index, expression.index, return_indices=True)
    methyl     = methyl.iloc[interListMerge[1], :]
    expression = expression.iloc[interListMerge[2], :]
    # remove zero std columns
    a = np.std(expression, axis=0)
    expression = expression.drop(labels=expression.columns[np.where(a == 0)[0]], axis=1)
    b = np.std(methyl, axis=0)
    methyl = methyl.drop(labels=methyl.columns[np.where(b == 0)[0]], axis=1)
    return expression, methyl

This function converts cell line names to a standard DepMap ID.

In [None]:
def convertToDepMap(methyl,cellNames):
    # convert cell names to depmap IDs
    interListBool = np.in1d(methyl.columns, cellNames['CCLE_Name'])
    # Some cell lines do not exist in depmap so remove them
    methyl = methyl.loc[:, interListBool]
    # rename methyl columns
    interList = np.intersect1d(methyl.columns, cellNames['CCLE_Name'], return_indices=True)
    methyl.columns = cellNames['DepMap_ID'][interList[2]].values
    return methyl

This is a function to process dependency data using CRISPR screens.

In [None]:
def processdepdata(imputationMissing):
    dep = pd.read_csv(ppath+'Achilles_gene_effect.csv', index_col=0)
    if imputationMissing=='zero':
        dep.replace(to_replace=np.nan, value=0, inplace=True)
    return dep

A processing function for miRNA expression data.

In [None]:
def processmirnadata(imputationMissing,cellNames):
    mirna=pd.read_csv(ppath+'CCLE_miRNA_20181103.gct',sep='\t',comment='#',skiprows=2,index_col=1)
    # remove unnecessary columns
    mirna = mirna.iloc[:,1:]
    # convert cell names to depmap IDs
    mirna=convertToDepMap(mirna,cellNames)
    mirna=mirna.transpose()
    return mirna

A processing function for drug viability data.

In [None]:
def processDrugs(imputationMissing):
    drugs=pd.read_csv(ppath+'primary-screen-replicate-collapsed-logfold-change.csv',index_col=0)
    drugMeta=pd.read_csv(ppath+'primary-screen-replicate-collapsed-treatment-info.csv')
    # remove failed drug experiments
    keepind=[]
    for i in range(drugs.shape[0]):
        if len(drugs.index[i].split('_')) == 1:
            keepind.append(i)
    #filter drug df
    drugs=drugs.iloc[keepind,:]
    #change drug name
    xy, x_ind, y_ind = np.intersect1d(drugs.columns,drugMeta.loc[:,'column_name'], return_indices=True)
    #first reorganize df by intersection
    drugs=drugs.iloc[:,x_ind]
    #then map drug names
    drugs.columns=drugMeta.loc[y_ind,'name']
    if imputationMissing=='zero':
        drugs.replace(to_replace=np.nan, value=0, inplace=True)
    return drugs

A processing function for proteomic data.

In [None]:
def processPPI(imputationMissing,cellNames):
    ppi = pd.read_csv(ppath+'Table_S2_Protein_Quant_Normalized.csv')
    # remove extra columns manually
    ppi=ppi.iloc[:,:426]
    # Keep SW948_LARGE_INTESTINE_TenPx20, CAL120_BREAST_TenPx28, and HCT15_LARGE_INTESTINE_TenPx18
    # according to https://www.biorxiv.org/content/10.1101/2020.02.03.932384v1
    swintestine=[i for i,item in enumerate(ppi.columns) if "SW948_LARGE_INTESTINE" in item] #132
    calbreast=[i for i,item in enumerate(ppi.columns) if "CAL120_BREAST" in item] #64
    hctintestine=[i for i,item in enumerate(ppi.columns) if "HCT15_LARGE_INTESTINE" in item] #338
    ppi= ppi.drop(labels=ppi.columns[[132,64,338]],axis=1)
    # remove more metadata columns
    ppiindex=ppi.iloc[:,1]
    ppi=ppi.iloc[:,49:]
    ppi.index=ppiindex
    if imputationMissing=='zero':
        ppi=ppi.fillna(0)
    # rename columns
    newColumns=[]
    for i in range(len(ppi.columns)):
        newColumns.append('_'.join(str.split(ppi.columns[i],'_')[0:2]))
    ppi.columns=newColumns
    # remove nan entries in index
    ppi = ppi.loc[ppi.index.dropna()]
    ppi = convertToDepMap(ppi, cellNames)
    ppi = ppi.transpose()
    return ppi

A processing function for metabolomic data.

In [None]:
def processmetabolism():
    metabolism = pd.read_csv(ppath+'CCLE_metabolomics_20190502.csv', index_col=1)
    # remove extra column in metabolism
    # manually remove nan row
    metabolism = metabolism.iloc[:, 1:]
    metabolism = metabolism.loc[metabolism.index.dropna()]
    return metabolism

# 1. DRAGON multiomic CCLE network
First, we load the metadata that ahs information about cell lines and various omics used.

In [None]:
cellNames=pd.read_csv(ppath+'sample_info.csv')
drugMeta=pd.read_csv(ppath+'primary-screen-replicate-collapsed-treatment-info.csv')

## 1.1. Correlations between miRNA and gene dependency

In the first part, we compute correlations between miRNA levels and gene dependency. Our hypothesis is that strong miRNA repression induces the same effects as gene CRISPR KO.

In [None]:
def estimatemirnadep(imputationMissing,cellNames):
    dep = processdepdata(imputationMissing)
    mirna=processmirnadata(imputationMissing,cellNames)
    # align dataframes
    dep,mirna=alignDF(dep,mirna)
    # Call DRAGON
    mirnaMat     = mirna.values
    depMat       = dep.values
    # Transpose and scale arrays (do not transpose expression)
    mirnaMat     = Scale(mirnaMat)
    depMat       = Scale(depMat)
    # Estimate lambdas
    r_mir_dep, adj_p_vals_mir_dep, p_vals_mir_dep=estimateDragonValues(mirnaMat, depMat, pval=False)
    # edge format top 5k and bottom 5k edges
    mir_dep_edges = createVisNet(mirna, dep, r_mir_dep, 'mir', 'dep', nedges=0)
    return mir_dep_edges

In [None]:
mir_dep_edges=estimatemirnadep(imputationMissing,cellNames)
sortedarray = np.sort(mir_dep_edges.stack().values)[::-1]
plt.plot(sortedarray,'o',mfc='none', alpha=0.1, color='slategrey')

This plot represents correlations between dependency and miRNA expression. Correlation might imply that miRNA regulate these target genes.

In [None]:
c=np.argsort(mir_dep_edges.values, axis=None)#small to large
tdindices=np.unravel_index(c, mir_dep_edges.shape)
numindex=tdindices[0][2]
numcol=tdindices[1][2]
print(mir_dep_edges.iloc[numindex,numcol])
print(mir_dep_edges.index[numindex])
print(mir_dep_edges.columns[numcol])

We find that the pair GSR and miR-664a-3p which has a strong negative correlation (negative dependency being associated to decreased cell survival), this pair has been validated in [TargetScan](http://www.targetscan.org/cgi-bin/targetscan/vert_71/targetscan.cgi?mirg=hsa-miR-664a-3p) as a possible interaction based on various features.

## 1.2. Correlations between drug cell viability and gene dependency
Now, We compute DRAGON partial correlations between drug cell viability and gene dependency. Our hypothesis is that drugs inhibits their protein targets and therefore induces similar effects to CRISPR gene KO.

In [None]:
def estimatedepdrug(imputationMissing):
    print('Dep-Drug')
    # Read proteins and drugs
    dep  = processdepdata(imputationMissing)
    drugs= processDrugs(imputationMissing)
    # align dfs
    dep,drugs=alignDF(dep,drugs)
    # Call DRAGON
    depMat       = dep.values
    drugsMat     = drugs.values
    # Transpose and scale arrays (do not transpose expression)
    depMat       = Scale(depMat) #replace by dragon.scale
    drugsMat     = Scale(drugsMat)
    # Estimate lambdas
    r_dep_drugs, adj_p_vals_dep_drugs, p_vals_dep_drugs=estimateDragonValues(depMat, drugsMat)
    # edge format top 5k and bottom 5k edges
    dep_drugs_edges=createVisNet(dep, drugs, r_dep_drugs,'dep','drugs',nedges=0)
    return dep_drugs_edges

In [None]:
dep_drugs_edges = estimatedepdrug(imputationMissing)

This plot represents all correlations between gene KO and Dabrafenib cell viability. Dabrafenib is a multikinase inhibitor indicated for melanoma. 

In [None]:
oncdep_drugs_edges=dep_drugs_edges[oncdrugindex]
flierprops = dict(markerfacecolor='0.75', markersize=5,
              linestyle='none',marker='o')
sns_plot = sns.boxplot(oncdep_drugs_edges['dabrafenib'], orient='v',width=.6,flierprops=flierprops)

In particular, we find that gene dependencies correlated with Dabrafenib are BRAF, MAPK1 and MAPK2, which belong to the same kinase siganling pathway targeted by Dabrafenib.

## 1.3. Correlations between LDH protein levels and metabolite levels
We compute correlations between LDH proteins levels and metabolite levels. Here, we would like to infer the direction of glycolsis biochemical reactions to see if fermentation (Warburg effect) is prevalent in CCLE cacner cell lines.

In [None]:
def estimateprotmet(cellNames):
    # IV. Protein-metabolome
    print('Prot-met')
    # Read proteins and metabolism
    ppi        = processPPI(imputationMissing,cellNames)
    metabolism = processmetabolism()
    # align dataframes
    metabolism, ppi = alignDF(metabolism, ppi)
    # Call DRAGON
    ppiMat        = ppi.values
    metabolismMat = metabolism.values
    # Transpose and scale arrays (do not transpose expression)
    ppiMat        = Scale(ppiMat)  # replace by dragon.scale
    metabolismMat = Scale(metabolismMat)
    # Estimate lambdas
    r_ppi_met, adj_p_vals_ppi_met, p_vals_ppi_met = estimateDragonValues(ppiMat, metabolismMat)
    # edge format top 5k and bottom 5k edges
    ppi_met_edges = createVisNet(ppi, metabolism, r_ppi_met, 'prot', 'met',nedges=0)
    return ppi_met_edges

Now, we plot the correlations between all metabolites and LDH protein levels with its two isozymes (LDHA/LDHB). LDHA carrues the forward reaction for lactate production and LDHB converts lactate to pyruvate, preferentially.

In [None]:
f = {'LDHA': c.values, 'LDHB': d.values}
dff=pd.DataFrame(data=f)
sns_plot = sns.swarmplot(data=Scale(dff), orient='v')

We find that metabolites such as fumarate/maleate, PEP, and g3p have a negative correlation with LDHA levels, indicating production of lactate. We also see that LDHB levels have a positive partial correlation (3.705e-05) with lactate which indicates that LDHB works in the same direction as LDHA and further supporting lactate production in cancer cells (Warburg effect).

## 1.4. Correlations between TF targeting scores and metabolite levels
We first load TF and gene targeting scores for all CCLE cell lines. These scores were computed after running PANDA<sup>3</sup> on all CCLE cell lines gene expression to build an aggregate network, then run LIONESS<sup>4</sup> on the aggregate network to build single-sample networks for each cell. We then compute gene and TF targeting scores for each single-sample network.

In [None]:
genetar = pd.read_csv(ppath+'CCLE_genetar.csv',index_col=0)
tftar   = pd.read_csv(ppath+'CCLE_tftar.csv',index_col=0)

We compute correlations between TF targeting scores and 2HG metabolite levels. Our hypothesis is that we know that 2HG induces a hypermethylator phenotype and a cascade of epigenetic effects, however, we don't know which TF are affected by hypermethylation of their promoters and the consequent change in their binding and activity.

In [None]:
def estimatetftarmet(cellNames, tftar):
    print('tftar-met')
    tftar=tftar.transpose()
    # Read proteins and metabolism
    metabolism = processmetabolism()
    # align dataframes
    metabolism, tftar = alignDF(metabolism, tftar)
    # Call DRAGON
    tftarMat        = tftar.values
    metabolismMat   = metabolism.values
    # Transpose and scale arrays (do not transpose expression)
    tftarMat        = Scale(tftarMat)  # replace by dragon.scale
    metabolismMat = Scale(metabolismMat)
    # Estimate lambdas
    r_tftar_met, adj_p_vals_tftar_met, p_vals_tftar_met = estimateDragonValues(tftarMat, metabolismMat)
    # edge format top 5k and bottom 5k edges
    tftar_met_edges = createVisNet(tftar, metabolism, r_tftar_met, 'tftar', 'met',nedges=0)
    return tftar_met_edges

In [None]:
tftar_met_edges = estimatetftarmet(cellNames, tftar)
c=tftar_met_edges['2-hydroxyglutarate'].sort_values()
sns_plot = sns.boxplot(data=Scale(c.values))

We find that 2HG disrupts binding of TP73, PPARg, and GLI4. These TFs have various roles in cancer; TP73 is a tumor supressor, PPARg mediates several oncogenic signaling processes, and GLI4 is a glioma-inducing oncogene. GLI4 is particularly interesting because glioma is the cancer subtype where 2HG induces a hypermethylator phenotype.

# 2. MONSTER transition analysis in melanoma
In the second part of our analysis, and to follow up on the pan-cancer results we found earlier, we are now interested in a cancer-specific patterns of regualtion, particularly in melanoma. Here, we are interested in transition from primary to metastatic state, to estimate this transition, we will use MONSTER using a LIONESS network of primary melanoma cell line as an initial state, and a LIONESS network of a metastatic cell line as an end state. Since MONSTER in implemented in netZooR, this analysis will run in R.
First, we define a function that compute a transition matrix, then defines a null distribution by resampling columns of these matrices and recomputing transition 1000 times.

In [None]:
precomputed=1

Since resampling can take a while to finish, we can set the `precomputed` tag to load precomputed results.

In [None]:
if precomputed==0:
    %%R
    runemt <- function(nnet1,nnet2){
        primary219=read.table(paste0(ppath,nnet1,'.csv'),sep=',',header=TRUE, row.names=1)
        metastasis14=read.table(paste0(ppath,nnet2,'.csv'),sep=',',header=TRUE, row.names=1)
        combinedRegNetworks=as.data.frame(cbind(primary219,metastasis14))
        nGenes=length(metastasis14)
        design=c(rep(0,nGenes),rep(1,nGenes))
        monsterResRegNet <- monster(combinedRegNetworks, design ,motif=NA, nullPerms=1000, numMaxCores=12, mode='regNet')
        monsterResRegNet
        }

We now run the actual analysis, the cell that represents the initial state is [ACH-000580](https://depmap.org/portal/cell_line/ACH-000580?tab=mutation). The final metastatic state is represnted by cell line [ACH-001569](https://depmap.org/portal/cell_line/ACH-001569?tab=mutation) also called MM415. Both cells were sampled from male donors.

In [None]:
if precomputed==0:
    %%R
    primarycell='ACH-000580'
    metastasiscell='ACH-001569'
    monsterResRegNet=runemt(primarycell,metastasiscell)

We now sort TFs by their differential involvement scores in the transition to metastasis.

In [None]:
df=pd.read_csv(ppath+'emtrank.csv',index_col=0)
df.columns = ['score']
df = df.sort_values(by='score',ascending=False)
g= sns.jointplot(x=np.array(range(len(df)))+1, y=df["score"], kind='scatter',marker='.')
plt.text(25,110583,'RUNX2')
plt.text(7,146068,'GLI1')
plt.text(7,134995.634187,'CREB3L1')
g.ax_marg_x.set_axis_off()

We find that RUNX2, GLI1, and CREB3L1 are among the top 50 TFs. These TFs were identified previously in drug resistance (GLI1/CREB3l!) and most importantly in epithelial to mesenchymal transition (RUNX2).

# References

1- Guebila, Marouen Ben, et al. "The Network Zoo: a multilingual package for the inference and analysis of biological networks." bioRxiv (2022).

2- Weighill, Deborah, et al. "DRAGON: determining regulatory associations using graphical models on multi-omic networks." arXiv preprint arXiv:2104.01690 (2021).

3- Glass, Kimberly, et al. "Passing messages between biological networks to refine predicted interactions." PloS one 8.5 (2013): e64832.

4- Kuijjer, Marieke Lydia, et al. "Estimating sample-specific regulatory networks." Iscience 14 (2019): 226-240.

5- Schlauch, Daniel, et al. "Estimating drivers of cell state transitions using gene regulatory network models." BMC systems biology 11.1 (2017): 1-10.

6- Cohen‐Solal, Karine A., Howard L. Kaufman, and Ahmed Lasfar. "Transcription factors as critical players in melanoma invasiveness, drug resistance, and opportunities for therapeutic drug development." Pigment cell & melanoma research 31.2 (2018): 241-252.