# Building a DRAGON miRNA gene regulatory network using CCLE data
Marouen Ben Guebila<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

# Introduction

microRNAs (miRNA) play an important role in post-transcriptional regulation. For a long time, it has been assumed that miRNA acted on the translation of mRNA to protein, however recent evidence suggested that they module mRNA levels<sup>1</sup>. In this notebook, we will build a miRNA to mRNA network using [DRAGON](https://netzoo.github.io/zooanimals/dragon/)<sup>2</sup>, which allows to build multiomic network by implementing Gaussian Graphical Models (GGM) with covariance shrinkage. This approach has a greater accuracy than GGM, requires fewer samples to estimate partial correlations, and allows to scale different data structures such as the ones in 2 biological layers, miRNA and mRNA in our example.

We will use miRNA and mRNA expression data from 938 cell lines in the CCLE database<sup>3</sup>. The network can be visualized in the [GRAND database](https://grand.networkmedicine.org/cell/mirna/).

## Load libraries
First, we start by loading the libraries.

In [None]:
import pandas as pd          # To read dataframes
import os
import numpy as np
from netZooPy import dragon # To load dragon

Next, we define data path on netbooks server.

In [None]:
ppath = '/opt/data/netZooPy/dragonnet/'

In the following, we will define a set of functions that will allow use to run network inference. First, we deefine a scale function to scale the inout data before calling DRAGON.

In [None]:
def Scale(X):
    X_temp = X
    X_std = np.std(X_temp, axis=0)
    X_mean = np.mean(X_temp, axis=0)
    return (X_temp - X_mean) / X_std

Then define a function to filter the network for miRNA to gene edges and prune gene to gene and miRNA to gene edges. This part is imporatant because we are interested in bipartite edges for our analysis.

In [None]:
def createVisNet(methyl,expression,r_methyl_mrna,methylMat,layer1,layer2,nedges=2000):
    if methyl.shape[1]==expression.shape[0]:
        pdNames_methyl_mrna = methyl.index.append(expression.columns)
    elif methyl.shape[0]==expression.shape[1]:
        pdNames_methyl_mrna = methyl.columns.append(expression.index)
    elif methyl.shape[0] == expression.shape[0]:
        pdNames_methyl_mrna = methyl.columns.append(expression.columns)
    r_methyl_mrna_pd = pd.DataFrame(r_methyl_mrna,index=pdNames_methyl_mrna,columns=pdNames_methyl_mrna)
    r_methyl_mrna_pd = r_methyl_mrna_pd.iloc[:methylMat.shape[1],methylMat.shape[1]:]
    return r_methyl_mrna_pd

The next function is a wrapper that calls DRAGON to estimate partial correlations. The first part, estimate the penalty parameters based on the structure of miRNA and mRNA data. The second part compute the partial correlations between miRNA and genes using both gene expression and miRNA profiles across 938 cell line samples.

In [None]:
def estimateDragonValues(ppiMat, expressionMat,pval=False):
    print('computing lambdas')
    lambdas_exp_ppi, lambdas_landscape_exp_ppi = dragon.estimate_penalty_parameters_dragon(ppiMat, expressionMat)
    print('lambdas are ', lambdas_exp_ppi)
    # 8. compute partial correlation
    print('computing corrs')
    r_exp_ppi = dragon.get_partial_correlation_dragon(ppiMat, expressionMat, lambdas_exp_ppi)
    if pval==True:
        # 9. Compute pvalues
        n_exp_ppi =ppiMat.shape[0]
        p1_exp_ppi=ppiMat.shape[1]
        p2_exp_ppi=expressionMat.shape[1]
        adj_p_vals_exp_ppi, p_vals_exp_ppi = dragon.estimate_p_values_dragon(r_exp_ppi, n_exp_ppi, p1_exp_ppi, p2_exp_ppi, lambdas_exp_ppi)
    else:
        adj_p_vals_exp_ppi=[]
    return r_exp_ppi, adj_p_vals_exp_ppi

The following function converts cell name IDs to dependency map IDs, to help with the downstream analyses.

In [None]:
def convertToDepMap(methyl,cellNames):
    # convert cell names to depmap IDs
    interListBool = np.in1d(methyl.columns, cellNames['CCLE_Name'])
    # Some cell lines do not exist in depmap so remove them
    methyl = methyl.loc[:, interListBool]
    # rename columns
    interList = np.intersect1d(methyl.columns, cellNames['CCLE_Name'], return_indices=True)
    methyl.columns = cellNames['DepMap_ID'][interList[2]].values
    return methyl

And this function aligns two dataframes either by rows or by columns, which is intended to align the miRNA and mRNA dataframe across the same sample set.

In [None]:
def alignDF(expression, methyl, remove_std=0):
    interListMerge = np.intersect1d(methyl.columns, expression.index, return_indices=True)
    methyl = methyl.iloc[:, interListMerge[1]]
    expression = expression.iloc[interListMerge[2], :]
    if remove_std==1:
        # remove columsn with zero std
        a = np.std(expression, axis=0)
        expression = expression.drop(labels=expression.columns[np.where(a == 0)[0]], axis=1)
    elif remove_std==2:
        methyl=methyl.transpose()
        # remove columns with zero std
        a = np.std(methyl, axis=0)
        methyl = methyl.drop(labels=methyl.columns[np.where(a == 0)[0]], axis=1)
    return expression, methyl

We will also need to read CCLE cell metadata to convert cell line names and IDs.

In [None]:
cellNames=pd.read_csv(ppath+'sample_info.csv')

Finally, we chose to impute missing data by zero, although other approaches can be considered as well.

In [None]:
imputationMissing='zero'

# 1. Read miRNA expression data and gene expression data
In this section, we will read and clean the input files.

In [None]:
mirna=pd.read_csv(ppath+'CCLE_miRNA_20181103.gct',sep='\t',comment='#',skiprows=2,index_col=1)

Then remove unnecessary metdata columns

In [None]:
mirna = mirna.iloc[:,1:]

Next convert cell names to depmap IDs 

In [None]:
mirna=convertToDepMap(mirna,cellNames)
mirna

miRNA data has miRNA expression measurments across 952 cells for 734 miRNAs.

In [None]:
expression=pd.read_csv(ppath+'CCLE_expression.csv',index_col=0)
expression

Gene expression data has measurments for 19177 genes for 1376 cells. Finally we align both miRNA and gene expression dataframes on their intersecting cells.

In [None]:
expression,mirna=alignDF(expression,mirna,remove_std=1)
expression

We see that miRNA and mRNA expression is shared among 938 intersecting cells.

# 2. Scale miRNA and gene expression data

Before calling DRAGON on our 2 multi-omic layers (miRNA, mRNA), we need to scale the input data, which standardizes the expression for genes and miRNA across samples to be of mean 0 and variance 1.

In [None]:
mirnaMat     = mirna.values
expressionMat= expression.values

The miRNA data is a miRNA by sample matrix, therefore, we transpose it.

In [None]:
mirnaMat     = Scale(np.transpose(mirnaMat))
expressionMat= Scale(expressionMat)

# 3. Call Dragon

Finally, we call DRAGON on the processed data to estimate the partial correlations. In this specific application, we will skip computing the p-values for associations.

In [None]:
r_mir_exp, adj_p_vals_mir_exp=estimateDragonValues(mirnaMat, expressionMat)

Finally, we prune the edges between the nodes of the same type to create a bipartite network.

In [None]:
mir_exp_edges=createVisNet(mirna,expression,r_mir_exp,mirnaMat,'mir','exp')

The final network links miRNAs to their potential target transcripts. Edge weights represent partial correlations constructed across 2 biological layers across 938 cells, correcting for all other variables in the system, which can be useful to infer direct associations and remove spurious correlations. In this network, positive edge weights indicate a positive association, negative edge weights indicate anegative association, and partial correlations of zero indicate independence between the variables. This network can be visualized in GRAND database: https://grand.networkmedicine.org/cell/mirna/.

# References

1- Catalanotto, Caterina, Carlo Cogoni, and Giuseppe Zardo. "MicroRNA in control of gene expression: an overview of nuclear functions." International journal of molecular sciences 17.10 (2016): 1712.

2- Weighill, Deborah, et al. "DRAGON: Determining Regulatory Associations using Graphical models on multi-Omic Networks." arXiv preprint arXiv:2104.01690 (2021).

3- Ghandi, Mahmoud, et al. "Next-generation characterization of the cancer cell line encyclopedia." Nature 569.7757 (2019): 503-508.