<a href="https://colab.research.google.com/github/pochetlab/CommunityAMARETTO/blob/master/The__AMARETTO_framework_in_GenePattern_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The &#42;AMARETTO framework in GenePattern Notebook

### <i>Multiscale and multimodal inference of regulatory networks to identify cell circuits and their drivers shared and distinct within and across biological systems of human disease</i>

#### Mohsen Nabian<sup>&#35;</sup>, Celine Everaert<sup>&#35;</sup>, Jayendra Shinde<sup>&#35;</sup>, Shaimaa Bakr<sup>&#35;</sup>, Ted Liefeld<sup>&#35;</sup>, Mikel Hernaez, Thomas Baumert, Michael Reich, Jill Mesirov<sup>&#42;</sup>, Vincent Carey<sup>&#42;</sup>, Olivier Gevaert<sup>&#42;</sup>, Nathalie Pochet<sup>&#42;</sup>

## Introduction to the &#42;AMARETTO algorithm and software toolbox

Computational inference of regulatory networks underlying complex human diseases is one of the fundamental goals of systems biology and has shown great promise for deciphering the regulatory cell circuits driving complex disease biology, especially in cancer. The availability of increasing volumes of multimodal data ranging from multi-omics to imaging and clinical data across multiscale systems promises to improve our understanding of the regulatory mechanisms underlying complex human diseases. The main challenges are to integrate the multiple levels of multimodal data within biological systems and to translate them across multiscale biological systems to decipher the underpinnings of human diseases.

Here we introduce the <B>&#42;AMARETTO framework</B> as a toolbox for learning how regulatory networks - cell circuits and their drivers - are shared or distinct within and across biological systems with a broad range of applications, from disease subtyping to driver and drug discovery in studies of human disease such as cancer. The &#42;AMARETTO toolbox currently consists of two algorithmic tools:

(<B>1</B>) The <B>AMARETTO algorithm</B> facilitates multimodal inference of regulatory networks within one biological system via multi-omics data fusion (e.g., genetic, epigenetic, transcriptomic, proteomic) and association with phenotypes derived from clinical (e.g., diagnostic and prognostic) and imaging (e.g., histopathology and radiographic) data.

(<B>2</B>) The <B>Community-AMARETTO algorithm</B> enables multiscale inference to learn how these regulatory networks are shared or distinct across biological systems (e.g., across diseases, across cohorts, across model systems and patient studies, and across <i>in vitro</i> and <i>in vivo</i> systems). 

The &#42;AMARETTO framework is available as user-friendly tools from GitHub, Bioconductor, GenePattern, GenomeSpace, GenePattern Notebook and R Jupyter Notebook (see <B>Resources</B>).

Beyond our recent applications to studies of cancer (see <B>References</B>) the &#42;AMARETTO software toolbox is more generally applicable to studies of human disease, including cancer, infectious, neurologic and immune-mediated diseases.

## &#42;AMARETTO core tools and downstream analytic functionalities

The &#42;AMARETTO framework currently consists of two algorithmic software tools. First, <B>AMARETTO</B> infers regulatory networks via multi-omics data fusion within each biological system. Specifically, AMARETTO identifies potential cancer drivers by identifying genes whose genetic and epigenetic cancer aberrations have a direct functional impact on their own transcriptomic or proteomic expression. These (epi)genetic drivers can be augmented, intersected or replaced with predefined candidate drivers with known regulatory function (e.g., transcription factors from TFutils). AMARETTO then connects these drivers in a regulatory program with modules of co-expressed target genes that they putatively control, defined as regulatory modules or cell circuits, using a regularized regression algorithm (i.e., Elastic Net regression). Next, <B>Community-AMARETTO</B> learns communities or subnetworks by connecting regulatory networks inferred from different systems using an edge betweenness community detection algorithm (i.e., Girvan Newman) to identify cell circuits and drivers that are shared and distinct across biological systems and diseases.

The &#42;AMARETTO framework additionally offers tools for <B>downstream analytic functionalities</B> on both module and community levels, including functional annotation of modules and communities (e.g., using known functional categories from MSigDB), stratifying modules and communities for increasingly specific phenotypes (e.g., patient characteristics such as survival, molecular subclasses, known (epi)genetic cancer aberrations, or features derived from histopathology and  radiographic imaging, as well as in-depth studies of etiologies of cancer via spatiotemporal - time course and/or single-cell - studies in model systems), validation of predicted drivers (e.g., using genetic perturbation studies in model systems – knockdown or overexpression experiments of driver genes), discovering drugs targeting drivers and their predicted target genes (e.g., using chemical perturbation studies in model systems), and systematic assessment and benchmarking of the networks for generalized prediction performance of the (sub)networks.

### In this tutorial, we will guide you through the following steps for running the &#42;AMARETTO toolbox in GenePattern Notebook

<B>Before you begin</B>: Log in to GenePattern to access &#42;AMARETTO<br>
<B>Step 1</B>: Access data from TCGA for &#42;AMARETTO<br>
<B>Step 2</B>: Running AMARETTO to infer regulatory networks from functional genomics data or via multi-omics data fusion<br>
<B>Step 3</B>: Viewing AMARETTO results of first AMARETTO analysis<br>
<B>Step 4</B>: Running AMARETTO to infer regulatory networks from multiple data sources<br>
<B>Step 5</B>: Viewing AMARETTO results from multiple data sources<br>
<B>Step 6</B>: Running Community-AMARETTO to identify subnetworks shared/distinct across multiple AMARETTO networks<br>
<B>Step 7</B>:  Viewing Community-AMARETTO results combining multiple AMARETTO analyses<br>

<B>Time complexity</B>, <B>Resources</B>, <B>References</B>, and <B>Funding</B><br>

<B>Questions?</B>

For any questions with the &#42;AMARETTO Notebooks, please contact <B>Nathalie Pochet</B> (<npochet@broadinstitute.org>) and <B>Olivier Gevaert</B> (<olivier.gevaert@stanford.edu>).


# Before you begin, log in to GenePattern to access &#42;AMARETTO

Import libraries required for running the &#42;AMARETTO Notebook by running the next code cell.

Sign in to GenePattern by entering your username and password into the form below.

The AMARETTO and Community-AMARETTO modules are running on the GenePattern Amazon Cloud server (<https://cloud.genepattern.org>).

In [0]:
import csv
import io
import pandas
from gp.data import GCT
import numpy as np
from IPython.display import HTML
from IPython.display import display, Javascript
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import warnings
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

In [0]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

# Step 1. Access data from TCGA for &#42;AMARETTO <small><i>(optional)</i></small>

## Running &#42;AMARETTO on your own data or TCGA data

The &#42;AMARETTO framework can be run both on your own data or we have preloaded TCGA data:

(<B>1</B>) <B>your own data</B>: AMARETTO can be run with only gene expression data or if you also have matched copy number and DNA methylation data then this data can be added as well. In this case, you can immediately proceed to <B>Step 2</B>.

(<B>2</B>) <B>TCGA data</B>: by selecting a cohort from The Cancer Genome Atlas (TCGA) database. In this case, you can continue in this <B>Step 1</B>.


## Access to processed data from TCGA

The processed genetic, epigenetic and transcriptomic data sources from TCGA are directly accessible via this function for analysis by &#42;AMARETTO within this GenePattern Notebook. These TCGA data files are derived from The Cancer Genome Atlas (TCGA) as available from the Broad GDAC FireHose Portal (<https://gdac.broadinstitute.org/#>).

Once you select a cancer site from the drop-down menu, three data files will be loaded: <B>1</B>) mRNA gene expression data, (<B>2</B>) DNA copy number variation data, and (<B>3</B>) DNA methylation data, and will be available for selection in the drop-down menus in the next steps.

The list of TCGA cancer (sub)types currently available in this &#42;AMARETTO in GenePattern Notebook are:

<table>
    <tr><td>BLCA</td><td>bladder urothelial carcinoma</td></tr>
    <tr><td>BRCA</td><td>breast invasive carcinoma</td></tr>
    <tr><td>CESC</td><td>cervical squamous cell carcinoma and endocervical adenocarcinoma</td></tr>
    <tr><td>CHOL</td><td>cholangiocarcinoma</td></tr>
    <tr><td>COAD</td><td>colon adenocarcinoma</td></tr>
    <tr><td>ESCA</td><td>esophageal carcinoma</td></tr>
    <tr><td>GBM</td><td>glioblastoma multiforme</td></tr>
    <tr><td>HNSC</td><td>head and neck squamous cell carcinoma</td></tr>
    <tr><td>KIRC</td><td>kidney renal clear cell carcinoma</td></tr>
    <tr><td>KIRP</td><td>kidney renal papillary cell carcinoma</td></tr>
    <tr><td>LAML</td><td>acute myeloid leukemia</td></tr>
    <tr><td>LGG</td><td>brain lower grade glioma </td></tr>
    <tr><td>LIHC</td><td>liver hepatocellular carcinoma</td></tr>
    <tr><td>LUAD</td><td>lung adenocarcinoma</td></tr>
    <tr><td>LUSC</td><td>lung squamous cell carcinoma</td></tr>
    <tr><td>OV</td><td>ovarian serous cystadenocarcinoma</td></tr>
    <tr><td>PAAD</td><td>pancreatic adenocarcinoma</td></tr>
    <tr><td>PCPG</td><td>pheochromocytoma and paraganglioma</td></tr>
    <tr><td>READ</td><td>rectum adenocarcinoma</td></tr>
    <tr><td>SARC</td><td>sarcoma</td></tr>
    <tr><td>STAD</td><td>stomach adenocarcinoma</td></tr>
    <tr><td>THCA</td><td>thyroid carcinoma</td></tr>
    <tr><td>THYM</td><td>thymoma</td></tr>
    <tr><td>UCEC</td><td>endometrial carcinoma</td></tr>
</table>

The genetic, epigenetic and transcriptomic data sources for the TCGA cancer (sub)types included in this &#42;AMARETTO in GenePattern Notebook can also be downloaded via <https://datasets.genepattern.org/?prefix=data/module_support_files/Amaretto/>.

In [0]:
@genepattern.build_ui(parameters={
     "cancerType": {
        "default": "COAD",
        "type": "choice",
        "choices": {
            "BLCA": "BLCA",
            "BRCA": "BRCA",
            "CESC": "CESC",
            "CHOL": "CHOL",
            "COAD": "COAD",
            "ESCA": "ESCA",
            "GBM": "GBM",
            "HNSC": "HNSC",
            "KIRC": "KIRC",
            "KIRP": "KIRP",
            "LAML": "LAML",
            "LGG": "LGG",
            "LIHC": "LIHC",
            "LUAD": "LUAD",
            "LUSC": "LUSC",
            "OV": "OV",
            "PAAD": "PAAD",
            "PCPG": "PCPG",
            "READ": "READ",
            "SARC": "SARC",
            "STAD": "STAD",
            "THCA": "THCA",
            "THYM": "THYM",
            "UCEC": "UCEC"
        }
    },
   "output_var": {
        "name": "results",
        "description": "There are the results",
        "hide": True
    }
})
def getExampleTCGAFiles(cancerType):
    # URL or file path relative to the notebook's directory
    exp_path = 'https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_'+cancerType+'_Expression.gct'
    cn_path = 'https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_'+cancerType+'_CNV.gct'
    meth_path = 'https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_'+cancerType+'_Methylation.gct'

    return genepattern.GPUIOutput(files=[exp_path, cn_path, meth_path])
    

UIBuilder(function_import='getExampleTCGAFiles', name='getExampleTCGAFiles', params=[{'name': 'cancerType', 'l…

# Step 2. Running AMARETTO to infer regulatory networks from functional genomics data or via multi-omics data fusion

## Running AMARETTO on own and TCGA data

The AMARETTO algorithm that infers regulatory networks within one cohort or biological system can be run in two ways:

(<B>1</B>) <B>Your own data</B>: by uploading your own data. In this case, the minimal requirement is to upload a functional genomics (i.e., mRNA or protein gene expression) data file. When available, the user can additionally upload genetic (e.g., DNA copy number variation) and/or epigenetic (e.g., DNA methylation) data files.

(<B>2</B>) <B>TCGA data</B>: by selecting the multi-omics (functional genomics: mRNA gene expression, genetic: DNA copy number variation, and epigenetic: DNA methylation) or only the functional genomics (mRNA gene expression) data files from a previously selected cohort from The Cancer Genome Atlas (TCGA) database. See <B>Step 1</B>.

For any type of multi-omics data (genetic, epigenetic, transcriptomic and proteomic), data files should be formatted as .GCT files (rows represent genes, columns represent samples, see .GCT format <http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide>).

In both scenarios, the next step involves choosing the candidate driver definitions. 

## Running AMARETTO with various data and/or candidate driver definitions

In case <B>only functional genomics</B> (i.e., mRNA or protein gene expression) data is available, a predefined list of candidate regulators is required for analysis by the AMARETTO algorithm.

Alternatively, when <B>either genetic</B> (e.g., DNA copy number variation) <B>or epigenetic</B> (e.g., DNA methylation) data <B>or both</B> are available, there are various options for defining candidate drivers for analysis by the AMARETTO algorithm.

The AMARETTO algorithm can take various definitions of candidate drivers:

(<B>1</B>) Select or upload <B>predefined lists of candidate drivers</B>, such as transcription factors (e.g., from TFutils <https://bioconductor.org/packages/release/bioc/html/TFutils.html> <https://f1000research.com/articles/8-152/v1>, MSigDB C3 TFT <http://software.broadinstitute.org/gsea/msigdb/collections.jsp>,...) or cancer drivers (e.g., from MSigDB <http://software.broadinstitute.org/gsea/msigdb/collections.jsp>, COSMIC <https://cancer.sanger.ac.uk/cosmic>,...);

(<B>2</B>) Select <B>computed lists of candidate drivers</B> from genetic (e.g., DNA copy number variation) or epigenetic (e.g., DNA methylation) data sources (if corresponding data files are uploaded);

(<B>3</B>) Take the <B>union or intersection</B> between <B>predefined</B> (<B>1</B>) and <B>computed</B> (<B>2</B>) <B>lists of candidate drivers</B>.

For computed lists of candidate drivers from genetic (e.g., DNA copy number variation) or epigenetic (e.g., DNA methylation) data sources, these are precomputed for TCGA data, however, for processing own data additional algorithms may be required. In case of TCGA data, the <B>GISTIC algorithm</B> (also available from GenePattern) is used to identify somatic recurrent DNA copy number aberrations (copy number amplifications and deletions) and the <B>MethylMix algorithm</B> (available from Bioconductor) is used to identify recurrent DNA methylation aberrations (hyper and hypo methylated sites) that have a direct functional impact on their own gene expression levels (positive association for DNA copy number aberrations, negative association for DNA methylation aberrations).

## Other parameters settings for running AMARETTO

Additional parameters that can be set by the user for running AMARETTO include:

(<B>1</B>) <B>Number of regulatory modules</B> (i.e., cell circuits and their drivers) to be inferred. As a rule of thumb, hiqh quality regulatory modules  comprise of ~60-80 genes, so the optimal range of number of modules can be calculated by dividing the total number of genes in the analysis (see parameter % variation) by these numbers. Depending on the number of genes in the analysis, the ideal suggested range is ~100-200 modules.

(<B>2</B>) <B>Percent of most varying genes</B> across the sample population (% genes) to be included in the analysis. This is based on the functional genomics data, which can be population RNA-Seq, single-cell RNA-Seq, or proteomics data. While genes that do not vary across the population (i.e., stdev zero) are automatically filtered out from the analysis, it is recommended to adjust the % variation filter for each dataset. Depending on the type of data, the ideal suggested range is ~25%-75% genes that vary the most across the population.

(<B>3</B>) Provide a <B>base "file name" for output files</B> generated by the AMARETTO analysis (e.g., myAmarettoAnalysis).

(<B>4</B>) Define <B>collections of known functional categories</B> for functional characterization of the regulatory modules or cell circuits. One or more collections can be selected from the predefined MSigDB drop-down list (see <http://software.broadinstitute.org/gsea/msigdb/collections.jsp>) and/or uploaded by the user.

## Time complexity of AMARETTO

Depending on the size of the data, for example, for TCGA cohorts of ~300-500 samples, it can take up to ~2 hours to run the AMARETTO algorithm and generate reports on the GenePattern Amazon Cloud server. Once runs are finished and reports are generated, they can be accessed for viewing in next <B>Step 3</B>. 

In [0]:
amaretto_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00378')
amaretto_job_spec = amaretto_task.make_job_spec()
amaretto_job_spec.set_parameter("expression.file", "https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_GBM_Expression.gct")
amaretto_job_spec.set_parameter("copy.number.file", "https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_GBM_CNV.gct")
amaretto_job_spec.set_parameter("methylation.file", "https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_GBM_Methylation.gct")
amaretto_job_spec.set_parameter("number.of.modules", "150")
amaretto_job_spec.set_parameter("percent.genes", "75")
amaretto_job_spec.set_parameter("output.file", "LIHC_test")
amaretto_job_spec.set_parameter("driver.gene.list.file", "")
amaretto_job_spec.set_parameter("driver.gene.list", "TFs_TFutils_union")
amaretto_job_spec.set_parameter("driver.gene.list.selection.mode", "intersect")
amaretto_job_spec.set_parameter("gene.sets.database", ["ftp://gpftp.broadinstitute.org/module_support_files/msigdb/gmt/h.all.v6.2.symbols.gmt", "ftp://gpftp.broadinstitute.org/module_support_files/msigdb/gmt/c2.all.v6.2.symbols.gmt"])
amaretto_job_spec.set_parameter("job.memory", "2 Gb")
amaretto_job_spec.set_parameter("job.walltime", "02:00:00")
amaretto_job_spec.set_parameter("job.cpuCount", "4")
genepattern.display(amaretto_task)

job102535 = gp.GPJob(genepattern.session.get(0), 102535)
genepattern.display(job102535)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00378')

GPJobWidget(job_number=102535)

# Step 3. Viewing AMARETTO results of first AMARETTO analysis

## Queryable report generated for AMARETTO analysis

The <B>AMARETTO report</B> includes:

(<B>1</B>) An overview <B>index.html</B> page that provides:
- A summary of the analyzed data (# of samples in genetic, epigenetic and functional genomics data, % and # of most varying genes, # of regulatory modules)
- A queryable overview table of inferred regulatory modules
- A queryable overview table of gene assignments (drivers and targets) to regulatory modules
- A queryable overview table with enrichments of known functional categories in regulatory modules

(<B>2</B>) For each regulatory module a <B>module#.html</B> page is generated that provides:
- A heatmap visualization of the functional genomics or multi-omics data of inferred regulatory modules, onto which known phenotypic information can be mapped (e.g., imaging and clinical features)
- A queryable table of inferred regulatory programs, including activator and repressor driver genes
- A queryable table of target genes of the regulatory modules
- A queryable table with enrichments of known functional categories in regulatory modules

Future releases will include:
- Queryable overview and module-specific tables and links to reports for association to phenotypes (e.g., imaging, clinical)
- Queryable overview and module-specific tables of drivers predicted by AMARETTO that are validated by genetic perturbation studies in model systems
- Queryable overview and module-specific tables of drugs predicted by AMARETTO by integrating chemical perturbation studies in model systems
- Gene-level ontology network representations from AMARETTO & Community-AMARETTO results (e.g., <https://monabiyan.shinyapps.io/app_1/>)

## Output formats downloadable from AMARETTO report

The complete AMARETTO results and reports can be selected and downloaded as .ZIP files, which serve as input for the Community-AMARETTO analysis in <B>Steps 6 and 7</B>. Overview and module-specific Tables and Heatmaps can be saved as .CSV, .Excel, .PDF and .PNG files.

In [0]:
from IPython.core.display import display, HTML

@genepattern.build_ui(
    name="Display AMARETTO report",
    description="Display the AMARETTO HTML report within a cell of the notebook.",
    parameters={ "fileUrl":{"name":"Index file URL", "type":"file", "kinds":["/AMARETTOhtmls/index.html"]}, 
               "output_var":{"hide":True }}

)
def displayAmarettoReport(fileUrl):
    width="100%"
    height=1500
    #fileUrl="https://cloud.genepattern.org/gp/jobResults/33903/report_html/index.html"
    return HTML('<iframe src={0} width={1[0]} height={1[1]}></iframe>'.format(fileUrl, (width,height)))

UIBuilder(description='Display the AMARETTO HTML report within a cell of the notebook.', function_import='disp…

# Step 4. Running AMARETTO to infer regulatory networks from multiple data sources <small>(repeat steps 4 & 5) <i>(optional)</i></small>

## Running AMARETTO on one or more additional datasets

For comparative inference of networks shared or distinct across datasets, cohorts, biological systems, or diseases, previous <B>Steps 2 and 3</B> can be repeated multiple times in <B>Steps 4 and 5</B>.

In [0]:
amaretto_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00378')
amaretto_job_spec = amaretto_task.make_job_spec()
amaretto_job_spec.set_parameter("expression.file", "https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_GBM_Expression.gct")
amaretto_job_spec.set_parameter("copy.number.file", "https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_GBM_CNV.gct")
amaretto_job_spec.set_parameter("methylation.file", "https://datasets.genepattern.org/data/module_support_files/Amaretto/TCGA_GBM_Methylation.gct")
amaretto_job_spec.set_parameter("number.of.modules", "100")
amaretto_job_spec.set_parameter("percent.genes", "75")
amaretto_job_spec.set_parameter("output.file", "GBM_test")
amaretto_job_spec.set_parameter("driver.gene.list.file", "")
amaretto_job_spec.set_parameter("driver.gene.list", "TFs_TFutils_union")
amaretto_job_spec.set_parameter("driver.gene.list.selection.mode", "predefined")
amaretto_job_spec.set_parameter("gene.sets.database", ["ftp://gpftp.broadinstitute.org/module_support_files/msigdb/gmt/h.all.v6.2.symbols.gmt", "ftp://gpftp.broadinstitute.org/module_support_files/msigdb/gmt/c2.all.v6.2.symbols.gmt"])
amaretto_job_spec.set_parameter("job.memory", "2 Gb")
amaretto_job_spec.set_parameter("job.walltime", "02:00:00")
amaretto_job_spec.set_parameter("job.cpuCount", "4")
genepattern.display(amaretto_task)

job102536 = gp.GPJob(genepattern.session.get(0), 102536)
genepattern.display(job102536)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00378')

GPJobWidget(job_number=102536)

# Step 5. Viewing AMARETTO results from multiple data sources <small>(repeat steps 4 & 5) <i>(optional)</i></small>

## Queryable AMARETTO reports and output formats downloadable from AMARETTO report for one or more additional datasets

The AMARETTO reports for one or more additional datasets facilitate querying and downloading output formats similarly as in <B>Step 3</B>. The complete AMARETTO results and reports can be selected and downloaded as .ZIP files, which serve as input to the Community-AMARETTO analysis in <B>Steps 6 and 7</B>.

In [0]:
from IPython.core.display import display, HTML

@genepattern.build_ui(
    name="Display AMARETTO report",
    description="Display the AMARETTO HTML report within a cell of the notebook.",
    parameters={ "fileUrl":{"name":"Index file URL", "type":"file", "kinds":["/AMARETTOhtmls/index.html"]}, 
               "output_var":{"hide":True }}

)
def displayAmarettoReport(fileUrl):
    width="100%"
    height=1500
    #fileUrl="https://cloud.genepattern.org/gp/jobResults/33903/report_html/index.html"
    return HTML('<iframe src={0} width={1[0]} height={1[1]}></iframe>'.format(fileUrl, (width,height)))

UIBuilder(description='Display the AMARETTO HTML report within a cell of the notebook.', function_import='disp…

# Step 6. Running Community-AMARETTO to identify subnetworks shared/distinct across multiple AMARETTO networks <small><i>(optional)</i></small>

## Running Community-AMARETTO to identify regulatory networks shared and distinct across multiple systems

The Community-AMARETTO algorithm takes as input results and reports from two or more previous AMARETTO analyses to identify regulatory networks (i.e., cell circuits and their drivers) that are shared and distinct across multiple datasets, cohorts, biological systems and diseases.

## Selecting AMARETTO results and reports for Community-AMARETTO analysis

The user can select the .ZIP files that represent the corresponding results and reports from at least two or more previous AMARETTO analyses (see above, run in <B>Steps 2 till 5</B>).

## Other parameters settings for running Community-AMARETTO

Additional parameters that can be set by the user for running Community-AMARETTO include:

(<B>1</B>) Provide a <B>base "file name" for output files</B> generated by the Community-AMARETTO analysis (e.g., myCommunityAmarettoAnalysis).

(<B>2</B>) Define <B>collections of known functional categories</B> for functional characterization of the regulatory communities or subnetworks. One or more collections can be selected from the predefined MSigDB drop-down list (see <http://software.broadinstitute.org/gsea/msigdb/collections.jsp>) and/or uploaded by the user.

(<B>3</B>) <B>Filtering edges in the subnetworks or communities based on p-value significance</B>. Edges in the network with p-values larger than the cutoff value will be filtered out. The default cutoff p-value is 0.05.

(<B>4</B>) <B>Filtering edges in the subnetworks or communities based on the minimum number overlapping genes</B>. Edges in the network with the number of overlapping genes less than the cutoff value will be filtered out. The default cutoff value is 5 overlapping genes.

(<B>5</B>) <B>Filtering edges in the subnetworks or communities that do not satisfy all the following conditions</B>: <B>1.</B> Number of nodes in the community larger than the 1% of the total number of nodes in the network, <B>2.</B> Number of represented datasets/cohorts in the community larger than the 10% of the subnetwork size (and at least, larger than 2), <B>3.</B> Ratio between edges inside/outside the community larger than 0.5. The user can choose between filtering according to these criteria, in which case edges in the network that do not satisfy all of these criteria will be filtered out, or whether to not apply these filtering criteria to retain all edges.

## Time complexity of Community-AMARETTO

Depending on the number of regulatory networks that are submitted for comparative analysis by Community-AMARETTO, it typically takes ~15 minutes for two networks up to ~45 minutes for more than five networks to run the Community-AMARETTO algorithm and generate the report on the GenePattern Amazon Cloud server. Once the report is generated, it can be accessed for viewing in <B>Step 7</B>.


In [0]:
communityamaretto_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00381')
communityamaretto_job_spec = communityamaretto_task.make_job_spec()
communityamaretto_job_spec.set_parameter("amaretto.result.files", ["https://cloud.genepattern.org/gp/jobResults/102359/LIHC_test_AMARETTOresults_20190324_025003.zip", "https://cloud.genepattern.org/gp/jobResults/102360/GBM_test_AMARETTOresults_20190324_025414.zip"])
communityamaretto_job_spec.set_parameter("output.file", "CommunityAMARETTOResults_LIHC_test_GBM_test")
communityamaretto_job_spec.set_parameter("amaretto.report.files", ["https://cloud.genepattern.org/gp/jobResults/102359/LIHC_test_report.zip", "https://cloud.genepattern.org/gp/jobResults/102360/GBM_test_report.zip"])
communityamaretto_job_spec.set_parameter("gene.sets.database", ["ftp://gpftp.broadinstitute.org/module_support_files/msigdb/gmt/h.all.v6.2.symbols.gmt", "ftp://gpftp.broadinstitute.org/module_support_files/msigdb/gmt/c2.all.v6.2.symbols.gmt"])
communityamaretto_job_spec.set_parameter("p-value", "0.05")
communityamaretto_job_spec.set_parameter("min.number.overlapping.genes", "5")
communityamaretto_job_spec.set_parameter("filter.communities", "FALSE")
communityamaretto_job_spec.set_parameter("job.memory", "2 Gb")
communityamaretto_job_spec.set_parameter("job.walltime", "02:00:00")
communityamaretto_job_spec.set_parameter("job.cpuCount", "1")
genepattern.display(communityamaretto_task)

job102365 = gp.GPJob(genepattern.session.get(0), 102365)
genepattern.display(job102365)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00381')

GPJobWidget(job_number=102365)

# Step 7.  Viewing Community-AMARETTO results combining multiple AMARETTO analyses <small><i>(optional)</i></small>

## Queryable report generated for Community-AMARETTO analysis

The <B>Community-AMARETTO report</B> includes:

(<B>1</B>) An overview <B>index.html</B> page that provides:
- A summary of the analyzed networks (including links to the original AMARETTO networks derived from multiple datasets)
- A network graph overview of the subnetworks or communities learned across multiple datasets
- A queryable overview table of shared/distinct communities with assignments of modules to communities across all datasets
- A queryable overview table of shared/distinct drivers of communities with assignments of drivers to communities across all datasets
- A queryable overview table of gene assignments (drivers and targets) to communities
- A queryable overview table with enrichments of known functional categories in communities

(<B>2</B>) For each regulatory subnetwork or community a <B>community#.html</B> page is generated that provides:
- A network graph visualization of the subnetworks or communities learned across multiple datasets
- A queryable table of shared/distinct communities across multiple datasets, including regulatory modules that are shared/distinct across datasets
- A queryable table of gene assignments (drivers and targets) in communities
- A queryable table with enrichments of known functional categories in communities

## Output formats downloadable from Community-AMARETTO report

The complete Community-AMARETTO reports can be selected and downloaded as a .ZIP file, which directly link to the results and reports from the original AMARETTO analyses from which they were derived. Overview and community-specific Tables and Network Graphs can be saved as .CSV, .Excel, .PDF and .PNG files.

In [0]:
from IPython.core.display import display, HTML

@genepattern.build_ui(
    name="Display community-AMARETTO report",
    description="Display the community-AMARETTO HTML report within a cell of the notebook.",
    parameters={ "fileUrl":{"name":"Index file URL", "type":"file", "kinds":["/htmls/index.html"]}, 
               "output_var":{"hide":True }}

)
def displayAmarettoReport(fileUrl):
    width="100%"
    height=1500
    #fileUrl="https://cloud.genepattern.org/gp/jobResults/33903/report_html/index.html"
    return HTML('<iframe src={0} width={1[0]} height={1[1]}></iframe>'.format(fileUrl, (width,height)))

UIBuilder(description='Display the community-AMARETTO HTML report within a cell of the notebook.', function_im…

# Resources

The source code and user-friendly tools of the current &#42;AMARETTO toolbox and future developments are available from GitHub, Bioconductor, GenePattern, GenomeSpace, GenePattern Notebook and R Jupyter Notebook.

#### &#42;AMARETTO in Bioconductor
- <B>AMARETTO in Bioconductor</B>: <https://www.bioconductor.org/packages/devel/bioc/html/AMARETTO.html><br>
- <B>Community-AMARETTO in Bioconductor</B>: in preparation for submission<br>

#### &#42;AMARETTO in GitHub
- <B>AMARETTO in GitHub</B>: <https://github.com/gevaertlab/AMARETTO><br>
- <B>Community-AMARETTO in GitHub</B>: <https://github.com/broadinstitute/CommunityAMARETTO><br>

#### &#42;AMARETTO in GenePattern
- <B>AMARETTO in GenePattern</B>:<br>
<https://cloud.genepattern.org/gp/pages/index.jsf?lsid=urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00378><br>
- <B>Community-AMARETTO in GenePattern</B>:<br>
<https://cloud.genepattern.org/gp/pages/index.jsf?lsid=urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00381><br>

#### &#42;AMARETTO in GenomeSpace
The AMARETTO and Community-AMARETTO modules in GenePattern are also available within GenomeSpace: <http://www.genomespace.org/>

#### &#42;AMARETTO Notebooks
The <B>&#42;AMARETTO in GenePattern and R Notebooks</B> provide users with a complete analysis pipeline that enables running AMARETTO on one or multiple data cohorts and connecting them using Community-AMARETTO. Each AMARETTO and Community-AMARETTO analysis generates a detailed report of genome-wide networks inferred from one cohort and/or shared/distinct across multiple cohorts. These reports include queryable tables and visualizations (heatmaps and network graphs) of shared/distinct cell circuits and their drivers, as well as their functional and phenotypic characterizations.
- <B>&#42;AMARETTO in GenePattern Notebook</B>: GenePattern Notebook from <https://notebook.genepattern.org/>
- <B>&#42;AMARETTO in R via GitHub and Bioconductor</B>: Jupyter Notebook from <https://colab.research.google.com/>

#### &#42;AMARETTO example reports
<B>Studying hepatitis C & B virus-induced hepatocellular carcinoma using AMARETTO & Community-AMARETTO:</B><br>
- An example report that learns regulatory networks from multi-omics data for hepatocellular carcinoma based on integrating genetic, epigenetic and functional genomics data from TCGA: <a href = "http://portals.broadinstitute.org/pochetlab/example_reports/AMARETTO_results/LIHC_Report_TfUtils/AMARETTOhtmls/index.html">AMARETTO Report</a><br>
- An example report that integrates regulatory networks derived from >6 liver data sources (multi-omics hepatocellular carcinoma patient data from TCGA, ~25 liver cell line models from CCLE, time course hepatitis C virus infection data in Huh7 models, time course hepatitis B virus infection data in HepG2 models, single-cell hepatitis C virus infection data in Huh7 models, single-cell hepatitis B virus infection data in HepG2 models, further augmented with previously published prognostic network models that were derived from hepatocellular carcinoma patient data): <a href = "http://portals.broadinstitute.org/pochetlab/example_reports/Community-AMARETTO_results/cAMARETTO_all6_nonfiltered_SignaturesLiverHoshida/index.html">Community-AMARETTO Report</a><br>
- An example of ongoing work on developing gene-level ontology network representations from AMARETTO modules & Community-AMARETTO communities: <a href = "https://monabiyan.shinyapps.io/app_1/">Shiny App</a>

<B>Multi-omics & imaging data fusion for glioblastoma multiforme using AMARETTO:</B><br>
- An example report that integrates imaging data into the multi-omics regulatory networks for glioblastoma multiforme based on multi-omics and non-invasive imaging data from TCGA/TCIA (that we will later connect with networks learned from integrating RNA-Seq refined for anatomic structures and stem cells with histopathology imaging data from IvyGAP and that we will subsequently further refine based on single-cell RNA-Seq studies): <a href = "http://portals.broadinstitute.org/pochetlab/example_reports/AMARETTO_results/GBM_Report/AMARETTOhtmls/index.html">AMARETTO Report</a>

# References

1. Multiscale and multimodal inference of regulatory networks using &#42;AMARETTO. <i>In preparation for submission.</i>

2. Champion M., Brennan K., Croonenborghs T., Gentles A. J., Pochet N., Gevaert O. (2018). Module Analysis Captures Pancancer Genetically and Epigenetically Deregulated Cancer Driver Genes for Smoking and Antiviral Response. <i>EBioMedicine</i>, 27, 156-166. PMID:29331675 PMCID:PMC5828545

3. Gevaert O., Villalobos V., Sikic B. I., Plevritis S. K. (2014). Identification of ovarian cancer driver genes by using module network integration of multi-omics data. <i>Interface Focus</i>, 3(4), 20130013. PMID:24511378 PMCID:PMC3915833

4. Gevaert O., Tibshirani R., Plevritis S. K. (2015). Pancancer analysis of DNA methylation-driven genes using MethylMix. <i>Genome Biology</i>, 16(1), 17. PMID:25631659 PMCID:PMC4365533

5. Gevaert O. (2015). MethylMix: an R package for identifying DNA methylation-driven genes. <i>Bioinformatics</i>, 31(11), 1839-41. PMID:25609794 PMCID:PMC4443673

6. Stubbs B. J., Gopaulakrishnan S., Glass K., Pochet N., Everaert C., Raby B., Carey V. (2019). TFutils: Data structures for transcription factor bioinformatics. <i>F1000Research</i>, 8:152. (<https://doi.org/10.12688/f1000research.17976.1>)

7. Reich M., Liefeld T., Ocana M., Jang D., Bistline J., Robinson J., Carr P., Hill B., McLaughlin J., Pochet N., Borges-Rivera D., Tabor T., Thorvaldsdottir H., Regev A., Mesirov J. P. (2013). GenomeSpace: an environment for frictionless bioinformatics. <i>F1000Posters</i>, 4:804 (<https://f1000research.com/posters/1093972>)

8. Qu K., Garamszegi S., Wu F., Thorvaldsdottir H., Liefeld T., Ocana M., Borges-Rivera D., Pochet N., Robinson J. T., Demchak B., Hull T., Ben-Artzi G., Blankenberg D., Barber G. P., Lee B. T., Kuhn R. M., Nekrutenko A., Segal E., Ideker T., Reich M., Regev A., Chang H. Y., Mesirov J. P. (2016). Integrative genomic analysis by interoperation of bioinformatics tools in GenomeSpace. <i>Nature Methods</i>, 13(3), 245-247. PMID:26780094 PMCID:PMC4767623

9. Cedoz PL, Prunello M, Brennan K, Gevaert O. MethylMix 2.0: an R package for identifying DNA methylation genes. <i>Bioinformatics</i>. 2018 Sep 1;34(17):3044-3046. doi: 10.1093/bioinformatics/bty156. PubMed PMID: 29668835; PubMed Central PMCID: PMC6129298.

10. Gevaert O, Tibshirani R, Plevritis SK. Pancancer analysis of DNA methylation-driven genes using MethylMix. <i>Genome Biology</i> 2015 Jan 29;16:17. doi: 10.1186/s13059-014-0579-8. PubMed PMID: 25631659; PubMed Central PMCID: PMC4365533.

11. Gevaert O. MethylMix: an R package for identifying DNA methylation-driven genes. <i>Bioinformatics</i>. 2015 Jun 1;31(11):1839-41. doi: 10.1093/bioinformatics/btv020. Epub 2015 Jan 20. PubMed PMID: 25609794; PubMed Central PMCID: PMC4443673.

# Funding

This work was supported by grants from NIH NCI ITCR R21 CA209940 (Pochet), NIH NCI ITCR U01 CA214846 Collaborative Supplement (Carey/Pochet) and NIH NIAID R03 AI131066 (Pochet).

# Questions?

For any questions with the &#42;AMARETTO Notebooks, please contact <B>Nathalie Pochet</B> (<npochet@broadinstitute.org>) and <B>Olivier Gevaert</B> (<olivier.gevaert@stanford.edu>).