<a href="https://colab.research.google.com/github/matthewberry/uiuc_com_dsp/blob/master/DSP_genomics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using This Notebook

This notebook is an interactive environment that combines explanatory text with executable code. It provides computational tools useful in genomics, and you will familiarize yourself with them by stepping through a series of analyses that examine related data from multiple angles. This suite of tools and example can then serve as a starting point for a project of your own design.

If you are new to notebooks, you might find this introduction helpful: [Overview of Colaboratory Features](https://colab.research.google.com/notebooks/basic_features_overview.ipynb) You might also want to refer to the [Python 3 documentation](https://docs.python.org/3/).

**Important Note**: After a period of inactivity (Google does not specify exactly how long), Google will disconnect your notebook from the virtual machine that had been running it. When you return, Google will connect to a new virtual machine. Any data files you saved to your Google Drive will remain, but any variables or methods defined in your previous virtual machine will have to be reloaded. (You will know when this happens, because cells that previously ran without error will suddenly stop working, and the notebook will lose its connection to your Google Drive.) To reload the variable and method definitions, and to restore the connection to your Google Drive, you can simply re-run the cells that perform those tasks. This notebook will explain which cells might need to be re-run.

### Before You Proceed

In order to run the notebook, you will need your own copy of it, along with your own copy of the data. Here are the steps to follow:

1. If you have not already enabled Google Apps @ Illinois, which allows you to use Google Drive, Google Docs, and so on with your illinois.edu account, [enable Google Apps @ Illinois](https://answers.uillinois.edu/illinois/page.php?id=55049).
2. Check the currently active Google account on this notebook. Look near the top-right corner of the screen for either a `Sign In` button or a round icon containing either a letter or your profile photo. If you see a `Sign In` button, click it and follow the prompts to sign in with your illinois.edu account. Otherwise, click the icon to open a popup that identifies the currently active Google account. If the account is not your illinois.edu account, switch to your illinois.edu account (you might have to click `Add Account` if it is not already an option in the list).
3. In the `File` menu above, select `Save a copy in Drive...`. This step will create a new browser tab containing your copy of the notebook. At this point, you can close the old browser tab that contained the original copy of the notebook.
4. Open the `File` menu and select `Locate in Drive`, which will show you where you can find the notebook if you need to open it later.
5. [Click here](https://drive.google.com/drive/folders/1vsIBpVR0xi56u0WtihppTXrWscDBO70U?usp=sharing) to open the master copy of the example data in a new tab. In the new tab, again make sure the active Google account is your illinois.edu account. Click on the small triangle that appears after the folder name near the top of the screen. From the menu that appears, select `Add to My Drive`.


## Installation

The cell below installs software required to perform the analyses. Run the cell and wait for it to complete, which will take about two minutes. You'll see lots of text output as the cell runs, but there's no need to read it unless the following cell fails.

You **will** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
!pip3 install -I pyyaml==5.1.2
!pip3 install xmlrunner==1.7.7 redis==3.3.8 lifelines==0.22.8
!pip3 install git+https://github.com/KnowEnG/KnowEnG_Pipelines_Library.git@mjberry/update_dependencies
!pip3 install git+https://github.com/KnowEnG/Data_Cleanup_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/General_Clustering_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Samples_Clustering_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Feature_Prioritization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Gene_Prioritization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Geneset_Characterization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Spreadsheets_Transformation.git@mjberry/create_package

## Environment Setup

The cell below sets up the environment for running the analyses. Run the cell and wait for it to complete, which will only take a few seconds. You won't see any text output this time.

You probably will not need to call any of the methods defined in this method, and you probably will not need to edit anything in the cell.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [0]:
import csv
import os
import shutil
from tempfile import mkdtemp
import urllib.request

from IPython.display import HTML
from lifelines import KaplanMeierFitter
from lifelines.statistics import multivariate_logrank_test
import matplotlib.pyplot as plt
import pandas as pd

from kndatacleanup import data_cleanup
from knfeatureprioritization import feature_prioritization
from kngeneprioritization import gene_prioritization
from kngenesetcharacterization import geneset_characterization
from knsamplesclustering import samples_clustering
from kngeneralclustering import general_clustering
from knspreadsheetstransformation.spreadsheets_transformation_toolbox import \
    get_cluster_binary_dataframe

%matplotlib inline

NETWORK_DIR_PATH = '/network/'

REDIS_PARAMS = {
    'host': 'knowredis.knoweng.org',
    'password': 'KnowEnG',
    'port': 6379
}

NUM_CPUS = 2

def fetch_network(edge_file_path):
    """Given the local path to an edge file, ensures that the edge file exists
    on disk, downloading it from AWS if necessary.

    Arguments:
        edge_file_path (str): The local path to an edge file as found in the
            data returned by `get_interaction_networks` and
            `get_gene_property_networks`.

    Returns:
        None

    """
    if not os.path.isfile(edge_file_path):
        url = "https://s3.amazonaws.com/KnowNets/KN-20rep-1706/" + \
            "userKN-20rep-1706/" + edge_file_path[len(NETWORK_DIR_PATH):]
        os.makedirs(os.path.dirname(edge_file_path), exist_ok=True)
        with urllib.request.urlopen(url) as response:
            with open(edge_file_path, 'wb') as out_file:
                shutil.copyfileobj(response, out_file)

def fetch_network_metadata():
    """Downloads from AWS the network overview metadata files required to
    implement the network utility methods.

    Arguments:
        None

    Returns:
        None

    """
    filenames = ['db_contents.txt', 'species_desc.txt', 'edge_type.txt']
    for filename in filenames:
        out_file_path = os.path.join(NETWORK_DIR_PATH, filename)
        if not os.path.isfile(out_file_path):
            url = "https://s3.amazonaws.com/KnowNets/KN-20rep-1706/" + \
                "userKN-20rep-1706/" + filename
            with urllib.request.urlopen(url) as response:
                with open(out_file_path, 'wb') as out_file:
                    shutil.copyfileobj(response, out_file)

def get_path_to_newest_file_having_prefix(search_dir_path, prefix):
    """Finds all files in `search_dir_path` whose name begins with `prefix`.
    Returns the newest of these matching files, or returns None if no matching
    file exists.

    Arguments:
        search_dir_path (str): The local path to the directory to search.
        prefix (str): The string used to filter the files in `search_dir_path`.

    Returns:
        str: The path to the newest matching file, or None if no matching files
            exist.

    """
    matches = [os.path.join(search_dir_path, name) \
        for name in os.listdir(search_dir_path) \
        if name.startswith(prefix)]
    # ensure they're all files
    matches = [m for m in matches if os.path.isfile(m)]
    return_val = None
    if matches:
        return_val = sorted(matches, \
            key=lambda path: os.path.getctime(path), reverse=True)[0]
    return return_val

def get_cleaned_file_path(original_file_path, results_dir_path):
    """Given the name of a file passed to `kndatacleanup.data_cleanup`,
    along with the `results_dir_path` passed to `kndatacleanup.data_cleanup`,
    returns the path at which the cleaned version of the input can be found.

    Arguments:
        original_file_path (str): The path to the file that was passed to
            `kndatacleanup.data_cleanup`.
        results_dir_path (str): The path to the results directory that was
            passed to `kndatacleanup.data_cleanup`.

    Returns:
        str: The path to the cleaned version of `original_file_path` that was
            or would be produced by `kndatacleanup.data_cleanup`.

    """
    original_name = os.path.basename(original_file_path)
    original_name_root = os.path.splitext(original_name)[0]
    return os.path.join(results_dir_path, original_name_root + "_ETL.tsv")

def get_gene_map_file_path(original_file_path, results_dir_path):
    """Given the name of an omics file passed to `kndatacleanup.data_cleanup`,
    along with the `results_dir_path` passed to `kndatacleanup.data_cleanup`,
    returns the path at which the mapping of gene names to gene identifiers
    can be found.

    Arguments:
        original_file_path (str): The path to the omics file that was passed to
            `kndatacleanup.data_cleanup`.
        results_dir_path (str): The path to the results directory that was
            passed to `kndatacleanup.data_cleanup`.

    Returns:
        str: The path to the gene-name mapping file that was or would be
            produced by `kndatacleanup.data_cleanup`.

    """
    original_name = os.path.basename(original_file_path)
    original_name_root = os.path.splitext(original_name)[0]
    return os.path.join(results_dir_path, original_name_root + "_MAP.tsv")

os.makedirs(NETWORK_DIR_PATH, exist_ok=True)
fetch_network_metadata()

!rm -rf /content/sample_data

## Knowledge Network Utility Methods

The cell below defines several utility methods for working with the knowledge network. These methods are used in the example analyses and might be useful to you in your project. Run the cell and wait for it to complete, which will only take a second or so. It won't produce any text output.

You probably will not need to edit anything within the cell.

A later cell shows how to use the knowledge network utility methods.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [0]:
def get_network_species():
    """Returns information about the species found in the knowledge network.

    Arguments:
        None

    Returns:
        list: A list in which each element is a dictionary. Each dictionary has
            keys 'id' (which can be passed to other methods that require a
            `species_id`), 'short_latin_name', 'latin_name', 'familiar_name',
            and 'group_name'.

    """
    return_val = []
    species_file_path = os.path.join(NETWORK_DIR_PATH, 'species_desc.txt')
    with open(species_file_path) as csvfile:
        for row in csv.reader(csvfile, delimiter='\t'):
            return_val.append({
                'id': row[0],
                'short_latin_name': row[1],
                'latin_name': row[2],
                'familiar_name': row[3],
                'group_name': row[5]
            })
    return return_val

def display_network_species():
    """Displays a table of the species found in the knowledge network.

    Arguments:
        None

    Returns:
        None

    """
    html_string = "<table><tr><th>Familiar Name (Latin Name)</th><th>Species Id</th></tr>"
    for species in get_network_species():
        html_string += "<tr><td>" + species['familiar_name'] + " (" + \
            species['latin_name'] + ")</td><td>" + species['id'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

def get_edge_type_name_to_pretty_name():
    """Returns a dictionary in which the keys are edge type names and the values
    are pretty network names.

    Arguments:
        None

    Returns:
        dict: A dictionary in which the keys are edge type names and the values
            are pretty network names.

    """
    return_val = {}
    file_path = os.path.join(NETWORK_DIR_PATH, 'edge_type.txt')
    with open(file_path) as csvfile:
        for row in csv.DictReader(csvfile, delimiter='\t'):
            return_val[row['et_name']] = row['pretty_name']
    return return_val

def get_interaction_networks(species_id):
    """Given a `species_id`, returns information about the interaction networks
    available in the knowledge network.

    Arguments:
        species_id (int or str): The id for the species of interest, as returned
            by `get_network_species` or displayed by `display_network_species`.

    Returns:
        list: A list in which each element is a dictionary. Each dictionary has
            two keys, 'name' and 'edge_file_path'.

    """
    species_id = str(species_id) # user-friendliness
    return_val = []
    contents_file_path = os.path.join(NETWORK_DIR_PATH, 'db_contents.txt')
    with open(contents_file_path) as csvfile:
        edge_type_name_to_pretty_name = get_edge_type_name_to_pretty_name()
        for row in csv.DictReader(csvfile, delimiter='\t'):
            if row['n1_type'] == 'Gene' and row['taxon'] == species_id:
                return_val.append({
                    'name': edge_type_name_to_pretty_name[row['et_name']],
                    'edge_file_path': os.path.join(\
                        NETWORK_DIR_PATH, 'Gene', species_id, row['et_name'], \
                        species_id + '.' + row['et_name'] + '.edge')
                })
    return return_val

def display_interaction_networks(species_id):
    """Given a `species_id`, displays information about the interaction
    networks available in the knowledge network.

    Arguments:
        species_id (int or str): The id for the species of interest, as returned
            by `get_network_species` or displayed by `display_network_species`.

    Returns:
        None

    """
    html_string = "<table><tr><th>Interaction Network Name</th><th>Edge File Path</th></tr>"
    for network in get_interaction_networks(species_id):
        html_string += "<tr><td>" + network['name'] + "</td><td>" + \
            network['edge_file_path'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

def get_gene_property_networks(species_id):
    """Given a `species_id`, returns information about the gene-property
    networks available in the knowledge network.

    Arguments:
        species_id (int or str): The id for the species of interest, as returned
            by `get_network_species` or displayed by `display_network_species`.

    Returns:
        list: A list in which each element is a dictionary. Each dictionary has
            two keys, 'name' and 'edge_file_path'.

    """
    species_id = str(species_id) # user-friendliness
    return_val = []
    contents_file_path = os.path.join(NETWORK_DIR_PATH, 'db_contents.txt')
    with open(contents_file_path) as csvfile:
        edge_type_name_to_pretty_name = get_edge_type_name_to_pretty_name()
        for row in csv.DictReader(csvfile, delimiter='\t'):
            if row['n1_type'] == 'Property' and row['taxon'] == species_id:
                return_val.append({
                    'name': edge_type_name_to_pretty_name[row['et_name']],
                    'edge_file_path': os.path.join(\
                        NETWORK_DIR_PATH, 'Property', species_id, row['et_name'], \
                        species_id + '.' + row['et_name'] + '.edge')
                })
    return return_val

def display_gene_property_networks(species_id):
    """Given a `species_id`, displays information about the gene-property
    networks available in the knowledge network.

    Arguments:
        species_id (int or str): The id for the species of interest, as returned
            by `get_network_species` or displayed by `display_network_species`.

    Returns:
        None

    """
    html_string = "<table><tr><th>Interaction Network Name</th><th>Edge File Path</th></tr>"
    for network in get_gene_property_networks(species_id):
        html_string += "<tr><td>" + network['name'] + "</td><td>" + \
            network['edge_file_path'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

### Using the Knowledge Network Utility Methods

The three cells below show how `display_network_species`, `display_interaction_networks`, and `display_gene_property_networks` can be called to view information about the knowledge network. This information can be useful in configuring analyses, as you'll see later.

These methods are based on three other methods defined in the cell above, `get_network_species`, `get_interaction_networks`, and `get_gene_property_networks`. The "get" versions return the same information as the "display" versions, but the "get" versions return it in a format convenient for use in code instead of a format that's easy to read.

You **will not** need to re-run these three cells whenever your notebook connects to a new virtual machine.

In [0]:
# display all species in the knowledge network
display_network_species()

In [0]:
# display interaction networks for rat (species id 10116)
display_interaction_networks('10116')

In [0]:
# display gene property networks for roundworm (species id 6239)
display_gene_property_networks('6239')

## Analytics Methods

The cell below defines methods for running clustering, prioritization, and gene-set characterization. Run the cell and wait for it to complete, which will take a second or so. It won't produce any output.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [0]:
def do_clustering(\
    omics_file_path, phenotype_file_path, results_dir_path, num_clusters, \
    species_id, interaction_network_edge_file_path, network_influence, \
    num_bootstraps, bootstrap_sample_fraction):
    """Performs a clustering upon the samples found in `omics_file_path`.

    Arguments:
        omics_file_path (str): The path to the omics file.
        phenotype_file_path (str): The path to a file containing phenotype data
            on the same samples as found in `omics_file_path`, or None if no
            phenotype data are to be analyzed. If analyzed, each phenotype will
            scored for statistically significant differences between the
            clusters.
        results_dir_path (str): The path to a directory where results files
            should be stored.
        num_clusters (int): The number of clusters to create.
        species_id (int or str): The id for the species of interest, as returned
            by `get_network_species` or displayed by `display_network_species`,
            or None not using an `interaction_network_edge_file_path`.
        interaction_network_edge_file_path (str): The path to an interaction
            network edge file, to use a knowledge-guided approach to clustering,
            or else None.
        network_influence (float): A number between 0 and 1 that specifies the
            amount to which network data should influence the results, or None
            if not using an `interaction_network_edge_file_path`.
        num_bootstraps (int): A number of bootstrap iterations to run. Use 0 for
            no bootstrapping.
        bootstrap_sample_fraction (float): A number between 0 and 1 that
            specifies what fraction of the data should be used in each bootstrap
            iteration, or None if not using bootstrapping.

    Returns:
        None

    """
    try:
        species_id = str(species_id) # user-friendliness
        os.makedirs(results_dir_path, exist_ok=True)

        if interaction_network_edge_file_path is None:
            pipeline_type = 'general_clustering_pipeline'
        else:
            fetch_network(interaction_network_edge_file_path)
            pipeline_type = 'samples_clustering_pipeline'

        cleanup_parameters = {
            'spreadsheet_name_full_path': omics_file_path,
            'pipeline_type': pipeline_type,
        'results_directory': results_dir_path
        }
        if phenotype_file_path is not None:
            cleanup_parameters['phenotype_name_full_path'] = phenotype_file_path
        if interaction_network_edge_file_path is not None:
            cleanup_parameters.update({
                'gg_network_name_full_path': interaction_network_edge_file_path,
                'taxonid': species_id,
                'source_hint': '',
                'redis_credential': {
                    'host': REDIS_PARAMS['host'],
                    'port': REDIS_PARAMS['port'],
                    'password': REDIS_PARAMS['password']
                }
            })
        data_cleanup.run_pipelines(cleanup_parameters, \
            data_cleanup.SELECT[pipeline_type])

        clustering_parameters = {
            'spreadsheet_name_full_path': get_cleaned_file_path(\
                omics_file_path, results_dir_path),
            'results_directory': results_dir_path,
            'processing_method': 'parallel',
            'parallelism': NUM_CPUS,
            'number_of_clusters': num_clusters,
            'run_directory': results_dir_path,
            'tmp_directory': './tmp'
        }
        if phenotype_file_path is not None:
            clustering_parameters.update({
                'phenotype_name_full_path': get_cleaned_file_path(\
                    phenotype_file_path, results_dir_path),
                'threshold': 15
            })

        method_prefix = ''
        if num_bootstraps > 0:
            clustering_parameters.update({
                'number_of_bootstraps': num_bootstraps,
                'rows_sampling_fraction': 1.0,
                'cols_sampling_fraction': bootstrap_sample_fraction
            })
            method_prefix = 'cc_'

        if interaction_network_edge_file_path is not None:
            clustering_parameters.update({
                'gg_network_name_full_path': interaction_network_edge_file_path,
                'rwr_max_iterations': 100,
                'rwr_convergence_tolerence': 1.0e-4,
                'rwr_restart_probability': network_influence,
                'top_number_of_genes': 100,
                'nmf_conv_check_freq': 50,
                'nmf_max_invariance': 200,
                'nmf_max_iterations': 10000,
                'nmf_penalty_parameter': 1400,
                'method': method_prefix + 'net_nmf'
            })
            samples_clustering.SELECT[clustering_parameters['method']](\
                clustering_parameters)
        else:
            clustering_parameters.update({
                'top_number_of_rows': 100,
                'affinity_metric': 'euclidean',
                'linkage_criterion': 'ward',
                'method': method_prefix + 'hclust'
            })
            general_clustering.SELECT[clustering_parameters['method']](\
                clustering_parameters)
    except:
        print("Something went wrong! Check the debugging information below, " + \
            "and look for log output in " + results_dir_path)
        raise
    else:
        print("Find results in " + results_dir_path)

def do_prioritization(\
    omics_file_path, phenotype_file_path, results_dir_path, \
    correlation_measure, missing_value_strategy, num_exported_features, \
    num_response_correlated_features, species_id, \
    interaction_network_edge_file_path, network_influence):
    """Prioritizes the features (genes or otherwise) found in `omics_file_path`
    for each phenotype found in `phenotype_file_path`.

    Arguments:
        omics_file_path (str): The path to the omics file.
        phenotype_file_path (str): The path to a file containing phenotype data
            on the same samples as found in `omics_file_path`.
        results_dir_path (str): The path to a directory where results files
            should be stored.
        correlation_measure (str): Either 't_test' for binary or categorical
            phenotypes or 'pearson' for numeric phenotypes.
        missing_value_strategy (str): Governs how to handle missing values in
            `omics_file_path`. Options are 'average' to use the average value
            for the feature among the other samples, 'remove' to drop any
            samples with missing values, or 'reject' to fail if any missing
            values are found (perhaps as a sanity check if you believe missing
            values were prevented upstream).
        num_exported_features (int): The number of top features per phenotype to
            include in the matrix that can be passed to `do_characterization`.
        num_response_correlated_features (int): The number of top features to
            retain from the first stage of the analysis if using an
            `interaction_network_edge_file_path`.
        species_id (int or str): The id for the species of interest, as returned
            by `get_network_species` or displayed by `display_network_species`,
            or none if not using an `interactive_network_edge_file_path`.
        interaction_network_edge_file_path (str): The path to an interaction
            network edge file, to use a knowledge-guided approach to
            prioritization, or else None.
        network_influence (float): A number between 0 and 1 that specifies the
            amount to which network data should influence the results, or None
            if not using an `interaction_network_edge_file_path`.

    Returns:
        None

    """
    try:
        species_id = str(species_id) # user-friendliness
        os.makedirs(results_dir_path, exist_ok=True)

        if interaction_network_edge_file_path is None:
            pipeline_type = 'feature_prioritization_pipeline'
        else:
            fetch_network(interaction_network_edge_file_path)
            pipeline_type = 'gene_prioritization_pipeline'

        cleanup_parameters = {
            'spreadsheet_name_full_path': omics_file_path,
            'phenotype_name_full_path': phenotype_file_path,
            'pipeline_type': pipeline_type,
            'correlation_measure': correlation_measure, # t_test, pearson, edgeR
            'impute': missing_value_strategy, # average, remove, reject
            'results_directory': results_dir_path
        }
        if interaction_network_edge_file_path is not None:
            cleanup_parameters.update({
                'taxonid': species_id,
                'source_hint': '',
                'redis_credential': {
                    'host': REDIS_PARAMS['host'],
                    'port': REDIS_PARAMS['port'],
                    'password': REDIS_PARAMS['password']
                }
            })
        data_cleanup.run_pipelines(cleanup_parameters, \
            data_cleanup.SELECT[pipeline_type])

        prioritization_parameters = {
            'correlation_measure': correlation_measure,
            'spreadsheet_name_full_path': get_cleaned_file_path(\
                omics_file_path, results_dir_path),
            'phenotype_name_full_path': get_cleaned_file_path(\
                phenotype_file_path, results_dir_path),
            'results_directory': results_dir_path,
            'top_gamma_of_sort': num_exported_features,
            'max_cpu': NUM_CPUS
        }
        if interaction_network_edge_file_path is not None:
            prioritization_parameters.update({
                'gg_network_name_full_path': interaction_network_edge_file_path,
                'rwr_max_iterations': 100,
                'rwr_convergence_tolerence': 1.0e-4,
                'rwr_restart_probability': network_influence,
                'top_beta_of_sort': num_response_correlated_features,
                'method': 'net_correlation'
            })
            gene_prioritization.net_correlation(prioritization_parameters)
        else:
            prioritization_parameters.update({
                'top_beta_of_sort': num_exported_features,
                'method': 'correlation',
            })
            feature_prioritization.correlation(prioritization_parameters)
    except:
        print("Something went wrong! Check the debugging information below, " + \
            "and look for log output in " + results_dir_path)
        raise
    else:
        print("Find results in " + results_dir_path)

def do_characterization(\
    gene_matrix_file_path, results_dir_path, species_id, \
    gene_property_edge_file_path, interaction_network_edge_file_path, \
    network_influence):
    """Compares user-submitted gene sets to those found in a gene-property
    network from the knowledge network.

    Arguments:
        gene_matrix_file_path (str): The path to the gene matrix that defines
            one or more gene sets.
        results_dir_path (str): The path to a directory where results files
            should be stored.
        species_id (int or str): The id for the species of interest, as returned
            by `get_network_species` or displayed by `display_network_species`.
        gene_property_edge_file_path: The path to a gene-property network edge
            file.
        interaction_network_edge_file_path (str): The path to an interaction
            network edge file, to use a knowledge-guided approach to
            characterization, or else None.
        network_influence (float): A number between 0 and 1 that specifies the
            amount to which network data should influence the results, or None
            if not using an `interaction_network_edge_file_path`.

    Returns:
        None

    """
    try:
        species_id = str(species_id) # user-friendliness
        os.makedirs(results_dir_path, exist_ok=True)

        fetch_network(gene_property_edge_file_path)

        cleanup_parameters = {
            'spreadsheet_name_full_path': gene_matrix_file_path,
            'pipeline_type': 'geneset_characterization_pipeline',
            'results_directory': results_dir_path,
            'taxonid': species_id,
            'source_hint': '',
            'redis_credential': {
                'host': REDIS_PARAMS['host'],
                'port': REDIS_PARAMS['port'],
                'password': REDIS_PARAMS['password']
            }
        }
        data_cleanup.run_pipelines(cleanup_parameters, \
            data_cleanup.SELECT['geneset_characterization_pipeline'])

        characterization_parameters = {
            'spreadsheet_name_full_path': get_cleaned_file_path(\
                gene_matrix_file_path, results_dir_path),
            'gene_names_map': get_gene_map_file_path(\
                gene_matrix_file_path, results_dir_path),
            'results_directory': results_dir_path,
            'pg_network_name_full_path': gene_property_edge_file_path,
            'max_cpu': NUM_CPUS
        }
        if interaction_network_edge_file_path is None:
            characterization_parameters.update({
                'method': 'fisher'
            })
            geneset_characterization.fisher(characterization_parameters)
        else:
            fetch_network(interaction_network_edge_file_path)
            characterization_parameters.update({
                'method': 'DRaWR',
                'rwr_max_iterations': 500,
                'rwr_convergence_tolerence': 1.0e-4,
                'rwr_restart_probability': network_influence,
                'gg_network_name_full_path': interaction_network_edge_file_path
            })
            geneset_characterization.DRaWR(characterization_parameters)
    except:
        print("Something went wrong! Check the debugging information below, " + \
            "and look for log output in " + results_dir_path)
        raise
    else:
        print("Find results in " + results_dir_path)


## Connect to Google Drive

The cell below enables this notebook to use your Google Drive for file storage. Subsequent cells will use this access to load the example files you copied earlier and to save results of the example analyses. You might also find this helpful in running your own analyses.

Run the cell and click on the link that appears in the output. On the linked page, select your illinois.edu account and grant the requested permissions. The page will then display a code. Copy the code and paste it in the box that appears in the output below. Then press Enter.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [0]:
from google.colab import drive
GDRIVE_MOUNT_PATH = '/content/gdrive'
drive.mount(GDRIVE_MOUNT_PATH)

## Setting File Locations

In the cell below, we will tell the notebook where the example files can be found and where the results should be saved.

To confirm the locations, find the arrow symbol (>) near the top left corner of the portion of your screen that shows the notebook content. Click it to reveal a panel with three tabs labeled `Table of contents`, `Code snippets`, and `Files`. Click on the `Files` tab.

In the `Files` tab, you should see one folder named `gdrive`. Click the arrow next to the `gdrive` folder to expand it, and continue navigating through the folders until you find the `Genomics Data Science Project - example analyses inputs` folder copied previously. Right-click on `Genomics Data Science Project - example analyses inputs` and select `Copy path`. Paste the value into the cell below, and compare it to the value assigned to `INPUT_DATA_DIR_PATH`. If the values are different, replace the pre-coded value with the one you pasted.

This notebook is configured to store results in a folder named `Genomics Data Science Project - example analyses outputs` alongside the folder of input data. The folder will be created if it does not already exist. If you would rather store the results elsewhere, you can change the value of `OUTPUT_DATA_DIR_PATH` below.  

Once you have made any changes to `INPUT_DATA_DIR_PATH` and `OUTPUT_DATA_DIR_PATH`, run the cell.

If at any point you open the `Files` tab or click its `REFRESH` button and do not see `gdrive`, you might need to re-run the previous cell.

Note this cell also specifies the output directories that will be used for the different analyses in the example. They are defined here, in the last quick-running cell before the analyses below, because the variables will need to be refreshed if your notebook connects to a new virtual machine.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine, but the values assigned to `INPUT_DATA_DIR_PATH` and `OUTPUT_DATA_DIR_PATH` will not change unless you move the folders within your Google Drive.


In [0]:
INPUT_DATA_DIR_PATH = '/content/gdrive/My Drive/Genomics Data Science Project - example analyses inputs'
OUTPUT_DATA_DIR_PATH = os.path.join(\
    os.path.dirname(INPUT_DATA_DIR_PATH),
    'Genomics Data Science Project - example analyses outputs')
os.makedirs(OUTPUT_DATA_DIR_PATH, exist_ok=True)

CLUSTERING1_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering1')
CLUSTERING2_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering2')
CLUSTERING3_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering3')
CLUSTERING4_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering4')
CLUSTERING5_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering5')
CLUSTERING6_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering6')

PRIORITIZATION1_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'prioritization1')
PRIORITIZATION2_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'prioritization2')
PRIORITIZATION3_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'prioritization3')

CHARACTERIZATION1_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'characterization1')
CHARACTERIZATION2_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'characterization2')
CHARACTERIZATION3_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'characterization3')


## Loading Pre-Computed Results (optional)

The following sections will guide you through a series of analyses. Some of the steps are computationally intensive and require more than a few minutes to run. To avoid waiting, you may load pre-computed results into your `OUTPUT_DATA_DIR_PATH` at this point, by running the cell below and waiting until it finishes, which will take about 7 minutes. You will then be able to inspect the data, and you will still be able to run any of the remaining cells if you wish.

You **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
src_top_dir = os.path.join(INPUT_DATA_DIR_PATH, 'precomputed results')
dst_top_dir = OUTPUT_DATA_DIR_PATH

# shutil.copytree's dirs_exist_ok not introduced until py3.8, but colab uses 3.6
for root, dirs, files in os.walk(src_top_dir):
    for src_dir in dirs:
        src_dir_path = os.path.join(root, src_dir)
        dst_dir_path = src_dir_path.replace(src_top_dir, dst_top_dir, 1)
        os.makedirs(dst_dir_path, exist_ok=True)
    for src_file in files:
        src_file_path = os.path.join(root, src_file)
        dst_file_path = src_file_path.replace(src_top_dir, dst_top_dir, 1)
        shutil.copy2(src_file_path, dst_file_path)

## Clustering

The following four cells will use standard clustering techniques to group samples according to different omics data. Run each cell; note each cell contains a comment with an estimated running time. You'll see some output describing the inputs and results.

As each of these cells finishes, it will store the results to your Google Drive. For that reason, you **will not** need to re-run these cells whenever your notebook connects to a new virtual machine.

In [0]:
# takes about 9 minutes
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering1_genecopynumber.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    CLUSTERING1_DIR_PATH, 8, None, None, None, 0, None)

In [0]:
# takes about 3 minutes
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering2_exp_HiSeqV2.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    CLUSTERING2_DIR_PATH, 13, None, None, None, 0, None)

In [0]:
# takes less than 1 minute
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering3_hMethyl.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    CLUSTERING3_DIR_PATH, 19, None, None, None, 0, None)

In [0]:
# takes less than 1 minute
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering4_RPPA_RBN.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    CLUSTERING4_DIR_PATH, 8, None, None, None, 0, None)

### Network-Based Clustering

This fifth clustering analysis incorporates the knowledge network in order to improve results over sparse omics data. As with the above clustering analyses, run the cell and wait until it completes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
# takes almost 2 hours
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering5_mutation.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    CLUSTERING5_DIR_PATH, 14, '9606', \
    '/network/Gene/9606/hn_IntNet/9606.hn_IntNet.edge', 0.5, 0, None)

### Cluster-of-Clusters Analysis (COCA)

This sixth clustering analysis operates upon the cluster assignments generated by the previous five clustering analyses, along with the results of an additional clustering based on miRNA data. The additional clustering is not part of this notebook, but its results are provided in the folder of inputs so that this cell can load them. Again, run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
# takes about 8 minutes

# gather the outputs from the five previous clustering analyses, along with the
# miRNA clusters
raw_coca_inputs = [
    get_path_to_newest_file_having_prefix(CLUSTERING1_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING2_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING3_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING4_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING5_DIR_PATH, 'samples_label_by_cluster'),
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering6_miRNA.tsv')
]    

# assemble the raw inputs into a single file formatted like an omics file
os.makedirs(CLUSTERING6_DIR_PATH, exist_ok=True)
coca_input_file_path = os.path.join(CLUSTERING6_DIR_PATH, 'input.tsv')
temp_dir_path = mkdtemp()
try:
    for input in raw_coca_inputs:
        shutil.copy(input, temp_dir_path)
    coca_input_df = get_cluster_binary_dataframe(\
        [os.path.basename(input) for input in raw_coca_inputs], temp_dir_path).T
    coca_input_df.to_csv(coca_input_file_path, sep='\t')
finally:
    shutil.rmtree(temp_dir_path)

do_clustering(\
    coca_input_file_path, \
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    CLUSTERING6_DIR_PATH, 13, None, None, None, 200, 0.8)

We can also perform a survival analysis on the identified clusters.

In [0]:
# takes about 1 second

# load the file containing the clinical data
# survival data are found in columns _OS_IND (boolean representing event) and
# _OS (float indicating time)
phenotype_df = pd.read_csv(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    sep='\t', index_col=0, header=0)

# load the cluster assignments
cluster_labels_file_path = get_path_to_newest_file_having_prefix(\
    CLUSTERING6_DIR_PATH, 'samples_label_by_cluster')
cluster_labels_df = pd.read_csv(cluster_labels_file_path, \
    sep='\t', index_col=0, header=None, names=['cluster'])
# reorder cluster_labels_df to match the sample order in phenotype_df

combined_df = pd.concat([phenotype_df['_OS_IND'], phenotype_df['_OS'], \
        cluster_labels_df], axis=1, sort=True)

# retain only the samples that have a time and a cluster
combined_df.dropna(subset=['_OS', 'cluster'], inplace=True)

# fill missing values in event (0, for censored)
combined_df['_OS_IND'].fillna(value=0, inplace=True)

# calculate p-value
test_stats = multivariate_logrank_test(combined_df['_OS'].values, \
    combined_df['cluster'].values, combined_df['_OS_IND'].values)

# draw plot
fig = plt.figure()
ax = fig.gca()

kmf = KaplanMeierFitter()

for name, grouped_df in combined_df.groupby('cluster'):
    kmf.fit(grouped_df["_OS"], grouped_df["_OS_IND"], \
        label='Cluster ' + str(int(name)))
    kmf.plot(ax=ax, show_censors=True)

plt.title ('P-value = %s' %(test_stats.p_value))
plt.xlabel('Time (days)');

## Gene Prioritization

The following three cells will analyze gene expression data to determine the genes most associated with phenotypes of interest.

In the first of the three prioritization cells, the phenotypes are PANCAN disease types, and the method is a standard prioritization technique. Run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
# takes about 6 minutes
do_prioritization(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_prioritization_expr.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'pancan_disease_types.gXc'), \
    PRIORITIZATION1_DIR_PATH, 't_test', 'average', 100, 50, None, None, None)

In the second of the three prioritization cells, the phenotypes are again the PANCAN disease types, but the method incorporates the knowledge network. Again, run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
# takes about 14 minutes
do_prioritization(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_prioritization_expr.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'pancan_disease_types.gXc'), \
    PRIORITIZATION2_DIR_PATH, 't_test', 'average', 100, 50, '9606', \
    '/network/Gene/9606/hn_IntNet/9606.hn_IntNet.edge', 0.5)

In the third of the three prioritization cells, the phenotypes are the COCA cluster assignments. The standard method is used.

Run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
# takes about 6 minutes
do_prioritization(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_prioritization_expr.tsv'), \
    get_path_to_newest_file_having_prefix(CLUSTERING6_DIR_PATH, 'samples_label_by_cluster'), \
    PRIORITIZATION3_DIR_PATH, 't_test', 'average', 100, 50, None, None, None)

## Gene-Set Characterization

The final three cells compare the top genes found by the gene prioritization analyses with gene sets from the Gene Ontology database and a set of known cancer drivers.

Each cell in this sequence corresponds to one of the three gene prioritization analyses above. Run each cell and wait for it to finish.

As with all of the analysis cells, these will store the results to your Google Drive. For that reason, you **will not** need to re-run these cells whenever your notebook connects to a new virtual machine.

In [0]:
# takes less than 1 minute
for property_edge_file_path in [\
        '/network/Property/9606/gene_ontology/9606.gene_ontology.edge', \
        '/network/Property/9606/cancer_driver_genes/9606.cancer_driver_genes.edge']:
    do_characterization(\
        get_path_to_newest_file_having_prefix(PRIORITIZATION1_DIR_PATH, 'top_'), \
        os.path.join(CHARACTERIZATION1_DIR_PATH, os.path.basename(property_edge_file_path)), \
        '9606', property_edge_file_path, None, None)

In [0]:
# takes less than 1 minute
for property_edge_file_path in [\
        '/network/Property/9606/gene_ontology/9606.gene_ontology.edge', \
        '/network/Property/9606/cancer_driver_genes/9606.cancer_driver_genes.edge']:
    do_characterization(\
        get_path_to_newest_file_having_prefix(PRIORITIZATION2_DIR_PATH, 'top_'), \
        os.path.join(CHARACTERIZATION2_DIR_PATH, os.path.basename(property_edge_file_path)), \
        '9606', property_edge_file_path, None, None)

In [0]:
# takes less than 1 minute
for property_edge_file_path in [\
        '/network/Property/9606/gene_ontology/9606.gene_ontology.edge', \
        '/network/Property/9606/cancer_driver_genes/9606.cancer_driver_genes.edge']:
    do_characterization(\
        get_path_to_newest_file_having_prefix(PRIORITIZATION3_DIR_PATH, 'top_'), \
        os.path.join(CHARACTERIZATION3_DIR_PATH, os.path.basename(property_edge_file_path)), \
        '9606', property_edge_file_path, None, None)