<a href="https://colab.research.google.com/github/matthewberry/uiuc_com_dsp/blob/master/DSP_genomics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using This Notebook

This notebook is an interactive environment that combines explanatory text with executable code. It provides computational tools useful in genomics, and you will familiarize yourself with them by stepping through a series of analyses that examine related data sets from multiple angles. This suite of tools and example can then serve as a starting point for a project of your own design.

If you are new to notebooks, you might find this introduction helpful: [Overview of Colaboratory Features](https://colab.research.google.com/notebooks/basic_features_overview.ipynb) You might also want to refer to the [Python 3 documentation](https://docs.python.org/3/).

Note that after a period of inactivity, Google will disconnect your notebook from the virtual machine that had been running it. When you return, Google will connect to a new virtual machine. Any data files you saved to your Google Drive will remain, but any variables or methods defined in your previous virtual machine will have to be reloaded. (You will know when this happens, because cells that previously ran without error will suddenly stop working.) To reload the variable and method definitions, you can simply re-run the cells where they're defined. This notebook will explain which cells might need to be re-run.

Prep to document:
* enable google apps
* sign in to illinois google account
* open this notebook
* File -> Save a copy in Drive (and then File -> Locate in Drive to see where it's being saved; can return there to open it later if you need to)
* open data dir
* save to my drive


## Installation

The cell below installs software required to perform the analyses. Run the cell and wait for it to complete, which might take several minutes. You'll see lots of text output as the cell runs, but there's no need to read it unless the following cell fails.

You **will** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [1]:
!pip3 install -I pyyaml==5.1.2
!pip3 install xmlrunner==1.7.7 redis==3.3.8 lifelines==0.22.8
!pip3 install git+https://github.com/KnowEnG/KnowEnG_Pipelines_Library.git@mjberry/update_dependencies
!pip3 install git+https://github.com/KnowEnG/Data_Cleanup_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/General_Clustering_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Samples_Clustering_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Feature_Prioritization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Gene_Prioritization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Geneset_Characterization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Spreadsheets_Transformation.git@mjberry/create_package

Collecting pyyaml==5.1.2
Installing collected packages: pyyaml
Successfully installed pyyaml-5.1.2
Collecting git+https://github.com/KnowEnG/KnowEnG_Pipelines_Library.git@mjberry/update_dependencies
  Cloning https://github.com/KnowEnG/KnowEnG_Pipelines_Library.git (to revision mjberry/update_dependencies) to /tmp/pip-req-build-5k1lfa0k
  Running command git clone -q https://github.com/KnowEnG/KnowEnG_Pipelines_Library.git /tmp/pip-req-build-5k1lfa0k
  Running command git checkout -b mjberry/update_dependencies --track origin/mjberry/update_dependencies
  Switched to a new branch 'mjberry/update_dependencies'
  Branch 'mjberry/update_dependencies' set up to track remote branch 'mjberry/update_dependencies' from 'origin'.
Building wheels for collected packages: knpackage
  Building wheel for knpackage (setup.py) ... [?25l[?25hdone
  Created wheel for knpackage: filename=knpackage-0.1.27-cp36-none-any.whl size=14881 sha256=d6175f925beb654a74903cb8234c7ce10862a85019ee9b73826c338481d6760

## Environment Setup

The cell below sets up the environment for running the analyses. Run the cell and wait for it to complete. You won't see any text output this time.

You probably will not need to call any of the methods defined in this method, and you probably will not need to edit anything in the cell.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [0]:
import csv
import os
import shutil
from tempfile import mkdtemp
import urllib.request

from IPython.display import HTML

from kndatacleanup import data_cleanup
from knfeatureprioritization import feature_prioritization
from kngeneprioritization import gene_prioritization
from kngenesetcharacterization import geneset_characterization
from knsamplesclustering import samples_clustering
from kngeneralclustering import general_clustering
from knspreadsheetstransformation.spreadsheets_transformation_toolbox import \
    get_cluster_binary_dataframe

NETWORK_DIR_PATH = '/network/'

REDIS_PARAMS = {
    'host': 'knowredis.knoweng.org',
    'password': 'KnowEnG',
    'port': 6379
}

NUM_CPUS = 2

def fetch_network(edge_file_path):
    """TODO"""
    if not os.path.isfile(edge_file_path):
        url = "https://s3.amazonaws.com/KnowNets/KN-20rep-1706/" + \
            "userKN-20rep-1706/" + edge_file_path[len(NETWORK_DIR_PATH):]
        os.makedirs(os.path.dirname(edge_file_path), exist_ok=True)
        with urllib.request.urlopen(url) as response:
            with open(edge_file_path, 'wb') as out_file:
                shutil.copyfileobj(response, out_file)

def fetch_network_metadata():
    filenames = ['db_contents.txt', 'species_desc.txt', 'edge_type.txt']
    for filename in filenames:
        out_file_path = os.path.join(NETWORK_DIR_PATH, filename)
        if not os.path.isfile(out_file_path):
            url = "https://s3.amazonaws.com/KnowNets/KN-20rep-1706/" + \
                "userKN-20rep-1706/" + filename
            with urllib.request.urlopen(url) as response:
                with open(out_file_path, 'wb') as out_file:
                    shutil.copyfileobj(response, out_file)

def get_path_to_newest_file_having_prefix(search_dir_path, prefix):
    """TODO"""
    matches = [os.path.join(search_dir_path, name) for name \
        in os.listdir(search_dir_path) if name.startswith(prefix)]
    if matches:
        return sorted(matches, key=lambda path: os.path.getctime(path), reverse=True)[0]
    else:
        raise Exception("No file found with prefix " + prefix + " in " + \
            search_dir_path + ".")

def get_cleaned_file_path(original_file_path, results_dir_path):
    """TODO"""
    original_name = os.path.basename(original_file_path)
    original_name_root = os.path.splitext(original_name)[0]
    return os.path.join(results_dir_path, original_name_root + "_ETL.tsv")

def get_gene_map_file_path(original_file_path, results_dir_path):
    """TODO"""
    original_name = os.path.basename(original_file_path)
    original_name_root = os.path.splitext(original_name)[0]
    return os.path.join(results_dir_path, original_name_root + "_MAP.tsv")

os.makedirs(NETWORK_DIR_PATH, exist_ok=True)
fetch_network_metadata()

!rm -rf /content/sample_data

## Knowledge Network Utility Methods

The cell below defines several utility methods for working with the knowledge network. These methods are used in the example analyses and might be useful to you in your project. Run the cell and wait for it to complete. It won't produce any text output.

You probably will not need to edit anything within the cell.

A later cell shows how to use the knowledge network utility methods.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [0]:
def get_network_species():
    """TODO"""
    return_val = []
    species_file_path = os.path.join(NETWORK_DIR_PATH, 'species_desc.txt')
    with open(species_file_path) as csvfile:
        for row in csv.reader(csvfile, delimiter='\t'):
            return_val.append({
                'id': row[0],
                'short_latin_name': row[1],
                'latin_name': row[2],
                'familiar_name': row[3],
                'group_name': row[5]
            })
    return return_val

def display_network_species():
    """TODO"""
    html_string = "<table><tr><th>Familiar Name (Latin Name)</th><th>Species Id</th></tr>"
    for species in get_network_species():
        html_string += "<tr><td>" + species['familiar_name'] + " (" + \
            species['latin_name'] + ")</td><td>" + species['id'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

def get_edge_type_name_to_pretty_name():
    """TODO"""
    return_val = {}
    file_path = os.path.join(NETWORK_DIR_PATH, 'edge_type.txt')
    with open(file_path) as csvfile:
        for row in csv.DictReader(csvfile, delimiter='\t'):
            return_val[row['et_name']] = row['pretty_name']
    return return_val

def get_interaction_networks(species_id):
    """TODO"""
    species_id = str(species_id) # user-friendliness
    return_val = []
    contents_file_path = os.path.join(NETWORK_DIR_PATH, 'db_contents.txt')
    with open(contents_file_path) as csvfile:
        edge_type_name_to_pretty_name = get_edge_type_name_to_pretty_name()
        for row in csv.DictReader(csvfile, delimiter='\t'):
            if row['n1_type'] == 'Gene' and row['taxon'] == species_id:
                return_val.append({
                    'name': edge_type_name_to_pretty_name[row['et_name']],
                    'edge_file_path': os.path.join(\
                        NETWORK_DIR_PATH, 'Gene', species_id, row['et_name'], \
                        species_id + '.' + row['et_name'] + '.edge')
                })
    return return_val

def display_interaction_networks(species_id):
    """TODO"""
    html_string = "<table><tr><th>Interaction Network Name</th><th>Edge File Path</th></tr>"
    for network in get_interaction_networks(species_id):
        html_string += "<tr><td>" + network['name'] + "</td><td>" + \
            network['edge_file_path'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

def get_gene_property_networks(species_id):
    """TODO"""
    species_id = str(species_id) # user-friendliness
    return_val = []
    contents_file_path = os.path.join(NETWORK_DIR_PATH, 'db_contents.txt')
    with open(contents_file_path) as csvfile:
        edge_type_name_to_pretty_name = get_edge_type_name_to_pretty_name()
        for row in csv.DictReader(csvfile, delimiter='\t'):
            if row['n1_type'] == 'Property' and row['taxon'] == species_id:
                return_val.append({
                    'name': edge_type_name_to_pretty_name[row['et_name']],
                    'edge_file_path': os.path.join(\
                        NETWORK_DIR_PATH, 'Property', species_id, row['et_name'], \
                        species_id + '.' + row['et_name'] + '.edge')
                })
    return return_val

def display_gene_property_networks(species_id):
    """TODO"""
    html_string = "<table><tr><th>Interaction Network Name</th><th>Edge File Path</th></tr>"
    for network in get_gene_property_networks(species_id):
        html_string += "<tr><td>" + network['name'] + "</td><td>" + \
            network['edge_file_path'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

### Using the Knowledge Network Utility Methods

The three cells below show how `display_network_species`, `display_interaction_networks`, and `display_gene_property_networks` can be called to view information about the knowledge network. This information can be useful in configuring analyses, as you'll see later.

These methods are based on three other methods defined in the cell above, `get_network_species`, `get_interaction_networks`, and `get_gene_property_networks`. The "get" versions return the same information as the "display" versions, but the "get" versions return it in a format convenient for use in code instead of a format that's easy to read.

You **will not** need to re-run these three cells whenever your notebook connects to a new virtual machine.

In [20]:
# display all species in the knowledge network
display_network_species()

Familiar Name (Latin Name),Species Id
Human (Homo sapiens),9606
Chimpanzee (Pan troglodytes),9598
Cow (Bos taurus),9913
Dog (Canis familiaris),9615
Macaque (Macaca mulatta),9544
Mouse (Mus musculus),10090
Pig (Sus scrofa),9823
Rat (Rattus norvegicus),10116
Chicken (Gallus gallus),9031
Clawed frog (Xenopus tropicalis),8364


In [21]:
# display interaction networks for rat (species id 10116)
display_interaction_networks('10116')

Interaction Network Name,Edge File Path
Blastp Protein Sequence Similarity,/network/Gene/10116/blastp_homology/10116.blastp_homology.edge
Pathway Commons Catalysis Precedes,/network/Gene/10116/pathcom_catalysis_precedes/10116.pathcom_catalysis_precedes.edge
Pathway Commons Controls Expression,/network/Gene/10116/pathcom_controls_expression_of/10116.pathcom_controls_expression_of.edge
Pathway Commons Controls Phosphorylation,/network/Gene/10116/pathcom_controls_phosphorylation_of/10116.pathcom_controls_phosphorylation_of.edge
Pathway Commons Controls State Change,/network/Gene/10116/pathcom_controls_state_change_of/10116.pathcom_controls_state_change_of.edge
Pathway Commons In Complex With,/network/Gene/10116/pathcom_in_complex_with/10116.pathcom_in_complex_with.edge
PPI Protein Complex Association,/network/Gene/10116/PPI_association/10116.PPI_association.edge
PPI Colocalization,/network/Gene/10116/PPI_colocalization/10116.PPI_colocalization.edge
PPI Direct Interaction,/network/Gene/10116/PPI_direct_interaction/10116.PPI_direct_interaction.edge
PPI Genetic Interaction,/network/Gene/10116/PPI_genetic_interaction/10116.PPI_genetic_interaction.edge


In [22]:
# display gene property networks for roundworm (species id 6239)
display_gene_property_networks('6239')

Interaction Network Name,Edge File Path
Gene Ontology,/network/Property/6239/gene_ontology/6239.gene_ontology.edge
Pathway Commons Pathways,/network/Property/6239/pathcom_pathway/6239.pathcom_pathway.edge
PFam Prot Domains,/network/Property/6239/pfam_prot/6239.pfam_prot.edge
Reactome Pathways Curated,/network/Property/6239/reactome_annotation/6239.reactome_annotation.edge


## Analytics Methods

The cell below defines methods for running clustering, prioritization, and gene-set characterization. Run the cell and wait for it to complete. It won't produce any output.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [0]:
def do_clustering(\
    omics_file_path, phenotype_file_path, results_dir_path, num_clusters, \
    species_id, interaction_network_edge_file_path, network_influence, \
    num_bootstraps, bootstrap_sample_fraction):
    """TODO"""
    species_id = str(species_id) # user-friendliness
    os.makedirs(results_dir_path, exist_ok=True)

    if interaction_network_edge_file_path is None:
        pipeline_type = 'general_clustering_pipeline'
    else:
        fetch_network(interaction_network_edge_file_path)
        pipeline_type = 'samples_clustering_pipeline'

    cleanup_parameters = {
        'spreadsheet_name_full_path': omics_file_path,
        'pipeline_type': pipeline_type,
        'results_directory': results_dir_path
    }
    if phenotype_file_path is not None:
        cleanup_parameters['phenotype_name_full_path'] = phenotype_file_path
    if interaction_network_edge_file_path is not None:
        cleanup_parameters.update({
            'gg_network_name_full_path': interaction_network_edge_file_path,
            'taxonid': species_id,
            'source_hint': '',
            'redis_credential': {
                'host': REDIS_PARAMS['host'],
                'port': REDIS_PARAMS['port'],
                'password': REDIS_PARAMS['password']
            }
        })
    data_cleanup.run_pipelines(cleanup_parameters, data_cleanup.SELECT[pipeline_type])

    clustering_parameters = {
        'spreadsheet_name_full_path': get_cleaned_file_path(omics_file_path, results_dir_path),
        'results_directory': results_dir_path,
        'processing_method': 'parallel',
        'parallelism': NUM_CPUS,
        'number_of_clusters': num_clusters,
        'run_directory': results_dir_path,
        'tmp_directory': './tmp'
    }
    if phenotype_file_path is not None:
        clustering_parameters.update({
            'phenotype_name_full_path': get_cleaned_file_path(phenotype_file_path, results_dir_path),
            'threshold': 15
        })

    method_prefix = ''
    if num_bootstraps > 0:
        clustering_parameters.update({
            'number_of_bootstraps': num_bootstraps,
            'rows_sampling_fraction': 1.0,
            'cols_sampling_fraction': bootstrap_sample_fraction
        })
        method_prefix = 'cc_'

    if interaction_network_edge_file_path is not None:
        clustering_parameters.update({
            'gg_network_name_full_path': interaction_network_edge_file_path,
            'rwr_max_iterations': 100,
            'rwr_convergence_tolerence': 1.0e-4,
            'rwr_restart_probability': network_influence,
            'top_number_of_genes': 100,
            'nmf_conv_check_freq': 50,
            'nmf_max_invariance': 200,
            'nmf_max_iterations': 10000,
            'nmf_penalty_parameter': 1400,
            'method': method_prefix + 'net_nmf'
        })
        samples_clustering.SELECT[clustering_parameters['method']](clustering_parameters)
    else:
        clustering_parameters.update({
            'top_number_of_rows': 100,
            'affinity_metric': 'euclidean',
            'linkage_criterion': 'ward',
            'method': method_prefix + 'hclust'
        })
        general_clustering.SELECT[clustering_parameters['method']](clustering_parameters)

def do_prioritization(\
    omics_file_path, phenotype_file_path, results_dir_path, \
    correlation_measure, missing_value_strategy, num_exported_features, \
    num_response_correlated_features, species_id, \
    interaction_network_edge_file_path, network_influence):
    """TODO"""
    species_id = str(species_id) # user-friendliness
    os.makedirs(results_dir_path, exist_ok=True)

    if interaction_network_edge_file_path is None:
        pipeline_type = 'feature_prioritization_pipeline'
    else:
        fetch_network(interaction_network_edge_file_path)
        pipeline_type = 'gene_prioritization_pipeline'

    cleanup_parameters = {
        'spreadsheet_name_full_path': omics_file_path,
        'phenotype_name_full_path': phenotype_file_path,
        'pipeline_type': pipeline_type,
        'correlation_measure': correlation_measure, # t_test, pearson, edgeR
        'impute': missing_value_strategy, # average, remove, reject
        'results_directory': results_dir_path
    }
    if interaction_network_edge_file_path is not None:
        cleanup_parameters.update({
            'taxonid': species_id,
            'source_hint': '',
            'redis_credential': {
                    'host': REDIS_PARAMS['host'],
                    'port': REDIS_PARAMS['port'],
                    'password': REDIS_PARAMS['password']
            }
        })
    data_cleanup.run_pipelines(cleanup_parameters, data_cleanup.SELECT[pipeline_type])

    prioritization_parameters = {
        'correlation_measure': correlation_measure,
        'spreadsheet_name_full_path': get_cleaned_file_path(omics_file_path, results_dir_path),
        'phenotype_name_full_path': get_cleaned_file_path(phenotype_file_path, results_dir_path),
        'results_directory': results_dir_path,
        'top_gamma_of_sort': num_exported_features,
        'max_cpu': NUM_CPUS
    }
    if interaction_network_edge_file_path is not None:
        prioritization_parameters.update({
            'gg_network_name_full_path': interaction_network_edge_file_path,
            'rwr_max_iterations': 100,
            'rwr_convergence_tolerence': 1.0e-4,
            'rwr_restart_probability': network_influence,
            'top_beta_of_sort': num_response_correlated_features,
            'method': 'net_correlation'
        })
        gene_prioritization.net_correlation(prioritization_parameters)
    else:
        prioritization_parameters.update({
            'top_beta_of_sort': num_exported_features,
            'method': 'correlation',
        })
        feature_prioritization.correlation(prioritization_parameters)

def do_characterization(\
    gene_matrix_file_path, results_dir_path, species_id, \
    gene_property_edge_file_path, interaction_network_edge_file_path, \
    network_influence):
    """TODO"""
    species_id = str(species_id) # user-friendliness
    os.makedirs(results_dir_path, exist_ok=True)

    fetch_network(gene_property_edge_file_path)

    cleanup_parameters = {
        'spreadsheet_name_full_path': gene_matrix_file_path,
        'pipeline_type': 'geneset_characterization_pipeline',
        'results_directory': results_dir_path,
        'taxonid': species_id,
        'source_hint': '',
        'redis_credential': {
            'host': REDIS_PARAMS['host'],
            'port': REDIS_PARAMS['port'],
            'password': REDIS_PARAMS['password']
        }
    }
    data_cleanup.run_pipelines(cleanup_parameters, data_cleanup.SELECT[pipeline_type])

    characterization_parameters = {
        'spreadsheet_name_full_path': get_cleaned_file_path(gene_matrix_file_path, results_dir_path),
        'gene_names_map': get_gene_map_path(gene_matrix_file_path, results_dir_path),
        'results_directory': results_dir_path,
        'pg_network_name_full_path': gene_property_edge_file_path,
        'max_cpu': NUM_CPUS
    }
    if interaction_network_edge_file_path is None:
        characterization_parameters.update({
            'method': 'fisher'
        })
        geneset_characterization.fisher(characterization_parameters)
    else:
        fetch_network(interaction_network_edge_file_path)
        characterization_parameters.update({
            'method': 'DRaWR',
            'rwr_max_iterations': 500,
            'rwr_convergence_tolerence': 1.0e-4,
            'rwr_restart_probability': network_influence,
            'gg_network_name_full_path': interaction_network_edge_file_path
        })
        geneset_characterization.DRaWR(characterization_parameters)


## Connect to Google Drive

The cell below enables this notebook to use your Google Drive for file storage. Subsequent cells will use this access to load the example files you copied earlier and to save results of the example analyses. You might also find this helpful in running your own analyses.

Run the cell and click on the link that appears in the output. On the linked page, select your illinois.edu account and grant the requested permissions. The page will then display a code. Copy the code and paste it in the box that appears in the output below. Then press Enter.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine.


In [6]:
from google.colab import drive
GDRIVE_MOUNT_PATH = '/content/gdrive'
drive.mount(GDRIVE_MOUNT_PATH)

Mounted at /content/gdrive


## Setting File Locations

In the cell below, we will tell the notebook where the example files can be found and where the results should be saved.

To confirm the location, find the arrow symbol (>) near the top left corner of the portion of your screen that shows the notebook content. Click it to reveal a panel with three tabs labeled `Table of contents`, `Code snippets`, and `Files`. Click on the `Files` tab.

In the `Files` tab, you should see one folder named `gdrive`. Click the arrow next to the `gdrive` folder to expand it, and continue navigating through the folders until you find the `example_analyses` folder copied previously. Right-click on `example_analyses` and select `Copy path`. Paste the value into the cell below, and compare it to the value assigned to `INPUT_DATA_DIR_PATH`. If the values are different, replace the pre-coded value with the one you pasted. Once you have done that, run the cell.

If at any point you open the `Files` tab or click its `REFRESH` button and do not see `gdrive`, you might need to re-run the previous cell.

Note this cell also specifies the output directories that will be used for the different analyses in the example. They are defined here, in the last quick-running cell before the analyses below, because the variables will need to be refreshed if your notebook connects to a new virtual machine.

You **will** need to re-run the cell whenever your notebook connects to a new virtual machine, but the value assigned to `INPUT_DATA_DIR_PATH` will not change unless you move the folder within your Google Drive.


In [0]:
INPUT_DATA_DIR_PATH = '/content/gdrive/My Drive/College of Medicine Data Science Project/Genomics/example_analyses'
OUTPUT_DATA_DIR_PATH = os.path.join(INPUT_DATA_DIR_PATH, 'results')

CLUSTERING1_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering1')
CLUSTERING2_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering2')
CLUSTERING3_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering3')
CLUSTERING4_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering4')
CLUSTERING5_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering5')
CLUSTERING6_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'clustering6')

PRIORITIZATION1_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'prioritization1')
PRIORITIZATION2_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'prioritization2')
PRIORITIZATION3_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'prioritization3')

CHARACTERIZATION1_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'characterization1')
CHARACTERIZATION2_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'characterization2')
CHARACTERIZATION3_DIR_PATH = os.path.join(OUTPUT_DATA_DIR_PATH, 'characterization3')


## Clustering

The following four cells will use standard clustering techniques to group samples according to different omics data. Run each cell; note each will take several minutes. You'll see some output describing the inputs and results.

As each of these cells finishes, it will store the results to your Google Drive. For that reason, you **will not** need to re-run these cells whenever your notebook connects to a new virtual machine.

In [0]:
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering1_genecopynumber.tsv'), \
    None, CLUSTERING1_DIR_PATH, 8, None, None, None, 0, None)

In [0]:
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering2_exp_HiSeqV2.tsv'), \
    None, CLUSTERING2_DIR_PATH, 13, None, None, None, 0, None)

In [0]:
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering3_hMethyl.tsv'), \
    None, CLUSTERING3_DIR_PATH, 19, None, None, None, 0, None)

In [0]:
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering4_RPPA_RBN.tsv'), \
    None, CLUSTERING4_DIR_PATH, 8, None, None, None, 0, None)

### Network-Based Clustering

This fifth clustering analysis incorporates the knowledge network in order to improve results over sparse omics data. As with the above clustering analyses, run the cell and wait until it completes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
do_clustering(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering5_mutation.tsv'), \
    None, CLUSTERING5_DIR_PATH, 14, '9606', \
    '/network/Gene/9606/hn_IntNet/9606.hn_IntNet.edge', 0.5, 0, None)


### Cluster-of-Clusters Analysis (COCA)

This sixth clustering analysis operates upon the cluster assignments generated by the previous five clustering analyses. Again, run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [15]:
# gather the outputs from the five previous clustering analyses
raw_coca_inputs = [
    get_path_to_newest_file_having_prefix(CLUSTERING1_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING2_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING3_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING4_DIR_PATH, 'samples_label_by_cluster'),
    get_path_to_newest_file_having_prefix(CLUSTERING5_DIR_PATH, 'samples_label_by_cluster')
]

# assemble the raw inputs into a single file formatted like an omics file
coca_input_file_path = os.path.join(CLUSTERING6_DIR_PATH, 'input.tsv')
temp_dir_path = mkdtemp()
for input in raw_coca_inputs:
    shutil.copy(input, temp_dir_path)
coca_input_df = get_cluster_binary_dataframe(\
    [os.path.basename(input) for input in raw_coca_inputs], temp_dir_path)
coca_input_df.to_csv(coca_input_file_path, sep='\t')
shutil.rmtree(temp_dir_path)

do_clustering(\
    coca_input_file_path, \
    os.path.join(INPUT_DATA_DIR_PATH, 'in_clustering_clinical_data.tsv'), \
    CLUSTERING6_DIR_PATH, 13, None, None, None, 200, 0.8)



Unexpected error during reading input file /content/gdrive/My Drive/College of Medicine Data Science Project/Genomics/example_analyses/results/clustering6/input_ETL.tsv: <class 'FileNotFoundError'>


FileNotFoundError: ignored

## Gene Prioritization

The following three cells will analyze gene expression data to determine the genes most associated with phenotypes of interest.

In the first of the three prioritization cells, the phenotypes are PANCAN disease types, and the method is a standard prioritization technique. Run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [17]:
do_prioritization(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_prioritization_expr.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'pancan_disease_types.gXc'), \
    PRIORITIZATION1_DIR_PATH, 't_test', 'average', 100, 50, None, None, None)

Unexpected error during reading input file /content/gdrive/My Drive/College of Medicine Data Science Project/Genomics/example_analyses/results/prioritization1/pancan_disease_types_ETL.tsv: <class 'FileNotFoundError'>


FileNotFoundError: ignored

In the second of the three prioritization cells, the phenotypes are again the PANCAN disease types, but the method incorporates the knowledge network. Again, run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
do_prioritization(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_prioritization_expr.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'pancan_disease_types.gXc'), \
    PRIORITIZATION2_DIR_PATH, 't_test', 'average', 100, 50, '9606', \
    '/network/Gene/9606/hn_IntNet/9606.hn_IntNet.edge', 0.5):

In the third of the three prioritization cells, the phenotypes are the COCA cluster assignments. The standard method is used.

Run the cell and wait until it finishes.

As with all of the analysis cells, it will store the results to your Google Drive. For that reason, you **will not** need to re-run this cell whenever your notebook connects to a new virtual machine.

In [0]:
do_prioritization(\
    os.path.join(INPUT_DATA_DIR_PATH, 'in_prioritization_expr.tsv'), \
    os.path.join(INPUT_DATA_DIR_PATH, 'out_clustering6_assignments.tsv'), \
    PRIORITIZATION3_DIR_PATH, 't_test', 'average', 100, 50, None, None, None)

## Gene-Set Characterization

The final three cells compare the top genes found by the gene prioritization analyses with gene sets from the Gene Ontology database.

Each cell in this sequence corresponds to one of the three gene prioritization analyses above. Run each cell and wait for it to finish.

As with all of the analysis cells, these will store the results to your Google Drive. For that reason, you **will not** need to re-run these cells whenever your notebook connects to a new virtual machine.

In [0]:
do_characterization(\
    FIXME GENE MATRIX, characterization1_dir_path, '9606', \
    '/network/Property/9606/gene_ontology/9606.gene_ontology.edge',
    None, None)