<a href="https://colab.research.google.com/github/matthewberry/uiuc_com_dsp/blob/master/DSP_genomics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installation

The cell below installs software required to perform the analyses. Run the cell and wait for it to complete, which might take several minutes. You'll see lots of text output as the cell runs, but there's no need to read it unless the following cell fails.

Once you've run this cell and confirmed that the next cell also succeeds, you shouldn't need to run this cell again.

In [0]:
!pip3 install -I pyyaml==5.1.2 xmlrunner==1.7.7 redis==3.3.8
!pip3 install git+https://github.com/KnowEnG/KnowEnG_Pipelines_Library.git@mjberry/update_dependencies
!pip3 install git+https://github.com/KnowEnG/Data_Cleanup_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/General_Clustering_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Samples_Clustering_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Feature_Prioritization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Gene_Prioritization_Pipeline.git@mjberry/create_package
!pip3 install git+https://github.com/KnowEnG/Geneset_Characterization_Pipeline.git@mjberry/create_package

Collecting pyyaml==5.1.2
[?25l  Downloading https://files.pythonhosted.org/packages/e3/e8/b3212641ee2718d556df0f23f78de8303f068fe29cdaa7a91018849582fe/PyYAML-5.1.2.tar.gz (265kB)
[K     |████████████████████████████████| 266kB 4.9MB/s 
[?25hCollecting xmlrunner==1.7.7
  Downloading https://files.pythonhosted.org/packages/57/c0/a19e29bc6038a56bb690549573af6ea11a9d2a5c07aff2e27ed308c2cab9/xmlrunner-1.7.7.tar.gz
Collecting redis==3.3.8
[?25l  Downloading https://files.pythonhosted.org/packages/bd/64/b1e90af9bf0c7f6ef55e46b81ab527b33b785824d65300bb65636534b530/redis-3.3.8-py2.py3-none-any.whl (66kB)
[K     |████████████████████████████████| 71kB 23.2MB/s 
[?25hBuilding wheels for collected packages: pyyaml, xmlrunner
  Building wheel for pyyaml (setup.py) ... [?25l[?25hdone
  Created wheel for pyyaml: filename=PyYAML-5.1.2-cp36-cp36m-linux_x86_64.whl size=44104 sha256=a0aaca84739e7febd0032bb407631c25f284246389870a773e63f6dfbf131454
  Stored in directory: /root/.cache/pip/wheels/d9/

## Environment Setup

This cell sets up the environment for running the analyses. Run the cell and wait for it to complete. You won't see any text output this time.

You won't need to run this cell again, and you probably won't need to call any of the methods it defines.

In [0]:
import csv
import os
import shutil
import urllib.request

from IPython.display import HTML

from kndatacleanup import data_cleanup
from knfeatureprioritization import feature_prioritization
from kngeneprioritization import gene_prioritization
from kngenesetcharacterization import geneset_characterization
from knsamplesclustering import samples_clustering
from kngeneralclustering import general_clustering

NETWORK_DIR_PATH = '/network/'

REDIS_PARAMS = {
    'host': 'knowredis.knoweng.org',
    'password': 'KnowEnG',
    'port': 6379
}

NUM_CPUS = 2

def fetch_network(edge_file_path):
    """TODO"""
    if not os.path.isfile(edge_file_path):
        url = "https://s3.amazonaws.com/KnowNets/KN-20rep-1706/" + \
            "userKN-20rep-1706/" + edge_file_path[len(NETWORK_DIR_PATH)-1:]
        with urllib.request.urlopen(url) as response:
            with open(edge_file_path, 'wb') as out_file:
                shutil.copyfileobj(response, out_file)

def fetch_network_metadata():
    filenames = ['db_contents.txt', 'species_desc.txt', 'edge_type.txt']
    for filename in filenames:
        out_file_path = os.path.join(NETWORK_DIR_PATH, filename)
        if not os.path.isfile(out_file_path):
            url = "https://s3.amazonaws.com/KnowNets/KN-20rep-1706/" + \
                "userKN-20rep-1706/" + filename
            with urllib.request.urlopen(url) as response:
                with open(out_file_path, 'wb') as out_file:
                    shutil.copyfileobj(response, out_file)

def get_path_to_newest_file_having_prefix(search_dir_path, prefix):
    """TODO"""
    matches = [os.path.join(search_dir_path, name) for name \
        in os.listdir(search_dir_path) if name.startswith(prefix)]
    if matches:
        return sorted(matches, key=lambda path: os.path.getctime(path), reverse=True)[0]
    else:
        raise Exception("No file found with prefix " + prefix + " in " + \
            search_dir_path + ".")

def get_cleaned_file_path(original_file_path, results_dir_path):
    """TODO"""
    original_name = os.path.basename(original_file_path)
    original_name_root = os.path.splitext(original_name)[0]
    return os.path.join(results_dir_path, original_name_root + "_ETL.tsv")

def get_gene_map_file_path(original_file_path, results_dir_path):
    """TODO"""
    original_name = os.path.basename(original_file_path)
    original_name_root = os.path.splitext(original_name)[0]
    return os.path.join(results_dir_path, original_name_root + "_MAP.tsv")

for dir_path in [INPUT_DATA_DIR_PATH, OUTPUT_DATA_DIR_PATH, NETWORK_DIR_PATH]:
    os.makedirs(dir_path, exist_ok=True)
fetch_network_metadata()

## Knowledge Network Utility Methods

This cell defines several utility methods for working with the knowledge network. These methods are used in the example analyses and might be useful to you in your project. Run this cell and wait for it to complete. It won't produce any text output.

You won't need to edit anything within this cell or run it more than once. The next cell shows how to use the knowledge network utility methods.

In [0]:
def get_network_species():
    """TODO"""
    return_val = []
    species_file_path = os.path.join(NETWORK_DIR_PATH, 'species_desc.txt')
    with open(species_file_path) as csvfile:
        for row in csv.reader(csvfile, delimiter='\t'):
            return_val.append({
                'id': row[0],
                'short_latin_name': row[1],
                'latin_name': row[2],
                'familiar_name': row[3],
                'group_name': row[5]
            })
    return return_val

def display_network_species():
    """TODO"""
    html_string = "<table><tr><th>Familiar Name (Latin Name)</th><th>Species Id</th></tr>"
    for species in get_network_species():
        html_string += "<tr><td>" + species['familiar_name'] + " (" + \
            species['latin_name'] + ")</td><td>" + species['id'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

def get_interaction_networks(species_id):
    """TODO"""
    species_id = str(species_id) # user-friendliness
    return_val = []
    contents_file_path = os.path.join(NETWORK_DIR_PATH, 'db_contents.txt')
    with open(contents_file_path) as csvfile:
        for row in csv.DictReader(csvfile, delimiter='\t'):
            if row['n1_type'] == 'Gene' and row['taxon'] == species_id:
                return_val.append({
                    'name': row['et_name'],
                    'edge_file_path': os.path.join(\
                        NETWORK_DIR_PATH, 'Gene', species_id, row['et_name'], \
                        species_id + '.' + row['et_name'] + '.edge')
                })
    return return_val

def display_interaction_networks(species_id):
    """TODO"""
    html_string = "<table><tr><th>Interaction Network Name</th><th>Edge File Path</th></tr>"
    for network in get_interaction_networks(species_id):
        html_string += "<tr><td>" + network['name'] + "</td><td>" + \
            network['edge_file_path'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

def get_gene_property_networks(species_id):
    """TODO"""
    species_id = str(species_id) # user-friendliness
    return_val = []
    contents_file_path = os.path.join(NETWORK_DIR_PATH, 'db_contents.txt')
    with open(contents_file_path) as csvfile:
        for row in csv.DictReader(csvfile, delimiter='\t'):
            if row['n1_type'] == 'Property' and row['taxon'] == species_id:
                return_val.append({
                    'name': row['et_name'],
                    'edge_file_path': os.path.join(\
                        NETWORK_DIR_PATH, 'Property', species_id, row['et_name'], \
                        species_id + '.' + row['et_name'] + '.edge')
                })
    return return_val

def display_gene_property_networks(species_id):
    """TODO"""
    html_string = "<table><tr><th>Interaction Network Name</th><th>Edge File Path</th></tr>"
    for network in get_gene_property_networks(species_id):
        html_string += "<tr><td>" + network['name'] + "</td><td>" + \
            network['edge_file_path'] + "</td></tr>"
    html_string += "</table>"
    return HTML(html_string)

### Using the Knowledge Network Utility Methods

The cells below show how `display_network_species`, `display_interaction_networks`, and `display_gene_property_networks` can be called to view information about the knowledge network. This information can be useful in configuring analyses, as you'll see later.

These methods are based on three other methods, `get_network_species`, `get_interaction_networks`, and `get_gene_property_networks`. The "get" versions return the same information as the "display" versions, but the "get" versions return it in a format convenient for use in code instead of a format that's easy to read.

In [0]:
# display all species in the knowledge network
display_network_species()

Familiar Name (Latin Name),Species Id
Human (Homo sapiens),9606
Chimpanzee (Pan troglodytes),9598
Cow (Bos taurus),9913
Dog (Canis familiaris),9615
Macaque (Macaca mulatta),9544
Mouse (Mus musculus),10090
Pig (Sus scrofa),9823
Rat (Rattus norvegicus),10116
Chicken (Gallus gallus),9031
Clawed frog (Xenopus tropicalis),8364


In [0]:
# display interaction networks for rat (species id 10116)
display_interaction_networks('10116')

Interaction Network Name,Edge File Path
blastp_homology,/network/Gene/10116/blastp_homology/10116.blastp_homology.edge
pathcom_catalysis_precedes,/network/Gene/10116/pathcom_catalysis_precedes/10116.pathcom_catalysis_precedes.edge
pathcom_controls_expression_of,/network/Gene/10116/pathcom_controls_expression_of/10116.pathcom_controls_expression_of.edge
pathcom_controls_phosphorylation_of,/network/Gene/10116/pathcom_controls_phosphorylation_of/10116.pathcom_controls_phosphorylation_of.edge
pathcom_controls_state_change_of,/network/Gene/10116/pathcom_controls_state_change_of/10116.pathcom_controls_state_change_of.edge
pathcom_in_complex_with,/network/Gene/10116/pathcom_in_complex_with/10116.pathcom_in_complex_with.edge
PPI_association,/network/Gene/10116/PPI_association/10116.PPI_association.edge
PPI_colocalization,/network/Gene/10116/PPI_colocalization/10116.PPI_colocalization.edge
PPI_direct_interaction,/network/Gene/10116/PPI_direct_interaction/10116.PPI_direct_interaction.edge
PPI_genetic_interaction,/network/Gene/10116/PPI_genetic_interaction/10116.PPI_genetic_interaction.edge


In [0]:
# display gene property networks for roundworm (species id 6239)
display_gene_property_networks('6239')

Interaction Network Name,Edge File Path
gene_ontology,/network/Property/6239/gene_ontology/6239.gene_ontology.edge
pathcom_pathway,/network/Property/6239/pathcom_pathway/6239.pathcom_pathway.edge
pfam_prot,/network/Property/6239/pfam_prot/6239.pfam_prot.edge
reactome_annotation,/network/Property/6239/reactome_annotation/6239.reactome_annotation.edge


## Analytics Methods

The cell below defines methods for running clustering, prioritization, and gene-set characterization. Run the cell and wait for it to complete. It won't produce any output.

You won't need to run this cell more than once unless you later change it as part of your project.

In [0]:
def do_clustering(\
    omics_file_path, phenotype_file_path, results_dir_path, num_clusters, \
    species_id, interaction_network_edge_file_path, network_influence, \
    num_bootstraps, bootstrap_sample_fraction):
    """TODO"""
    os.makedirs(results_dir_path, exist_ok=True)

    if interaction_network_edge_file_path is None:
        fetch_network(interaction_network_edge_file_path)
        pipeline_type = 'general_clustering_pipeline'
    else:
        pipeline_type = 'samples_clustering_pipeline'

    cleanup_parameters = {
        'spreadsheet_name_full_path': omics_file_path,
        'pipeline_type': pipeline_type,
        'results_directory': results_dir_path
    }
    if phenotype_file_path is not None:
        cleanup_parameters['phenotype_name_full_path'] = phenotype_file_path
    if interaction_network_edge_file_path is not None:
        cleanup_parameters.update({
            'gg_network_name_full_path': interaction_network_edge_file_path,
            'taxonid': species_id,
            'source_hint': '',
            'redis_credential': {
                'host': REDIS_PARAMS['host'],
                'port': REDIS_PARAMS['port'],
                'password': REDIS_PARAMS['password']
            }
        })
    data_cleanup.run_pipelines(cleanup_parameters, data_cleanup.SELECT[pipeline_type])

    clustering_parameters = {
        'spreadsheet_name_full_path': get_cleaned_file_path(omics_file_path, results_dir_path),
        'results_directory': results_dir_path,
        'processing_method': 'parallel',
        'parallelism': NUM_CPUS,
        'number_of_clusters': num_clusters,
        'tmp_directory': './tmp'
    }
    if phenotype_file_path is not None:
        clustering_parameters.update({
            'phenotype_name_full_path': get_cleaned_file_path(pheno_file_path, results_dir_path),
            'threshold': 15
        })

    method_prefix = ''
    if num_bootstraps > 0:
        clustering_parameters.update({
            'number_of_bootstraps': num_bootstraps,
            'rows_sampling_fraction': 1.0,
            'cols_sampling_fraction': bootstrap_sample_fraction
        })
        method_prefix = 'cc_'

    if interaction_network_edge_file_path is not None:
        clustering_parameters.update({
            'gg_network_name_full_path': interaction_network_edge_file_path,
            'rwr_max_iterations': 100,
            'rwr_convergence_tolerence': 1.0e-4,
            'rwr_restart_probability': network_influence,
            'top_number_of_genes': 100,
            'nmf_conv_check_freq': 50,
            'nmf_max_invariance': 200,
            'nmf_max_iterations': 10000,
            'nmf_penalty_parameter': 1400,
            'method': method_prefix + 'net_nmf'
        })
        samples_clustering.SELECT[clustering_parameters['method']](clustering_parameters)
    else:
        clustering_parameters.update({
            'top_number_of_rows': 100,
            'affinity_metric': 'euclidean',
            'linkage_criterion': 'ward',
            'method': method_prefix + 'hclust'
        })
        general_clustering.SELECT[clustering_parameters['method']](clustering_parameters)

def do_prioritization(\
    omics_file_path, phenotype_file_path, results_dir_path, \
    correlation_measure, missing_value_strategy, num_exported_features, \
    num_response_correlated_features, species_id, \
    interaction_network_edge_file_path, network_influence):
    """TODO"""
    os.makedirs(results_dir_path, exist_ok=True)

    if interaction_network_edge_file_path is None:
        fetch_network(interaction_network_edge_file_path)
        pipeline_type = 'feature_prioritization_pipeline'
    else:
        pipeline_type = 'gene_prioritization_pipeline'

    cleanup_parameters = {
        'spreadsheet_name_full_path': omics_file_path,
        'phenotype_name_full_path': phenotype_file_path,
        'pipeline_type': pipeline_type,
        'correlation_measure': correlation_measure, # t_test, pearson, edgeR
        'impute': missing_value_strategy, # average, remove, reject
        'results_directory': results_dir_path
    }
    if interaction_network_edge_file_path is not None:
        cleanup_parameters.update({
            'taxonid': species_id,
            'source_hint': '',
            'redis_credential': {
                    'host': REDIS_PARAMS['host'],
                    'port': REDIS_PARAMS['port'],
                    'password': REDIS_PARAMS['password']
            }
        })
    data_cleanup.run_pipelines(cleanup_parameters, data_cleanup.SELECT[pipeline_type])

    prioritization_parameters = {
        'correlation_measure': correlation_method,
        'spreadsheet_name_full_path': get_cleaned_file_path(omics_file_path, results_dir_path),
        'phenotype_name_full_path': get_cleaned_file_path(pheno_file_path, results_dir_path),
        'results_directory': results_dir_path,
        'top_gamma_of_sort': num_exported_features,
        'max_cpu': NUM_CPUS
    }
    if gg_network_name_full_path is not None:
        prioritization_parameters.update({
            'gg_network_name_full_path': interaction_network_edge_file_path,
            'rwr_max_iterations': 100,
            'rwr_convergence_tolerence': 1.0e-4,
            'rwr_restart_probability': network_influence,
            'top_beta_of_sort': num_response_correlated_features,
            'method': 'net_correlation'
        })
        gene_prioritization.net_correlation(prioritization_parameters)
    else:
        prioritization_parameters.update({
            'top_beta_of_sort': num_exported_features,
            'method': 'correlation',
        })
        feature_prioritization.correlation(prioritization_parameters)

def do_characterization(\
    gene_matrix_file_path, results_dir_path, species_id, \
    gene_property_edge_file_path, interaction_network_edge_file_path, \
    network_influence):
    """TODO"""
    os.makedirs(results_dir_path, exist_ok=True)

    fetch_network(gene_property_edge_file_path)

    cleanup_parameters = {
        'spreadsheet_name_full_path': gene_matrix_file_path,
        'pipeline_type': 'geneset_characterization_pipeline',
        'results_directory': results_dir_path,
        'taxonid': species_id,
        'source_hint': '',
        'redis_credential': {
            'host': REDIS_PARAMS['host'],
            'port': REDIS_PARAMS['port'],
            'password': REDIS_PARAMS['password']
        }
    }
    data_cleanup.run_pipelines(cleanup_parameters, data_cleanup.SELECT[pipeline_type])

    characterization_parameters = {
        'spreadsheet_name_full_path': get_cleaned_file_path(gene_matrix_file_path, results_dir_path),
        'gene_names_map': get_gene_map_path(gene_matrix_file_path, results_dir_path),
        'results_directory': results_dir_path,
        'pg_network_name_full_path': gene_property_edge_file_path,
        'max_cpu': NUM_CPUS
    }
    if interaction_network_edge_file_path is None:
        characterization_parameters.update({
            'method': 'fisher'
        })
        geneset_characterization.fisher(characterization_parameters)
    else:
        fetch_network(interaction_network_edge_file_path)
        characterization_parameters.update({
            'method': 'DRaWR',
            'rwr_max_iterations': 500,
            'rwr_convergence_tolerence': 1.0e-4,
            'rwr_restart_probability': network_influence,
            'gg_network_name_full_path': interaction_network_edge_file_path
        })
        geneset_characterization.DRaWR(characterization_parameters)


In [0]:
INPUT_DATA_DIR_PATH = '/original_data/'
OUTPUT_DATA_DIR_PATH = '/results/'