# Rank probes into a list, save out the top half

We need a ranked list of probes/genes for many reasons. Top of mind as I write this is to generate a list of the top 50% of our probes so we can truncate the computationally expensive process of running permutation tests on huge matrices of gene expression values. If we only run the top half, it can cut the time spent on each run by XXX. This code will average the ranking of each REAL optimization run, and then we can run permutations and shuffles on only the best genes.

Which results are used to generate rankings can be modified by editing the path in the 'context' block below.

In [1]:
""" imports """

import pandas as pd
import os


In [2]:
""" context """

PYGEST_DATA = "/data"
BASE_DIR = "derivatives/sub-all_hem-A_samp-glasser_prob-fornito/parby-wellid_splby-wellid_batch-train{:05d}/tgt-max_algo-smrt"
FILE_NAME = "sub-all_comp-hcpniftismoothgrandmeansim_mask-{}_norm-srs_adj-none.tsv"


In [3]:
""" load all the files, ranking probes as we go """

dfs = {}
for mask in ['none', '16', '32', '64', ]:
    dfs[mask] = []
    for split in range(200, 216, 1):
        full_path = os.path.join(PYGEST_DATA, BASE_DIR.format(split), FILE_NAME.format(mask))
        if os.path.isfile(full_path):
            df = pd.read_csv(full_path, sep='\t')[['probe_id', ]]
            df['rank_{:05d}'.format(split)] = range(1, len(df) + 1, 1)
            df = df[['probe_id', 'rank_{:05d}'.format(split), ]].set_index('probe_id')
            dfs[mask].append(df.sort_index())
        else:
            print("DOES NOT EXIST: {}".format(full_path))
        
    print("Loaded {} dataframes with {} mask".format(len(dfs[mask]), mask))
    dfs['df-' + mask] = pd.concat(dfs[mask], axis=1)


Loaded 16 dataframes with none mask
Loaded 16 dataframes with 16 mask
Loaded 16 dataframes with 32 mask
Loaded 16 dataframes with 64 mask


In [4]:
""" average the ranking for each probe, and create an overall ranked list. """

for mask in ['none', '16', '32', '64', ]:
    dfs['df-' + mask]['rank'] = dfs['df-' + mask].mean(axis=1)
    dfs['df-' + mask] = dfs['df-' + mask].sort_values('rank')


In [5]:
""" Save out the top half of the probes to a file. """

for mask in ['none', '16', '32', '64', ]:
    mask_digits = '00' if mask == 'none' else mask
    file_name = "top_half_probes_hcpww{}ss.csv".format(mask_digits)
    with open(file_name, "wt") as f:
        for probe in df.index[:int(len(df.index) / 2)]:
            f.write("{}\n".format(probe))

    print("Top 50% of probes have been written to {}".format(file_name))

Top 50% of probes have been written to top_half_probes_hcpww00ss.csv
Top 50% of probes have been written to top_half_probes_hcpww16ss.csv
Top 50% of probes have been written to top_half_probes_hcpww32ss.csv
Top 50% of probes have been written to top_half_probes_hcpww64ss.csv


In [6]:
""" Save out the top >10% (all probes that ever make the top 10%) of the probes to a file. """

for mask in ['none', '16', '32', '64', ]:
    mask_digits = '00' if mask == 'none' else mask
    dfs['df-' + mask]['min_rank'] = dfs['df-' + mask].min(axis=1)
    dfs['df-' + mask]['max_rank'] = dfs['df-' + mask].min(axis=1)
    decile_threshold = len(dfs['df-' + mask]) / 10.0
    df_top_decile = dfs['df-' + mask][dfs['df-' + mask]['min_rank'] < decile_threshold]

    file_name = "top_decile_probes_hcpww{}ss.csv".format(mask_digits)
    with open(file_name, "wt") as f:
        for probe in df_top_decile.index:
            f.write("{}\n".format(probe))

    print("Top decile (ever ranked < {:0,.0f}) probes ({:,}) have been written to {}".format(
        decile_threshold, len(df_top_decile), file_name
    ))

Top decile (ever ranked < 1,574) probes (4,630) have been written to top_decile_probes_hcpww00ss.csv
Top decile (ever ranked < 1,574) probes (4,601) have been written to top_decile_probes_hcpww16ss.csv
Top decile (ever ranked < 1,574) probes (4,708) have been written to top_decile_probes_hcpww32ss.csv
Top decile (ever ranked < 1,574) probes (4,768) have been written to top_decile_probes_hcpww64ss.csv


# What to do from here

The eight files written above should be copied to wherever PyGEST is being run. I chose to put it into <code>\$PYGEST_DATA/derivatives/sub-all_hem-A_samp-glasser_prob-fornito/</code>. It can then be used by calling <code>pygest push ... --only-probes-in \$PYGEST_DATA/derivatives/sub-all_hem-A_samp-glasser_prob-fornito/top_half_probes_hcpww16ss.csv ...</code>. In that way, it will start whack-a-probe with only the probes in the list provided, rather than all 15,745.

----

# Testing things

Code below here is not necessary, but can be used to test code above.

In [7]:
""" Test reading of the file as a csv """

input_probes = pd.read_csv(file_name, header=None)
input_probes.columns

Int64Index([0], dtype='int64')