# Rank probes into a list

We need a ranked list of probes/genes for many reasons. Top of mind as I write this is to generate a list of the top 50% of our probes so we can truncate the computationally expensive process of running permutation tests on huge matrices of gene expression values. If we only run the top half, it can cut the time spent on each run by XXX. This code will average the ranking of each REAL optimization run, and then we can run permutations and shuffles on only the best genes.

Which results are used to generate rankings can be modified by editing the path in the 'context' block below.

In [1]:
""" imports """

import pandas as pd
import os


In [2]:
""" context """

PYGEST_DATA = "/data.local"
BASE_DIR = "derivatives/sub-all_hem-A_samp-glasser_prob-fornito/parby-wellid_splby-wellid_batch-train{:05d}/tgt-max_algo-smrt"
FILE_NAME = "sub-all_comp-hcpniftismoothgrandmeansim_mask-16_norm-srs_adj-none.tsv"


In [3]:
""" load all the files, ranking probes as we go """

dfs = []
for split in range(200, 216, 1):
    full_path = os.path.join(PYGEST_DATA, BASE_DIR.format(split), FILE_NAME)
    if os.path.isfile(full_path):
        df = pd.read_csv(full_path, sep='\t')[['probe_id', ]]
        df['rank_{:05d}'.format(split)] = range(1, len(df) + 1, 1)
        df = df[['probe_id', 'rank_{:05d}'.format(split), ]].set_index('probe_id')
        dfs.append(df.sort_index())
    else:
        print("DOES NOT EXIST: {}".format(full_path))
        
print("Loaded {} dataframes".format(len(dfs)))
df = pd.concat(dfs, axis=1)


Loaded 16 dataframes


In [4]:
""" average the ranking for each probe, and create an overall ranked list. """

df['rank'] = df.mean(axis=1)
df = df.sort_values('rank')


In [5]:
""" Save out the top half of the probes to a file. """

file_name = "top_half_probes_hcpww16ss.csv"
with open(file_name, "wt") as f:
    for probe in df.index[:int(len(df.index) / 2)]:
        f.write("{}\n".format(probe))

print("Top 50% of probes have been written to {}".format(file_name))

Top 50% of probes have been written to top_half_probes_hcpww16ss.csv
