---
> title: "Exploring Overlaps"
> created: 2019-09-23
> last updated: 2019-10-10
> author: "Mike Schmidt"
---

Exploring Overlaps
==================

Calculating overlap, or percent similarity, is one of many ways to assess similarity of ranked gene lists. The overlap of two gene lists that were generated in similar ways can be used as a measure of consistency, as each list agrees on the most important genes to some extent.

This file was created to develop different ways to assess overlaps (not algorithmically, but procedurally, within vs across groups, etc). And we don't know what thresholds are appropriate for considering genes listed above the threshold as relevant while considering those below to be irrelevant. The code here will be ported back into ge_data_manager code to generate these overlaps for any collection of results selected.

In [1]:
""" Bring in all the libraries we will need in this notebook. """
import pandas as pd
import numpy as np
import pickle
from pygest import algorithms

In [2]:
""" Define some lists we can use later. """
shuffles = ['agno', 'dist', 'edge', ]
thresholds = [320, 160, 80, 40, 20, ]

In [3]:
""" Pick up a dataframe that's just like what I'd normally send off for plotting in ge_data_manager.

    This file was copied from PYGEST_DATA/plots/ and was explicitly written from the ge_data_manager
    tasks.py:assess_performance to allow for debugging and experimenting at this stage. It contains
    784 rows, each representing a result from one of 16 training runs, or 3 * 256 permutation runs. """

with open("overlap_data/pickled_dataframe_tt_post_nkigg00s.df", "br") as f:
    df = pickle.load(f)

In [4]:
# Scrape the seeds used to generate each split half
def seed(path, key):
    """ Scrape the 5-character seed from the path and return it as an integer.
    
    :param path: path to the tsv file containing results
    :param key: substring preceding the seed, "batch-train" for splits, seed-" for shuffles
    """
    try:
        i = path.find(key) + len(key)
        return int(path[i:i + 5])
    except ValueError:
        return 0

In [5]:
""" Scrape a few pieces from the path and store them as their own columns. """
df['split'] = df['path'].apply(seed, args=("batch-train", ))
df['seed'] = df['path'].apply(seed, args=("seed-", ))

What proportion of genes overlaps within each type of permutation?
------------------------------------------------------------------

At each threshold, and for each type of permutation, generate the average overlap percentage for any given pair of results. In other words, for any given shuffled result is likely to contain ${MEAN_OVERLAP}% of the same genes as any other similarly shuffled permutation.

In [6]:
""" Figure out how to determine the percentage of genes that consistently rise to the top at a given threshold. """

results = {}

for threshold in thresholds:
    for shuffle in shuffles:
        mask_shuffle = df['shuffle'] == shuffle

        # Internal_overlap is just each result compared to every other result in the list
        df.loc[mask_shuffle, 'internal_overlap'] = algorithms.pct_similarity_list(list(df.loc[mask_shuffle, 'path']), top=threshold)
        
        results[(threshold, shuffle)] = "{:>6} (n = {:>3}, sd = {:>6})".format(
            "{:0.2%}".format(np.mean(df.loc[mask_shuffle, 'internal_overlap'])),
            sum(mask_shuffle),
            "{:0.4f}".format(np.std(df.loc[mask_shuffle, 'internal_overlap'])),
        )

print("{}{}".format(" " * 18, (" " * 27).join(shuffles)))
for threshold in thresholds:
    print("Threshold: {:>3} - {}".format(
        threshold, "  ".join(results[(threshold, x)] for x in shuffles),
    ))

                  agno                           dist                           edge
Threshold: 320 -  5.45% (n = 256, sd = 0.0068)  18.43% (n = 256, sd = 0.0179)  26.41% (n = 256, sd = 0.0185)
Threshold: 160 -  3.34% (n = 256, sd = 0.0045)  12.70% (n = 256, sd = 0.0158)  22.46% (n = 256, sd = 0.0188)
Threshold:  80 -  1.80% (n = 256, sd = 0.0031)   9.91% (n = 256, sd = 0.0155)  22.14% (n = 256, sd = 0.0194)
Threshold:  40 -  0.97% (n = 256, sd = 0.0024)   8.83% (n = 256, sd = 0.0181)  25.31% (n = 256, sd = 0.0299)
Threshold:  20 -  0.52% (n = 256, sd = 0.0018)   8.11% (n = 256, sd = 0.0228)  28.53% (n = 256, sd = 0.0425)


In [7]:
print("{}{}".format(" " * 20, (" " * 27).join(shuffles)))

                    agno                           dist                           edge


How do those genes survive successive overlap pruning?
------------------------------------------------------

We generate 16 split halves, each training half a random sampling of 50% of the available wellids or parcels. For each of those 16 training halves, we run an actual greedy Mantel whack-a-probe process. We also run three different types of shuffling, each with 16 random permutations for each of these training sets. This results in 49 (1 + 16 + 16 + 16) results per split training half. We would like to determine the consistency across shuffles and splits. As we check each successive result for its overlap with our remaining genes, how many last through to the very end, meaning they survived every overlap check and are listed in every result, no matter how they were split and shuffled?

There are a lot of data to report, so the following block saves out html tables to view results. See the overlap_data subfolder.

In [8]:
# Figure out how to determine the percentage of genes that consistently rise to the top at a given threshold.
with open("overlap_data/overlap_survival.html", "w") as f:
    f.write("<html>\n")
    f.write("<head>\n")
    f.write("  <title>Survival analysis by overlap percetages</title>\n")
    f.write("  <style>\n")
    f.write("    table, th { border: 1px solid black; }\n")
    f.write("    td { width: 40px; }\n")
    f.write("    td.first { width: 200px; }\n")
    f.write("  </style>\n")
    f.write("</head>\n")
    f.write("<body>\n")
    f.write("<h1>Probe survival analysis, by overlap</h1>\n")
    for threshold in thresholds:
        print("Threshold == {}".format(threshold))
        f.write("<h2>Threshold == {}</h2>\n".format(threshold))
        for shuffle in shuffles:
            print("  Shuffle == {}".format(shuffle))
            f.write("<h3>{}-shuffled</h3>\n".format(shuffle))
            f.write("  <table>\n")
            f.write("    <tr>\n")
            f.write("      <th>Seed \\ Split</th><th>{}</th>\n".format(
                "</th><th>".join(str(x) for x in range(16))
            ))
            f.write("    </tr>\n")
            mask_shuffle = df['shuffle'] == shuffle
        
            # A list of lists of sets
            remaining_probe_multi_list = []
            
            for seed in sorted(list(set(df['seed']))):
                f.write("    <tr>\n")
                mask_seed = df['seed'] == seed
                
                num_probes_remaining = []
                which_probes_remaining = []
                remaining_probes = None
                for result in sorted(df[mask_shuffle & mask_seed]['path']):
                    if remaining_probes is None:
                        remaining_probes = set(algorithms.run_results(result, threshold)['top_probes'])
                    else:
                        remaining_probes = set(algorithms.run_results(result, threshold)['top_probes']).intersection(remaining_probes)
                    num_probes_remaining.append(len(remaining_probes))
                    which_probes_remaining.append(remaining_probes)
                if len(which_probes_remaining) > 0:
                    remaining_probe_multi_list.append(which_probes_remaining)
                    f.write("      <td class=\"first\">Survivors in {} splits with seed={}</td><td>{}</td>\n".format(
                        len(df[mask_shuffle & mask_seed].index),
                        seed,
                        "</td><td>".join(str(x) for x in num_probes_remaining)
                    ))
                f.write("    </tr>\n")
            f.write("  </table>\n")
                
                
            # From those across-seed results, combine them across-splits
            final_num_probes_remaining = []
            final_remaining_probes = None
            # loop over batch train-test splits (layer two in the multi_list)
            for i, split in enumerate(sorted(list(set(df['split'])))):
                # loop over shuffle seeds (layer one in the multi_list)
                for j, probe_set in enumerate([set_list[i] for set_list in remaining_probe_multi_list]):
                    if final_remaining_probes is None:
                        final_remaining_probes = probe_set
                    else:
                        final_remaining_probes = probe_set.intersection(final_remaining_probes)
                    # print("  ({},{}) at {} probes".format(j, i, len(final_remaining_probes)))
                final_num_probes_remaining.append(0 if final_remaining_probes is None else len(final_remaining_probes))
            f.write("  <p>Across all splits and seeds, {} probes remained.</p>\n".format(
                "<None>" if final_remaining_probes is None else len(final_remaining_probes)
            ))
            f.write("  <table><tr>\n")
            f.write("      <td class=\"first\">by split</td><td>{}</td>\n".format(
                "</td><td>".join(str(x) for x in final_num_probes_remaining)
            ))
            f.write("  </tr></table>\n")
    f.write("</body>\n")
    f.write("</html>\n")


Threshold == 320
  Shuffle == agno
  Shuffle == dist
  Shuffle == edge
Threshold == 160
  Shuffle == agno
  Shuffle == dist
  Shuffle == edge
Threshold == 80
  Shuffle == agno
  Shuffle == dist
  Shuffle == edge
Threshold == 40
  Shuffle == agno
  Shuffle == dist
  Shuffle == edge
Threshold == 20
  Shuffle == agno
  Shuffle == dist
  Shuffle == edge


Do those overlaps compare favorably to real data?
-------------------------------------------------

Real training halves are not shuffled, and therefore cannot be viewed across shuffle-seeds. But we can view them across each of the training sets. In other words, From one random 50% sample to the next, how many probes are in every list so far?

In [9]:
# Only 16 real results (shuffle == 'none'); how do they compare?
mask_shuffle = df['shuffle'] == 'none'

for threshold in thresholds:
    num_probes_remaining = []
    remaining_probes = None
    for result in df[mask_shuffle]['path']:
        if remaining_probes is None:
            remaining_probes = set(algorithms.run_results(result, threshold)['top_probes'])
        else:
            remaining_probes = set(algorithms.run_results(result, threshold)['top_probes']).intersection(remaining_probes)
        num_probes_remaining.append(len(remaining_probes))
    print("Survivors in real. n={}: {}".format(
        len(df[mask_shuffle].index),
        ", ".join(str(x) for x in num_probes_remaining)
    ))

Survivors in real. n=16: 320, 113, 63, 35, 29, 21, 18, 18, 18, 17, 13, 13, 12, 12, 12, 12
Survivors in real. n=16: 160, 59, 32, 19, 16, 10, 8, 8, 7, 7, 6, 6, 5, 5, 5, 5
Survivors in real. n=16: 80, 21, 10, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4
Survivors in real. n=16: 40, 11, 5, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2
Survivors in real. n=16: 20, 8, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 1


Just hacking around
===================

The following code was in another file, also related to overlaps, and is included here so all overlap code experiments are in one place.

In [11]:
files = [
    "/home/mike/scx_ge_data/shuffles/sub-H03512002_hem-L_ctx-scx_prb-fornito/tgt-max_alg-smrt/sub-H03512002_norm-none_cmp-hcpniftismoothgrandmeansim_msk-16_adj-none_seed-0001.tsv",
    "/home/mike/scx_ge_data/shuffles/sub-H03512002_hem-L_ctx-scx_prb-fornito/tgt-max_alg-smrt/sub-H03512002_norm-none_cmp-hcpniftismoothgrandmeansim_msk-none_adj-none_seed-0002.tsv",
    "/home/mike/scx_ge_data/shuffles/sub-H03512002_hem-L_ctx-scx_prb-fornito/tgt-max_alg-smrt/sub-H03512002_norm-none_cmp-hcpniftismoothgrandmeansim_msk-none_adj-none_seed-0001.tsv",
    "/home/mike/scx_ge_data/shuffles/sub-H03512002_hem-L_ctx-scx_prb-fornito/tgt-max_alg-smrt/sub-H03512002_norm-none_cmp-hcpniftismoothgrandmeansim_msk-16_adj-none_seed-0002.tsv",
]
probelists = [algorithms.top_probes(f) for f in files]

m_olap_files = algorithms.pct_similarity_matrix(files[0:2])
m_olap_probelists = algorithms.pct_similarity_matrix_raw(probelists[0:2])

print("files : {:0.3%}".format(np.mean(m_olap_files[np.tril_indices_from(m_olap_files, k=-1)])))
print("probes: {:0.3%}".format(np.mean(m_olap_probelists[np.tril_indices_from(m_olap_probelists, k=-1)])))

m_olap_files = algorithms.pct_similarity_matrix(files[0:3])
m_olap_probelists = algorithms.pct_similarity_matrix_raw(probelists[0:3])

print("files : {:0.3%}".format(np.mean(m_olap_files[np.tril_indices_from(m_olap_files, k=-1)])))
print("probes: {:0.3%}".format(np.mean(m_olap_probelists[np.tril_indices_from(m_olap_probelists, k=-1)])))

m_olap_files = algorithms.pct_similarity_matrix(files[2:4])
m_olap_probelists = algorithms.pct_similarity_matrix_raw(probelists[2:4])

print("files : {:0.3%}".format(np.mean(m_olap_files[np.tril_indices_from(m_olap_files, k=-1)])))
print("probes: {:0.3%}".format(np.mean(m_olap_probelists[np.tril_indices_from(m_olap_probelists, k=-1)])))


files : 9.069%
probes: 9.069%
files : 35.288%
probes: 35.288%
files : 9.434%
probes: 9.434%
