Reproducibility notebook for preliminary analysis
=
This notebook accompanies TF_target_cause_effect.ipynb (in the same directory). I take the analysis I did there and make it as reproducible as possible. This means packaging analysis into functions. For example, after producing the differential expression of Myc heatmap, we might ask "okay, is there just something weird going on with this clone that has a stripe, indicating a very different Myc profile?" One basic check would be to produce that heatmap for other genes and see if the same clone or different clones have stripes. Because gene counts are normalized in each cell, it seems unlikely that the same clone will have a stripe for many different genes, but still helpful to know what's going on. Also, this will make it easier to rerun this analysis when we get new data/new clustering. 

In [1]:
import pandas as pd
import numpy as np
import scanpy
from collections import Counter
from matplotlib import pyplot as plt
from scipy import special
from scipy import stats

In [424]:
ann_data = scanpy.read_h5ad('preprocessed_adata/Weinreb2020.adata.h5ad')

In [2]:
def filter_for_top_clones(data, n=100):
    c = Counter(data.obs['clone_idx'].dropna().tolist())
    topn_clones = [clone[0] for clone in c.most_common(n)]
    return data[data.obs['clone_idx'].isin(top100_clones)].copy()

In [3]:
def make_pairwise_Wasserstein_distances(data, gene_id):
    tuples = []
    for clone1 in set(myc.obs.clone_idx):
        print(clone1)
        for clone2 in set(myc.obs.clone_idx):
            myc1 = myc[myc.obs.clone_idx == clone1, :]
            myc2 = myc[myc.obs.clone_idx == clone2, :]
            wd = stats.wasserstein_distance(myc1.to_df()[myc_id], myc2.to_df()[myc_id])
            tuples.append([clone1, clone2, wd])

    df = pd.DataFrame(tuples, columns=['clone1', 'clone2', 'wd'])
    return df