# Multiplier analysis

The goal of this notebook is to examine why genes were found to be generic. Specifically, this notebook is trying to answer the question: Are generic genes found in more multiplier latent variables compared to specific genes?

The PLIER model performs a matrix factorization of gene expression data to get two matrices: loadings (Z) and latent matrix (B). The loadings (Z) are constrained to aligned with curated pathways and gene sets specified by prior knowledge [Figure 1B of Traoni et. al.](). This ensure that some but not all latent variables capture known biology. The way PLIER does this is by applying a penalty such that the individual latent variables represent a few gene sets in order to make the latent variables more interpretable. Ideally there would be one latent variable associated with one gene set unambiguously.

While the PLIER model was trained on specific datasets, MULTIPLIER extended this approach to all of recount2, where the latent variables should correspond to specific pathways or gene sets of interest. Therefore, we will look at the coverage of generic genes versus specific genes across these MULTIPLIER latent variables.

In [1]:
%load_ext autoreload
%autoreload 2

import os
import re
import pandas as pd

from generic_expression_patterns_modules import process

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# Get data directory containing gene summary data
base_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
data_dir = os.path.join(base_dir, "human_general_analysis")

In [None]:
# TO DO: Move all functions to file

In [3]:
# Read in all files of the form "generic_gene_summary_*"
# For each file, return list of generic genes and list of specific genes
# Should it return dictionary of lists?
def get_gene_summary_files(data_dir):
    files = [os.path.join(data_dir,f) for f in os.listdir(data_dir) if re.match(r'generic_gene_summary_*', f)]
    return files

ls_data_files = get_gene_summary_files(data_dir)

In [4]:
ls_data_files

['/home/alexandra/Documents/Repos/generic-expression-patterns/human_general_analysis/generic_gene_summary_SRP012656-Copy1.tsv',
 '/home/alexandra/Documents/Repos/generic-expression-patterns/human_general_analysis/generic_gene_summary_SRP012656.tsv']

In [5]:
def get_generic_specific_genes(list_files, z_threshold):
    ls_genes = []
    for file in list_files:
        print(f"Reading data for {file}")
        data = pd.read_csv(file, sep="\t", index_col=0, header=0)
        print(data.shape)
        
        # Get predicted specific DEGs using z-score cutoff
        ls_specific_genes = list(
            (
                data[(data[f"Test statistic (Real)"] > 1)
                    & (data[f"abs(Z score)"] > z_threshold
                    )
                ]
                .set_index("Gene ID")
                .index
            )
        )
        print(f"No. of specific DEGs using z-score: {len(ls_specific_genes)}")

        # Get predicted generic DEGs using z-score cutoff
        ls_generic_genes = list(
            (
                data[
                    (data[f"Test statistic (Real)"] > 1)
                    & (data[f"abs(Z score)"]< z_threshold
                    )
                ]
                .set_index("Gene ID")
                .index
            )
        )
        print(f"No. of generic DEGs using z-score: {len(ls_generic_genes)}")
    
        ls_genes.append([ls_generic_genes, ls_specific_genes])
    
    return ls_genes

# TO DO: add more accurate description here
# Get predicted generic DEGs using z-score cutoff
# Z-score cutoff was found by calculating invnorm(0.05/17754). 
# To do this in python you can use the following code:
# from scipy.stats import norm
# norm.ppf((0.05/17754)/2)
# Here we are using a p-value = 0.05
# with a Bonferroni correction for 17754 tests, which are
# the number of P. aeruginosa genes

zscore_threshold = 4.68
ls_genes_out = get_generic_specific_genes(ls_data_files, zscore_threshold)

# TO DO:
# Is this how we want to define the genes? ranking?

Reading data for /home/alexandra/Documents/Repos/generic-expression-patterns/human_general_analysis/generic_gene_summary_SRP012656-Copy1.tsv
(17754, 10)
No. of specific DEGs using z-score: 2
No. of generic DEGs using z-score: 4065
Reading data for /home/alexandra/Documents/Repos/generic-expression-patterns/human_general_analysis/generic_gene_summary_SRP012656.tsv
(17754, 10)
No. of specific DEGs using z-score: 2
No. of generic DEGs using z-score: 4065


In [6]:
# Load multiplier models
# Converted formatted pickle files (loaded using phenoplier environment) from
# https://github.com/greenelab/phenoplier/blob/master/nbs/01_preprocessing/005-multiplier_recount2_models.ipynb
# into .tsv files
# Raw data was downloaded from https://figshare.com/articles/recount_rpkm_RData/5716033/4
multiplier_model_u = pd.read_csv("multiplier_model_u.tsv", sep="\t", index_col=0, header=0)
multiplier_model_z = pd.read_csv("multiplier_model_z.tsv", sep="\t", index_col=0, header=0)

print(multiplier_model_u.shape)
multiplier_model_u.head()

(628, 987)


Unnamed: 0,LV1,LV2,LV3,LV4,LV5,LV6,LV7,LV8,LV9,LV10,...,LV978,LV979,LV980,LV981,LV982,LV983,LV984,LV985,LV986,LV987
IRIS_Bcell-Memory_IgG_IgA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
IRIS_Bcell-Memory_IgM,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
IRIS_Bcell-naive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
IRIS_CD4Tcell-N0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
IRIS_CD4Tcell-Th1-restimulated12hour,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
print(multiplier_model_z.shape)
multiplier_model_z.head()

(6750, 987)


Unnamed: 0,LV1,LV2,LV3,LV4,LV5,LV6,LV7,LV8,LV9,LV10,...,LV978,LV979,LV980,LV981,LV982,LV983,LV984,LV985,LV986,LV987
GAS6,0.0,0.0,0.039438,0.0,0.050476,0.0,0.0,0.0,0.590949,0.0,...,0.050125,0.0,0.033407,0.0,0.0,0.005963,0.347362,0.0,0.0,0.0
MMP14,0.0,0.0,0.0,0.0,0.070072,0.0,0.0,0.004904,1.720179,2.423595,...,0.0,0.0,0.001007,0.0,0.035747,0.0,0.0,0.0,0.014978,0.0
DSP,0.0,0.0,0.0,0.0,0.0,0.041697,0.0,0.005718,0.0,0.0,...,0.020853,0.0,0.0,0.0,0.0,0.005774,0.0,0.0,0.0,0.416405
MARCKSL1,0.305212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.161843,0.149471,...,0.027134,0.05272,0.0,0.030189,0.060884,0.0,0.0,0.0,0.0,0.44848
SPARC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014014,...,0.0,0.0,0.0,0.0,0.0,0.0,0.067779,0.0,0.122417,0.062665


In [17]:
# Get a rough sense for how many genes contribute to a given LV
# (i.e. how many genes have a value > 0 per LV)
(multiplier_model_z > 0).sum()

LV1      3219
LV2      1886
LV3      2683
LV4      2951
LV5      1668
LV6      2168
LV7      2141
LV8      2425
LV9      3717
LV10     2042
LV11     1832
LV12     1563
LV13     1426
LV14     1603
LV15     3065
LV16     2850
LV17     2284
LV18     2177
LV19     2113
LV20     3593
LV21     2354
LV22     2241
LV23     2704
LV24     2416
LV25     2364
LV26     2064
LV27     2088
LV28     2961
LV29     1632
LV30     2778
         ... 
LV958    2618
LV959    2331
LV960    2043
LV961    2360
LV962    2436
LV963    3056
LV964    3539
LV965    2701
LV966    3725
LV967    1989
LV968    3372
LV969    2925
LV970    3168
LV971    2674
LV972    2503
LV973    2873
LV974    3761
LV975    2136
LV976    4128
LV977    2048
LV978    2888
LV979    3153
LV980    2539
LV981    2654
LV982    2385
LV983    2874
LV984    4605
LV985    3030
LV986    2663
LV987    4285
Length: 987, dtype: int64

In [15]:
# One off just to get a sense for how many genes are being compared
# Filter genes to only use those shared between our analysis and multiplier
# Check overlap between multiplier genes and our genes
multiplier_genes = list(multiplier_model_z.index)
our_genes = list(pd.read_csv(ls_data_files[0], sep="\t", index_col=0, header=0).index)
shared_genes = set(our_genes).intersection(multiplier_genes)

print(len(our_genes))
print(len(shared_genes))

17754
6374


In [9]:
# Input: list of generic genes, specific genes, LV matrix
# Compare coverage of generic genes vs LV
# Compare coverage of specific genes vs LV
# Find genes that have a nonzero contribution to LV
# Return number of LV with at least one gene

In [10]:
# Plot coverage distribution given list of generic coverage, specific coverage
# save plot