# Compare SPELL vs counts correlation

This notebook performs an experiment to determine which correlation matrix we should use
1. Correlation of the MR counts expression matrix
2. Correlation of the MR counts that have been processed using SPELL

The correlation of the counts matrix relates how similar a pair of genes are based on their expression profiles - **relates genes over samples**.
High correlation means that a pair of genes have a similar expression profiles - i.e. similar levels of expression across samples/contexts, so genes both have low expression in the same samples and high expression in the same samples.
* Pro: Easy to interpret
* Con: Many gene pairs found to have a high correlation because many genes are related to the same pathway have the similar expression profiles. This is consistent with [Myers et al.](https://link.springer.com/article/10.1186/1471-2164-7-187), who found that there can be an over-representation of genes associated with the same pathway (i.e. a large fraction of gene pairs represent ribosomal relationships). This very prominent signal makes it difficult to detect other signals. Figure 1C demonstrates that a large fraction of gene pairs are ribosomal relationships - in the top 0.1% most co-expressed genes, 99% belong to the ribosome pathway. Furthermore, protein function prediction based on co-expression drop dramatically after removing the ribisome pathway (Figure 1A, B).

To try to remove this very dominant global signal in the data. Here we are applying dimensionality reduction techniques in addition to scaling the data using a method called SPELL.
The correlation of the SPELL matrix relates genes based on the gene coefficient matrix - **relate genes over their contribution to singular vectors (linear combination of genes - linear relationship between genes)**.
High correlation means that a pair of genes contributes similarly to a singular vector, which are the axes pointing in the direction of the spread of the data and capture how genes are related to each other
* Pro: Gene contributions are more balanced so that redundant signals (i.e. many genes from the same pathway - genes that vary together) are represented by a few SVs as opposed to many samples. More balanced also means that more subtle signals can be amplified (i.e. genes related by a smaller pathway are also captured by a few SVs)
* Con: Can amplify noise - i.e. an SV that corresponds to some technical source of variability now has a similar weight to other real signals

For more information comparing using counts vs SPELL-processing see: https://docs.google.com/presentation/d/18E0boNODJaxP-YYNIlccrh0kASbc7bapQBMovOX62jw/edit#slide=id.gf9d09c6be6_0_0

In [1]:
%load_ext autoreload
%autoreload 2
import os
import pandas as pd
import numpy as np
from scripts import paths

## User params

In [2]:
# Which correlation matrix to use. Choices = ["raw", "spell"]
which_corr = "raw"

# Threshold to use to define edges between genes
# Top X% of genes are used
top_percent = 0.1

## Load correlation matrix

In [3]:
if which_corr == "raw":
    pao1_corr_filename = paths.PAO1_CORR_RAW
    pa14_corr_filename = paths.PA14_CORR_RAW
elif which_corr == "spell":
    pao1_corr_filename = paths.PAO1_CORR_LOG_SPELL
    pa14_corr_filename = paths.PA14_CORR_LOG_SPELL

In [4]:
# Load correlation data
pao1_corr = pd.read_csv(pao1_corr_filename, sep="\t", index_col=0, header=0)
pa14_corr = pd.read_csv(pa14_corr_filename, sep="\t", index_col=0, header=0)

## Make edge matrix

Convert correlation matrix of continuous values to an adjacency matrix with 1's if the correlation between a pair of genes exceeds the user defined threshold and therefore indicates if an edge exits those pair of genes.

In [5]:
# Get threshold to use based on percentage
def get_corr_threshold(corr_df, top_percent):
    # Since we are using the distribution of scores to determine the threshold
    # we need to remove duplicates and also the diagonal values
    # Here we get lower triangular matrix values only
    tril_corr_df = corr_df.where(~np.triu(np.ones(corr_df.shape)).astype(np.bool))

    # Flatten dataframe
    flat_corr_df = tril_corr_df.stack().reset_index()
    flat_corr_df.columns = ["gene_1", "gene_2", "corr_value"]

    # Get quantile
    # TO DO:Take abs????
    threshold = flat_corr_df.quantile(1 - top_percent)["corr_value"]
    print("correlation threshold: ", threshold)

    # Verify that number of gene pairs above the threshold
    # is approximately equal to the `top_percent`
    total_genes = flat_corr_df.shape[0]
    num_genes_above = flat_corr_df[flat_corr_df["corr_value"] > threshold].shape[0]
    percent_genes_above = num_genes_above / total_genes
    print("percent of pairs exceeding threshold: ", percent_genes_above)

    return threshold

In [6]:
pao1_corr_threshold = get_corr_threshold(pao1_corr, top_percent)

correlation threshold:  0.23264017569441092
percent of pairs exceeding threshold:  0.10000004524681264


In [7]:
pa14_corr_threshold = get_corr_threshold(pa14_corr, top_percent)

correlation threshold:  0.25345171772732417
percent of pairs exceeding threshold:  0.09999999422814115


In [8]:
# Create adjacency matrix using threshold defined above
# The adjacency matrix will determine the strength of the connection between two genes
# If the concordance is strong enough (i.e. above the threshold), then
# the genes are connected by an edge
# TO DO:abs?????
pao1_adj = (pao1_corr > pao1_corr_threshold).astype(float)
pa14_adj = (pa14_corr > pa14_corr_threshold).astype(float)

In [10]:
pao1_adj.head()

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA1905,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1
PA0001,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
PA0002,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
PA0003,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
PA0004,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
PA0005,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0


## Examine relationships

Given a set of known gene regulons, that we expect to be clustered together, let's calculate if that is the case.

Here we calculate the percentage of within (i.e. edges connecting genes from the s vs between edges there are between genes from a given regulon.

In [17]:
regulon_filename = "gene_sets_refs.csv"

In [18]:
regulon_df = pd.read_csv(regulon_filename, header=0, index_col=0)

In [20]:
regulon_df["Genes"] = regulon_df["Genes"].str.split(";").apply(list)

In [21]:
regulon_df

Unnamed: 0_level_0,Lengths,Genes
Regulon,Unnamed: 1_level_1,Unnamed: 2_level_1
Anr_regulon,72,"[PA5475, PA1673, PA5027, PA3337, PA4348, PA434..."
PhoB_regulon,160,"[PA0050, PA0051, PA0082, PA0102, PA0105, PA016..."
PvdR_regulon,14,"[PA2386, PA2399, PA2397, PA2396, PA2425, PA241..."
PchR_regulon,12,"[PA4231, PA4230, PA4229, PA4228, PA4226, PA422..."
AlgU_regulon,238,"[PA0059, PA0060, PA0061, PA0062, PA0071, PA013..."
LasR_regulon,183,"[PA0007, PA0026, PA0027, PA0028, PA0050, PA005..."
RhlR_regulon,123,"[PA0059, PA0109, PA0132, PA0144, PA0175, PA017..."
PqsR_regulon,133,"[PA0122, PA0355, PA0567, PA0996, PA0997, PA099..."
QscR_regulon,405,"[PA0007, PA0059, PA0105, PA0106, PA0107, PA012..."
VreI_regulon,30,"[PA0149, PA0532, PA0674, PA0675, PA0676, PA067..."


In [28]:
def calc_within_edge(adj_df, regulon_df):
    # Loop through each regulon
    for regulon_name in regulon_df.index:
        geneset = regulon_df.loc[regulon_name, "Genes"]
        print(geneset)
        print(len(geneset))

        # Since there are some gene ids that are from PA14
        # We will take the intersection
        geneset_processed = set(geneset).intersection(adj_df.index)

        # Get within edges
        within_genes = adj_df.loc[geneset_processed, geneset_processed]
        print(within_genes)

        # Get between edges
        not_geneset = set(adj_df.index).difference(geneset_processed)
        between_genes = adj_df.loc[geneset_processed, not_geneset]

        print(between_genes)

        # Get the proportion of 1's looking at within and between genes

        # make df

        break

        # return

In [29]:
calc_within_edge(pao1_adj, regulon_df)

['PA5475', 'PA1673', 'PA5027', 'PA3337', 'PA4348', 'PA4347', 'PA4346', 'PA0527', 'PA2119', 'PA0200', 'PA0519', 'PA0518', 'PA0517', 'PA0516', 'PA0515', 'PA0514', 'PA0513', 'PA0512', 'PA0511', 'PA0510', 'PA0509', 'PA1546', 'PA5232', 'PA5231', 'PA5230', 'PA4577', 'PA0141', 'PA1746', 'PA1557', 'PA1556', 'PA1555', 'PA1554', 'PA4067', 'PA5427', 'PA4352', 'PA2127', 'PA2126', 'PA2125', 'PA3930', 'PA3929', 'PA3928', 'PA4587', 'PA0024', 'PA0459', 'PA0520', 'PA0521', 'PA0522', 'PA0523', 'PA0524', 'PA0525', 'PA0526', 'PA0836', 'PA1561', 'PA1789', 'PA1863', 'PA1862', 'PA1861', 'PA2193', 'PA2194', 'PA2195', 'PA3190', 'PA3309', 'PA3391', 'PA3877', 'PA3876', 'PA3878', 'PA3879', 'PA4236', 'PA4328', 'PA4922', 'PA5170', 'PA5171']
72
        PA3928  PA0524  PA4236  PA5027  PA3877  PA0141  PA1554  PA0509  \
PA3928     1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
PA0524     0.0     1.0     0.0     0.0     1.0     1.0     0.0     1.0   
PA4236     0.0     0.0     1.0     0.0     0.0     1.0

In [None]:
# Make boxplot for number of edges within vs between genes in gene sets/regulons

In [None]:
# Repeat this for different thresholds and save each one (spell, counts) (0.01, 0.05, 0.1, 0.2)

In [None]:
# Try this for single threshold for one dataset for other gene sets

In [9]:
"""# Save membership dataframe
pao1_membership_filename = os.path.join(
    paths.LOCAL_DATA_DIR, f"pao1_modules_{cluster_method}_{gene_subset}_{processed}.tsv"
)
pa14_membership_filename = os.path.join(
    paths.LOCAL_DATA_DIR, f"pa14_modules_{cluster_method}_{gene_subset}_{processed}.tsv"
)
pao1_membership_df.to_csv(pao1_membership_filename, sep="\t")
pa14_membership_df.to_csv(pa14_membership_filename, sep="\t")"""

'# Save membership dataframe\npao1_membership_filename = os.path.join(\n    paths.LOCAL_DATA_DIR, f"pao1_modules_{cluster_method}_{gene_subset}_{processed}.tsv"\n)\npa14_membership_filename = os.path.join(\n    paths.LOCAL_DATA_DIR, f"pa14_modules_{cluster_method}_{gene_subset}_{processed}.tsv"\n)\npao1_membership_df.to_csv(pao1_membership_filename, sep="\t")\npa14_membership_df.to_csv(pa14_membership_filename, sep="\t")'