# Mutation correlations
In this notebook we will focus on each of the mutatons individually.

In [None]:
%matplotlib inline
from scipy.stats import pearsonr
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns

from source import load_avenio_files
from transform import (
    clean_mutation_columns, 
    get_top_correlated, 
    patient_allele_frequencies,
)


RANDOM_STATE = 1234
np.random.seed(RANDOM_STATE)

In [None]:
# Load data from spreadsheet and SPSS files.
mutation_data_frame, phenotypes = load_avenio_files()

In this document the mutation correlations will be calculated between $t_0$ and $t_1$. 
We consider the correlations for the following two quantities:
- allele frequencies $f_0$ and $f_1$ respectively,
- and the mutant concentration (molecules per ml) $c_0$ and $c_1$.

In [None]:
# Vocabulary is the entire dataset, not only training set. Otherwise we run into problems during inference.
gene_vocabulary = mutation_data_frame['Gene'].unique()

# Convert particular columns to numbers and drop rows with missing data.
mutation_data_frame = clean_mutation_columns(mutation_data_frame)

There are several ways to evaluate the development of mutation. One way is to calculate the relative difference
$$r(x) = \frac{\Delta x}{x_0} \equiv \frac{x_1 - x_0}{x_0} .$$
where $x$ is $f$ (allele frequency) or $c$ (mutant concentration).

In [None]:
def r(t_0, t_1):
    return (t_1 - t_0) / t_0

Apart from the relative difference $r(x)$ there are also other ways to evaluate the growth, such as the ratio $x_1/x_0$ or the absolute difference $\Delta x$, but it turns out that these quantities lead to large variances amongst the patients, for a given mutation (results of which, are not shown here). I will therefore focus only on the relative difference $r(x)$.

Next, carry out the following steps:
1. For each patient: calculate $r(f)$/$r(c)$ for each gene mutation.
2. If there are multiple mutations in a single gene, sum the result $r(x) = \sum_i r(x_i)$.
3. Store result in a column corresponding to that mutation.

In [None]:
# Calculate allele frequencies r(f).
mutant_allele_frequencies = patient_allele_frequencies(
    mutation_data_frame, 
    gene_vocabulary, 
    # Calculate r(t_0, t_1).
    transformation=r,
    # Sum mutation values per gene in each patient.
    handle_duplicates="sum",
    allele_columns=["T0: Allele \nFraction", "T1: Allele Fraction"],
)

# Calculate mutant concentration r(c).
mutant_allele_concentration = patient_allele_frequencies(
    mutation_data_frame, 
    gene_vocabulary, 
    # Calculate r(t_0, t_1).
    transformation=r,
    # Sum mutation values per gene in each patient.
    handle_duplicates="sum",
    allele_columns=[
        "T0: No. Mutant \nMolecules per mL",
        "T1: No. Mutant \nMolecules per mL",
    ],
)

To give you an idea of the resulting table, let me give you the first few patient records for $r(f)$:

In [None]:
mutant_allele_frequencies.head()

# Correlations
Now that we have cleaned the data we can start calculating correlations for both $r(f)$ as well as $r(c)$. The goal: to see if either $r(f)$ or $r(c)$ lead to more pronounced correlations. 
We calculate the Pearson correlation value which is defined as:
$$C_{ij} = \sum_{m=1}^{N} \frac{(X_{mi} - \mu_i)(X_{mj} - \mu_j)}{\sigma_i \sigma_j} \, ,$$
with $\mu$ and $\sigma$ the mean and standard deviation, respectively.

In [None]:
# Extra function to calculate p-value for given Pearson correlation.
def pearson_pval(x, y):
    return pearsonr(x, y)[1]

## Allele frequency
Let us first focus on the relative difference in allele frequencies $r(f)$ and calculate the corresponding correlations:

In [None]:
corr = mutant_allele_frequencies.corr().fillna(0)
pval_corr = mutant_allele_frequencies.corr(method=pearson_pval).fillna(1)
corr.style.background_gradient(cmap='coolwarm', axis=None)

### Negative correlation
Now, zoom in on the top anti-correlating genes. That is, a relative increase in allele frequency of gene $a$ is associated with a decrease in gene $b$, or vice versa.

In [None]:
gene_counts = mutation_data_frame['Gene'].value_counts()
get_top_correlated(
    corr, 
    pval_corr,
    gene_counts=gene_counts, 
    ascending=True, 
    top_count=4,
)

The p-values should not be taken to seriously. The fact that the p-values are extremely low is easy to understand:
- All columns are zero.
- Except the columns where the two unique mutations happen to coincide.
This immediately implies that the p-value should be near zero.

### Positive correlations
Likewise, calculate the top correlating genes. That is, both relative allele frequencies increase or decrease in concert.

In [None]:
pcorr_rf = get_top_correlated(corr, pval_corr, gene_counts=gene_counts, top_count=20, ascending=False)
pcorr_rf

Again, note the concordance of extremely rare mutations. As discussed above, these p values should be taken with a pinch of salt.

## Mutant concentration
Mutatis mutandis, we will now calculate the correlations for the relative difference in mutation concentration $r(c)$.

In [None]:
corr = mutant_allele_concentration.corr().fillna(0)
pval_corr = mutant_allele_concentration.corr(method=pearson_pval).fillna(1)
corr.style.background_gradient(cmap='coolwarm', axis=None)

### Negative correlation

In [None]:
gene_counts = mutation_data_frame['Gene'].value_counts()
get_top_correlated(
    corr, 
    pval_corr,
    gene_counts=gene_counts, 
    ascending=True, 
    top_count=4,
)

Comparing these results with the negative correlations of $r(f)$ we find that negative correlations have decreased in absolute size. Importantly, the pair `NFE2L2` with `KDR` has remained the top anti-correlating mutation. But the remainder of the list has changed altogether.

### Positive correlation
Likewise, the list for top positive correlating mutations:

In [None]:
pcorr_rc = get_top_correlated(corr, pval_corr, gene_counts=gene_counts, top_count=20, ascending=False)
pcorr_rc

Comparatively, the size of the top 20 correlating mutations have increased in size. 

## Do responders show an increase in mutational allel frequency?

Instead of looking at results per patient, consider now all the mutations and compare them according to the response. Again, we will group the results according to $r(f)$ and $r(c)$.

In [None]:
# Use the columns containing the allele frequencies.
mutation_data_frame["r(f)"] = r(
    mutation_data_frame["T0: Allele \nFraction"], mutation_data_frame["T1: Allele Fraction"]
)

# Use the columns containing the concentration.
mutation_data_frame["r(c)"] = r(
    mutation_data_frame["T0: No. Mutant \nMolecules per mL"], mutation_data_frame["T1: No. Mutant \nMolecules per mL"]
)

Combine records with patient response data.

In [None]:
mutation_data_frame['response'] = mutation_data_frame['Patient ID'].apply(lambda x: phenotypes.loc[x, 'response_grouped'])
mutation_data_frame['progression'] = mutation_data_frame['Patient ID'].apply(lambda x: phenotypes.loc[x, 'progressie'])

Since the occurences of most genes are extremely rare, only the top 4 most occuring gene mutations are shown:

In [None]:
gene_subset = mutation_data_frame['Gene'].isin(['TP53', 'KRAS', 'PIK3CA', 'NFE2L2'])

g = sns.catplot(
    x='Gene', 
    y='r(f)', 
    hue='response',
    data=mutation_data_frame[gene_subset],
    kind='violin',
)
plt.title('Relative increase allele frequency $r(f)$')
g.fig.set_size_inches(16,8)

By looking carefully, it looks like responders have a slight decrease in $r(f)$ for `TP53` and `KRAS` compared to non-responders. For `NFE2l2` this looks like the other way around.

In [None]:
gene_subset = mutation_data_frame['Gene'].isin(['TP53', 'KRAS', 'PIK3CA', 'NFE2L2'])

g = sns.catplot(
    x='Gene', 
    y='f_t2', 
    hue='response',
    data=mutation_data_frame[gene_subset],
    kind='violin',
)
plt.title('Relative increase mutant concentration r(c)')
plt.ylim([-4, 10])
g.fig.set_size_inches(16,8)

Essentially the same conclusion cna be drawn based on the data from $r(c)$, but now with larger spread in the distributions.