## Association Analysis Between Protein Types
Since this is a multi-label setting, we might be interested in understanding whether or not certain protein types are associated with one-another in the data. This notebook goes through calculating Chi^2 and Cramer's V metrics to try to uncover any link in appearances between proteins.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency


import os
print(os.listdir("../input"))

%matplotlib inline

# Any results you write to the current directory are saved as output.

In [None]:
label_names = {
    '0': "Nucleoplasm",
    '1': "Nuclear membrane",
    '2': "Nucleoli",
    '3': "Nucleoli fibrillar center",
    '4': "Nuclear speckles",
    '5': "Nuclear bodies",
    '6': "Endoplasmic reticulum",
    '7': "Golgi apparatus",
    '8': "Peroxisomes",
    '9': "Endosomes",
    '10': "Lysosomes",
    '11': "Intermediate filaments",
    '12': "Actin filaments",
    '13': "Focal adhesion sites",
    '14': "Microtubules",
    '15': "Microtubule ends",
    '16': "Cytokinetic bridge",
    '17': "Mitotic spindle",
    '18': "Microtubule organizing center",
    '19': "Centrosome",
    '20': "Lipid droplets",
    '21': "Plasma membrane",
    '22': "Cell junctions",
    '23': "Mitochondria",
    '24': "Aggresome",
    '25': "Cytosol",
    '26': "Cytoplasmic bodies",
    '27': "Rods & rings"
}

### Formatting the Data
For our purposes, it'll be easiest if we just one-hot our data.

In [None]:
data = pd.read_csv('../input/train.csv')
data.Target = data.Target.apply(lambda x: x.split(' '))
one_hot = data.Target.str.join(sep='*').str.get_dummies(sep='*')
one_hot.rename(columns=label_names, inplace=True)
one_hot.index = data.Id
label_count = one_hot.sum()

We'll only count the columns that have more than 1000 observations for now, but we might want to come back to this and lower the threshold.

In [None]:
valid_columns = label_count[label_count > 1000].index.values

OK, now that the preprocessing is out of the way we can calculate our statistics. We'll focus on two measures here:
* Chi-Squared
* Cramer's V

Chi-Squared is the go-to statistic for figuring out if there is an association between two categorical variables. However, chi-squared can't tell us _how strongly_ the variables are associated - for that reason, we're going to use [Cramer's V](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) to measure the strength of association.

Cramer's V will give us a nice range from 0-1 that tells us how strongly two nominal variables are associated, 1 being perfectly associated and 0 being not associated at all (much like standard correlation coefficients). We'll make a quick helper function for that below, since it isn't provided in `scipy.stats`.

In [None]:
def cramersV(phi, n):
    """Calculates the corrected Cramer's V for the binary crosstab case."""
    phi_bar_squared = max(0, phi**2 - 1/(n-1))
    
    v = np.sqrt(phi_bar_squared)
    return v

Now we'll just iterate over all the columns and generate our statistics.

In [None]:
def generate_statistics(df, columns):
    p_results = []
    for protein in columns:
        for protein2 in columns:
            crosstab = pd.crosstab(df[protein], df[protein2])
            n = crosstab.sum().sum()
            chi2, p, _, _ = chi2_contingency(crosstab, correction=False)
            phi = np.sqrt(chi2/n)
            v = cramersV(phi, n)

            p_results.append([protein, protein2, p, v])
        
    result_df = pd.DataFrame(p_results, columns=['Protein1', 'Protein2', 'P-Value', 'CramersV'])
    return result_df

p_df = generate_statistics(one_hot, valid_columns)

First we'll just check the P-values returned by the Chi^2 contingency function. The lower the P-value, the more certain we can be that an association exists. Generally, a P-Value lower than 0.05 is considered statistically significant.

In [None]:
fig, ax = plt.subplots(figsize=(15,10))

ax.set_title('Chi-Squared P-Values')
sns.heatmap(p_df.pivot('Protein1', 'Protein2', 'P-Value'), annot=True, ax=ax)
plt.show()

Well it's not very pretty, but it does tell us that virtually everything is associated in a statistically meaningful way. Interestingly, the _rare_ case is a dis-association, and the only two protein pairs that aren't meaningfully associated are:
* Nucloplasm vs. Centrosome
* Nucleoli vs. Nuclear Membrane
I wonder why that is?

Anyway, everything else is associated. This might be useful, but probably only if the associations are relatively strong. Let's check Cramer's V.

In [None]:
fig, ax = plt.subplots(figsize=(15,10))

ax.set_title("Cramer's V Scores")
sns.heatmap(p_df.pivot('Protein1', 'Protein2', 'CramersV'), annot=True, ax=ax)
plt.show()

It doesn't look like we have a whole lot in the strength of the associations. The highest one seems to be nucleoplasm vs. nuclear speckles. Just for fun we'll check it out in more detail.

In [None]:
protein1 = 'Nuclear speckles'
protein2 = 'Nucleoplasm'

crosstab = pd.crosstab(one_hot[protein1], one_hot[protein2])
_, p, _, expected = chi2_contingency(crosstab)

expected = pd.DataFrame(expected)
    
print("Observations")
display(crosstab)

print("Expected if No Association Exists")
display(expected)

So it looks like the association is that they're more likely to co-occur than you would expect. The presence of nuclear speckles seems to depend more on nucleoplasm than the other way around. Let's go back and lower the threshold to investigate some of the proteins that appear less. Let's try 400.

In [None]:
valid_columns2 = label_count[label_count > 400].index.values
p_df2 = generate_statistics(one_hot, valid_columns2)

In [None]:
fig, ax = plt.subplots(figsize=(15,10))

sns.heatmap(p_df2.pivot('Protein1', 'Protein2', 'CramersV'), annot=True, ax=ax)
plt.show()

Welp, it doesn't look like we got any stronger association than nuclear speckles - nucleoplasm.

It is sort of strange that there is a significant association between most proteins (even if just a weak one), but a few are completely disconnected. I'd be interested to here if anyone has a hypothesis for why that might be.