# Get network communities

This notebook gets network communities for the compendia (PAO1 and PA14) using different thresholds.

The output of this notebook are files for each threshold. These files have the following columns:
gene id | module id

In [1]:
%load_ext autoreload
%autoreload 2
import os
import pandas as pd
from sklearn.cluster import DBSCAN, AgglomerativeClustering, AffinityPropagation
from core_acc_modules import paths

## Set user parameters

For now we will vary the correlation threshold (`corr_threshold`) but keep the other parameters consistent

We will run this notebook for each threshold parameter

In [2]:
# User params to set

# Clustering method
# Choices: {"dbscan", "hierarchal", "affinity"}
cluster_method = "affinity"

# DBSCAN params
density_threshold = 8

# Hierarchical clustering params
hier_threshold = 8
link_dist = "average"

# Affinity params
affinity_damping = 0.6

# Correlation matrix files
pao1_corr_filename = paths.PAO1_CORR_LOG_SPELL
pa14_corr_filename = paths.PA14_CORR_LOG_SPELL

In [3]:
# Load correlation data
pao1_corr = pd.read_csv(pao1_corr_filename, sep="\t", index_col=0, header=0)
pa14_corr = pd.read_csv(pa14_corr_filename, sep="\t", index_col=0, header=0)

## Module detection
To detect modules, we will use a clustering algorithm

### DBSCAN
[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN):  Density-Based Spatial Clustering of Applications with Noise views clusters as areas of high density separated by areas of low density. The central component to the DBSCAN is the concept of _core samples_, which are samples that are in areas of high density. A cluster is therefore a set of _core samples_ that are close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).

A cluster is a set of core samples that can be built by recursively taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.

* We define a core sample as being a sample in the dataset such that there exist `min_samples` other samples within a distance of `eps`, which are defined as neighbors of the core sample.
* Here we use `eps=8` based on the observations in the [prevous notebook](1_correlation_analysis.ipynb). In the previous notebook we plotted the distribution of pairwise distances (pdist) per gene and we selected 8 based on where the distribution curve drops off on the left side to mark how similar gene pairs are.


In [4]:
# Clustering using DBSCAN
if cluster_method == "dbscan":
    pao1_clustering = DBSCAN(eps=density_threshold).fit(pao1_corr)
    pa14_clustering = DBSCAN(eps=density_threshold).fit(pa14_corr)

### Hierarchical clustering
[Hierarchical clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering): Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters (i.e. linkage distance is minimized), continuing until there is just a single cluster.

* n_cluster: The number of clusters to find.
* linkage: Criterion used to determine distance between observations. 'average'=average distance of each observation in the two sets.
* distance_threshold: The linkage distance threshold above which, clusters will not be merged
* Here we use `distance_threshold=8` based on the observations in the [prevous notebook](1_correlation_analysis.ipynb). In the previous notebook we plotted the distribution of pairwise distances (pdist) per gene and we selected 8 based on where the distribution curve drops off on the left side to mark how similar gene pairs are.

* Note: It looks like this method tends to produce 1 very large cluster. To break this up we will iteratively apply hierarchal clustering on the largest cluster.

In [5]:
# Clustering using hierarchal clustering
if cluster_method == "hierarchal":
    pao1_clustering = AgglomerativeClustering(
        n_clusters=None, distance_threshold=hier_threshold, linkage=link_dist
    ).fit(pao1_corr)
    pa14_clustering = AgglomerativeClustering(
        n_clusters=None, distance_threshold=hier_threshold, linkage=link_dist
    ).fit(pa14_corr)

### Affinity propogation

[Affinity propogation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation): creates clusters by sending messages between pairs of samples until convergence. The messages sent between points belong to one of two categories. The first is the responsibility $r(k,i)$, which is the accumulated evidence that sample $k$ should be the exemplar for sample $i$ compared to other exemplars. The second is the availability $a(k,i)$ which is the accumulated evidence that sample $i$ should choose sample $k$to be its exemplar. _Exemplar_ meaning the members of the input set that are representative of clusters -- similar to _centroids_ in k-means. Unlike k-means this method doesn't require a preset $k$ to be chosen.

* damping: Damping factor (between 0.5 and 1) is the extent to which the current value is maintained relative to incoming values (weighted 1 - damping). This in order to avoid numerical oscillations when updating these values. Default is 0.5. Using default for PA14 data, the model didn't converge so we increased this to 0.6.


In [6]:
# Clustering using affinity propogation
if cluster_method == "affinity":
    pao1_clustering = AffinityPropagation(random_state=0).fit(pao1_corr)
    pa14_clustering = AffinityPropagation(random_state=0, damping=affinity_damping).fit(
        pa14_corr
    )

## Membership assignments

In [7]:
# Get module membership for a single threshold
# Format and save output to have columns: gene_id | group_id
pao1_membership_df = pd.DataFrame(
    data={"module id": pao1_clustering.labels_}, index=pao1_corr.index
)

pao1_membership_df["module id"].value_counts()

562    47
304    35
178    34
72     34
457    29
62     27
159    26
547    26
481    26
9      25
203    24
503    24
380    23
230    23
54     23
415    22
443    22
317    21
332    21
188    21
99     21
337    20
202    20
183    19
535    19
522    19
22     19
379    18
158    18
421    18
       ..
311     5
500     4
520     4
119     4
475     4
187     4
48      4
208     4
524     4
235     4
189     4
315     4
217     4
194     4
518     4
13      4
348     4
106     3
282     3
499     3
521     3
66      3
38      3
146     3
483     3
16      3
358     3
470     3
122     2
548     2
Name: module id, Length: 564, dtype: int64

In [8]:
# Get module membership for a single threshold
# Format and save output to have columns: gene_id | group_id
pa14_membership_df = pd.DataFrame(
    data={"module id": pa14_clustering.labels_}, index=pa14_corr.index
)

pa14_membership_df["module id"].value_counts()

265    39
589    36
524    36
439    35
353    33
159    32
580    29
301    28
410    28
128    27
506    27
508    26
255    26
14     25
387    24
463    24
571    24
570    23
94     23
21     23
434    23
339    21
354    21
393    21
264    21
62     21
91     20
355    20
490    20
93     20
       ..
296     5
35      4
459     4
477     4
445     4
24      4
127     4
579     4
521     4
518     4
430     4
555     4
78      4
29      4
502     4
7       4
572     4
252     4
383     4
488     4
201     4
480     4
258     4
163     3
510     3
563     3
136     3
560     3
466     3
531     3
Name: module id, Length: 593, dtype: int64

**Final method:**
We will use <Method> because ...

    Thoughts on different methods

In [9]:
# Save membership dataframe
pao1_membership_filename = os.path.join(
    paths.LOCAL_DATA_DIR, f"pao1_modules_{cluster_method}.tsv"
)
pa14_membership_filename = os.path.join(
    paths.LOCAL_DATA_DIR, f"pa14_modules_{cluster_method}.tsv"
)
pao1_membership_df.to_csv(pao1_membership_filename, sep="\t")
pa14_membership_df.to_csv(pa14_membership_filename, sep="\t")