# November 13th, 2021

**Motivation**: Discussing ways of comparing Ca$^{2+}$ and BOLD results and more. <br>

In [1]:
from os.path import join as pjoin
from IPython.display import display, IFrame

width, height = 600, 640
figs_dir = '../../_extras/fig_sketches'

---
---
---

## Definitions

In this section I will define mathematical symbols that I will use throughout this document.

### Number of ...

- Number of nodes (or ROIs):  $N = 996$
    - For now this is only the surface subset of ROIs that appear in the optical space after the 3D --> 2D transformation.
    - I use exactly the same surface ROIs for both Ca and BOLD (call this BOLD-Lite). 
- Number of communities:  $K = 6$
- Number of animals:  $N_{sub} = 10$
- Number of sessions:  $N_{ses} = 3$
- Number of rest runs:  $N_{run} = 4$
- Number of random seeds:  $N_{seed} = 1000$

### Closed simplex in $\mathbb{R}^K$
Let $c$ be a positive number.  The $n-$dimensional closed simplex in $\mathbb{R}^n$ is defined by

$$
\mathbb{T}_n(c) = \Big\{ (x_1, \dots, x_n): \; x_i > 0, \: 1 \leq i \leq n, \: \sum_{i=1}^n x_i = c \Big\}.
$$

Therefore, membership distributions $\pi_i$ are members of $\mathbb{T}_K(1)$.

### Memberships
In this section I define some mathematical notations for membership values, the main quantity in our analysis.

#### For a run
We apply the algorithm $N_{seed} = 1000$ times to each run to get membership values.  This is a matrix that has shape $N_{seed} \times N \times K$.  After aligning and averaging results across seeds, we get the membership matrix for this run, $\pi$, that has shape $N \times K$ where each row $\pi_i \in \mathbb{T}_K(1)$.  Each column of $\pi$ corresponds to an overlapping community and contains membership values for all nodes in that community.

#### All runs
We can take membership matrices for all runs and arrange them in a large tensor that contains membership information for all runs from every session/animal:

$$
\Pi_{i_{sub}, i_{ses}, i_{run}} \quad := \quad \text{membership matrix for run $i_{run}$ from session $i_{ses}$ from animal $i_{sub}$},
$$

where $i_{sub} \in \{1, 2, \dots, N_{sub}\}$ and so on. We see that $\Pi$ is a tensor with shape $N_{sub} \times N_{ses} \times N_{run} \times N \times K$.  To get membership matrix for each animal, we simply align and average across run dimension and then average across session dimension.  Let us use greek letters to index animals.  We use $\pi^{(\alpha)}$ to denote the averaged membership matrix for animal $\alpha$:

$$
\pi^{(\alpha)} = \frac{1}{N_{ses}}\frac{1}{N_{run}}\sum_{i_{ses} = 1}^{N_{ses}}\sum_{i_{run} = 1}^{N_{run}}\Pi_{\alpha, i_{ses}, i_{run}},
$$
where $\alpha \in {1, 2, \dots, N_{sub}}$. To get group results, we average again along the animal dimension:

$$
\pi^{(group)} = \frac{1}{N_{sub}}\sum_{\alpha = 1}^{N_{sub}}\pi^{(\alpha)}.
$$


#### Entropy
For animal $\alpha$, membership diversity of node $i$ is estimated using Shannon's entropy

$$
h^{(\alpha)}_i = -\sum_{k = 1}^{K}\pi^{(\alpha)}_{ik} \log\pi^{(\alpha)}_{ik}.
$$

Similarly, group entropy estimate of node $i$ is obtained by averaging over all animals

$$
h^{(group)}_i = \frac{1}{N_{sub}}\sum_{\alpha = 1}^{N_{sub}}h^{(\alpha)}_i.
$$

**Question:** An alternative way to calculate entropies would be to use $h^{(group)}_i = -\sum_{k = 1}^{K}\pi^{(group)}_{ik} \log\pi^{(group)}_{ik}$.  This does not sound like the most principled way to go since animals are units of variability in our analysis.  Therefore, I think the other way I described above makes the most sense. Right?


#### Data modalities
We have one $\Pi$ tensor per data modality: $\Pi^{(\text{Ca}^{2+})}$ and $\Pi^{(\text{BOLD-Lite})}$. From now on we refer to BOLD-Lite as simply BOLD but remembering that it's only the surface subset of the full ROIs.  In this notation, we use $\pi^{(\alpha, \,\text{Ca}^{2+})}$ to denote membership matrix for animal $\alpha$ obtained using Ca$^{2+}$ data and so on.  Similarly, $\pi^{(group, \,\text{Ca}^{2+})}$ and $\pi^{(group, \,\text{BOLD})}$ represent group results obtained from Ca$^{2+}$ and BOLD data respectively.

---
---

## Aligning results
We use k-means on the membership vectors.  I will add more details later.

---
---

## Comparing results
Next we seek to compare the results across different conditions.  This is useful information to learn in general, but also necessary to explore to a reasonable extent now.  Why do we report both Ca$^{2+}$ and BOLD results?  What do we gain by including both data modalities in our analysis?  I think Eve is very interested in addressing this, and I'm also partially interested. It is useful to explore this question a little bit now because a) reviewers might ask about this, and b) what we learn now will become useful later when building dynamic models that operate at the time series level.

### What kinds of comparisons?
Here I define some conditions that I will frequently refer back to.

1. **Condition # 1 (across sessions):** In this condition, we pool data from one modality (Ca$^{2+}$ or BOLD) and from the same set of animals.  Session label is the only difference between the two groups.
2. **Condition # 2 (across animals):** Data from one modality and same session(s) is used, but for different sets of animals.  This can be a comparison of a single animal to the group average obtained from other animals.  Or it could be comparison of each individual to all other individuals in the group (as is more commonly done done in CPM type of work).
2. **Condition # 3 (across modalities):** Here we use the same animal(s) and session(s) to compare Ca$^{2+}$ vs BOLD results.

Each of these conditions serves a unique purpose.  I will go over what we can gain from such comparisons and motivate each of these, but before that let's discuss what are the necessary ingredients for these comparisons to make sense.

### How to compare
To do this properly, we need the following:

1. **Define distance measure:** we should use interpretable metrics that extract relevant quantities from the results.
2. **Choose unit of variance**: how to properly quantify variability around these numbers and test for significance if necessary.

In the following sections, I will discuss different kinds of comparisons that we can carry out.  In short, we can compare

- at the interaction graph level,
- at overlapping or disjoint community membership level, and
- compare the amount of overlap present in overlapping results.

For each item, I will define and motivate the particular distance metric I have in mind.  I also explicitly discuss how to quantify variability.

---

## Comparison at the interaction graph level
First, we compare the interaction graphs. These graphs are the input to SVINET algorithm, so it makes sense to quantify how similar or different they are in different conditions.  These differences will inevitabely leak into next levels of analysis and will be combined with other sources of variability. Therefore, it makes sense to start here and quantify some of the varialibity.

### Consistency across sessions: an objective function
I'm gonna argue that a longitudinal dataset recorded in the span of several spread out sessions is actually a fantstic resource.  We can use this to choose some of the hyperparameters in the study without relying on arbitrary metrics.  For instance, I have already established that the **amount of overlap** present in our results depends on the **sparsification threshold** used.  This is problematic, because ideally a biological fact should not depend so strongly on how we extract it from data.  This situation can potentially be remedied by using ***consistency across sessions*** as an objective function.

#### The argument
If thresholded covariance matrices are reprentative of the true interaction graphs in mice, then these interaction graphs must exhibit some level of consistency across sessions.  That is, if we take interaction graphs from session 1 and compare them to session 2, we should not see significant differences. Any significant and strong dependence on session label here would be concerning (assuming there was enough data per session to sample relevant corners of the phase space).

Constructing the interaction graph depends nontrivially on the choices we make.  These choices include the initial parcellation scheme used (columnar or not?, number of columns, number of layers), data preprocessing steps (GSR or no GSR? bandpass frequency, etc.), edge filtering approach (proportional thresholding, constant threshold, or other?), and the threshold value used.  Rather than relying on heuristics, we should strive to find hyperparameters that maximize the representativeness of interaction graphs across sessions.  This is one reason why Condition # 1 is useful to consider.

#### Examples in the literature
This very nice [paper](https://direct.mit.edu/netn/article/3/1/1/2187/Graph-theory-approaches-to-functional-network) exposes some of the issues.  A recent [paper](https://direct.mit.edu/netn/article/5/1/96/97529/Combining-network-topology-and-information-theory) uses Network Portrait Divergence (I discussed this in my very first presentation) to find parcellation and edge filtering schemes that are maximally representative.  Their goal is respectable; however, their objective is not very well motivated:

> ...We reasoned that the lower a parcellation’s average divergence from the others, the more this is
a representative parcellation. Thus, our measure of representativeness for a given parcellation
was deﬁned as one minus its average divergence from all other parcellations.

In other words, they consider a parcellation (or edge filtering) approach to be better (more representative) if it has smaller divergence from all other approaches.  This argument is inherently problematic since they assume being closer to the concensus (quantified by having smaller average divergence) is the best. We can do better by defining ***consistency across sessions*** as our objective.  How to do this?  Below I define an appropriate distance metric and statistical approach.

#### 1) Distance measure
We use Portrait Divergence between interaction graphs inferred from each run.  I do not like using correlation on the covariance matrix since the distribution of edge weights strongly depends on preprocessing steps.  For instance, GSR mathematically mandates that approximately 50% of correlations between regions will be negative.  Therefore, looking at distances between an idealized (binary) interaction graphs is a better choice.  We gain some sort of normalization (or abstraction) at the extent of losing information about correlation strengths.  Another reason to do this at the graph level is simply because SVINET accepts only binary edges.

#### 2) Variability
There are $N_{ses} = 3$ sessions and $N_{run} \times N_{sub} = 4 \times 10 = 40$ runs per session for each data modality. Therefore, we have a total of $3 \times 40 = 120$ runs overal.  Let's say our goal is to find the best sparsity threshold that maximizes consistency across sessions for BOLD.  This means we would like to minimize topological distances between runs across sessions.  We compute the pairwise Portrait Divergence between each pairs of runs and arrange these numbers in a distance matrix with size $120 \times 120$.  This matrix has a block diagonal structure, meaning that blocks of size $40$ can be identified that contain distances only in either *within-session* or *across-session* groups.  Please see figure below.

We can compute the average distances for each label.  Here the null hypothesis can be that these numbers (*within-session* vs. *across-session* distances) come from different distributions.  This is not the usual choise becuase most people want to find group vs control effects, but here we want to show that our hyperparameter choices are effective as they minimize the differences between sessions. In other words, we want to show that distribution of within-session vs across-session distances are not different in a statistically sgificant way given our hyperparameter and preprocessing choices. A permutation test can be used to test statistical significane. It is equivalent to shuffling rows and columns of this distance matrix and performing the averaging for each group while pretending the session labels remain unchanged.

Alternatively, we can define a null hypothesis that assumes the samples come from identical distributions.  Then we can report the p-value and show that we could not find significant evidence to reject null hypothesis, therefore we assume it is true to the best of our knowledge.

>**Edit Nov 14th:** I'm currently reading about test-retest reliability, intra-class correlation, ANOVA, and related topics to see if I can come up with more principled statistical tests to achieve this.  But the end goal remains: choose hyperparameters and processing steps to maximize longitudinal consistency.

In [2]:
path = pjoin(figs_dir, 'Comparison-1.pdf')
display(IFrame(path, width=width, height=height))

### How similar/different are these animals from each other?
I'm not really insisting on doing this. I don't care about this particular measure as much.  I know that Eve definitely cares and maybe Todd does too, so I would like to include this as well.  One benefit of doing this analysis is that these numbers can put other numbers into perspective.  For instance we may find that the average distance between the animals is 0.3 but the distance between Ca$^{2+}$ vs BOLD is 0.5 or whatever.  That's it.  We are just looking at some numbers and trying to understand different sources of variability.

#### Examples in the literature
This [paper](https://www.nature.com/articles/s42003-020-01472-5) has similar goals.  They compare adjacency matrices to characterize individual variation.  They do a bunch of CPM style analysis too.

### Ca$^{2+}$ vs BOLD
We compare the interaction graphs constructed from different data modalities.  This is important since it's the starting point for all the other downstream analysis.  How can we compare other things without knowing how the starting graphs compare to each other? Moreover, if necessary, we can include the results from here when choosing the appropriate frequency band for processing Ca$^{2+}$ data. This depends on our end goal, so we should discuss this during my presentation next week to see what Eve thinks.

Here is what I have in mind specifically:

#### 1) Distance measure
Again, we use Portrait Divergence to quantify topological similarity between interaction graphs $G^{(\text{Ca}^{2+})}$ and $G^{(\text{BOLD})}$.
#### 2) Variability
There are a total of $N_{sub} \times N_{ses} \times N_{run} = 10 \times 3 \times 4 = 120$ runs per modality.  We forget about all other labels here and retain only the Ca$^{2+}$ vs BOLD label.  Thus, we assume there are $N = 120$ samples drawn from each group.  We then apply Portrait Divergence between every possible pair of graphs. This yields a distance matrix with size $240 \times 240$, because we have a total of $2 \times 120$ runs.  Plea  This distancec matrix can be used to compute the average *within-group* distances (for each modality), as well as average *across-group* distance (Ca$^{2+}$ vs. BOLD).  Similar to above, a permutation test here would correspond to randomly shuffling rows and columns of this distance matrix but pretending that the labels are the same.

**Question:** I'm not sure if this is the best approach.  This depends on what our goal is.  One possible goal would be to show that interaction graphs have different topologies for Ca$^{2+}$ vs. BOLD.  If that is the case, within-group mean distance will be significantly smaller than across-group mean distance. On the contrary, let's say for a particular analysis we decided to choose a frequency band or number of ROIs that maximizes topological similarity between Ca$^{2+}$ and BOLD.  Here the null hypothesis would be: "distribution of within-group distances is significantly different than the distribution of across-group distances".  If that is the case, the histograms obtained by shuffling modality labels will overlap.  Depending on the goal, we can define proper distributions, statistics, and hypothesis to test.

In [3]:
path = pjoin(figs_dir, 'Comparison-2.pdf')
display(IFrame(path, width=width, height=height))

---

## Differences/similarities in main results

So far, we have used consistency across sessions to choose hyperparameters and have a sense of similarities and differences at the interaction graph level.  We have done this across sessions (condition # 1) and across modalities (condition # 3), and maybe across animals (condition # 2).  In this section we discuss how our main results (membership vectors or some of its deriivatives) compare when we use different modalities.

### Similarity of membership vectors
We want to find out if the resulting membership vectors are similar across modalities. One goal here is to compare $\pi^{(\alpha, \,\text{Ca}^{2+})}$ with $\pi^{(\alpha, \,\text{BOLD})}$ for $\alpha \in \{1, 2, \dots, 10, group\}$.  There are multiple types of comparisons we can make, but for now let us assume we have an awesome metric, $d_{awesome}$ that takes the resulting membership vectors and outputs a number in $[0, 1]$.  This represents some sort of distance between the inputs and is informative about some aspect of the results.  For instance, it could be cosine distance between communities or the difference between overlap size and so on.

#### 1) Distance measure
Let's skip details of this for now.  I will discuss possible options below.
#### 2) Variability
There are several optoins.  First of all, for a comparison between $\pi^{(group, \,\text{Ca}^{2+})}$ with $\pi^{(group, \,\text{BOLD})}$ we can simply use the bootstrapped distribution to get a sense of the variability around the number coming from real sample.  If we wanted to define a concrete hypothesis and test it, then a permutation test would be a better option.

**Question:** Is it OK sometimes to just look at some quantities without having to define a hypothesis to test for significance?  Simply report a number and quantify variability around it without having a hypothesis in mind.

### Distance metrics
In general, community comparison can be divided into the following categories:

1. Overlapping comparison
2. Disjoint comparison

Furthermore, for each category, I define two complementary approaches to compare results across modalities: a) by comparing communities, and b) by comparing nodes.

#### 1) Overlapping comparison
**Between communities.** For overlapping communities, we compute the cosine similarity of membership vectors across communities (after aligning them).  Specifically, let's use $\pi^{(A)}_{[1:N], k}$ to denote the membership vector over all nodes for community $k$ estimated using data from condition/group $A$.  For instance, we could have $A = (group, \,\text{Ca}^{2+})$ and so on. Here $\pi^{(A)}_{[1:N], k} = (\pi^{(A)}_{1k}, \dots, \pi^{(A)}_{ik}, \dots, \pi^{(A)}_{Nk})^T$ is simply $k-$th column of $\pi^{(A)}$.  We use cosine distance to quantify network similarities at the overlapping community level

$$
d_{\cos}(A, B) = \frac{1}{K} \sum_{k=1}^K d_{\cos}(A, B; k), \\
d_{\cos}(A, B; k) = \frac{1}{2}\big[1 - \cos(A, B; k)\big]
= \frac{1}{2}\Bigg[1 - \frac{\sum_{i = 1}^N \pi^{(A)}_{ik} \pi^{(B)}_{ik}}{||\pi^{(A)}_{[1:N], k}|| . ||\pi^{(B)}_{[1:N], k}||}\Bigg].
$$

If the two groups have identical membership vectors we will have $d_{\cos}(A, B) = 0$ which corresponds to maximum similarity.

**Between nodes.** We can also quantify similarity between node membership vectors across conditions.  Let $\pi^{(A)}_{i}$ denote the $K$-dimensional membership distribution for node $i$ estimated using data from group $A$.  This is a probability distribution, therefore, we can use standard measures such as Jensen–Shannon divergence to compute the distance between $\pi^{(A)}_{i}$ and $\pi^{(B)}_{i}$.  Jensen–Shannon divergence is simply symmetrized Kullback–Leibler divergence:

$$
d_{JS}(A, B) = \frac{1}{N} \sum_{i = 1}^N d_{JS}(A, B; i), \\
d_{JS}(A, B; i) = \frac{1}{2} d_{KL}(\pi^{(A)}_{i} || \pi^{(AB)}_{i}) + \frac{1}{2} d_{KL}(\pi^{(B)}_{i} || \pi^{(AB)}_{i}).
$$
Here we have defined $\pi^{(AB)}_{i} := (\pi^{(A)}_{i} + \pi^{(B)}_{i}) / 2$ and KL divergence is given by

$$
d_{KL}(Q || P) = \mathbb{E}_Q\Big[\log \frac{Q}{P}\Big].
$$

For identical membership distributions $\pi^{(A)}_{i} = \pi^{(B)}_{i}$ we have $d_{JS}(A, B; i) = 0$, therefore, a smaller JS divergence indicates larger agreement between node membership distributions.

#### 2) Disjoint comparison
Our overlapping results can be used to infer disjoint communities for each node as a special case.  We assign a disjoint community index $k_i$ to node $i$ by choosing its maximum membership value.  In other words, we have

$$
k_{i} = \underset{k}{\mathrm{argmax}} \; \pi_{ik}. 
$$

The measures discussed here can also be used to compare our argmax disjoint results with the results of other disjoint algorithms such as Hierarchical clustering.

**Between communities.** We use Dice coefficient to quantify similarity between disjoint communities as follows

$$
d_{dice}(A, B) = \frac{1}{K} \sum_{k = 1}^K d_{dice}(A, B; k), \\
d_{dice}(A, B; k) = 1 - \frac{ 2 |A_k \cap B_k|}{|A_k| + |B_k|}.
$$
Here $A_k$ denotes the set of all voxels assigned to community $k$ for group $A$ and similarly for $B$.

**Between nodes.**  For comparison at node level we use Hamming distance to quantify disjoint community similarities.  The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.  Therefore, it is an appropriate measure to quantify distance between two discrete codes (community indices).  Let use define $V$ as the set of all vertices (nodes).  We have

$$
d_{hamm}(A, B) = \text{# nodes with different community indices} \; / 
\; \text{total # nodes}\\
= \frac{|\{i : k^{(A)}_i \neq k^{(B)}_i \,\,\, \text{for} \,\,\, i \in V\}|}{|V|}.
$$

If all nodes get the same disjoint community assignments in both groups $A$ and $B$, then we will have $d_{hamm}(A, B) = 0$.  This means maximum disjoint similarity.

### Comparing the amount of overlap
So far there is strong evidence that $\pi^{(group, \, \text{Ca}^{2+})}$ is substantially more overlapping than $\pi^{(group, \, \text{BOLD})}$.  The best way to quantify this difference is across nodes, since there might be alignment issues between communities.  To do so, let us define a quantity, maximum membership value of a node

$$
x_{i} = \underset{k}{\mathrm{max}} \; \pi_{ik}.
$$

This quantity keeps track of the strongest membership value that a node expresses to some community, any community.  It is bounded from below by $1/K$ and from above by $1$

$$
\frac{1}{K} \leq x_i \leq 1.
$$
A large $x_i$ indicates that node $i$ is most likely belongs to disjoint parts of the graph. Therefore, keeping track of this quantity across conditions can reveal whether a node prefers to belong to many communities or just one.  Let us label a node **disjoint** if $x_i \gtrsim 0.8$, and **overlapping** if $x_i \lesssim 0.8$.  We use this criterion to define

$$
\text{set of nodes with disjoint memberships} = S_{disjoint} = \big\{i : i = 1, \dots, N \; \text{if} \; x_{i} \geq 0.8 \big\}\\
\text{set of nodes with mixed memberships} = S_{overlap} = \big\{i : i = 1, \dots, N \; \text{if} \; x_{i} < 0.8 \big\}.
$$


Next, we use this criterion to compute, $\theta^{(A)}$, proportion of nodes that are overlapping in condition $A$.  That is

$$
\theta^{(A)} = \frac{\big| S^{(A)}_{overlap} \big|}{N}.
$$

Comparing the amount of overlap this way has several benefits.  For example, we avoid complications that come from having to match different communities across modalities, because $x_i$ is well defined for any set of communities.

---

## General questions to discuss

The comparisons discussed so far can be made at different levels.  For instance, we can consider each animal as a sample ($n=10$) and compute averages before comparing, or we can forget about session and animal labels and assume each run is a sample ($n=120$). I think both approaches are useful in different scenarios.

- **When comparing interaction graphs:** I propose using each run as a sample for this.  To get interaction graphs at animal level, we should either concatenate data before computing correlations or average over adjacency matrices. This introduces additional complications due to co-registration artifacts or lack of proper theory to interpret each edge value in a covariance matrix.

- **When comparing the amount of overlap:** I think here we should still compute quantities of interest for each run first. Then we can compute aggregate statistics to get quantities such as last bin density for each animal, and eventually for the group.  The reason to do this is because we want to minimize artifacts due to co-registration and misalignment.  Averagin before binning adds additional but spurious overlap to the results.

- **When comparing proportion of overlapping nodes:** Similar arguments apply here.  When comparing $\theta^{(\text{Ca}^{2+})}$ and $\theta^{(\text{BOLD})}$, we should calculate this quantity for all $n=120$ runs and plot the histograms.  Then we can comptute the means across all runs and peerform approriate statistical tests on these means.

- **When visualizing communities and entropies:** Here it is best to calculate $\pi^{(\alpha)}$ for each animal and calculate entropies from this for each animal separately. We can then compute the mean over all animals to get the group membership matrix and entropy values, which we can then visualie in the main paper.  If necessary, we can visualize each animal's membership and entropy maps somewhere in the supplementary.


Lastly, I wanted to mention that we can consider another type of shuffling to get a sense of community similarities.  For instance, we can shuffle the community dimension when comparing $\pi^{(A)}$ and $\pi^{(B)}$.  This is not exactly permutation test, but still can be used to see how results from each distance metric depends on shuffling the results.

In [4]:
path = pjoin(figs_dir, 'Comparison-3.pdf')
display(IFrame(path, width=width, height=height))

---
---
---