# November 12th, 2021

**Motivation**: Discussing ways of comparing Ca and BOLD results.  What is the most conceptually sound way of doing this? <br>

---
---
---

## Definitions

In this section I will define mathematical symbols that I will use throughout this document.

### Number of ...

- Number of nodes (or ROIs):  $N = 996$
    - For now this is only the surface subset of ROIs that appear in the optical space after the 3D --> 2D transformation.
    - I use exactly the same surface ROIs for both Ca and BOLD (call this BOLD-Lite). 
- Number of communities:  $K = 6$
- Number of animals:  $N_{sub} = 10$
- Number of sessions:  $N_{ses} = 3$
- Number of rest runs:  $N_{run} = 4$
- Number of random seeds:  $N_{seed} = 1000$

### Closed simplex in $\mathbb{R}^K$
Let $c$ be a positive number.  The $n-$dimensional closed simplex in $\mathbb{R}^n$ is defined by

\begin{equation}
\mathbb{T}_n(c) = \Big\{ (x_1, \dots, x_n)^T: \; x_i > 0, \: 1 \leq i \leq n, \: \sum_{i=1}^n x_i = c \Big\}.
\end{equation}

Therefore, membership distributions $\pi_i$ are members of $\mathbb{T}_K(1)$.

### Memberships
In this section I define some mathematical notations for membership values, the main quantity in our analysis.

#### For a run
We apply the algorithm $N_{seed} = 1000$ times to each run to get membership values.  This is a matrix that has shape $N_{seed} \times N \times K$.  After aligning and averaging results across seeds, we get the membership matrix for this run, $\pi$, that has shape $N \times K$ where each row $\pi_i \in \mathbb{T}_K(1)$.  Each column of $\pi$ corresponds to an overlapping community and contains membership values for all nodes in that community.

#### All runs
We can take membership matrices for all runs and arrange them in a large tensor that contains membership information for all runs from every session/animal:

\begin{equation}
\Pi_{i_{sub}, i_{ses}, i_{run}} \quad := \quad \text{membership matrix for run $i_{run}$ from session $i_{ses}$ from animal $i_{sub}$},
\end{equation}

where $i_{sub} \in \{1, 2, \dots, N_{sub}\}$ and so on. We see that $\Pi$ is a tensor with shape $N_{sub} \times N_{ses} \times N_{run} \times N \times K$.  To get membership matrix for each animal, we simply align and average across run dimension and then average across session dimension.  Let us use greek letters to index animals.  We use $\pi^{(\alpha)}$ to denote the averaged membership matrix for animal $\alpha$:

\begin{equation}
\pi^{(\alpha)} = \frac{1}{N_{ses}}\frac{1}{N_{run}}\sum_{i_{ses} = 1}^{N_{ses}}\sum_{i_{run} = 1}^{N_{run}}\Pi_{\alpha, i_{ses}, i_{run}},
\end{equation}
where $\alpha \in {1, 2, \dots, N_{sub}}$.  To get group results, we average again along the animal dimension:

\begin{equation}
\pi^{(group)} = \frac{1}{N_{sub}}\sum_{\alpha = 1}^{N_{sub}}\pi^{(\alpha)}.
\end{equation}


#### Data modalities
We have one $\Pi$ tensor per data modality: $\Pi^{(\text{Ca}^{2+})}$ and $\Pi^{(\text{BOLD-Lite})}$. From now on we refer to BOLD-Lite as simply BOLD but remembering that it's only the surface subset of the full ROIs.  In this notation, we use $\pi^{(\alpha, \,\text{Ca}^{2+})}$ to denote membership matrix for animal $\alpha$ obtained using Ca$^{2+}$ data and so on.  Similarly, $\pi^{(group, \,\text{Ca}^{2+})}$ and $\pi^{(group, \,\text{BOLD})}$ represent group results obtained from Ca$^{2+}$ and BOLD data respectively.

---
---

## Aligning results
We use k-means on the membership vectors.  I will add more details later.

---
---

## Comparing results
Next we seek to compare the results across different conditions.  This is useful information to learn in general, but also necessary to explore to a reasonable extent now.  Why do we report both Ca$^{2+}$ and BOLD results?  What do we gain by including both data modalities in our analysis?  I think Eve is very interested in addressing this, and I'm also partially interested. It is useful to explore this question a little bit now because a) reviewers might ask about this, and b) what we learn now will become useful later when building dynamic models that operate at the time series level.

### What kinds of comparisons?
Here I define some conditions that I will frequently refer back to.

1. **Condition # 1 (across sessions):** In this condition, we use data from one modality (Ca$^{2+}$ and BOLD) and from the same set of animals.  Session label is the only variable.
2. **Condition # 2 (across animals):** Data from one modality and same session(s) is used, but for different sets of animals.  This can be a comparison of a single animal to the group average obtained from other animals.  Or it could be comparison of each individual to all other individuals in the group (as is typically done in CPM type of work).
2. **Condition # 3 (across modalities):** Here we use the same animal(s) and session(s) to compare Ca$^{2+}$ vs BOLD results.

Each of these conditions serves a unique purpose.  I will go over what we can gain from such comparisons and motivate each of these, but before that let's discuss what are the necessary ingredients for these comparisons to make sense.

### How to compare
To do this properly, we need the following:

1. **Define distance measure:** we should use interpretable metrics that extract relevant information from the results.
2. **Choose unit of variance**: how to properly quantify variability around these numbers and test for significance.

In the following sections, I will discuss different kinds of comparisons that we can carry out. For each item, I will define and motivate the particular distance metric I have in mind.  I also explicitly discuss how to quantify variability.

---

## Comparison at the interaction graph level
First, we compare the interaction graphs. These graphs are the input to SVINET algorithm, so it makes sense to quantify how similar or different they are in different conditions.  These differences will inevitabely leak into next levels of analysis and will be combined with other sources of variability. Therefore, it makes sense to start here.

### Consistency across sessions
I'm gonna argue that a longitudinal dataset recorded in the span of several spread out sessions is actually a fantstic resource.  We can use this to choose some of the hyperparameters in the study without relying on arbitrary metrics.  For instance, I have already established that the **amount of overlap** present in our results depends on the **sparsification threshold** used.  This is problematic, because ideally a biological fact should not depend so strongly on how we extract it from data.  This situation can potentially be remedied by using ***consistency across sessions*** as an objective function.

#### The argument
If thresholded covariance matrices are reprentative of the true interaction graphs in mice, then these interaction graphs must exhibit some level of consistency across sessions.  That is, if we take interaction graphs from session 1 and compare them to session 2, we should not see significant differences. Any significant and strong dependence on session label here would be concerning (assuming there was enough data per session to sample relevant corners of the phase space).

Constructing the interaction graph depends nontrivially on the choices we make.  These choices include the initial parcellation scheme used (columnar or not?, number of columns, number of layers), data preprocessing steps (GSR or no GSR? bandpass frequency, etc.), edge filtering approach, and sparsification threshold.  Rather than relying on heuristics, we should strive to find hyperparameters that maximize the representativeness of interaction graphs across sessions.  This is one reason why Condition # 1 is useful to consider.

#### Examples in the literature
A recent [paper](https://direct.mit.edu/netn/article/5/1/96/97529/Combining-network-topology-and-information-theory) uses Network Portrait Divergence (I discussed this in my very first presentation) to find parcellation and edge filtering schemes that are maximally representative.  Their goal is respectable; however, their objective is not very well motivated:

> ...We reasoned that the lower a parcellation’s average divergence from the others, the more this is
a representative parcellation. Thus, our measure of representativeness for a given parcellation
was deﬁned as one minus its average divergence from all other parcellations.

In other words, they consider a parcellation (or edge filtering) approach to be better (more representative) if it has smaller divergence from all other approaches.  This argument is inherently problematic. We can do better by defining ***consistency across sessions*** as our objective.  How to do this?  Below I define an appropriate distance metric and statistical approach.

#### 1) Distance measure
We use Portrait Divergence between interaction graphs inferred from each run.  I do not like using correlation on the covariance matrix since the distribution of edge weights strongly depends on preprocessing steps.  For instance, GSR mathematically mandates that approximately 50% of correlations between regions will be negative.  Therefore, looking at distances between an idealized (binary) interaction graphs is a better choice, even though we lose some information about correlation strengths.  Another reason to do this at the graph level is simply because our algorithm accents binary edges.

#### 2) Variability
There are $N_{ses} = 3$ sessions and $N_{run} \times N_{sub} = 4 \times 10 = 40$ runs per session for each data modality. Therefore, we have a total of $3 \times 40 = 120$ runs overal.  Let's say our goal is to find the best sparsity threshold that maximizes consistency across sessions for BOLD.  This means we would like to minimize topological distances between runs across sessions.  We compute the pairwise Portrait Divergence between each pairs of runs and arrange these numbers in a distance matrix with size $120 \times 120$.  This matrix has a block diagonal structure, meaning that blocks of size $40$ can be identified that contain distances only in either *within-session* or *across-session* groups.  We can compute the average distances for each label.  Here the null hypothesis would be these numbers (*within-session* vs. *across-session* distances) come from the same distribution.  A permutation test can be used to test statistical significane. It is equivalent to shuffling rows and columns of this distance matrix and performing the averaging for each group while pretending the session labels remain unchanged.

### How similar/different are these animals from each other?
I'm not insisting on doing this. I don't care about this particular measure as much.  I know that Eve definitely cares and maybe Todd does too, so I would like to include this as well.  One benefit of doing this analysis is that the numbers we get can put other numbers into perspective.  For instance we may find that the average similarity between these animals is 0.7 but the similarity between Ca$^{2+}$ vs BOLD is 0.5 or whatever.  That's it.  We are just looking at some numbers and trying to understand different sources of variability.

#### Examples in the literature
This [paper](https://www.nature.com/articles/s42003-020-01472-5) has similar goals.  They compare adjacency matrices to characterize individual variation.  They do a bunch of CPM style analysis too.

### Ca$^{2+}$ vs BOLD
We compare the interaction graphs constructed from different data modalities.  This is important since it's the starting point for all the other downstream analysis.  How can we compare other things without knowing how the starting graphs compare to each other? Moreover, if necessary, we can include the results from here when choosing the appropriate frequency band for processing Ca$^{2+}$ data.  This depends on our end goal, so we should discuss this during my presentation next week to see what Eve thinks.

Here is what I have in mind specifically:

#### 1) Distance measure
Again, we use Portrait Divergence to quantify topological similarity between interaction graphs $G^{(\text{Ca}^{2+})}$ and $G^{(\text{BOLD})}$.
#### 2) Variability
There are a total of $N_{sub} \times N_{ses} \times N_{run} = 10 \times 3 \times 4 = 120$ runs per modality.  We forget about all other labels here and retain only the Ca$^{2+}$ vs BOLD label.  Thus, we assume there are $N = 120$ samples drawn from each group.  We then apply Portrait Divergence between every possible pair of graphs. This yields a distance matrix with size $240 \times 240$, because we have a total of $2 \times 120$ runs.  This distancec matrix can be used to compute the average *within-group* distances (for each modality), as well as average *across-group* distance (Ca$^{2+}$ vs. BOLD).  Similar to above, a permutation test here would correspond to randomly shuffling rows and columns of this distance matrix but pretending that the labels are the same.

**Question:** I'm not sure if this is the best approach.  This depends on what our goal is.  One possible goal would be to show that interaction graphs have different topologies for Ca$^{2+}$ vs. BOLD.  If that is the case, within-group mean distance will be significantly smaller than across-group mean distance. On the contrary, let's say for a particular analysis we decided to choose a frequency band or number of ROIs that maximizes topological similarity between Ca$^{2+}$ and BOLD.  Here the null hypothesis would be: "distribution of within-group distances is not significantly different than the distribution of across-group distances".  If that is the case, the histograms obtained by shuffling modality labels will overlap.  Depending on the goal, we can define proper distributions, statistics, and hypothesis to test.

---

## Similarity of membership vectors
So far, we have used consistency across sessions to choose hyperparameters and have a sense of similarities and differences at the interaction graph level.  We have done this across sessions (condition # 1) and across modalities (condition # 3), and maybe across animals (condition # 2).

Next, we want to find out if the resulting membership vectors are similar across modalities. One goal here is to compare $\pi^{(\alpha, \,\text{Ca}^{2+})}$ with $\pi^{(\alpha, \,\text{BOLD})}$ for $\alpha \in \{1, 2, \dots, 10, group\}$.  There are multiple types of comparisons we can make, but for now let us assume we have an awesome metric, $d_{awesome}$ that takes the resulting membership vectors and outputs a number in $[0, 1]$.  This represents some sort of distance between the inputs and is informative about some aspect of the results.  For instance, it could be cosine distance between communities or the difference between overlap size and so on.

#### 1) Distance measure
(copy from methods section here)
#### 2) Variability
Once again, permutation testing can be used here.  There are two possibilities.  When are 


---
---
---