# Analysis of the spatial structure of connectivity in the MICrONS dataset

The MICrONS initiative provided a dense reconstruction of around a cubic milimeter of mouse brain tissue.

At OBI, we have converted that data into the SONATA format that is often used to represent biophysically-detailed computational models of neuronal circuitry. We believe that this is a useful resource for the community for the following reasons:
 1. It allows direct comparison of models to the data, as both are in the same format. In the future it may even be possible to simulate the MICrONS circuitry as one simulates the computational models.
 2. There are many useful code libraries for analyzing SONATA-formatted circuits.
 3. It is reduced representation of the data. While this discards a lot of information, what remains is still very useful for many purposes. And the reduced data can be more easily handled and analyzed faster.
 4. During the conversion to SONATA we added derived data. Specifically, high-quality morphology skeletons with extracted spines.


Here, we want to expand on point (3) above. 

We use the data to test our hypothesis about the mechanisms that allow individual neuron morphologies to leads to structured, non-random synaptic connectivity.

### Summary of the analysis

It has been demonstrated before, that modeled networks based on axo-dendritic touch generate a highly non-random structure that matches biological characteristics better than simpler, connection probability-based models. Such characteristics are for example degree distributions, and motif overexpression patterns.

We formulate a hypothesis of what mechanism leads to this and test it using the MICrONS data. The hypothesis is as follows:

  1. A required (although not sufficient) condition for the formation of a connection is proximity of the axon to the dendrite
  2. This condition can in principle be formulated as a distance and direction-dependent probability function on the offset between somata of a neuron pair. 
  3. The shape of this function is determined by the overall average shape of the dendrites and axons of the classes of neurons considered.
  4. However, once a connection at a given distance and direction has been confirmed for a given pre-/post-synaptic neuron, this function must be updated for all its future potential connections. This is because presence of the connection demonstrates that the axon / dendrites are more likely to be oriented towards the point where the connection has been formed.
  5. On a theoretical level, this introduces a _statistical dependence_ between connections: Presence or absence of one connection influences the probability that another connection is present. This is something that connection probability-dependent models, even complex ones, cannot capture, as they are based on statistically independent evaluations of connection probabilities.

We consider points 1-3 to be widely accepted. Points 4 and 5 describe aplausible scenario. But we have to demonstrate that the proposed mechanism actually affects connectivity on a measurable level. To do so, we test a prediction derived from the hypothesis.

**Prediction**: If a neuron A innervates / is innervated by a neuron B, then the probability that it also innervates / is innervated by the _nearest spatial neighbor_ of B is increased.

## Importing code libraries and loading the data

We import a number of standard packages, as well as _conntility_ and _connalysis_. These two packages provide (as we will see) useful functionality for the analysis of this type of data.

In [None]:
import conntility
import pandas
import numpy

from matplotlib import pyplot as plt

numpy.seterr(all="ignore")


fn = "../../../../shared_data/MICrONS_SONATA/microns_con_mat_multi.h5"

# We load the data that has been serialized into a single hdf5 file into an object.
M = conntility.ConnectivityMatrix.from_h5(fn)


### Side note: data representation

The data, that is, the neurons and their connections, are represented in the object M. 
The representation has a list of _vertices_, i.e. neurons, and _edges_, i.e. synaptic connections. 

We can list the vertices and their properties.
Important properties for this analysis are:
  - layer, a string indicating the cortical layer of each neuron
  - synapse_class, this is either "EXC" or "INH" indicating that a neuron is excitatory or inhibitory
  - x, y, z, the spatial locations of the neurons in um

In [None]:
display(M.vertices)

We can also list the edges and their properties.
For the purpose of this analysis, we do not consider the properties at all - we are only interested in the presence or absence of a connection. Still, other analyses can use the properties.

For example, "spine_id" lists an identifier of the spine that a synapse innervates, or -1 for shaft synapses. Note that we have only identified spines for a subset of postsynaptic neurons, and for the rest all afferent synapses list a value of -1. We are working on extending the number of neurons with identified spines.

In [None]:
display(M.edges.head())

## Select the excitatory sub-graph

It is quite accepted that connectivity of inhibitory neurons follows quite different rules than excitatory connectivity. 
Hence, we limit our analysis here to only the excitatory subgraph, for simplicity.


In [None]:
# Create a subcircuit using the .index functionality. The following creates the subcircuit of neurons where 
# the values of "synapse_class" is equal to "EXC".
M = M.index("synapse_class").eq("EXC")

## Finding the nearest neighbors
To find the nearest neighbors of each neuron, we use a specialized data structure called a KDTree, implemented in scipy. It takes as input the spatial coordinates (i.e., the "x", "y" and "z" properties) of all neurons.

Specifically, we build a numpy.array with one entry per neuron, where the entry at index i yields the index of the nearest neighbor of the ith neuron.

In [None]:
from scipy.spatial import KDTree

xyz = ["x", "y", "z"]
# Build KDTree
tree = KDTree(M.vertices[xyz])

# Query KDTree for nearest neighbor. [2] indicates we ask for the 2nd nearest neighbor. This is, because the neuron itself
# is not excluded from this query and will be the 1st nearest neighbor with a distance of 0
nn_dists, nn_idx = tree.query(M.vertices[xyz], [2])
nn_idx = nn_idx.flatten()

### Small detour: plot nearest neighbor distances

Just as a quick sanity check and out of curiosity, we plot a histogram of the distances to the nearest neighbor of each neuron

In [None]:
H = numpy.histogram(nn_dists, bins=15)
plt.bar(H[1][:-1], H[0], width=numpy.mean(numpy.diff(H[1])))
plt.gca().set_xlabel("Nearest neighbor distance (um)")
plt.gca().set_ylabel("Count")

# Main analysis

Now, we perform the main analysis. 
As we have ~50,000 neurons in the dataset, we have ~2,500,000,000 pairs of potentially connected neurons. While it is still viable to iterate in analyses over that number of pairs, it is also highly inefficient. 

Instead, I will implement a more efficient form of the analysis.

We begin by assembling a DataFrame with one row per **connected** pair. Its columns are as follows:
  - "row": Index of the pre-synaptic neuron of the connection
  - "col": Index of the post-synaptic neuron of the connection
  - "nn_row": Index of the nearest neighbor of the pre-synaptic neuron
  - "nn_col": Index of the nearest neighbor of the post-synaptic neuron
  - "dx", "dy", "dz": Offset along the spatial axes of the pair locations.
  - "1d_dist": Distance (1-dimensional) of between the neurons of the pair

In [None]:
# columns "row", and "col" are already part of the data structure we loaded inititally. 
edge_df = M._edge_indices.copy().reset_index(drop=True)

# We can then look up "nn_row", "nn_col" from the array we assembled earlier
edge_df["nn_col"] = nn_idx[M._edge_indices["col"]]
edge_df["nn_row"] = nn_idx[M._edge_indices["row"]]

# M provides a function that yields for each connection the "x", "y" or "z" coordinate of the pre- and post-syn. neurons
# We calculate their distances as "dx", "dy", "dz"
for col_name in xyz:
    per_edge_coords = M.edge_associated_vertex_properties(col_name)
    edge_df["d" + col_name] = per_edge_coords["col"] - per_edge_coords["row"]

# 1-dimensional distance is easily calculated
xyz_delta = ["dx", "dy", "dz"]
edge_df["1d_dist"] = numpy.linalg.norm(edge_df[xyz_delta], axis=1)

display(edge_df.head())


This allows us to easily look up connection probabilities from the adjacency matrix

In [None]:
# Adjacency matrix. Efficiently represented as a sparse matrix with bool entries
actual_con = M.matrix.tocsc().astype(bool)

# Note that each row of the DataFrame represents one instance where the nearest neighbor of neuron B is connected to A.
# Hence, we can use to easily look up the probability that B is also connected to A.
post_nn_con_mean = actual_con[edge_df["row"], edge_df["nn_col"]].mean()
pre_nn_con_mean = actual_con[edge_df["nn_row"], edge_df["col"]].mean()

print(f"""Overall connection probability: {actual_con.mean()},
      If post-synaptic nearest neighbor is connected: {post_nn_con_mean},
      if pre-synaptic nearest neighbor is connected: {pre_nn_con_mean}""")

We see that the connection probability is drastically increased, especially if the post-synaptic nearest neighbor is connected.

However, distance-dependence of connectivity can explain (part of) such an effect: The nearest neighbor being connected makes it likely that we are considering a pair of neurons at a low distance. That also increases the probability that the original neuron is connected.

## Loading a distance-dependent control

To address this, we begin by creating a distance-dependent control connectome. Specifically, we consider the following: For each ordered pair of layers, (L_i, L_j) we fit an exponential function that describes the connection probability from neurons in L_i to neurons in L_j. Then we generate a stochastic instance of that connectome.

The fitting would take around 3-5 minutes to run. To speed things up, we have prepared such a control connectome and we simply load it. 
**Imortantly**, the control connectome was generated on the same nodes with the same locations as the original connectome. So we do not have to re-generate the lookup for nearest neighbors.

In [None]:
fn_ctrl = "../../../../shared_data/MICrONS_SONATA/control_con_mat.h5"
C = conntility.ConnectivityMatrix.from_h5(fn_ctrl)

# Make sure the neuron locations really are identical
assert (M.vertices[xyz] == C.vertices[xyz]).all().all()
# Sparse adjacency matrix
control_con = C.matrix.tocsc().astype(bool)

post_nn_con_mean = control_con[edge_df["row"], edge_df["nn_col"]].mean()
pre_nn_con_mean = control_con[edge_df["nn_row"], edge_df["col"]].mean()

print(f"""Overall connection probability: {control_con.mean()},
      If post-synaptic nearest neighbor is connected: {post_nn_con_mean},
      if pre-synaptic nearest neighbor is connected: {pre_nn_con_mean}""")

We see that there is indeed an effect in the distance-dependent control. However, it is sgnificantly weaker than in the actual data.

## Making the analysis distance-dependent
To control for the effect of distance dependence even further, we repeat the analysis, but separately in 50 um distance bins.

To that end, we first need to generate the bins and calculate how many pairs of neurons there are in each bin. Once again, this is valid for both the original data and the control.
The KDTree again offers functionality to calculate this.

In [None]:
# Set up bins
bin_sz = 50.0
# We create the last bin border at a really large distance to ensure that the last bin captures any remaining pair
bin_borders = numpy.hstack([numpy.arange(0, 1000 + bin_sz, bin_sz), 1E12])
bin_indices = numpy.arange(len(bin_borders) - 1)
# Calculate bin centers. For plotting purporses we consider the really large last bin to be "bin_sz" distance away from the previous.
bin_centers = 0.5 * (bin_borders[:-2] + bin_borders[1:-1])
bin_centers = numpy.hstack([bin_centers, bin_centers[-1] + bin_sz])

# For each connected pair, we calculate which distance bin it belongs to
edge_df["1d_dist_bin"] = numpy.digitize(edge_df["1d_dist"], bins=bin_borders) - 1

# The "count_neighbors" function yields the numbers of pairs at distances up to and including the queried distance. 
# That is, it is a cumulative count. We take the .diff to get the non-cumulative numbers in each bin.
n_pairs_per_bin = numpy.diff(tree.count_neighbors(tree, bin_borders))


Now we can calulate the overall connection probability in each bin. 
We calculate it simply as the count of connected pairs in a bin, divided by the number of potential pairs in the bin.

The connection probabilities conditional on the nearest neighbor being connected we calculate as before. Only this time we perform the analysis separately for each distance bin using the "groupby" functionality.

In [None]:
prior_con_prob = edge_df.groupby("1d_dist_bin")["1d_dist_bin"].count().reindex(bin_indices) / n_pairs_per_bin

con_prob_post_is_connected = edge_df.groupby("1d_dist_bin").apply(lambda _df: actual_con[_df["row"], _df["nn_col"]].mean(),
                                                                  include_groups=False)

con_prob_pre_is_connected = edge_df.groupby("1d_dist_bin").apply(lambda _df: actual_con[_df["nn_row"], _df["col"]].mean(),
                                                                  include_groups=False)


plt.plot(bin_centers[prior_con_prob.index], prior_con_prob, color="black", ls="--", label="Overall probability")

plt.plot(bin_centers[con_prob_post_is_connected.index], con_prob_post_is_connected, color="red", label="If postsyn. NN is connected")
plt.plot(bin_centers[con_prob_pre_is_connected.index], con_prob_pre_is_connected, color="blue", label="If presyn. NN is connected")
plt.legend()
plt.gca().set_xlabel("Distance (um)")
plt.gca().set_ylabel("P")


Once again, we find a very strong effect. Especially if the post-synaptic neighbor is connected.

We repeat the same analysis for the distance-dependent control.

In [None]:
ctrl_df = C._edge_indices.copy().reset_index(drop=True)

ctrl_df["nn_col"] = nn_idx[C._edge_indices["col"]]
ctrl_df["nn_row"] = nn_idx[C._edge_indices["row"]]

for col_name in xyz:
    per_edge_coords = C.edge_associated_vertex_properties(col_name)
    ctrl_df["d" + col_name] = per_edge_coords["col"] - per_edge_coords["row"]

ctrl_df["1d_dist"] = numpy.linalg.norm(ctrl_df[xyz_delta], axis=1)
ctrl_df["1d_dist_bin"] = numpy.digitize(ctrl_df["1d_dist"], bins=bin_borders) - 1



ctrl_prior_con_prob = ctrl_df.groupby("1d_dist_bin")["1d_dist_bin"].count() / n_pairs_per_bin

ctrl_con_prob_post_is_connected = ctrl_df.groupby("1d_dist_bin").apply(lambda _df: control_con[_df["row"], _df["nn_col"]].mean(),
                                                                  include_groups=False)

ctrl_con_prob_pre_is_connected = ctrl_df.groupby("1d_dist_bin").apply(lambda _df: control_con[_df["nn_row"], _df["col"]].mean(),
                                                                  include_groups=False)



plt.plot(bin_centers[ctrl_prior_con_prob.index], ctrl_prior_con_prob, color="black", ls="--", label="Overall probability")

plt.plot(bin_centers[ctrl_con_prob_post_is_connected.index], ctrl_con_prob_post_is_connected, color="red",
         label="If postsyn. NN is connected")
plt.plot(bin_centers[ctrl_con_prob_pre_is_connected.index], ctrl_con_prob_pre_is_connected, color="blue",
         label="If presyn. NN is connected")
plt.legend()
plt.gca().set_xlabel("Distance (um)")
plt.gca().set_ylabel("P")

We find a small to non-existant effect. This indicates that performing the analysis for distance bins separately controls for most of the distance-dependence in the data. 

### Exercise
If you made it this far, the analysis must be of interes to you.
As an exercise, we encourage you to implement a version of the analysis with 2-dimensional distance bins instead of merely 1-dimensional ones. One bin indicating the horizontal ("x"-"z") distance of the pair, the other bin indicating the vertical ("y") offset of the pair. 

There is some interesting spatial structure to this effect that such an analysis captures.