# Joint Representation Learning

In many network problems, our network might be more than just its topology (its collection of nodes and edges). If we were investigating a social network, we might have access to extra information about each person -- their gender, for instance, or their age. If we were investigating a brain network, we might have information about the physical location of neurons, or the volume of a brain region. When we we embed a network, it seems like we should be able to use these extra bits of information - called the "features" or "covariates" of a network - to somehow improve our analysis. The techniques and tools that we'll explore in this section use both the covariates and the topology of a network to create and learn from new representations of our network. Because these new representations jointly use both the topology of our network and its extra covariate information, these techniques and tools are called joint representation learning.

There are two primary reasons that we might want to explore using node covariates in addition to topological structure. First, they might improve our standard embedding algorithms, like Laplacian and Adjacency Spectral Embedding. For example, if the latent structure of the covariates of a network lines up with the latent structure of its topology, then we might be able to reduce noise when we embed, even if the communities in our network don't overlap perfectly with the communities in our covariates. Second, figuring out what the clusters of an embedding actually mean can sometimes be difficult and covariates create a natural structure in our network that we can explore. Covariate information in brain networks telling us where in the brain each node is, for instance, might let us better understand the types of characteristics that distinguish between different brain regions.

In this section, we'll explore different ways to learn from our data when we have access to these covariates of a network in addition to its topological structure. We'll explore *Covariate-Assisted Spectral Embedding* (CASE), a variation on Spectral Embedding. In CASE, instead of embedding just the adjacency matrix or one of the many versions of its Laplacian, we'll combine the Laplacian and our covariates into a new matrix and embed that.

A good way to illustrate how using covariates might help us is to use a model in which some of our community information is in the covariates and some is in our topology. Using the Stochastic Block Model, we’ll create a simulation using three communities: the first and second community will be indistinguishable in the topological structure of a network, and the second and third community will be indistinguishable in its covariates. By combining the topology and the covariates, we'll get a nice embedding that lets us find three distinct community clusters.

### Stochastic Block Model

Suppose we have a Stochastic Block Model that looks like this.

In [None]:
import warnings
warnings.filterwarnings("ignore")  # TODO: don't do this, fix scatterplot

import numpy as np
np.random.seed(42)

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from graspologic.simulations import sbm
from graspologic.plot import heatmap

# Start with some simple parameters
N = 1500  # Total number of nodes
n = N // 3  # Nodes per community
p, q = .3, .15
B = np.array([[.3, .3, .15],
              [.3, .3, .15],
              [.15, .15, .3]])  # Our block probability matrix

# Make and visualize our Stochastic Block Model
A, labels = sbm([n, n, n], B, return_labels = True)

# make the colorbar look nice
fig, ax = plt.subplots(figsize=(10,10))
cmap = matplotlib.colors.ListedColormap(["white", 'black'])
with sns.plotting_context("talk", font_scale=1):
    ax = sns.heatmap(A, cmap=cmap, ax=ax, square=True,
                     cbar_kws=dict(shrink=0.7), xticklabels=False,
                     yticklabels=False)
    ax.set_title("A Stochastic Block Model")
    cbar = ax.collections[0].colorbar
    cbar.set_ticks([0.25, .75])
    cbar.set_ticklabels(['No Edge', 'Edge'])
    cbar.ax.set_frame_on(True)

There are three communities (we promise), but the first two are impossible to distinguish between using only our adjacency matrix (which only stores the topological structure of a network). The third community is distinct: nodes belonging to it aren't likely to connect to nodes in the first two communities, and are very likely to connect to each other. If we wanted to embed this graph using our Laplacian or Adjacency Spectral Embedding methods, we'd find the first and second communities layered on top of each other (though we wouldn't be able to figure that out from our embedding). The python code below embeds our latent positions down to two dimensions with a Laplacian Spectral Embedding, and then plots the results, color-coding each node by its true community.

In [None]:
from graspologic.embed import LaplacianSpectralEmbed as LSE
from graspologic.utils import to_laplacian
import matplotlib.pyplot as plt
import seaborn as sns
from graspologic.plot import pairplot


def plot_latents(latent_positions, *, title, labels, ax=None):
    if ax is None:
        ax = plt.gca()
    plot = sns.scatterplot(latent_positions[:, 0], latent_positions[:, 1], hue=labels, 
                           linewidth=0, s=10, ax=ax, palette="Set1")
    plot.set_title(title, wrap=True);
    ax.axes.xaxis.set_visible(False)
    ax.axes.yaxis.set_visible(False)
    
    return plot

L = to_laplacian(A, form="R-DAD")
lse = LSE(form="R-DAD", n_components=2)
L_latents = lse.fit_transform(L)
plot_latents(L_latents, title="Latent positions when we\n only embed the Laplacian", 
             labels=labels);

We'd like to use extra information to more clearly distinguish between the first and second community. We don't have this information in our network: it needs to come from somewhere else.

### Covariates

But we're in luck - we have a set of covariates for each node! These covariates contain the extra information we need that allows us to separate our first and second community. However, with only these extra covariate features, we can no longer distinguish between the last two communities - they contain the same information.

Below is a visualization of our covariates. Each node is associated with its own group of covariates. We'll organize this information into a matrix, where the $i_{th}$ row contains the covariates associated with node $i$. We'll draw the elements of each row from a Beta distribution, from statistics. The first community is represented by the lighter-colored rows, and the last two are represented by the darker-colored rows.

In [None]:
import numpy as np
from scipy.stats import bernoulli, beta
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize

def gen_covariates(N=1500):
    make_community = lambda a, b: beta.rvs(a, b, size=(N//3, 30))
    c1 = make_community(2, 5)
    c2 = make_community(2, 2)
    c3 = make_community(2, 2)

    covariates = np.vstack((c1, c2, c3))
    return covariates
    

# Generate a covariate matrix
X = gen_covariates(N=N)

# Plot and make the axis look nice
fig, ax = plt.subplots(figsize=(5, 8))
ax = sns.heatmap(X, ax=ax, cmap="rocket_r")
ax.set(title="Visualization of the covariates", xticks=[], 
       ylabel="Nodes (each row is a node)",
       xlabel="Covariates for each node (each column is a covariate)");

We can play almost the same game here as we did with the Laplacian. If we embed the information contained in this matrix of covariates into lower dimensions, we can see the reverse situation as before - the first community is separate, but the last two are overlayed on top of each other.

In [None]:
XXt = X@X.T
X_latents = lse.fit_transform(XXt)
plot_latents(X_latents, title="Latent positions when we\n only embed our covariates", 
             labels=labels);

We want full separation between all three communities, so we need some kind of representation of our network that allows us to use both the information in the topology and the information in the covariates. This is where CASE comes in.

## Covariate-Assisted Spectral Embedding

<i>Covariate-Assisted Spectral Embedding</i>, or CASE<sup>1</sup>, is a simple way of combining our network and our covariates into a single model. In the most straightforward version of CASE, we combine the network's regularized Laplacian matrix $L$ and a function of our covariate matrix $XX^T$. Here, $X$ is just our covariate matrix, in which row $i$ contains the covariates associated with node $i$. Notice the word "regularized" - This means (from the Laplacian section earlier) that our Laplacian looks like $L = L_{\tau} = D_{\tau}^{-1/2} A D_{\tau}^{-1/2}$.

```{note}
Suppose that $X$ only contains 0's and 1's. To interpret $XX^T$, notice from linear algebra that we're effectively taking the weighted sum - or, in math parlance, the dot product - of each row of $X$ with each other row, because the transpose operation turns rows into columns. Now, look at what happens below when we take the dot product of two vectors with only 0's and 1's in them:

\begin{align}
\begin{bmatrix}
1 \\
1 \\
1 \\
\end{bmatrix} \cdot 
\begin{bmatrix}
0 \\
1 \\
1 \\
\end{bmatrix} = 1\times 0 + 1\times 1 + 1\times 1 = 2
\end{align}

If there are two overlapping 1's in the same position of the left vector and the right vector, then there will be an additional 1 added to their weighted sum. So, in the case of the binary $XX^T$, when we matrix-multiply a row of $X$ by a column of $X^T$, the resulting value, $(XX^T)_{i, j}$, will be equal to the number of shared locations in which vectors $i$ and $j$ both have ones.
```

A particular value in $XX^T$, $(XX^T)_{i, j}$, can be interpreted as measuring the "agreement" or "similarity" between row $i$ and row $j$ of our covariate matrix. The higher the value, the more the two rows share 1's in the same column. The result is a matrix that looks fairly similar to our Laplacian!  

The following Python code generates both our SBM and our covariate similarity matrix $XX^T$. We'll also normalize the rows of our covariate matrix to have unit length using scikit-learn - this is because we want the scale for our covariate matrix to be roughly the same as the scale for our adjacency matrix. Later, we'll use a tuning coefficient to help with this as well.

In [None]:
import numpy as np
from graspologic.utils import to_laplacian
from graspologic.simulations import sbm
from sklearn.preprocessing import normalize

def gen_sbm(p=.3, q=.15, N=1500):
    """
    Generate an adjacency matrix.
    """
    n = N // 3
    B = np.full((3, 3), q)
    B[np.diag_indices_from(B)] = p
    A = sbm([n, n, n], B, return_labels=True)

    return A
    
def gen_covariates(N=1500):
    make_community = lambda a, b: beta.rvs(a, b, size=(N//3, 30))
    c1 = make_community(2, 5)
    c2 = make_community(2, 2)
    c3 = make_community(2, 2)

    covariates = np.vstack((c1, c2, c3))
    return covariates

# Generate a covariate matrix
X = gen_covariates(N=N)
X = normalize(X, axis=0)

L = to_laplacian(A, form="R-DAD")
XXt = X@X.T

You can see what our two matrices look like below. As you can see, each matrix contains information about our communities that the other doesn't have.

In [None]:
# plot
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5), constrained_layout=True)
L_ax = heatmap(L, title=r"Regularized Laplacian", ax=axs[0])
X_ax = heatmap(XXt, title="Covariate matrix times \nits transpose", ax=axs[1]);

CASE is simply a weighted sum of these two matrices. The weight is multiplied by $XX^T$ - that way, both matrices contribute an equal amount of useful information to the embedding. Here, we'll just use the ratio of the two biggest eigenvalues (called the "leading eigenvalues") of our two matrices as the weight (henceforth known as $\alpha$). Later on, we'll explore ways to pick a better $\alpha$.

In [None]:
# Find the eigenvalues of L and XX^T (in ascending order)
L_eigvals = np.linalg.eigvalsh(L)
XXt_eigvals = np.linalg.eigvalsh(XXt)

# Find our simple weight - the ratio of the leading eigenvalues of L and XX^T.
alpha = np.float(L_eigvals[-1] / XXt_eigvals[-1])

# Using our simple weight, combine our two matrices
L_ = L + alpha * X@X.T

In [None]:
heatmap(L_, title="Our Combined Laplacian and covariates matrix");

As you can see, the combined matrix has some separation between all three groups. Because we used an imperfect weight, the Laplacian is clearly contributing more to the sum -- but it's good enough for now.

Now we can embed this network and see what the results look like. Our embedding works the same as it does in Laplacian Spectral Embedding from here: we decompose our combined matrix using Singular Value Decomposition, truncating the columns, and then we visualize the rows of the result. We'll embed all the way down to two dimensions, just to make visualization simpler.

In [None]:
from sklearn.utils.extmath import randomized_svd
from graspologic.embed import selectSVD
from graspologic.plot import pairplot
import scipy


def embed(A, *, dimension):
    latents, _, _ = randomized_svd(A, n_components=dimension)
#     latents = latents[:, :dimension]
    return latents

latents_ = embed(L_, dimension=2)

Below, you can see three figures: the first is our embedding when we only use our network, the second is our embedding when we only use our covariates, and the third is our embedding when we only use both. We've managed to achieve separation between all three communities.

In [None]:
from graspologic.embed import LaplacianSpectralEmbed as LSE


# Plot
fig, axs = plt.subplots(1, 3, figsize=(15,5))
plot_latents(L_latents, title="Latent positions when we only use the Laplacian", 
             labels=labels, ax=axs[0])
plot_latents(X_latents, title="Latent positions when we only use our covariates", 
             labels=labels, ax=axs[1]);
plot_latents(latents_, title="Latent positions when we combine\n our network and its covariates", 
             labels=labels, ax=axs[2])

plt.tight_layout()

### Setting A Better Weight

Our simple choice of the ratio of leading eigenvalues for our weight $\alpha$ is straightforward, but we can probably do better. If our covariate matrix doesn't tell us much about our communities, then we'd want to give it a smaller weight so we use more of the information in our Laplacian when we embed. If our Laplacian is similarly uninformative, we'd like a larger weight to emphasize the covariates.

In general, we'd simply like to embed in a way that makes our clustering better - meaning, if we label our communities, we'd like to be able to correctly retrieve as many labels after the embedding as possible with a clustering algorithm, and for our clusters to be as distinct as possible.

One reasonable way of accomplishing this goal is to simply find a range of possible $\alpha$ values, embed our combined matrix for every value in this range, and then to simply check which values produce the best clustering.

#### Getting A Good Range

For somewhat complicated linear algebra reasons<sup>1</sup>, it's fairly straightforward to get a good range of possible $\alpha$ values: a good minimum and maximum is described by only two equations. In the below equations, $K$ is the number of communities present in our network, $R$ is the number of covariate values each node has, and $\lambda_i(L)$ is the $i_{th}$ eigenvalue of L (where $\lambda_1(L)$ is our Laplacian's highest, or "leading", eigenvalue).

```{admonition} Equations for getting our $\alpha$ range
$\alpha_{min} = \frac{\lambda_K(L) - \lambda_{K+1}(L)}{\lambda_1(XX^T)}$

If the number of covariate dimensions is less than or equal to the number of clusters, then  
$\alpha_{max} = \frac{\lambda_1 (L)}{\lambda_R (XX^T)}$

Otherwise, if the number of covariate dimensions is greater than the number of clusters, then  
$\alpha_{max} = \frac{\lambda_1(L)}{\lambda_K(XX^T) -\lambda_{K+1} (XX^T)}$
```

In [None]:
from scipy.linalg import eigvalsh
from sklearn.utils.extmath import randomized_svd
from myst_nb import glue

def get_eigvals(M, n_eigvals):
    N = M.shape[0]
    top_eigvals = eigvalsh(M, subset_by_index=[N-n_eigvals, N-1])
    return np.flip(top_eigvals)

_, X_eigvals, _ = randomized_svd(XXt, n_components=4)
_, L_eigvals, _ = randomized_svd(L, n_components=5)
n_covariates = X.shape[1]
n_components = 3


amin = (L_eigvals[n_components - 1] - L_eigvals[n_components]) / X_eigvals[0]
if n_covariates > n_components:
    amax = L_eigvals[0] / (
        X_eigvals[n_components - 1] - X_eigvals[n_components]
    )
else:
    amax = L_top / X_eigvals[n_covariates - 1]

glue("amin", str(amin)[:4], display=False)
glue("amax", str(amax)[:4], display=False)

Using these equations, we get a minimum weight of {glue:}`amin` and a maximum weight of {glue:}`amax`.

#### Searching with K-Means

We have a range of possible weights to search through, but we don't have the best one. To find it, we'll embed with Covariate-Assisted Clustering, using all the tricks described previously, for as many alpha-values in our range as we're willing to test. Then, we'll simply pick the value which best lets us distinguish between the different communities in our network. 

To figure out which $\alpha$ is best, we need to cluster our data using a machine learning algorithm. The algorithm of choice will be scikit-learn's faster implementation of k-means. K-means is a simple algorithm capable of clustering most datasets very quickly and efficiently, often in only a few iterations. It works by initially sticking some number of predetermined cluster centers in essentially random places in our data, and then iterating through a searching procedure until all the cluster centers are in nice places. If you want more information, you can check out the original paper by Stuart Lloyd<sup>2</sup>, or scikit-learn's tutorial describing K-means<sup>3</sup>.

We also need to define exactly what it means to check which tuning values produce the best clustering. We want a metric that emphasizes clusters that are small and far apart; that way, our clusters will be distinct and we'll be able to see our community structure better. Scikit-learn has exactly the metric we need: the *silhouette score*. This metric outputs a large number if our clusters are far apart, and a small number if our clusters are close together. For more details, see the scikit-learn documentation<sup>4</sup>.

Below is Python code which searches through our range of possible $\alpha$ values, and then tests a clustering using each value. We'll use a golden-section search<sup>5</sup>, which will speed up the searching processes.

In [None]:
from sklearn.cluster import KMeans
from scipy.optimize import golden
from sklearn.metrics import silhouette_score
    
# Assume we've already generated alphas using the 
# equations above
def cluster(alpha_, L, XXt):
    L_ = L + alpha_*XXt
    latents = embed(L_, dimension=2)
    kmeans = KMeans(n_clusters=3).fit(latents)
    ss = silhouette_score(latents, labels=kmeans.labels_)
    return -1 * ss

best_alpha = golden(cluster, args=(L, XXt), brack=[amin, amax])
new_latents = embed(L+best_alpha*XXt, dimension=2)

Tuning the weight improved our clustering a bit. Below, you can see the difference between our embedding prior to tuning and our embedding after tuning.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10,5))
plot_latents(latents_, title="Our embedding prior to tuning", labels=labels, ax=axs[0]);
plot_latents(new_latents, title="Our embedding after tuning", labels=labels, ax=axs[1]);

### Variations on CASE

There are situations where changing the matrix that you embed is useful. 

*non-assortative*  
If your graph is *non-assortative* - meaning, the between-block probabilities are greater than the within-block communities - it's better to square your Laplacian. You end up embedding $LL + aXX^T$.  

*big graphs*  
Since the tuning procedure is computationally expensive, you wouldn't want to spend the time tuning $\alpha$ for larger graphs. There are a few options here: you can use a non-tuned version of alpha, or you can use a variant on classical correlation analysis<sup>4</sup> and simply embed $LX$.

### Using Graspologic

Graspologic's CovariateAssistedSpectralEmbedding class implements CASE directly. The following code applies CASE to reduce the dimensionality of $L + aXX^T$ down to two dimensions, and then plots the latent positions to show the clustering. You can also try the above variations on CASE with the `embedding_alg` parameter.

In [None]:
import graspologic
import importlib
importlib.reload(graspologic)

casc = graspologic.embed.CovariateAssistedEmbedding(embedding_alg="assortative", n_components=2, tuning_runs=100)
latents = casc.fit_transform(A, covariates=X)
plot_latents(latents, title="Embedding our model using graspologic", labels=labels);

#### References

[1] N. Binkiewicz, J. T. Vogelstein, K. Rohe, Covariate-assisted spectral clustering, Biometrika, Volume 104, Issue 2, June 2017, Pages 361–377, https://doi.org/10.1093/biomet/asx008  
[2] Lloyd, S. (1982). Least squares quantization in PCM. IEEE transactions on information theory, 28(2), 129-137.  
[3] https://scikit-learn.org/stable/modules/clustering.html#k-means
[4] Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28, 321–77.  
[4] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html  
[5] https://en.wikipedia.org/wiki/Golden-section_search