# Adjacency Spectral Embed

This demo shows how to use the Adjacency Spectral Embed (ASE) class. We will then use ASE to show how two communities from a stochastic block model graph can be found in the embedded space using k-means. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans

from graspologic.embed import AdjacencySpectralEmbed
from graspologic.simulations import sbm
from graspologic.plot import heatmap, pairplot

import warnings
warnings.filterwarnings('ignore')
np.random.seed(8889)
%matplotlib inline

## Data Generation
ASE is a method for estimating the latent positions of a network modeled as a Random Dot Product Graph (RDPG). This embedding is both a form of dimensionality reduction for a graph and a way of fitting a generative model to graph data. We first generate two 2-block SBMs: one directed, and one undirected.

In [None]:
# Define parameters
n_verts = 100
labels_sbm = n_verts * [0] + n_verts * [1]
P = np.array([[0.8, 0.2], 
              [0.2, 0.8]])

# Generate SBMs from parameters
undirected_sbm = sbm(2 * [n_verts], P)
directed_sbm = sbm(2 * [n_verts], P, directed=True)

# Plot both SBMs
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
heatmap(undirected_sbm, title='2-block SBM (undirected)', inner_hier_labels=labels_sbm, ax=axes[0])
heatmap(directed_sbm, title='2-block SBM (directed)', inner_hier_labels=labels_sbm, ax=axes[1]);

## Embedding: Undirected Case

We now use the AdjacencySpectralEmbed class to embed the adjacency matrix into lower-dimensional space.  
If no parameters are given to the AdjacencySpectralEmbed class, it will automatically choose the number of dimensions to embed into.

In [None]:
# instantiate an ASE object
ase = AdjacencySpectralEmbed()

# call its fit_transform method to generate latent positions
Xhat = ase.fit_transform(undirected_sbm)
pairplot(Xhat, title='SBM adjacency spectral embedding')

## Embedding: Directed Case

If the graph is directed, we will get two outputs roughly corresponding to the "out" and "in" latent positions, since these are no longer the same. 

In [None]:
# Transform in directed case
ase = AdjacencySpectralEmbed()
Xhat, Yhat = ase.fit_transform(directed_sbm)

# Plot both embeddings
pairplot(Xhat, title='SBM adjacency spectral embedding "out"')
pairplot(Yhat, title='SBM adjacency spectral embedding "in"')

## Dimension specification

One can also specify the parameters for embedding.  
Here, we specify the number of embedded dimensions and change the SVD solver used to compute the embedding.

In [None]:
# fit and transform
ase = AdjacencySpectralEmbed(n_components=2, algorithm='truncated')
Xhat = ase.fit_transform(undirected_sbm)

# plot
pairplot(Xhat, title='2-component embedding', height=4)

## Out-of-Sample (OOS) Embedding

Given new observations not seen in the original matrix, we sometimes wish to determine their latent positions without re-embedding the entire updated matrix.  

More formally, suppose we have computed the embedding $\hat{X} \in \textbf{R}^{n \times d}$ from some adjacency matrix $A \in \textbf{R}^{n \times n}$.  
Suppose we then obtain some new vertex $w \in \textbf{R}^n$ and we want to know its latent position without re-embedding the entire matrix.

We can obtain an estimation for this out-of-sample latent position with ASE's `predict` method, for either single oos vertices or multiple.

In [None]:
from graspologic.utils import remove_vertices

# create in-sample adjacency matrix A and an oos array a.
# An oos array with k oos vertices should have shape (n, k).
A, a = remove_vertices(undirected_sbm, indices=[0, -1], return_vertices=True)

# embed
ase = AdjacencySpectralEmbed(n_components=2)
X_hat = ase.fit_transform(A)

# predicted latent positions
w = ase.predict(a)
print(w)

#### Plotting out-of-sample embedding

Here, we plot the original latent positions as well as the out-of-sample vertices. Note that the out-of-sample vertices are near their expected latent positions despite not having been run through the original embedding.

In [None]:
# Set up data for plotting
def plot_oos(X_hat, oos_vertices, n_verts, labels, title):
    labels = ["Red"]*(n_verts-1) + ["Blue"]*(n_verts-1)
    d = {'Type': labels, 'Dimension 1': X_hat[:, 0], 'Dimension 2': X_hat[:, 1]}
    df = pd.DataFrame(data=d)

    # add out-of-sample points to data
    for pt, label in zip(oos_vertices, ["Red out-of-sample", "Blue out-of-sample"]):
        data = {"Type": label, "Dimension 1": pt[0], 
                "Dimension 2": pt[1]}
        df = df.append(data, ignore_index=True)

    # Create plot of data
    g = sns.PairGrid(df, hue="Type",
                     hue_order=["Red", "Blue", "Red out-of-sample", "Blue out-of-sample"],
                     palette=["tab:red", "tab:blue", "r", "b"],
                     hue_kws={"s": [20, 20, 300, 300],
                              "marker": ["o", "o", "*", "*"],
                              "alpha": [.5, .5, 1, 1],
                             },
                     layout_pad=1)

    # Add data to plot and change figure settings
    g.map_offdiag(plt.scatter, linewidth=.5, edgecolor="w")
    g.map_diag(sns.kdeplot);
    g.add_legend();
    plt.subplots_adjust(top=0.9);
    g.fig.suptitle(title);
    
plot_oos(X_hat, w, n_verts, labels, title="Out-of-Sample Embeddings (2-block SBM)")

### Out-of-Sample Embedding with Directed Graphs

Not all graphs are undirected. When finding out-of-sample latent positions for directed graphs, $A \in \textbf{R}^{n \times n}$ is not symmetric. $A_{i,j}$ represents the edge from node $i$ to node $j$, whereas $A_{j, i}$ represents the edge from node $j$ to node $i$.

To account for this, we pass the tuple (left_oos, right_oos) into the predict method. It then outputs a tuple of (left_latent_prediction, right_latent_prediction).

In [None]:
# a is a tuple of (oos_left, oos_right)
A, a = remove_vertices(directed_sbm, indices=[0, -1], return_vertices=True)

# Fit a directed graph
X_hat, Y_hat = ase.fit_transform(A)

# predicted latent positions
w = ase.predict(a)
print(f"output of `ase.predict(a)` is {type(w)}", "\n")
print(f"left latent positions: \n{w[0]}\n")
print(f"right latent positions: \n{w[1]}")

#### Plotting directed latent predictions

In [None]:
plot_oos(X_hat, w[0], n_verts, labels, title="Left Latent Predictions")
plot_oos(Y_hat, w[1], n_verts, labels, title="Right Latent Predictions")

## Clustering in the embedded space
Now, we will use the Euclidian representation of the graph to apply a standard clustering algorithm like k-means. We start with an SBM model where the 2 blocks have the exact same connection probabilities (effectively giving us an ER model graph). In this case, k-means will not be able to distinguish among the two embedded blocks. As the connections between the blocks become more distinct, the clustering will improve. For each graph, we plot its adjacency matrix, the predicted k-means cluster labels in the embedded space, and the error as a function of the true labels. Adjusted Rand Index (ARI) is a measure of clustering accuracy, where 1 is perfect clustering relative to ground truth. Error rate is simply the proportion of correctly labeled nodes. 

In [None]:
palette = {'Right':(0,0.7,0.2),
           'Wrong':(0.8,0.1,0.1)}

for insularity in np.linspace(0.5, 0.625, 4):
    P = np.array([[insularity, 1-insularity], [1-insularity, insularity]])
    sampled_sbm = sbm(2 * [n_verts], P)
    Xhat = AdjacencySpectralEmbed(n_components=2).fit_transform(sampled_sbm)
    labels_kmeans = KMeans(n_clusters=2).fit_predict(Xhat)
    ari = adjusted_rand_score(labels_sbm, labels_kmeans)
    error = labels_sbm - labels_kmeans
    error = error != 0
    # sometimes the labels given by kmeans will be the inverse of ours
    if np.sum(error) / (2 * n_verts) > 0.5:
        error = error == 0
    error_rate = np.sum(error) / (2 * n_verts)
    error_label = (2 * n_verts) * ['Right']
    error_label = np.array(error_label)
    error_label[error] = 'Wrong'
    
    heatmap(sampled_sbm, title=f'Insularity: {str(insularity)[:5]}', 
            inner_hier_labels=labels_sbm)
    pairplot(Xhat,
             labels=labels_kmeans,
             title=f'KMeans on embedding, ARI: {str(ari)[:5]}',
             legend_name='Predicted label',
             height=3.5,
             palette='muted',)
    pairplot(Xhat,
             labels=error_label,
             title=f'Error from KMeans, Error rate: {str(error_rate)}',
             legend_name='Error label',
             height=3.5,
             palette=palette,)