# Why embed networks?

Networks by themselves are interesting objects, but a network is not how we traditionally organize data in machine learning. In almost any ML algorithm - whether you're using a neural network, or a decision tree, or whether your goal is to classify datapoints or to predict a value using regression - you'll see data organized into a matrix, where the rows represent observations and the columns represent features, or variables. Each data point, then, is traditionally represented as a single point in d-dimensional space: each column gets its own axis in a plot, and each row is a single datapoint.

For example, the data below is organized traditionally. On the left is the data matrix; you can see that there are two feature columns, one for each axis. The x-column contains the x-coordinates for each datapoint, and the x-column contains the y-coordinates for each data point. We can see two clusters of data numerically.

On the right is the same data, but plotted in Euclidean space. Each column of the data matrix gets its own axis of the plot, so that the x and y axis location of the $i_{th}$ datapoint in the scatterplot is the same as the x and y values of the $i_{th}$ row of the data matrix. We can see the two clusters of data geometrically.

In [None]:
from sklearn.datasets import make_blobs
import pandas as pd
import numpy as np

# make the data
centers = np.array([[-2, -2], 
                    [2, 2]])
X, y = make_blobs(n_samples=10, cluster_std=0.5,
                  centers=centers, shuffle=False)

# convert data into a DataFrame
data = pd.DataFrame(X, columns=["x", "y"])

In [None]:
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns

# setup
fig = plt.figure(figsize=(12, 8))
gs = GridSpec(1, 3)
axm = fig.add_subplot(gs[0])
axs = fig.add_subplot(gs[1:])
cmap="flare"

# plot left
hm = sns.heatmap(data, ax=axm, yticklabels=False, 
                 cmap=cmap, annot=True, cbar=True)
hm.hlines(range(len(data)), *hm.get_xlim(), colors='k', alpha=.1)

# plot right
plot = sns.scatterplot(data=data, x='x', y='y', legend=False, ax=axs)

# lines
max_ = int(data.values.max()) + 1
plot.vlines(0, -max_, max_, colors="black", lw=.9, linestyle="dashed", alpha=.2)
plot.hlines(0, -max_, max_, colors="black", lw=.9, linestyle="dashed", alpha=.2)

# ticks
plot.xaxis.set_major_locator(plt.MaxNLocator(3))
plot.yaxis.set_major_locator(plt.MaxNLocator(3))

# set axis bounds
lim = (-max_, max_)
plot.set(xlim=lim, ylim=lim)

# title, etc
plt.suptitle("Euclidean data represented as a data matrix and represented in Euclidean space", fontsize=16)
plt.tight_layout()

Most machine learning methods require our data to be organized like this. For example, with the data above, we could use scikit-learn to perform simple K-Means Clustering to find two groups of data points:

In [None]:
from sklearn.cluster import KMeans

predicted_labels = KMeans(n_clusters=2).fit_predict(X)
predicted_labels

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))

# plot right
plot = sns.scatterplot(data=data, x='x', y='y', ax=ax, 
                       hue=predicted_labels, palette=cmap)

# lines
plot.vlines(0, -max_, max_, colors="black", lw=.9, linestyle="dashed", alpha=.2)
plot.hlines(0, -max_, max_, colors="black", lw=.9, linestyle="dashed", alpha=.2)

# ticks
plot.xaxis.set_major_locator(plt.MaxNLocator(3))
plot.yaxis.set_major_locator(plt.MaxNLocator(3))

# title
plot.set_title("Clustered data after K-Means", fontsize=16);

Network-valued data is different. Take the Stochastic Block Model below, shown as both a layout plot and an adjacency matrix. Say your goal is to view the nodes as particular datapoints, and you'd like to cluster the data in the same way you clustered the Euclidean data above. Intuitively, you'd expect to find two groups: one for the first set of heavily connected nodes, and one for the second set. Unfortunately, traditional machine learning algorithms won't work here, because network data doesn't live in the traditional rows-as-observations, columns-as-features format.

In [None]:
import networkx as nx
from graspologic.simulations import sbm

p = np.array([[.6, .05],
              [.05, .6]])
A, labels = sbm([25, 25], p, return_labels=True)

In [None]:
from graspologic.plot import heatmap

fig, axs = plt.subplots(1, 2, figsize=(12, 6))

G = nx.Graph(A)
rgb = np.atleast_2d((0.12156862745098039, 0.4666666666666667, 0.7058823529411765))
colors = np.repeat(rgb, 50, axis=0)

pos = nx.spring_layout(G)

options = {"edgecolors": "tab:gray", "node_size": 100}
nx.draw_networkx_nodes(G, node_color=colors, pos=pos, ax=axs[0], **options)
nx.draw_networkx_edges(G, alpha=.5, pos=pos, width=.3, ax=axs[0])

hm = heatmap(A, ax=axs[1], cbar=False, 
        cmap=[(1, 1, 1), (0.12156862745098039, 0.4666666666666667, 0.7058823529411765)], 
        center=None)
sns.despine(bottom=True, left=True)

You, of course, *can* make up methods which work directly on networks - algorithms which run by traversing along edges in various ways, for instance, or which use edge weights directly, and so on - but to be able to use the entire toolbox that machine learning offers, without having to design special network-based algorithms, you'd like to be able to figure out a way to *represent* networks in Euclidean spaces. 

## High Dimensionality of Network Data

## ML Methods for Euclidean Data

## Adjacency Spectral Embedding

## Laplacian Spectral Embedding

- calculation of similarity between two objects (e.g., cosine or euclidean distance)