# Representations of Networks

Now that we know how to represent networks with matrices, let's take a step back and take a look at what network representation is in general, and the different ways you might think about representing networks to understand different aspects of the network.

We already know that the topological structure of networks is just a collection of nodes, with pairs of nodes potentially linked together by edges. Mathematically, this means that a network is defined by two objects: the set of nodes, and the set of edges, with each edge just being defined as a pair of nodes for undirected networks. Networks can have additional structure: you might have extra information about each node ("features" or "covariates"), which we'll talk about in the joint representation section in chapter 6. Edges might also have weights, which are usually measure the connection strength in some way. We learned in the previous section that network topology can be represented with matrices in a number of ways -- with adjacency matrices, Laplacians, or (less commonly) with incidence matrices. 

One major challenge in working with networks is that a lot of standard mathematical operations and metrics remain undefined. What does it mean to add a network to another network, for instance? How would network multiplication work? How do you define a distance between one node and another node? Without these kinds of basic operations and metrics, we are left in the dark when we try to find analogies to non-network data analysis.

Another major challenge is that the number of possible networks can get obscene fairly quickly. See the figure below, for instance. When you allow for only 50 nodes, there are already more than $10^{350}$ possible networks. Just for reference, if you took all hundred-thousand quadrillion vigintillion atoms in the universe, and then made a new entire universe for each of those atoms... you'd still be nowhere near $10^{350}$ atoms. 

In [None]:
import matplotlib.pyplot as plt
from scipy.special import comb
from contextlib import suppress
from scipy.interpolate import interp1d
import numpy as np

# get number of graphs for a given n in log scale
vertices = np.arange(51, step=10)

with suppress(OverflowError):
    n_graphs = np.log10(2**comb(vertices, 2))
n_graphs[-1] = 368

xnew = np.linspace(vertices.min(), vertices.max(), 300)
interpolated = interp1d(vertices, n_graphs, kind="cubic")
ynew = interpolated(xnew)

# plotting code
fig, ax = plt.subplots()
ax.plot(xnew, ynew)
ax.set_xlabel("Number of Nodes")
ax.set_ylabel("Number of Networks\n (log scale)");

To address these challenges, we can generally group analysis into four approaches, each of which addresses these challenges in some way: the bag of features, the bag of edges, the bag of nodes, and the bag of networks, each so-called because you're essentially throwing data into a bag and treating each thing in it as its own object. Let's get into some details!

## Bag of Features
The first approach is called the bag of features. The idea is that you take networks and you compute statistics from them, either for each node or for the entire network. These statistics could be simple things like the edge count or average path length between two nodes, or more complicated metrics like the modularity, which measures how well a network can be separated into communities. Unfortunately, network statistics like this tend to be correlated; the value of one network statistic will almost always influence the other. This means that it can be difficult to interpret analysis that works by comparing network statistics. It's also hard to figure out which statistics to compute, since there are an infinite number of them.

### You Lose A Lot of Information with the Bag of Features Approach

![anascombe](../Images/anascombe.jpeg)

This figure contains for networks, all of whom have the exact same network statistics. They each have ten nodes and 15 edges. They also all contain the same number of triangles (sets of three nodes which are all connected), and the same global clustering coefficient. Global clustering cloefficient, for reference, gives an indication of how clustered the network is, and works by counting the proportion of triangles out of all edges that share a node.

Each of these networks, however, are completely different from each other. The first network, for instance, has two connected components, while the others are all connected. The second network has a community of nodes that are only connected along a path, and a different community which are tightly connected -- and so on. Modeling these networks through computing features from them would lose a great deal of information.

#### Network Features Tend to be Correlated

The other issue with the bag of features approach is that network features tend to be correlated with each other: if you consider all possible networks, knowing the value of any of the network features gives you information about what the value of other network features might be.

Let's play around with this. We'll make 100 random networks, each with 50 nodes, and then we'll compute some of the most common network features that people use on them (we'll explain what each network feature is along the way). Then, we'll look at how correlated these features are. For now, just think of a random network as being a network with each node being connected to each other node with some set probability. Each network will have a different connection probability. These networks will also have communities -- groups of nodes which are more connected with each other than other nodes -- the strength of which will also be determined randomly. When we generate data later on in this book, we'll get into different types of random network models you can use.

In [None]:
from graspologic.simulations import sbm
import numpy as np
from numpy.random import uniform

n_nodes = 50
n_networks = 100
p = uniform(size=100, low=.5).round(2)
q = uniform(size=100, high=.5).round(2)

networks = []
for i in range(n_networks):
    P = np.array([[p[i], q[i]],
                  [q[i], p[i]]])
    network = sbm(n=[n_nodes//2, n_nodes//2], p=P)
    networks.append(network)

Now, for each of these networks, we'll calculate a set of network features. These will just be a collection of some of the most common metrics you can use to define graphs. We'll use the network density, the clustering coefficient, the path length, and the modularity. We'll go over these features with more depth later, but here are some brief explanations for now just to get a sense for what we're actually calculating.

The **Modularity** measures the fraction of edges in your network that belong to the same community, subtracting out the probability of an edge existing at random. It effectively measures how much better a particular assignment of community labels is at defining communities than a completely random assignment.

The **Network Density** is the fraction of all possible edges that a network can have which actually exist. If every node were connected to every other node, the network density would be 1; and if no node is connected to anything, the network density would be 0.

The **Clustering Coefficient** indicates how much nodes tend to cluster together. If you pick out three nodes, and two of them are connected, a high clustering coefficient would mean that the third is probably connected as well.

The **Path Length** indicates how far apart two nodes in your network are on average. If two nodes are directly connected, their path length is one. If two nodes are connected through an intermediate node, their path length is two.

The code below defines functions to calculate each of these network features, and then calculates them for each of the networks we created above. Since most of these metrics already exist in `networkx`, we'll just pull from there. You can check the `networkx` documentation for details.

We'll also define a preprocessing decorator, which just converts the network from a numpy array into the format networkx uses.

In [None]:
import functools
import networkx as nx

def preprocess(f):
    @functools.wraps(f)
    def wrapper(network):
        network = nx.from_numpy_matrix(network)
        return f(network)
    return wrapper

@preprocess
def modularity(network):
    communities = nx.algorithms.community.greedy_modularity_communities(network)
    Q = nx.algorithms.community.quality.modularity(network, communities)
    return Q

@preprocess
def network_density(network):
    return nx.density(network)

@preprocess
def clustering_coefficient(network):
    return nx.transitivity(network)

@preprocess
def path_length(network):
    if nx.number_connected_components(network) != 1:
        # You want to make sure this still works if your network isn't fully connected!
        network = max((network.subgraph(c) for c in nx.connected_components(network)), 
                      key=len)
    return nx.average_shortest_path_length(network)

Now, we'll calculate all of these features for each network, and finally we'll create a heatmap of their correlation.

In [None]:
import pandas as pd

network_features = []
for network in networks:
    modularity_ = modularity(network)
    network_density_ = network_density(network)
    clustering_coefficient_ = clustering_coefficient(network)
    path_length_ = path_length(network)
    features = {"Modularity": modularity_, "Network Density": network_density_, 
                "Clustering Coefficient": clustering_coefficient_, "Average Path Length": path_length_}
    network_features.append(features)
    
df = pd.DataFrame(network_features)
feature_correlation = df.corr()

Below is the heatmap. Numbers close to 1 mean that when the first feature is large, the second tends to be large, numbers close to 0 mean that the features are not very correlated, and numbers close to -1 mean that when the first feature is large, the second feature tends to be small.

In [None]:
import seaborn as sns
from graphbook_code import cmaps

plot = sns.heatmap(feature_correlation, annot=True, square=True, 
                   cmap=cmaps["divergent"], cbar_kws={"aspect": 10, "ticks": [-1., -.5, 0., 0.5, 1.]})
plot.set_title("Average Correlation \nFor Our Network Features", y=1.05);

If you're familiar with correlation, you'll notice that these correlation numbers generally have a pretty high magnitude: each feature generally tells you a lot about each other feature. Let's explore why this can lead to some issues in practice.

#### Why Network Feature Correlatedness Can Lead To Problems

Let's take a step back to the implications of using the bag of features approach to analyze networks, now that we can see how correlated they usually are. Say you have a bunch of brain networks of mice, where the nodes are neurons and the edges are connections between neurons. You have a group of mice who were raised total darkness, and another group who were raised normally: let's call the ones who were raised in the darkness the batman mice. You're interested in how the visual parts of the brain are affected in the batman mice. You find the networks for only the visual parts of their brain, and then you calculate some network feature; maybe the density. It turns out that the network density is much lower for batman mice than it is for normal mice, so you conclude that raising mice in the darkness causes lower network density. Seems reasonable.

The problem is that network density is correlated with pretty much every other network feature you could have used. For example, just looking at the heatmap above, we can see that it's lower for more modularity; higher with a greater clustering coefficient; and much lower for smaller average path lengths. So, for instance, if you measured modularity instead of density, then you could have just as easily concluded that raising mice in total darkness causes their brains to have more well-defined clusters of neurons, and you'd then write an entire paper on the implications of that.

## Bag of Edges

The second approach is called the bag of edges. Here, you just take all of the edges in your network and treat them all as independent entities. You study each edge individually, ignoring any interactions between edges. This can work in some situations, but you still run into dependence: if two people within a friend group are friends, that can change the dynamic of the friend group and so change the chance that a different set of two people within the group are friends.

More specifically, in the bag of edges approach, you generally assume that every edge in your network will exist with some particular *probability*, which can be different depending on the edge that you're looking at. For example, there might be a 60% chance that the first and second nodes in your network are connected, but only a 20% chance that the third and fourth nodes are. What often will happen here is that you have multiple networks describing the same (or similar) systems. For example, let's use the mouse example again from above. You have your batman mice (who were raised in the dark) and your normal mice. You'll have a network for each batman mouse and a network for each normal mouse, and you assume that, even though there's a bit of variation in what you actually see, the *probability* of an edge existing between the same two nodes is the same for all batman mice. Your goal would be to figure out which edges have a different *probability* of existing with the batman mice compared to the normal mice.

Let's make some example networks to explore this. We'll have two groups of networks, and all of the networks will have only three nodes for simplicity's sake. Each group will contain 20 networks, for a total of 40 networks. In the first group, every edge between every pair of nodes simply has a 50% chance of existing. In the second group, the edge between nodes 0 and 1 will instead have a 90% chance of existing, but every other edge will still just be 50%. We'll generate ten networks from the first group, and ten networks from the second group.

In [None]:
from graspologic.simulations import sample_edges

P1 = np.array([[.5, .5, .5],
               [.5, .5, .5],
               [.5, .5, .5]])

P2 = np.array([[.5, .9, .5],
               [.9, .5, .5],
               [.5, .5, .5]])

# First group
n_networks = 20
first_group = np.empty((n_networks, 3, 3))
for i in range(n_networks):
    network = sample_edges(P1)
    first_group[i] = network
    
# Second group
second_group = np.empty((n_networks, 3, 3))
for i in range(n_networks):
    network = sample_edges(P2)
    second_group[i] = network

### Figuring out which edge is the outlier

By design, we know that the edge between nodes 0 and 1 is an outlier - the probability that it's there changes depending on whether your network is in the first or the second group. One common goal when using the bag of edges approach is finding signal edges: an edge whose probability of existing changes depending on which type of network you're looking at. In our case, we're trying to figure out (without using our prior knowledge) that the edge between nodes 0 and 1 is a signal edge.

To find the outlier edge, we'll first get the set of all edges, along with their indices. Since all of our networks are undirected, we'll get the edges and their indices by finding all of the values in the the upper-triangular portion of the adjacency matrices.

In [None]:
edge_indices = np.triu_indices(3, k=1)

Now, we'll use a hypothesis test called *dcorr* to find the outlier edge. You don't need to worry too much about what dcorr is; it's essentially a hypothesis that is useful with networks, which doesn't make any assumptions about the relationships between edges or the way your networks were randomly generated. 

In the code below, we:
1. Loop through the edge indices
2. Get a list of all instances of that edge in the first group, and all instances of that edge in the second group
3. Feed that list into the dcorr test to obtain p-values for each edge

In [None]:
from hyppo.ksample import KSample

edge_pvals = []
for i, j in zip(*edge_indices):
    samples = [group[:, i, j] for group in [first_group, second_group]]
    _, pvalue = KSample("Dcorr").test(*samples)
    edge_pvals.append(pvalue)

You can see below that the p-value for the first edge, the one that connects nodes 0 and 1, is extremely small, whereas the p-values for the other two edges are relatively large.

In [None]:
np.array(edge_pvals).round(3)

#### Correcting for Multiple Comparisons

Because we are doing multiple tests, we're running into a multiple comparisons problem here. If you're not familiar with the idea of multiple comparisons in statistics, it is as follows. Suppose you have a test that estimates the probability of making a discovery (or, to be more rigorous, tells you whether you should reject the idea that you didn't make a discovery). You run that test multiple times. If you run this test enough times, even if there's no discovery to be made, eventually random chance will make it *seem* like you've made a discovery. So, the chance that you make a false discovery increases with the number of tests that you run. For example, say your test has a 5% false-positive rate, and you run this test 100 times. On average, there will be 5 false positives. If there was only one true positive in all of your data, and your test finds it, then you'll end up with 6 positives total, 5 of which were false.

We need to correct for this here because we're doing a new test for each edge. There are a few standard ways to do this, but we'll use something called the *Holm-Bonferroni correction*. Don't worry about the details of this; all you need to know for now is that it corrects for the multiple comparisons problem by being a bit more conservative with what we classify as a positive result. This correction is implemented in the `statsmodels` library, a popular library for statistical tests and data exploration.

In [None]:
from statsmodels.stats.multitest import multipletests

reject, corrected_pvals, *_ = multipletests(edge_pvals, method="holm", alpha=0.05)

You can see below that the corrected p-value for the edge connecting nodes 0 and 1 is still extremely small. We somewhat arbitrarily chose a value of .05 as the cutoff for determining an outlier, so we can say that any edge with a corrected p-value below .05 is an outlier edge. We've used the bag-of-edges approach to find an edge whose probability of existing changed depending on which group a network belongs to!

In [None]:
corrected_pvals.round(3)

## Bag of Nodes

Similarly to the bag of edges, you can treat all of the nodes as their own entity and do analysis on a bag of nodes. Much of this book will focus on the bag of nodes approach, because you'll often use edge count, covariate information, and other things when you work with bags of nodes -- and, although there's still dependence between nodes, it generally isn't as big of an issue. Most of the single-network methods we'll use in this book will take the bag of nodes approach. What you'll see repeatedly is that we take the nodes of a network and *embed* them so each node is associated with a point on a plot (this is called the Euclidean representation of the node). Then, you can use other methods from mainstream machine learning to learn about your network. We'll get into this heavily in future chapters.

We'll also often associate node representation with community investigation. The idea is that sometimes you have groups of nodes which behave similarly -- maybe they have a higher chance of being connected to each other, or maybe they're all connected to certain other groups of nodes. Regardless of how you define communities, a community investigation motif will pop up: you get your node representation, then you associate nearby nodes to the same community. We can then look at the properties of the node belonging to a particular community, or look at relationships between communities of nodes.

Since we'll use the bag of nodes approach heavily throughout this book, you'll be getting a much better sense for what you can do with it later. As a sneak preview right now, let's generate a few networks and embed their nodes to get a feel for what bag-of-nodes type analysis might look like.

Don't worry about the specifics, but below we generate a simple network with two communities. Nodes in the same community have an 80% chance of being connected, whereas nodes in separate communities have a 20% chance of being connected. There are 20 nodes per community.

In [None]:
from graspologic.simulations import sbm

# generate network
P = np.array([[.8, .2,],
              [.2, .8]])
network, labels = sbm(p=P, n=[20, 20], return_labels=True)

Now, we'll use graspologic to find the points in 2D space that each node is associated with. Again, don't worry about the specifics: this will be heavily explained later in the book. All you have to know right now is that we're moving the nodes of our network from network space, where each node is associated with a set of edges with other nodes, to 2D Euclidean space, where each node is associated with an x-coordinate and a y-coordinate.

In [None]:
from graspologic.embed import AdjacencySpectralEmbed as ASE

ase = ASE(n_components=2)
embedding = ase.fit_transform(network)

Below you can see the result, colored by community. Each of the dots in this plot is one of the nodes of our network. You can see that the nodes cluster into two groups: one group for the first community, and another group for the second community. Using this representation for the nodes of our network, we can open the door to later downstream machine learning tasks.

In [None]:
from graphbook_code import plot_latents

plot = plot_latents(embedding, labels=labels)
plot.set_xlim(-1, 1)
plot.set_ylim(-1, 1)

plot.set_title("Bag of Nodes on a coordinate axis");

## Bag of Networks

The last approach is the bag of networks, which you'd use when you have more than one network that you're working with. Here, you'd study the networks as a whole and you'd want to test for differences across different networks or classify entire networks into one category or another. You might want to figure out if two networks were drawn from the same probability distribution, or whether you can find a smaller group of nodes that could represent the whole network. The end of this book will cover the bag of networks approach.