### CS4423 - Networks
Prof. Götz Pfeiffer<br />
School of Mathematics, Statistics and Applied Mathematics<br />
NUI Galway

#### 4. Small Worlds

# Lecture 12: Characteristic Path Length and Clustering Coefficient

Many real world networks are **small world networks**,
where most pairs of nodes are only a few steps away from each other,
and where nodes tend to form cliques, i.e., subgraphs having
all nodes connected to each other.

We introduce **three network attributes** that measure these small-world
effects:

* the **characteristic path length** $L$, defined a the
  average length of all shortest paths in the network;
  
* the **transitivity** $T$, defined as the proportion of
  triads that form triangles;
  
* the **clustering coefficient** $C$, defined as the
  average node clustering coefficient.

In terms of these attributes, a network is called a **small world network** if it has 

1. a small **average shortest path length** $L$
(scaling with $\log n$, where $n$ is the number of nodes), and
2. a high **clustering coefficient** $C$.

It turns out that ER random networks do have a small average shortest path length,
but not a high clustering coefficient.
This observation justifies the need for a different model of
random networks, if they are to be used to model the 
clustering behavior of real world networks.

In [None]:
import networkx as nx
opts = { "with_labels" : True, "node_color": 'y' }

## Characteristic Path Length

We have seen how BFS can determine the length of a shortest path from a given node $x$ to any
node $y$ in a **connected network**.  An application to all nodes $x$ yields the shortest distances
between all pairs of nodes.

Let $\mathcal{D} = (d_{ij})$ be the **distance matrix** of a connected graph $G = (X, E)$,
whose entry $d_{ij}$ is the length of the shortest path from node $i \in X$ to node $j \in X$.  (Hence $d_{ii} = 0$ for all $i$.)  There are a number of graph (and node) attributes that can be defined in terms of this matrix. 

<div class="alert alert-warning">

**Definition.**  Let $G = (X, E)$ be a connected graph.

* The **eccentricity** $e_i$ of a node $i \in X$ is the maximum distance between $i$
and any other vertex in $G$,
$$
e_i = \max_j d_{ij}.
$$

* The **graph radius** $R$ is the minimum eccentricity,
$$
R = \min_i e_i.
$$

* The **graph diameter** $D$ is the maximum eccentricity,
$$
D = \max_i e_i.
$$

* The **characteristic path length** $L$ of $G$ is the average distance between pairs of distinct nodes,
$$
L = \frac1{n(n-1)} \sum_{i \neq j} d_{ij}.
$$
</div>

* **Fact:** The characteristic path length of a random graph $G(n, m)$, or $G(n, p)$ is
$$
L = \frac{\ln n}{\ln \langle k \rangle}.
$$

So if $n = 16$ and $m = 32$, then the average node degree in $G(n, m)$ is $\langle k \rangle = \frac{2m}{n} = 4$,
and, approximately, $L = \frac{\log_2 16}{\log_2 4} = 2$.

Let's find a small connected graph. (Loop until it's connected ...)

In [None]:
n, m = 16, 32
while True:
    G = nx.gnm_random_graph(n, m)
    if nx.is_connected(G):
        break
nx.draw(G, **opts)

Compute **all** shortest path lengths with the `shortest_path_length` function provided by `networkx`.

In [None]:
dist = dict(nx.shortest_path_length(G))
dist = [[dist[i][j] for j in range(n)] for i in range(n)]
dist

In [None]:
eccentricity = [max(d) for d in dist]
eccentricity

`networkx` has a function `eccentricity` which computes a dictionary of eccentricities.

In [None]:
print(nx.eccentricity(G))

The extreme values of the eccentricity are the radius and the diameter of the network.

In [None]:
radius = min(eccentricity)
diameter = max(eccentricity)
radius, diameter

The characteristic path length $L$ is the sum of all entries in $\mathcal{D}$, divided by the number of pairs of distinct nodes. 

In [None]:
cpl = sum([sum(d) for d in dist]) / n / (n - 1)
cpl

`networkx` computes $L$ with a function `average_shortest_path_length`.

In [None]:
nx.average_shortest_path_length(G)

In [None]:
from math import log
kavg = sum(dict(G.degree()).values()) / n
log(n) / log(kavg)

In [None]:
kavg

<div class="alert alert-warning">

**Definition (Small-world behaviour).**
A network $G = (X, E)$ is said to exhibit a **small world behaviour** if 
its characteristic path length $L$ grows proportionally to the
logarithm of the number $n$ of nodes of $G$:
$$
L \sim \ln n.
$$
</div>

In this sense, the ensembles $G(n, m)$ and $G(n, p)$ of random graphs do exhibit small
world behavior (as $n \to \infty$).

## Clustering

In contrast to random graphs, real world networks contain **many triangles**:  
it is not uncommon that a friend of one of my friends is my friend, too.
This **degree of transitivity** can be measured in several different ways.

<div class="alert alert-warning">

**Definition (Graph transitivity).**
A **triad** is a tree of $3$ nodes or, equivalently, a graph consisting of $2$
adjacent edges (and the nodes they connect).  The transitivity $T$ of a graph $G = (X, E)$
is the proportion of **transitive** triads, i.e., triads which are subgraphs of **triangles**:
$$
T = \frac{3 n_{\Delta}}{n_{\wedge}},
$$
where $n_{\Delta}$ is the number of triangles in $G$, and $n_{\wedge}$ is the number of triads.
</div>

By definition, $0 \leq T \leq 1$.

**Example.**

In [None]:
G = nx.Graph(((1,2), (2,3), (3,1), (3,4), (3,5)))
nx.draw(G, **opts)

The function `nx.triangles(G)` returns a `python` dictionary reporting for each node
of the graph `G` the number of triangles it is contained in.

In [None]:
print(nx.triangles(G))

Overall, each triangle in `G` is thus accounted for $3$ times, once for each of its
vertices.  The following sum determines this number $3 n_{\Delta}$.

In [None]:
triple_nr_triangles = sum(nx.triangles(G).values())
print(triple_nr_triangles)

The number $n_{\wedge}$ of triads in `G` can be determined from the graph's degree
sequence, as each node of degree $k$ is the central node of exactly
$\binom{k}{2}$ triads.  (Why?)

In [None]:
print(G.degree())
print({k : v * (v-1) // 2 for k, v in dict(G.degree()).items()})
nr_triads = sum([v * (v-1) // 2 for v in dict(G.degree()).values()])
nr_triads

The transitivity $T$ of `G` is the quotient of these two quantities, $T = 3 n_{\Delta}/n_{\wedge}$,
which `networkx` computes with a function `transitivity`.

In [None]:
print(triple_nr_triangles / nr_triads )
print(nx.transitivity(G))

* The transitivity of a $G(n, p)$ **random graph** is
$$
T = p,
$$
the probability of any edge as third edge in a triangle.
(Or: Compute $3 n_{\Delta}/n_{\wedge}$ using the explicit formulas
from the previous lecture: $n_{\Delta} = \binom{n}{3}p^3$ and $n_{\wedge} = 3 \binom{n}{3}p^2$.)

The concept of **clustering** measures the transitivity of a node, or of an entire graph in a different way.

<div class="alert alert-warning">

**Definition (Clustering coefficient).**
For a node $i \in X$ of a graph $G = (X, E)$, denote by
$G_i$ the subgraph induced on the neighbours of $i$ in $G$,
and by $m(G_i)$ its number of edges.

The **node clustering coefficient** $c_i$ of node $i$ is defined
as
$$
c_i = \begin{cases}
\binom{k_i}{2}^{-1} m(G_i), & k_i \geq 2, \\
0, & \text{else.}
\end{cases}
$$

The **graph clustering coefficient** $C$ of $G$ is the 
average node clustering coefficient,
$$
C = \langle c\rangle = \frac1n \sum_{i=1}^n c_i.
$$
</div>

By definition, $0 \leq c_i \leq 1$ for all nodes $i \in X$, and $0 \leq C \leq 1$.

**Example.**

In [None]:
G = nx.Graph([(0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (2,3), (3,4)])
nx.draw(G, **opts)

In [None]:
N = nx.neighbors(G, 0)
S = G.subgraph(list(N))
nx.draw(S, **opts)

In [None]:
nS = S.number_of_nodes()
nS_choose_2 = nS * (nS - 1) // 2
mS = S.number_of_edges()
print(nS, mS, mS / nS_choose_2 )

In [None]:
nx.clustering(G)

In [None]:
nx.average_clustering(G)

* The **node clustering coefficient** of any node $i$ in a $G(n, p)$ **random graph** is
$c_i = p$. (In any selection of potential edges, by construction a proportion of $p$ is
present in the random graph; this is true in particular for the $\binom{k}{2}$ potential
edges between the $k$ neighbors of a node of degree $k$.)

* Thus the **graph clustering coefficient** of a $G(n, p)$ **random graph** is
$$
C = p.
$$

* Note that when $p(n) = \langle k \rangle n^{-1}$ for a fixed expected average degree $\langle k \rangle$
then $C = \langle k \rangle / n \to 0$ for $n \to \infty$: in large random graphs
the number of triangles is negligible.

* In real world networks, one often observes that $C / \langle k \rangle$ does not depend.
on $n$ (as $n \to \infty$)

### Clustering vs Transitivity

For a node $i \in X$, denote by $n_i^{\wedge} = \binom{k_i}{2}$ the number of
triads containing $i$ as their central node, and by $n_i^{\Delta}$ the actual
number of triangles containing $i$.

Then the node clustering coefficients is $c_i = n_i^{\Delta}/n_i^{\wedge}$,
or $n_i^{\Delta} = n_i^{\wedge} c_i$.

Moreover $3 n_{\Delta} = \sum_i n_i^{\Delta}$ and $n_{\wedge} = \sum_i n_i^{\wedge}$.

It follows that
$$
T = \frac{3 n_{\Delta}}{n_{\wedge}} = \frac1{n_{\wedge}} \sum_i n_i^{\wedge} c_i
$$
in contrast to
$$
C = \frac1n \sum_i c_i,
$$
$C$ is the (plain) **average** of the node clustering coefficients, whereas $T$ is a
**weighted average** of node clustering coefficients, giving higher weight to
high degree nodes.

The following example illustrates how $C$ and $T$ are different measures: as $n \to \infty$ here, $T \to 0$ while $C \to 1$.

In [None]:
n = 20
G = nx.Graph(["AB"])
G.add_edges_from([(x, k) for x in "AB" for k in range(n)])
    
nx.draw(G, **opts)

In [None]:
nx.average_clustering(G), nx.transitivity(G)

##  Code Corner

### `networkx`

* `shortest_path_length` : [[doc]](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.shortest_paths.generic.shortest_path_length.html)


* `eccentricity`: [[doc]](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.distance_measures.eccentricity.html)


* `triangles`: [[doc]](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.triangles.html)


* `transitivity`: [[doc]](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.transitivity.html)


* `clustering`: [[doc]](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.clustering.html)


* `average_clustering`: [[doc]](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.average_clustering.html)

##  Exercises

1. What are the characteristic path length $L$, the transitivity $T$, and the clustering coefficient $C$
of the Peterson graph?

1. What are the characteristic path length $L$, the transitivity $T$, and the clustering coefficient $C$
of the Florentine families marital graph?

2. What is the transitivity and what is the clustering coefficient
of a complete graph on $n$ nodes?

3. What is the transitivity and what is the clustering coefficient
of a tree on $n$ nodes?

1. Design an experiment with random graphs to verify the predicted characteristic path length.

1. Design an experiment with random graphs to verify the predicted graph clustering coefficient.