# Centrality Measures

In this notebook we look at *centrality measures*, which can be used to quantify how important nodes are in a network.  A wide variety of different centrality measures exist. The choice of appropriate measure depends on the task, and the interpretation of the values it produces depends on the nature of the network. A range of such measures are implemented in *NetworkX*.

https://networkx.org/documentation/stable/reference/algorithms/centrality.html

In [None]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option("display.precision", 3)

First, let's load a small sample network representing connections on the LinkedIn platform. The network is stored in the XML-based GEXF format, which NetworkX can read.

In [None]:
g = nx.read_gexf("linkedin25.gexf")

This is a undirected network:

In [None]:
g.is_directed()

In [None]:
plt.figure(figsize=(10,8))
nx.draw_networkx(g, with_labels=True, node_size=800, node_color="lightblue")
plt.axis("off");

### Node Degree

The **degree** of a node is the number of nodes which it is connected to in the network - i.e. its number of neighbours. The *degree()* function returns the degree for all nodes in a network, or a specified node.

In [None]:
# get degree score for a single user
g.degree("Tina Davis")

In [None]:
# get a dictionary of degree scores for all nodes
degrees = dict(g.degree())
degrees

We could look at various statistics for the degree sequence in this network:

In [None]:
degree_seq = pd.Series(degrees)
degree_seq

In [None]:
print('Degree range: [%d, %d]' % (degree_seq.min(), degree_seq.max() ) )
print('Mean degree: %.2f' % degree_seq.mean() )
print('Median degree: %d' % degree_seq.median() )

We could also generate a plot of the **degree distribution** for this network:

In [None]:
ax = degree_seq.plot.hist(figsize=(8,6), fontsize=14, legend=None, color="darkred")
ax.set_ylabel("Number of Nodes", fontsize=14)
ax.set_xlabel("Degree", fontsize=14);

### Measuring Centrality

Centrality analysis allows us to identify the most important nodes in a network. The actual definition of importance depends on the nature of the network, and many different centrality measures exist. NetworkX includes implementations of the most common measures.

The most basic measure of centrality, **degree centrality**, is simply the degree of each node divided by $(n-1)$, where $n$ is the total number of nodes. The output is a dictionary, where the keys are the nodes.

In [None]:
deg = nx.degree_centrality(g)
deg

We can use these scores to populate a Pandas *DataFrame* and display a ranking of the nodes by their degree centrality.

In [None]:
s = pd.Series(deg)
df = pd.DataFrame(s,columns=["degree_centrality"])
# display the DataFrame sorted by degree centrality
df.sort_values(by="degree_centrality",ascending=False).head(10)

Another measure, **betweenness centrality** can be used to find "brokers" or "bridging" nodes in a network.
Nodes that occur on many shortest paths between other nodes in the graph have a high betweenness centrality score.

In [None]:
bet = nx.betweenness_centrality(g)
bet

In [None]:
df["betweenness"] = pd.Series(bet)
df.sort_values(by="betweenness", ascending=False).head(10)

**Closeness centrality** measures the extent to which a node is close to all other nodes in a network, either directly or indirectly.

In [None]:
close = nx.closeness_centrality(g)
close

In [None]:
df["closeness"] = pd.Series(close)
df.sort_values(by="closeness",ascending=False).head(10)

The **eigenvector centrality** of a node proportional to the sum of the centrality scores of its neighbours. This means that a node is important if it connected to other important nodes.


In [None]:
eig = nx.eigenvector_centrality(g)
df["eigenvector"] = pd.Series(eig)
df.sort_values(by="eigenvector", ascending=False).head(10)

Often the **normalised eigenvector centrality** is reported to allow for comparisons across different networks. Normalisation is done relative to the maximum value in the current network.

In [None]:
df["norm_eigenvector"] = df["eigenvector"]/max(df["eigenvector"])
df.sort_values(by="norm_eigenvector",ascending=False).head(10)

As we see from the Data Frame, the order of the ranking produced by the various measures can differ, particularly in the case of betweenness centrality.

We could quantify this by looking at the correlation scores between the different measures (i.e. the columns of the Data Frame):

In [None]:
df.corr()

 In our LinkedIn network, Eric Smith has the highest betweenness score, as he acts as a key bridge in the network.

In [None]:
# Get ranking of the nodes, where 1 indicates the highest rank
df["betweenness"].rank(ascending=False)

If we remove Eric from the network, it is no longer connected:

In [None]:
g.remove_node("Eric Smith")

In [None]:
nx.is_connected(g)

With this "bridge" node removed, we can see that the network has two distinct components:

In [None]:
plt.figure(figsize=(10,8))
nx.draw_networkx(g, with_labels=True, node_size=800, node_color="lightblue")
plt.axis("off");