<div style="float:left;"><img src="logo.png" width="500"/></div>

# Characterising Networks

In this demo we will use the Python [NetworkX](https://networkx.org) package to quantitatively characterise an existing network, looking at aspects of the overall network structure, and the centrality or importance of individual nodes. We will use the US air transport network that we created in the last demo

In [None]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.precision', 3)

## Network Loading

Load the directed weighted network from the GEXF file created in Demo 1:

In [None]:
g = nx.read_gexf("airstats-weighted-directed.gexf")

## Basic Characterisation

Based on this network, characterise the network’s connectedness (i.e., the density, number of components).

In [None]:
# how many nodes and edges are in the network?
print("Network has %d nodes and %d edges" % (g.number_of_nodes(), g.number_of_edges()))

In [None]:
# what level of density in the network?
print("Density = %.4f" % nx.density(g))

In [None]:
# how many strongly connected components?
nx.number_strongly_connected_components(g)

## Measuring Centrality

Centrality analysis allows us to identify the most important nodes in a network. The actual definition of importance depends on the nature of the network, and many different centrality measures exist. NetworkX includes implementations of the most common measures.

The most basic measure of centrality, **degree centrality**, measures the number of connections that a node has. In the case of our airport data, it indicates the number of *routes* than an airport is involved in (i.e. either incoming or outgoing flights).

In [None]:
# We can use networkX to produce a dictionary of centrality values, where the keys are the nodes.
deg = dict(nx.degree(g))
deg

We can use these scores to populate a Pandas *Data Frame* and display a ranking of the nodes by their degree scores:

In [None]:
# create Pandas series
sdeg = pd.Series(deg).sort_values(ascending=False)
# create Pandas Data Frame from the series, along with the node city attributes
df = pd.DataFrame({"City":nx.get_node_attributes(g, "city"), "Degree":sdeg})
df = df.sort_values(by="Degree", ascending=False)
# sort in descending order and get top values
df.sort_values(by="Degree", ascending=False).head(10)

We could use these scores to identify peripheral airports (i.e. involved in very few routes), where we are interested in few unique connections, either incoming or outgoing:

In [None]:
# note we want the low values
df.sort_values(by="Degree", ascending=True).head(10)

We often want to look at the distribution of degree scores across all nodes in a network to see how they vary. This is usually plotted as a histogram, which shows us the network's **degree distribution**: 

In [None]:
# produce a histogram of the values
ax = df["Degree"].plot(kind="hist", figsize=(12, 5.5), fontsize=13, legend=None, color="darkred", 
    bins=20, zorder=3, rwidth=0.9)
ax.yaxis.grid()
ax.set_xlim(0)
ax.set_ylabel("Number of Nodes", fontsize=14)
ax.set_xlabel("Degree", fontsize=14);

## In-Degree and Out-Degree Centrality

For directed networks, we usually distinguish between **in-degree** and **out-degree**:

- In-degree: the number of incoming edges that each node has.
- Out-degree: the number of outgoing edges that each node has.

In the case of our flight network, these measures allow us to identify airports which have many incoming routes or outgoing routes.

Firstly, look at out-degrees:

In [None]:
# calculate out-degree scores
out_degrees = dict(g.out_degree())
# add a column to our Data Frame
df["Out-Degree"] = pd.Series(out_degrees)
# sort in descending order and get top values
df.sort_values(by="Out-Degree", ascending=False).head(10)

We can repeat the process for in-degree:

In [None]:
# calculate in-degree scores
in_degrees = dict(g.in_degree())
# add a column to our Data Frame
df["In-Degree"] = pd.Series(in_degrees)
# sort in descending order and get top values
df.sort_values(by="In-Degree", ascending=False).head(10)

It appears that out-degree and in-degree scores for the nodes are generally very similar for this network. 

We could check this at a network level by creating a scatter plot of in-degree versus out-degree for all nodes.

In [None]:
# produce the scatter plot
ax = df.plot(kind="scatter", x="In-Degree", y="Out-Degree", figsize=(8, 7), 
    fontsize=13, color="teal", s=30)
ax.set_xlim(0)
ax.set_ylim(0)
ax.set_xlabel("In-Degree", fontsize=14)
ax.set_ylabel("Out-Degree", fontsize=14);

## Weighted Centrality Measures

When we have *weights* on our edges, we can take these into account when measuring centrality. In a weighted network, the **weighted degree** is the sum of the weights on the edges connected to each node.

There are analogous weighted equivalents of in-degree and out-degree. We can use these to identify frequent origin and destination airports in the network (i.e., high weighted in-degree / out-degree).

Firstly we will look at **weighted out-degree** (i.e. the sum of the weights on outgoing edges):

In [None]:
# calculate weighted out-degree
wout_degrees = dict(g.out_degree(weight="weight"))
# add a column to our Data Frame
df["W-Out-Degree"] = pd.Series(wout_degrees)
# sort in descending order and get top values
df.sort_values(by="W-Out-Degree", ascending=False).head(10)

Next we looked at **weighted in-degree**  (i.e. the sum of the weights on incoming edges):

In [None]:
# calculate weighted in-degree
win_degrees = dict(g.in_degree(weight="weight"))
# add a column to our Data Frame
df["W-In-Degree"] = pd.Series(win_degrees)
# sort in descending order and get top values
df.sort_values(by="W-In-Degree", ascending=False).head(10)

Again we could look at the relationship between these two weighted centrality scores in the network:

In [None]:
# produce a scatter plot
ax = df.plot(kind="scatter", x="W-In-Degree", y="W-Out-Degree", figsize=(8, 7), 
    fontsize=13, color="teal", s=30)
ax.set_xlim(0)
ax.set_ylim(0)
ax.set_xlabel("Weighted In-Degree", fontsize=14)
ax.set_ylabel("Weighted Out-Degree", fontsize=14);

We could also look at the overall distributions for the in-degree and out-degree scores:

In [None]:
plt.subplots(1, 2, sharey=True, figsize=(13,5))
# create the first subplot
plt.subplot(1,2,1)
ax1 = df["W-In-Degree"].plot(kind="hist", fontsize=13, legend=None, color="darkorange", 
    bins=20, zorder=3, rwidth=0.9)
ax1.yaxis.grid()
ax1.set_xlim(0)
ax1.set_ylabel("Number of Nodes", fontsize=14)
ax1.set_xlabel("Weighted In-Degree", fontsize=14);
# create the second subplot
plt.subplot(1,2,2)
ax2 = df["W-Out-Degree"].plot(kind="hist", fontsize=13, legend=None, color="purple", 
    bins=20, zorder=3, rwidth=0.9, ax=plt.gca());
ax2.yaxis.grid()
ax2.set_xlim(0)
ax2.set_ylabel("Number of Nodes", fontsize=14)
ax2.set_xlabel("Weighted Out-Degree", fontsize=14);

## Other Centrality Measures

Going beyond counting edges and weights, we can use **betweenness centrality** to identify bridging nodes in a network. Nodes that occur on many shortest paths between other nodes in the network have high betweenness centrality.

In our flight network, we could use this measure to identify key hub airports in the network with high betweenness

In [None]:
# calcuate betweennness centrality scores
between_scores = nx.betweenness_centrality(g)
# add a column to our Data Frame
df["Between"] = pd.Series(between_scores)
# sort in descending order and get top values
df.sort_values(by="Between", ascending=False).head(10)

The **eigenvector centrality** of a node proportional to the sum of the centrality scores of its neighbours. This means that a node is important if it connected to other important nodes.

In [None]:
# calcuate eigenvector centrality scores
eig_scores = nx.eigenvector_centrality(g)
# add a column to our Data Frame
df["Eigenvector"] = pd.Series(eig_scores)
# sort in descending order and get top values
df.sort_values(by="Eigenvector", ascending=False).head(10)

Different measures will be appropriate in different contexts:

- Degree centrality: when the number of connections is important
- Betweenness centrality: when control over transmission is important
- Closeness centrality: when time taken to reach nodes is important
- Eigenvector centrality: when influence of neighbours is important

In [None]:
# look at the correlation between the different centrality measures
df.corr()