# Network thresholding and spanning trees: visualizing the US air traffic network
In this exercise, we will get familiar with different approaches to thresholding networks and learn how they can be used for efficiently visualizing weighted networks.

We will work with an undirected network describing air traffic in the US between the 14th and 23rd of December 2008 (Data source: Bureau of Transportation Statistics). In the network, each node corresponds to an airport, and the weights of links are equal to the number of flights between the two linked airports during the period.

Source files:
* The network is given in the file `aggregated_US_air_traffic_network_undir.edg`
* The CSV file `us_airport_id_info.csv` contains information about the names and locations of the airports. 
* The file `US_air_bg.png` contains a map of the US to be used as a background image for visualizations.


In [None]:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from PIL import Image

In [None]:
# NO NEED TO MODIFY THIS FUNCTION
def visualize_US_network(ax, net, xycoords, bg_figname, edges=None, alpha=0.3):
    """
    Plot the network on top of a map of the US.

    Parameters
    ----------
    ax : matplotlib axis object
    net : the network to be plotted
    xycoords : dictionary of node_id to coordinates (x,y)
    bg_figname : file name for the background map figure
    edges : list of node index tuples (node_i,node_j),
            if None all network edges are plotted.
    alpha : float between 0 and 1, describing the level of
            transparency
    """
    img = Image.open(bg_figname)
    axis_extent = (-6674391.856090588, 4922626.076444283,
                   -2028869.260519173, 4658558.416671531)
    ax.imshow(img, extent=axis_extent)
    ax.set_xlim((axis_extent[0], axis_extent[1]))
    ax.set_ylim((axis_extent[2], axis_extent[3]))
    ax.set_axis_off()
    nx.draw_networkx(net,
                     pos=xycoords,
                     with_labels=False,
                     node_color='k',
                     node_size=5,
                     edge_color='r',
                     alpha=alpha,
                     edgelist=edges)
    return ax

## Data
Let us load the data from the right directory. If you run this notebook in your machine, please specify the right directory.

In [None]:
# Select data directory
import os
if os.path.isdir('/coursedata'):
    course_data_dir = '/coursedata'
elif os.path.isdir('../data'):
    course_data_dir = '../data'
else:
    # Specify course_data_dir on your machine
    course_data_dir = '.'

print('The data directory is %s' % course_data_dir)

csv_path = os.path.join(course_data_dir, 'US_airport_id_info.csv')
network_path = os.path.join(course_data_dir, 'aggregated_US_air_traffic_network_undir.edg')
bg_figname = os.path.join(course_data_dir, 'US_air_bg.png')

id_data = np.genfromtxt(csv_path, delimiter=',', dtype=None, names=True, encoding='utf8') 
xycoords = {} # A dictionary of node coordinates for visualization
for row in id_data:
    xycoords[int(row['id'])] = (row['xcoordviz'], row['ycoordviz'])
net = nx.read_weighted_edgelist(network_path, nodetype=int)

### a) Basic properties
Let us first check the basic network properties to get an overview of the network. **Compute** the following quantities
- Number of nodes $N$
- Number of links $L$
- Network density $\rho$
- Average clustering coefficient $C$

**Hints**:
- For the clustering coefficient, consider the undirected and unweighted version of the network, where two airports are linked if there is a flight between them in either direction.

In [None]:
# TODO: Calculate basic properties of the network
# Your solution here
# YOUR CODE HERE
N = net.number_of_nodes()
L = net.number_of_edges()
p = nx.density(net)
C = nx.average_clustering(net)

print('Number of nodes: %d' % N)
print('Number of links: %d' % L)
print('Density: %f' % p)
print('Average clustering coefficient: %f' % C)

### b. Visualization
**Visualize** the full network with all links on top of the map of the US. The resulting figure is somewhat messy due to the large number of visible links.

**Hint**: Use the above function `visualize_US_network`.

In [None]:
fig_full, ax_full = plt.subplots(figsize=(8, 5))
# TODO: Plot the full network on ax_full
# Your solution here
# YOUR CODE HERE
visualize_US_network(ax_full, net, xycoords, bg_figname)
fig_full.suptitle('Full network, 100% of links')

In [None]:
# Save the figure in the current directory
fig_full.savefig('air_traffic_full.pdf')

### c. Maximum and minimum spanning trees
To reduce the number of plotted links, **compute** both the maximum and minimum spanning tree (MST) of the network and **visualize** them. Then, **answer** following question:
- If you would like to understand the overall organization of the air traffic in the US, would you use the minimal or maximal spanning tree? Why?

**Hint**: For computing minimum spanning trees, use `nx.minimum_spanning_tree`. For computing maximum spanning trees, use `nx.maximum_spanning_tree`.

In [None]:
fig_minst, ax_minst = plt.subplots(figsize=(8, 5))
n_minst = 0 # REPLACE! number of links in minimum spanning tree

# TODO: Obtain the minimum spanning tree and plot it on ax_minst
# Your solution here
# YOUR CODE HERE
minst = nx.minimum_spanning_tree(net)
n_minst = minst.number_of_edges()
visualize_US_network(ax_minst, minst, xycoords, bg_figname)


fig_minst.suptitle('Minimum spanning tree, L={}'.format(n_minst))

In [None]:
# Save the figure in the current directory
fig_minst.savefig('air_traffic_minst.pdf')

In [None]:
fig_maxst, ax_maxst = plt.subplots(figsize=(8, 5))
n_maxst = 0 # REPLACE! number of links in maximum spanning tree

# TODO: Obtain the maximum spanning tree and plot it on ax_maxst
# Your solution here
# YOUR CODE HERE
maxst = nx.maximum_spanning_tree(net)
n_maxst = minst.number_of_edges()
visualize_US_network(ax_maxst, maxst, xycoords, bg_figname)

fig_maxst.suptitle('Maximum spanning tree, L={}'.format(n_maxst))

In [None]:
# Save the figure in the current directory
fig_maxst.savefig('air_traffic_maxst.pdf')

### d. Thresholded networks
**Threshold** and **visualize** the network so that only the strongest $M$ links remain in it, where $M = N - 1$ is the number of links in the maximal spanning tree. Then, **answer** following questions:
- How many links does the thresholded network share with the maximal spanning tree?
- Given this number and the visualizations, does simple thresholding yield a network similar to the maximum spanning tree?

**Hint**: Note that the network is undirected, which means that edge $(i, j)$ is the same as edge $(j, i)$. When comparing the set of edges, however, the order of the nodes in the edge tuple matters. Make sure that each edge is represented consistently across the two networks when comparing them.

In [None]:
fig_thr, ax_thr = plt.subplots(figsize=(8, 5))

# TODO: Obtain the thresholded network and visualize it
# Then calculate the number of common links between the thresholded network and the maximum spanning tree
# This can be done, e.g., by storing the links of both in a set and using the intersection method

common=set() # Write your code so that this variable will contain the edges in both networks (thresholded, spanning tree)

# YOUR CODE HERE
M = maxst.number_of_edges()

edges_sorted = sorted(net.edges(data=True), key=lambda x: x[2]['weight'], reverse=True)
thresholded_edges = edges_sorted[:M]

G_thr = nx.Graph()
G_thr.add_edges_from([(u, v) for u, v, w in thresholded_edges])

visualize_US_network(ax_thr, G_thr, xycoords, bg_figname)

common = set(G_thr.edges()).intersection(set(maxst.edges()).union(set([(v, u) for u, v in maxst.edges()])))

print("Number of common edges with MaxST:", len(common))
print("{:.1f} % of links are the same.".format((len(common) * 100.0) / n_maxst))

fig_thr.suptitle('{} ({:.1f} %) strongest links of the original network'\
    .format(n_maxst, (n_maxst * 100.0) / L))

In [None]:
# Save the figure in the current directory
fig_thr.savefig('air_traffic_thresholded.pdf')