In [None]:
import numpy as np
import networkx as nx
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings( 'ignore' )

## ToDo

1. Your Network Summary  
    * Network source and preprocessing
    * Node/Edge attributes
    * Size, Order
    * Gorgeous network layout. Try to show that your network has some structure, play with node sizes and colors, scaling parameters, tools like Gephi may be useful here
    * Degree distribution, Diameter, Clustering Coefficient
2. Structural Analysis  
    * Degree/Closeness/Betweenness centralities. Top nodes interpretation
    * Page-Rank. Comparison with centralities
    * Assortative Mixing according to node attributes
    * Node structural equivalence/similarity
3. Community Detection  
    * Clique search
    * Best results of various community detection algorithms, both in terms of interpretation and some quality criterion. Since Networkx has no community detection algorithms, use additional modules e.g. igraph, communities, graph-tool, etc
    * The results should be visible on the network layout or adjacency matrix picture

# <center>Structural Analysis and Visualization of Networks</center>

## <center>Analysis of facebook graph</center>

### <center>Student: *Nazarov Ivan*</center>

## Summary
### Network source
This graph shows friend relationships among the people in mu facebook friends list. The newtork was obtained by Netviz facebook app. A purely technical step, but prior to loading with the networkx procedure $\text{read_gml}(\cdot)$ the GML-file was preprocessd to convert UTF-8 encoding into special HTML entities. In fact the problem seems to be rooted in the software used to crawl the facebook network. 

### Attributes
The nodes have a short list of attributes which are 
* gender;
* number of posts on the wall;
* locale, which represents the language setting of that nodes's facebook page.

The network does not have any edge attrbiutes

In [None]:
G = nx.read_gml( path =
	"./data/ha5/huge_100004196072232_2015_03_24_11_20_1d58b0ecdf7713656ebbf1a177e81fab.gml", relabel = False )

The order of a network $G=(V,E)$ is $|V|$ and the size is $|E|$.

In [None]:
print "The network G is of the order %d. Its size is %d." % ( G.number_of_nodes( ), G.number_of_edges( ) )

### Visualisation

It is always good to have a nice and attractive picture in a study. 

In [None]:
deg = G.degree( )
fig = plt.figure( figsize = (12,8) )
axs = fig.add_subplot( 1,1,1, axisbg = 'black' )
nx.draw_networkx( G, with_labels = False, ax = axs,
     cmap = plt.cm.Purples, node_color = deg.values( ), edge_color = "magenta",
     nodelist = deg.keys( ), node_size = [ 100 * np.log( d + 1 ) for d in deg.values( ) ],
     pos = nx.fruchterman_reingold_layout( G ), )

Let's have a look at connected components, since the plot suggests, that the graph is not connected.

In [None]:
CC = sorted( nx.connected_components( G ), key = len, reverse = True )
for i, c in enumerate( CC, 1 ):
    row = ", ".join( [ G.node[ n ][ 'label' ] for n in c] )
    print "%#2d (%d)\t"%(i, len(c)), ( row )[:100].strip() + (" ..." if len( row ) > 100 else "" )

The largest community connected component represents family, my acquaintances at shool ($\leq 2003$) and in university ($2003-2009$) and the second largest component are people I met at Oxford Royale Summer School in 2012. The one-node are either old acquaintances, select colleagues from work, instructors et c.  

Since the largest component is an order of magnitude larger than hte next biggest, I decide to focus just on it, rather than the whole network. In fact this convers almost $\frac{91}{121}\approx 75\%$ of vertices, and $\frac{1030}{1091} \approx 94\%$ of edges.

In [None]:
H = G.subgraph( CC[ 0 ] )
print "The largest component is of the order %d. Its size is %d." % ( H.number_of_nodes( ), H.number_of_edges( ) )

Let's plot the subgraph and study the its degree distribution.

In [None]:
deg = H.degree( )
fig = plt.figure( figsize = (16, 6) )
axs = fig.add_subplot( 1,2,1, axisbg = 'black', title = "Master cluster", )
pos = nx.fruchterman_reingold_layout( H )
nx.draw_networkx( H, with_labels = False, ax = axs,
     cmap = plt.cm.Oranges, node_color = deg.values( ), edge_color = "cyan",
     nodelist = deg.keys( ), node_size = [ d * 10 for d in deg.values( ) ],
     pos = pos )
## Degree distribution
v, f = np.unique( nx.degree( H ).values( ), return_counts = True)
axs = fig.add_subplot( 1,2,2, xlabel = "Degree", ylabel = "Frequency", title = "Node degree frequency" )
axs.plot( v, f, "ob" )

### Degree distribution
A useful tool for exploring the tail behaviour of sample is the **M**ean **E**xcess plot, defined as the  

$$M(u) = \mathbb{E}\Big(\Big. X-u\,\big.\big\rvert\,X\geq u \Big.\Big)$$
of which the emprical counterpart is  
$$\hat{M}(u) = {\Big(\sum_{i=1}^n 1_{x_i\geq u}\Big)^{-1}}\sum_{i=1}^n (x_i-u) 1_{x_i\geq u}$$
The key properties of $M(u)$ are
 * it steadily increases for power-law tails and the steeper the slope the smaller is the exponent;
 * it levels for exponential tails (heurstically: the case when $\alpha\to \infty$ is similar to an exponential tail);
 * it decays towards zero for a tail of a compactly supported distribution.
When dealing with the empirical mean-excesses one looks for the trend in the large thresholds to discern behaviour, necessarily bearing in mind that in that region the varinace of the $\hat{M}(u)$ grows.

In [None]:
from scipy.stats import rankdata
def mean_excess( data ) :
    data = np.array( sorted( data, reverse = True ) )
    ranks = rankdata( data, method = 'max' )
    excesses = np.array( np.unique( len( data ) - ranks ), dtype = np.int )
    thresholds = data[ excesses ]
    mean_excess = np.cumsum( data )[ excesses ] / ( excesses + 0.0 ) - thresholds
    return thresholds, mean_excess

In [None]:
plt.figure( figsize = ( 8, 6 ) ) 
u, m = mean_excess( H.degree().values() )
plt.plot( u, m, lw = 2 )
plt.title( "Mean Excess polt of node-degree" )
plt.xlabel( "Threshold" )
plt.ylabel( "Expected excess over the threshold")

The Mean Excess plot does seems to indicate that the node degree does not follow a scale free distribution. Indeed, the plot levels off as ita approaches the value $50$. The rightmost spike is in the region where the variance of the estimate of the conditional expectation is extremely high, which is why this artefact of finite sample may be ignored.

### Clustering tightness
The average clustering coefficient of a graph $G=(V,E)$ is defined by the following formula :  

$$\bar{c} = \frac{1}{n}\sum_{x\in V}c_x$$
where $n=|V|$ and $c_x$ is the local clustering coefficient of vertex $x\in V$ defined below.

The local (trinagular) clustering coefficient of a node $x\in V$ is defined as the ratio of the number of unique edge triangles containing $x$ to the number of unique triangles a vertex has in a complete graph of order $\delta_x$ -- the degree of $x$ in $G$.

The expression for $c_x$ is
$$c_x = \frac{1}{\delta_x (\delta_x-1)} \sum_{u\neq x} \sum_{v\neq x,u} 1_{xu} 1_{uv} 1_{vx} = \frac{1}{\delta_x (\delta_x-1)} \#_{x}$$  

where $1_{ij}$ is the indicator equal to $1$ if the edge (undirected) $(i,j)\in E$ and $0$ otherwise.

In [None]:
print "This subgraph's clustering coefficient is %.3f." % nx.average_clustering( H )
print "This subgraph's average shortest path length is %.3f." % nx.average_shortest_path_length( H )
print "The radius (maximal distance) is %d." % nx.radius( H )

The clustering coefficient is moderately high and any two members in this component are 2 hops away from each other on average. This means that this subgraph has a tightly knit cluster structure, almost a like small world, were it not for the light-tailed degreee distribution. 

## Structural analysis

### Centrality measures
#### Degree
The degree centrality measure of a node $v\in V$ in graph $G=\big(V, E\big)$ is the sum of all edges incident on it:

$$C_v = \sum_{u\in V} 1_{(v,u)\in E} = \sum_{u\in V} A_{vu} = \delta_v$$

In other words the more 1st-tier (nearest, reachable in one hop) negihbours a vertex has the higher its centrality is.

#### Betweenness
This measure assesses how important a node is in terms of the global graph connectivity:

$$C_B(v) = \sum_{s\neq v\neq t\in V} \frac{\sigma_{st}(v)}{\sigma_{st}}$$

where $\sigma_{st}(v)$ is the number of **shortest** paths from $s$ to $t$ passing through $v$, while $\sigma_{st}$ is the total number of paths of least legnth connecting $s$ and $t$. 

High local centrality means that a node is in direct contact with many other nodes, whereas low centrality indicates a periphrial vertex. 

Alogn with these local measures, compute the centrality closeness and the PageRank ranking.

In [None]:
pr = nx.pagerank_numpy( H, alpha = 0.85 )
cb = nx.centrality.betweenness_centrality( H )
cc = nx.centrality.closeness_centrality( H )
cd = nx.centrality.degree_centrality( H )

#### the Mixing coefficient

The mixing coefficient for a numerical node attribute $X = \big(x_i\big)$ in an undirected graph $G$, with the adjacency matrix $A$, is defined as

$$\rho(x) = \frac{\text{cov}}{\text{var}} = \frac{\sum_{ij}A_{ij}(x_i-\bar{x})(x_j-\bar{x})}{\sum_{ij}A_{ij}(x_i-\bar{x})^2} $$

where $\bar{x} = \frac{1}{2m}\sum_i \delta_i x_i$ is the mean value of $X$ weighted by vertex degree. Note that $A$ is necessarily symmetric. This coefficient can be represented in the matrix notation as  

$$\rho(x) = \frac{X'AX - 2m \bar{x}^2}{X'\text{diag}(D)X - 2m \bar{x}^2} $$

where the diagonal matrix $\text{diag}(D)$ is the matrix of vertex degrees, and the value $\bar{x}$ is the sample mean of the numerical node attribute $X$.

In [None]:
def assortativity( G, X ) :
## represent the graph in an adjacency matrix form
    A = nx.to_numpy_matrix( G, dtype = np.float, nodelist = G.nodes( ) )
## Convert x -- dictionary to a numpy vector
    x = np.array( [ X[ n ] for n in G.nodes( ) ] , dtype = np.float )
## Compute the x'Ax part
    xAx = np.dot( x, np.array( A.dot( x ) ).flatten( ) )
##  and the x'\text{diag}(D)x part. Note that left-multiplying a vector
##  by a diagonal matrix is equivalent to element-wise multiplication.
    D = np.array( A.sum( axis = 1 ), dtype = np.float ).flatten( )
    xDx = np.dot( x, np.multiply( D, x ) )
## numpy.average( ) actually normalizes the weights.
    x_bar = np.average( x, weights = D )
    D_sum = np.sum( D, dtype = np.float )
    return ( xAx - D_sum * x_bar * x_bar ) / ( xDx - D_sum * x_bar * x_bar )

Let's compute the assortativity for the centralities, pagerank vector, vertex degrees and node attributes.

In [None]:
print "PageRank assortativity coefficient: %.3f" % assortativity( H, nx.pagerank_numpy( H, alpha = 0.85 ) )
print "Betweenness centrality assortativity coefficient: %.3f" % assortativity( H, nx.centrality.betweenness_centrality( H ) )
print "Closenesss centrality assortativity coefficient: %.3f" % assortativity( H, nx.centrality.closeness_centrality( H ) )
print "Degree assortativity coefficient: %.3f" % assortativity( H, nx.centrality.degree_centrality( H ) )

In [None]:
print "Gender assortativity coefficient: %.3f" % nx.assortativity.attribute_assortativity_coefficient( H, 'sex' )
print "Agerank assortativity coefficient: %.3f" % assortativity( H, nx.get_node_attributes( H, 'agerank') )
print "Language assortativity coefficient: %.3f" % nx.assortativity.attribute_assortativity_coefficient( H, 'locale' )
print "Number of posts on the wall assortativity coefficient: %.3f" % nx.assortativity.attribute_assortativity_coefficient( H, 'wallcount' )

This component does not show segregation patterns in connectivity, as the computed coefficinets do indicate that neither that "opposites", nor that "kindred spritis" attach.  The noticably high values of degree centrality is probably due to the component already having a tight cluster structure.

### Node Rankings

It is sometimes interesting to look at a table representation of a symmetric distance matrix. The procedure below prints a matrix in a more straightforward format.

In [None]:
## Print the upper triangle of a symmetric matrix in reverse column order
def show_symmetric_matrix( A, labels, diag = False ) :
    d = 0 if diag else 1
    c = len( labels ) - d
    print "\t", "\t".join( c * [ "%.3s" ] ) % tuple( labels[ d: ][ ::-1 ] )
    for i, l in enumerate( labels if diag else labels[ :-1 ] ) :
        print ( ( "%4s\t" % l ) + "\t".join( ( c - i ) * [ "%.3f" ] ) %
                tuple( rank_dist[ i,i+d: ][ ::-1 ] ) )

It actually interesting, to compare the ordering produced by different vertex-ranking algorithms. The most direct way is to analyse pariwise Spearman's $\rho$, since it compares the rank-transformation of one vector of observed data to another.

In [None]:
from scipy.spatial.distance import pdist, squareform
from scipy.stats import spearmanr as rho
labels = [ 'btw', 'deg', 'cls', 'prk' ]
align = lambda dd : np.array( [ dd[ n ] for n in H.nodes( ) ], dtype = np.float )
rank_dist = squareform( pdist(
    [ align( cb ), align( cd ), align( cc ), align( pr ) ],
        metric = lambda a, b : rho(a,b)[0] ) )
show_symmetric_matrix( rank_dist, labels )

The rankings match each other very closely!

### Commutnity detection

A $k$-clique commutniy detection method considers a set of nodes a community if its maximal clique is of order $k$, all nodes are parto of at least one $k$-clique and all $k$-cliques overlap by at least $k-1$ vertrices.

In [None]:
kcq = list( nx.community.k_clique_communities( H, 3 ) )

Label propagation algorithm, initially assigns unique labels to each node, and the relabels nodes in random order until stabilization.
New label corresponds to the label, which the largest number of neighbours has.

Code borrowed from **lpa.py** by **Tyler Rush**, which can be found at [networkx-devel](https://groups.google.com/forum/#!topic/networkx-devel/J1HCjv74RGE). The procedure is an implementation of the idea in:
* **Cordasco, G., & Gargano, L. (2012)**. _Label propagation algorithm: a semi-synchronous approach_. International Journal of Social Network Mining, 1(1), 3-26.


In [None]:
import lpa
lab = lpa.semisynchronous_prec_max( H )

[Markov Cluster Algorithm](http://micans.org/mcl/) (MCL).
**Input:** Transition matrix $T = D^{-1}A$  
**Output:** Adjacency matrix $M^*$  
1. Set $M = T$
2. **repeat:**
    3. *Expansion Step:* $M = M^p$ (usually $p=2$)
    4. *Inflation Step:* Raise every entry of $M$ to the power $\alpha$ (usualy $\alpha=2$)
    5. *Renormalize:* Normalize each row by its sum
    6. *Prunning:* Replace entries that are close to $0$ by pure $0$
7. **until** $M$ converges
8. $M^* = M$

In [None]:
def mcl_iter( A, p = 2, alpha = 2, theta = 1e-8, rel_eps = 1e-4, niter = 10000 ) :
## Convert A into a transition kernel: M_{ij} is the probability of making a transition from i to j.
    M = np.multiply( 1.0 / A.sum( axis = 1, dtype = np.float64 ).reshape(-1,1), A )
    i = 0 ; status = -1
    while i < niter :
        M_prime = M.copy( )
## Expansion step: M_{ij} is the probability of reaching a vertex j from i in p hops.
        M = np.linalg.matrix_power( M, p )
## Pruning: make paths with low transition probability into almost surely unused.
        M[ np.abs( M ) < theta ] = 0
## Inflation step: dampen the probabilites
        M = np.power( M, alpha )
## Renormalisation step: make the matrix into a stochastic transition kernel
        N = M.sum( axis = 1, dtype = np.float64 )
## If a nan is encountered, then abort
        if np.any( np.isnan( N ) ) :
            status = -2
            break
        M = np.multiply( 1.0 / N.reshape(-1,1), M )
## Convergence criterion is the L1 norm of relative divergence of transition probabilities
        if np.sum( np.abs( M - M_prime ) / ( np.abs( M_prime ) + rel_eps ) ) < rel_eps :
            status = 0
            break
## Advance to the next iteration
        i += 1
    return ( M, (status, i) )

def extract_communities( M, lengths = True ) :
## It is extected that the MCL matrix detects communities in columns
    C = list( ) ; i0 = 0
    if np.any( np.isnan( M ) ) :
        return C
## Find all indices of nonzero elements
    r, c = np.where( np.array( M ) )
## Sort them by the column index and find the community sizes
    r = r[ np.argsort( c ) ]
    u = np.unique( c, return_counts = True )
    if np.sum( u[ 1 ] ) > M.shape[ 1 ] :
        return C
    if lengths :
        return u[ 1 ]
## Columns indices of nonzero entries are ordered, so we just need to
##  sweep across the sizes
    for s in u[ 1 ] :
## Row indices for a column with a nonzero element are the indices of
##  nodes in the community.
        list.append( C, r[ i0:i0+s ] )
        i0 += s
    return C

def make_labels( com, mapper = None ) :
    dd = dict( )
    for i, c in enumerate( com, 1 ) :
        for k in c :
            if mapper is not None :
                dd[ mapper[ k ] ] = i
            else :
                dd[ k ] = i      
    return dd

Let's check how the Markov Clustering Algorithm fares against $k$-clique, and vertex labelling.

In [None]:
fig = plt.figure( figsize = (12, 8) )
axs = fig.add_subplot( 1,1,1, axisbg = 'black', title = "Master cluster", )
A = nx.to_numpy_matrix( H, dtype = np.float, nodelist = nx.spectral_ordering( H ) )
C, _ = mcl_iter( A )
mcl = extract_communities( C, lengths = False)
axs.spy( A, color = "gold", markersize = 15, marker = '.' )
axs.spy( C, color = "magenta", markersize = 10, marker = '.' )
for i, c in enumerate( kcq, 1 ):
    row = ", ".join( [ G.node[ n ][ 'label' ] for n in c] )
    print "%#2d (%d)\t"%(i, len(c)), ( row )[:100].strip() + (" ..." if len( row ) > 100 else "" )

In [None]:
fig = plt.figure( figsize = (12, 8) )
axs = fig.add_subplot( 1,1,1, axisbg = 'black', title = "Master cluster: 5-clique community", )

kcq = list( nx.community.k_clique_communities( H, 5 ) )
deg = make_labels( kcq )
nx.draw_networkx( H, with_labels = False, ax = axs,
     cmap = plt.cm.Reds, node_color = deg.values( ), edge_color = "cyan",
     nodelist = deg.keys( ), node_size = 200, pos = pos )
for i, c in enumerate( kcq, 1 ):
    row = ", ".join( [ G.node[ n ][ 'label' ] for n in c] )
    print "%#2d (%d)\t"%(i, len(c)), ( row )[:100].strip() + (" ..." if len( row ) > 100 else "" )

In [None]:
fig = plt.figure( figsize = (12, 8) )
axs = fig.add_subplot( 1,1,1, axisbg = 'black', title = "Master cluster: 7-clique community", )

kcq = list( nx.community.k_clique_communities( H, 7 ) )
deg = make_labels( kcq )
nx.draw_networkx( H, with_labels = False, ax = axs,
     cmap = plt.cm.Reds, node_color = deg.values( ), edge_color = "cyan",
     nodelist = deg.keys( ), node_size = 200, pos = pos )
for i, c in enumerate( kcq, 1 ):
    row = ", ".join( [ G.node[ n ][ 'label' ] for n in c] )
    print "%#2d (%d)\t"%(i, len(c)), ( row )[:100].strip() + (" ..." if len( row ) > 100 else "" )

In [None]:
fig = plt.figure( figsize = (12, 8) )
axs = fig.add_subplot( 1,1,1, axisbg = 'black', title = "Master cluster: 4-clique communitites", )

kcq = list( nx.community.k_clique_communities( H, 4 ) )
deg = make_labels( kcq )
nx.draw_networkx( H, with_labels = False, ax = axs,
     cmap = plt.cm.Reds, node_color = deg.values( ), edge_color = "cyan",
     nodelist = deg.keys( ), node_size = 200, pos = pos )
for i, c in enumerate( kcq, 1 ):
    row = ", ".join( [ G.node[ n ][ 'label' ] for n in c] )
    print "%#2d (%d)\t"%(i, len(c)), ( row )[:100].strip() + (" ..." if len( row ) > 100 else "" )

In [None]:
fig = plt.figure( figsize = (12, 8) )
axs = fig.add_subplot( 1,1,1, axisbg = 'black', title = "Master cluster: label propagation", )

deg = make_labels( lab.values() )
nx.draw_networkx( H, with_labels = False, ax = axs,
     cmap = plt.cm.Reds, node_color = deg.values( ), edge_color = "cyan",
     nodelist = deg.keys( ), node_size = 200, pos = pos )
for i, c in enumerate( lab.values(), 1 ):
    row = ", ".join( [ G.node[ n ][ 'label' ] for n in c] )
    print "%#2d (%d)\t"%(i, len(c)), ( row )[:100].strip() + (" ..." if len( row ) > 100 else "" )

In [None]:
fig = plt.figure( figsize = (12, 8) )
axs = fig.add_subplot( 1,1,1, axisbg = 'black', title = "Master cluster: Markov Clustering", )

mcl = extract_communities( mcl_iter( nx.to_numpy_matrix( H, dtype = np.float ), p = 2, alpha = 2 )[ 0 ], lengths = False)
deg = make_labels( mcl, mapper = H.nodes() )
nx.draw_networkx( H, with_labels = False, ax = axs,
     cmap = plt.cm.Reds, node_color = deg.values( ), edge_color = "cyan",
     nodelist = deg.keys( ), node_size = 200, pos = pos )
for i, c in enumerate( mcl, 1 ):
    row = ", ".join( [ G.node[ n ][ 'label' ] for n in c] )
    print "%#2d (%d)\t"%(i, len(c)), ( row )[:100].strip() + (" ..." if len( row ) > 100 else "" )

# <center> Thank You!</center>