# <center>Structural Analysis and Visualization of Networks</center>

## <center>Home Assignment #1: Power law</center>

### <center>Student: *Nazarov Ivan*</center>

#### <hr /> General Information

**Due Date:** 28.01.2015 23:59 <br \>
**Late submission policy:** -0.2 points per day <br \>


Please send your reports to <mailto:leonid.e.zhukov@gmail.com> and <mailto:shestakoffandrey@gmail.com> with message subject of the following structure:<br \> **[HSE Networks 2015] *Nazarov* *Ivan* HA*1***

Support your computations with figures and comments. <br \>
If you are using IPython Notebook you may use this file as a starting point of your report.<br \>
<br \>
<hr \>

# Preabmle

Let's start by defining several routines, which would become helpuf later on in the assignment.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import numpy.linalg as la
from scipy.stats import rankdata
%matplotlib inline

## Construct a regression model
def lm_model( X, Y, intercept = True ) :
    T = np.array( Y, dtype = float )
    M = np.array( X, dtype = float )
    if intercept is True :
        M = np.vstack( [ np.ones( len( Y ) ), M ] ).T
	return (M,T, intercept)

## Define the OLS regression routine:
def lm_fit( model ) :
	M, T, intercept = model
	MMinv = la.inv( ## implement (X'X)^{-1} (X'Y)
		np.dot( M.T, M ) ) 
	coef = np.dot( MMinv,
		np.dot( M.T, T ) )
## Estimate the residual standard deviation
	resid = T - np.dot(M, coef)
	dof = len( T ) - len( coef )
	RSS = np.dot( resid.T, resid )
	return (coef, RSS, dof, MMinv )

A continusous random variable $X$ is distributed accorind to the power law (also known as Pareto distibution) if it's probability density function is $$p(x) = \frac{\alpha-1}{u} {\bigg (\frac{x}{u} \bigg )}^{-\alpha} 1_{[u,+\infty)} (x)$$

The maximum likelihood esitimate of the exponent $\alpha$ is given by $$\hat{\alpha} = 1 + {\bigg( \frac{\sum_{k=1}^n \ln x_k}{n} - \ln u \bigg) }^{-1}$$ provided all the observed sample values are not less than the threshold $u$. 

A random variable $N$ with discrete power law distribution exceeding a certain thershold $u$ has the following probabilities: $$\mathbb{P}(N=k) = C \frac{1}{{(k-u+1)}^{\,\gamma}}$$ where $k\geq u$ is the value over the threshold $u$, $\gamma > 1$ and the constant $C$ is given by the reciprocal of Riemann's zeta function $$\zeta(\gamma) = \sum_{n\geq 0} \frac{1}{n^{\,\gamma}}$$.

The MLE estimate of the power parameter of discrete power law involves the derivative of the Zeta function, which forbids  a closed agelbaric form of the solution to the first order conditions on the maximum of log-likelihhod: $$\frac{\partial }{\partial \gamma} \mathcal{L}\quad:\quad \frac{-\zeta'(\gamma)}{\zeta(\gamma)} = \sum_{i=1}^n \ln (k_i-u+1)$$ 
In practice it is necessary to resort to numerical optimization in order to finde the MLE under this distributional assumption.

The routines below implements exactly the MLE of $\alpha$ and $\gamma$.

In [None]:
## ML estimator of the power law in the "tail" (x≥u):
##  x_k \sim C x^{-\alpha} 1_{[u,+∞)}(x).
def mle_alpha( data, threshold ) :
## Keep the data observations, that we consider to be in the tail
	tail = np.array( [ v for v in data if v >= threshold ] )
## Estimate the mean log of the peaks over threshold
	sum_log = np.sum( np.log( tail ) ) / ( len( tail ) + 0.0 )
## Use the closed form expression for the value of the power at an optimum
	alpha = 1.0 + 1.0 / ( sum_log - np.log( threshold ) )
## Using the delta-method compute the s.e of the estimate.
	return alpha, ( alpha - 1 ) / np.sqrt( len( tail ) )

## The function below implements the same functionality as the previous one
##  but instead of the continuous version it works with the discrete power law.
from scipy.special import zeta
from scipy.optimize import minimize
## The discrete power law gives marginally different results
##  \Pr(N=n) \defn \frac{1}{\zeta(\gamma)} n^{-\gamma}, n -- positive integer
def mle_alpha_d( data, threshold ) :
## Keep the data observations, that we consider to be in the tail
	tail = np.array( [ v for v in data if v >= threshold ] )
## Estimate the mean log of the peaks over threshold
	sum_log = np.sum( np.log( tail - threshold + 1 ) ) / ( len( tail ) + 0.0 )
## Define minus log-likelihood of the discrete power law
	loglik = lambda alpha : np.log( zeta( alpha ) ) + alpha * sum_log
## Compute the ML estimate of the exponent, with a view to using it as the
##  initial seed for the numerical minimizer for better convergence.
	res = minimize( loglik, ( 1.0 + 1.0 / sum_log, ), method = 'Nelder-Mead', options = { 'disp': False } )
## Return the "optimal" argument, regardless of its quality. Potentially DANGEROUS!
	return res.x[ 0 ], float( 'nan' )


Selecting an optimal threshold, beyond which the power-law like tail behaviour is expected, which adequately balances between the bias and the variance, is very important. As suggested in ... this task is preformed well by employing the statistic in the Kolmogorov-Smirnov goodness-of-fit test. The statistic itself is the $L^\infty$ norm of the difference between the hypothesised distribution function and the observed (empirical) CDF.

Routines below implement this functionality.

In [None]:
## Define a convenience function for estimating the power parameter
##  of the continuous power law
from scipy.stats import kstest
def ks_dist( data, threshold ) :
## Estimate the power given the current threshold
	alpha, sd = mle_alpha( data, threshold )
## Construct the CDF in the current environment
	cdf = lambda x : 1.0 - ( x / threshold ) ** ( 1.0 - alpha )
## Return the output of the out-of-the box Kolmogorov-Smirnov test:
##  the infinity norm of the difference between the distribution functions.
	d, pv = kstest( [ v for v in data if v >= threshold ], cdf )
	return (d, pv), (alpha, sd)

def ks_dist_d( data, threshold ) :
## Estimate the power given the current threshold
	alpha, sd = mle_alpha_d( data, threshold )
## Construct the CDF in the current environment
	cdf = lambda k : 1.0 - zeta( alpha, k-threshold+1 ) / zeta( alpha )
## Return the output of the out-of-the box Kolmogorov-Smirnov test:
##  the infinity norm of the difference between the distribution functions.
	d, pv = kstest( [ v for v in data if v >= threshold ], cdf )
	return (d, pv), (alpha, sd)


These helper functions invert an array and count the number of occurrences of distinct values in an array.

In [None]:
def values( data, frequency = False ) :
	bins = dict( )
## For each value in the given array, add the index of each occurrence
##  into the bin dedicated to the encountered value.
	for i, x in enumerate( sorted( data ) ) :
## Prepend the current occurrence of a value, unless it has never been
##  seen before, in which case initialise the list of indices for it.
		bins[ x ] = bins.get( x, [] ) + [ i ]
	return bins

## It was brought to my attention, that numpy.unique() does the same trick... 
def counts( data ) :
## Count the number of times a value occurs in the array.
	counts = dict( )
	for x in data :
## If the values has not been seen yet, then initialize it to
##  a single occurrence otherwise increment its counter.
		counts[ x ] = counts.get( x, 0 ) + 1
	return counts.items( )

## Construct the complimentary cumulative distribution function for
##  the data exceedig the given tail threshold.
def ccdf( data, threshold ) :
## Count the occurrences of values over some threshold in the array
	freq = np.array( counts(
			[ v for v in data if v >= threshold ] ),
		dtype = float )
## Sort the counts along the growing values they correspond to
	freq = freq[ freq[ :, 0 ].argsort( ), : ]
## ... and compute the fraction of data with values lower than the current
	freq[:,1] = 1.0 - np.cumsum( freq[ :,1 ], dtype = float ) / sum( freq[ :,1 ] )
	return freq

Mean excess plot is a visual tool that helps determine the tail-type behavoiur from the sample data. Basically it is just the plot of the sample mean of values exceeding some threshold.

Id $X$ is some random varaible with $\mathbb{E}X^+ < +\infty$, then the function $M(u)$, also known as the means residual lifetime, or mean excess over threshlod, is defined as
$$ M(u) \overset{\Delta}{=} \mathbb{E}{\Big ( {\big. X-u\,\big \rvert}\, X\geq u \Big)} = \mathbb{E}{\Big ( {\big. X\,\big \rvert}\, X\geq u \Big)} - u$$
Its empirical analog is providedn by the folowing expression:
$$\hat{M}(u) \overset{\Delta}{=} \frac{1}{\sum_{i=1}^n 1_{[u, \infty)}(x_i)} \sum_{i=1}^n (x_i - u)1_{[u, \infty)}(x_i) $$

Heavy-tailed behaviour reveals itself as an upwards trend in the graph above some threshold. A downward trend shows thin-tailed behaviour whereas an almost flat line shows an exponential tail. Mean excesses for higher thresholds are averages of a handful of extreme excesses, which implies that in this region the plot is unstable.

Indeed, if $X\sim \text{Pwr}(\alpha,x_0)$ then 
$$M(u) = \frac{ (\alpha-1)\,x_0^{\,\alpha-1}\,\int\limits_u^\infty s^{1-\alpha} ds }{{\big (\tfrac{u}{x_0}\big )}^{1-\alpha}} - u= \frac{\alpha-1}{\alpha-2} u - u = \frac{1}{\alpha-2} u$$

If, however, $X\sim \text{Exp}(\lambda)$ then 
$$M(u) = \frac{ \int\limits_u^\infty s \lambda e^{-\lambda s} ds }{e^{-\lambda u}} - u = \frac{  u e^{-\lambda u}+ e^{-\lambda u}\lambda^{-1} }{e^{-\lambda u}} - u = \frac{1}{\lambda}$$


In [None]:
def mean_excess( data ) :
	data = np.array( sorted( data, reverse = True ) )
## Compute the last positions in the sorted array of each repeated observation
	ranks = rankdata( data, method = 'max' )
## Since the array is sorted, the number of observation exceeding the current
##  is givne by difference between the length of the array and the max-rank.
	excesses = np.array( np.unique( len( data ) - ranks ), dtype = np.int )
## Get the last values in each group -- the thresholds
	thresholds = data[ excesses ]
## Get the sum of all values greater than the current threshold 
	mean_excess = np.cumsum( data )[ excesses ] / ( excesses + 0.0 ) - thresholds
	return np.array( zip( thresholds, mean_excess ), dtype = np.float )

## Problems

### Task 1.

Load [wordcounts](http://www.leonidzhukov.net/hse/2015/networks/data/wordcounts.txt) dataset. 
1. Check that Zipf's Law holds
2. Assuming that the data is distributed according to the Power Law, find
 * $\alpha$ of the distribution
 * mean sample variance $\sigma^2$
3. Produce summary of the frequencies: min, max, mean, median

In [None]:
#####################################################################
#+ 0. Load the data (yes, it is a milestone!)
## Load the word count dataset
wordcount = np.fromregex(
	'./data/wordcounts.txt', r"(\d+)\s+(.{,32})",
	[ ( 'freq', np.int64 ), ( 'word', 'S32' ) ] )

In [None]:
#####################################################################
##+ 1. Check that Zipf's Law holds
## Pre-sort the frequencies: in ascending order of frequencies
wordcount.sort( order = 'freq' )
freqs = wordcount[ 'freq' ]
## PRoduce ranks: from 1 up to |W|
ranks = np.arange( 1, len( wordcount ) + 1, dtype = float )[::-1]
## The probability of a word frequency being not less than the 
##  frequency of a gien word w it exactly the ratio of the w's rank
##  to the total number of words.
probs = ranks / len( wordcount )

## estimate f_k\sim C k^{-\gamma} model
mdl = lm_model( np.log( ranks ), np.log( freqs ), True )
coef, rss, dof, XX = lm_fit( mdl )

## Define the fitted Zipf's law
# zipf = lambda r : np.exp( coef.dot( ( 1, np.log( r ) ) ) )
zipf = lambda r : np.exp( coef[0] + coef[1] * np.log( r ) )

## Show how well is was estimated.
plt.loglog( freqs, probs, "xr" )
plt.plot( zipf( ranks ), probs, "-b" )
plt.xlabel( "frequency" ) ; plt.ylabel( "ranks" )
plt.title( "Wordcount data" )
plt.show( )

In [None]:
######################################################################
##+ 2. Assuming that the data is distributed according to the Power Law, find
##  * $\alpha$ of the distribution
##  * mean sample variance $\sigma^2$

## Get the ML estimate
alpha_ml, alpha_ml_sd = mle_alpha( freqs, freqs.min( ) )

## Let's suppose that the rank is proportional to the complementary CDF
##  of a power law: $\bar{F}(x) = {\left(\frac{x}{u}\right)}^{1-\alpha}$
##  Thus the following econometric model is to be estimated:
##  $\log \text{rank} \sim C + (1-\alpha) \log \text{freq} + \epsilon$
mdl = lm_model( np.log( freqs ), np.log( ranks ), True )
beta, rss, dof, XX = lm_fit( mdl )
## Transform the coefficient
alpha_ls = 1 - beta[ 1 ]

## The regression estimate of the power should be close
##  to the ML estimate
print "the OLS estimate of alpha is %f\n" % alpha_ls
print "Whereas the ML estimate is %f (%f) \n" % ( alpha_ml, alpha_ml_sd )
print "Since ML is more theoretically sound, the relative error is %f%%\n" % (
	100 * np.abs( 1.0 - alpha_ls / alpha_ml ), )

## The mean and the sample variance of the sample
##  frequency distribution:
print "The average frequency over the sample is ", freqs.mean(), "\n"
print "The sample variance is ", freqs.var(), "\n"

## Theoretical mean and variance of the power law distribution
##  significantly depend on the power parameter.
## Indeed for $x\sim \frac{\alpha-1}{u} {\left( \frac{x}{u} \right)}^\alpha$ one has the following:
##   $E(x) = \frac{\alpha-1}{\alpha-2} u$ if $\alpha>2$
##   $E(x^2) = \frac{\alpha-1}{\alpha-3} u^2$ if $\alpha>3$
## The estimated parameter is less than 2, implying that the frequency
##  distribution is unlikely to have even a finite mean under the
##  assumed distribution.

In [None]:
#####################################################################
##+ 3. Produce summary of the frequencies: min, max, mean, median
## Does it make sense to compute these summaries? What does the mean frequency tell us?
print "The minimum frequency is ", freqs.min(), "\n"
print "The mean frequency is ", freqs.mean(), "\n"
print "The median frequency is ", np.median( freqs ), "\n"
print "The maximum frequency is ", freqs.max(), "\n"

### <hr /> Task 2.

Find and plot PDF and CDF for the following networks:
* [Routing network](http://www.leonidzhukov.net/hse/2015/networks/data/network.txt)
* [Web graph](http://www.leonidzhukov.net/hse/2015/networks/data/web_Stanford.txt)
* [Facebook network](http://www.leonidzhukov.net/hse/2015/networks/data/fb_Princeton.txt)


1. Are they correspondent to power law?
2. Find max and mean values of incoming and outcoming node degrees
3. Find $\alpha$ via Maximum Likelihood and calculate $\sigma^2$
4. Determine $x_{min}$ via Kolmogorov-Smirnov test

### The routing network graph

In [None]:
#####################################################################
## + 0. Read the graph
## Load the network routing graph first as it is the smallest. It is
##  an undirected graph.
import networkx as nx
G = nx.read_edgelist( "./data/network.txt", create_using = nx.Graph( ) );

node_degree = G.degree( )
deg = np.array( node_degree.values( ), dtype = np.int )

In [None]:
#####################################################################
##+ 1. Are they correspondent to power law?
## First let's draw the frequency plot of the node degree distribution.
degree_freq = np.array( counts( deg ) )
deg_me = mean_excess( deg )

plt.figure( 1, figsize = ( 10, 5 ) )

plt.subplot(121)
plt.title( "Node degree frequency" )
plt.loglog( degree_freq[:,0], degree_freq[:,1], "bo" )
plt.xlabel( "degree" ) ; plt.ylabel( "frequency" )

plt.subplot(122)
## An upward trend in plot shows heavy-tailed behaviour, but the
##  values for high thresholds are unreliably estimated.
plt.title( "Mean excess plot" )
plt.loglog( deg_me[:,0], deg_me[:,1], "bo-", linewidth = 2 )
plt.ylabel( "mean excess" ) ; plt.xlabel( "threshold" )

plt.show( )

The mean excess plot has an unmistakeable upward trend throughout the whole set of thresholds. This is strong heuristic evidence for a heavy tail in the node degree distribution.

In [None]:
## The empirical degree distribution may not correspond to a power
##  law per se, but it definitely has some heavy tailed behaviour,
##  which exhibits itself, when the only data exceeding same truncated
##  is considered.
cc = ccdf( deg, 0 )

plt.title( "Degree cCDF" )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )
plt.loglog( cc[:,0], cc[:,1], "bo-", linewidth = 2 )
plt.show( )

## Clearly the chances of an extremely high node degree decay proportional
##  to the value of the degree on a log-log scale.

In [None]:
#####################################################################
##+ 2. Find max and mean values of incoming and outcoming node degrees
## Since the network graph is undirected it does not make sense to
##  distinguish in- and out- nodes. Thus let's check the range of the
##  general (two-way) degree.

print "The degrees range from %d to %d" % ( min( deg ), max( deg ) ) #, "\n"
print "The average degree over the sample is %.3f" % ( G.size( ) / G.order( ) ) #, "\n"
print "The degree standard deviation is %.3f" % ( np.sqrt( np.var( deg ) ) ) #, "\n"
print "The median degree is %d" % ( np.median( deg ) ) #, "\n"

In [None]:
#####################################################################
##+ 3. Find $\alpha$ via Maximum Likelihood and calculate $\sigma^2$
##+ 4. Determine $x_{min}$ via Kolmogorov-Smirnov test

## We have reasons to believe there are some power law-like effects in
##  the behaviour of the node degree (treated as a random variable).
##  Let's pursue this lead and estimate the exponent in the power law
##  and select the most likely breakpoint, beyond which the degree
##  is heavy tailed.

#####################################################################
## Get the ML estimate of the exponent parameter.
alpha_ml, alpha_ml_se = mle_alpha( deg, min( deg ) )
print "The Maximum likelihood estimate of the exponent of the node degree distribution is %.3f (%.4f)\n" % ( alpha_ml, alpha_ml_se )

#####################################################################
## Run the KS threshold selection routine
thresholds = np.unique( deg )
## The ks_dist() function returns a tuple of the following parameters:
##  * ( KS-distance, PV of the KS-test ), ( MLE of alpha, the standard error of the MLE )
ks_min = np.array( [ ks_dist( deg, u ) for u in thresholds ] )

## Select the x_min that brings the KS metric to its minimum on the given
##  degree data. Note the first threshold is removed, since it is likely
##  to yield very biased estimate.
i_min = np.argmin( ks_min[1:,0,0] )+1
x_min = thresholds[ i_min ]
alpha_ml, alpha_ml_se = ks_min[ i_min, 1, : ]

## Produce a dataset for cCDF plotting.
x = np.arange( x_min, 2 * np.max( deg ) )
deg_ccdf = ccdf( deg, x_min )
pwr_ccdf = lambda x : ( x / ( x_min + 0.0 ) ) ** ( 1.0 - alpha_ml )

Visualize the dependence of $\alpha$ and the KS statistic on the threshold $u$.

In [None]:
## Produce the hill plot: the correspondence between the threshold
##  and the estimated exponent.

plt.figure( 1, figsize = ( 10, 5 ) )

plt.subplot( 121 )
plt.title( 'The Hill plot of the degree distribution' )
plt.ylabel( 'alpha' ) ; plt.ylabel( 'threshold' )
plt.axhline( y = alpha_ml, linewidth = 1, color = 'b' )
plt.axvline( x = x_min, linewidth = 1, color = 'b', linestyle = '--' )
plt.loglog( thresholds, ks_min[:,1,0], "r-")

## In fact the KS-metric is the $L^\infty$ norm on the set of distribution
##  functions.
plt.subplot( 122 )
plt.title( 'KS metric distance' )
plt.ylabel( 'max distance' ) ; plt.ylabel( 'threshold' )
plt.axhline( y = ks_min[ i_min, 0, 0 ], linewidth = 1, color = 'b' )
plt.axvline( x = x_min, linewidth = 1, color = 'b', linestyle = '--' )
plt.loglog( thresholds, ks_min[:,0,0], "r-")

plt.show( )

The estimated theoretical and the empirical CDFs are quite well aligned with each other for the chosen threshold. The tail of the hypothesised node degree law appears to have a higher tail decay rate, but that is due to the common severe undersampling of the tails.

In [None]:
print "The Kolmogorov-Smirnov metric yielded %.1f as the optimal threshold\n" % ( x_min)
print "'Optimal' exponent is %.3f (%.3f)\n" % ( alpha_ml, alpha_ml_se )

plt.title( "Degree cCDF" )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )
plt.plot( x, pwr_ccdf( x ), "b-", linewidth = 2 )
plt.plot( deg_ccdf[:,0], deg_ccdf[:,1], "r-", linewidth = 2 )
plt.axvline( x = x_min, linewidth = 2, color = 'k', linestyle = '-' )
plt.show( )

### Facebook graph

In [None]:
#####################################################################
## + 0. Read the graph
## Load the network routing graph first as it is the smallest. It is
##  an undirected graph.
import networkx as nx
G = nx.read_edgelist( "./data/fb_Princeton.txt", create_using = nx.DiGraph( ) );

node_in_degree = G.in_degree( )
node_out_degree = G.out_degree( )

in_deg = np.array( node_in_degree.values( ), dtype = np.int )
out_deg = np.array( node_out_degree.values( ), dtype = np.int )

In [None]:
#####################################################################
##+ 1. Are they correspondent to power law?
## First let's draw the frequency plot of the node degree distribution.
degree_in_freq = np.array( counts( in_deg ) )
degree_out_freq = np.array( counts( out_deg ) )

plt.title( "Node degree frequency" )
plt.xlabel( "degree" ) ; plt.ylabel( "frequency" )
plt.loglog( degree_out_freq[:,0], degree_out_freq[:,1], "bo" )
plt.loglog( degree_in_freq[:,0], degree_in_freq[:,1], "r<" )
plt.show( )

In [None]:
plt.figure( 1, figsize = ( 10, 5 ) )

plt.subplot(121)
plt.title( "Degree cCDF-loglog" )
out_cc = ccdf( out_deg, 0 )
plt.loglog( out_cc[:,0], out_cc[:,1], "bo-", linewidth = 2 )
in_cc = ccdf( in_deg, 0 )
plt.loglog( in_cc[:,0], in_cc[:,1], "r<-", linewidth = 2 )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )

plt.subplot(122)
## An upward trend in plot shows heavy-tailed behaviour, but the
##  values for high thresholds are unreliably estimated.
plt.title( "Mean excess plot" )
out_me = mean_excess( out_deg )
plt.loglog( out_me[:,0], out_me[:,1], "bo-", linewidth = 2 )
in_me = mean_excess( in_deg )
plt.loglog( in_me[:,0], in_me[:,1], "r<-", linewidth = 2 )
plt.ylabel( "mean excess" ) ; plt.xlabel( "threshold" )

plt.show( )


The ME plots for both the in- and the out-degrees possess significantly long flat regions, and only at the high end of the thresholds do they "explode" into a singularity. Such behaviour hints at the possibilty of an exponential tail in the distributions of both inward and outward vertex degrees.

In [None]:
#####################################################################
##+ 2. Find max and mean values of incoming and outcoming node degrees
print "The degrees range from %d to %d for inward direction and from %d to %d for outward edges" % ( min( in_deg ), max( in_deg ), min( out_deg ), max( out_deg ) ) #, "\n"
print "The average degree over the sample is %.3f (IN) and %.3f (OUT)" % ( np.sum( in_deg ) / ( G.order( ) + 0.0 ), np.sum( out_deg ) / ( G.order( ) + 0.0 ) ) #, "\n"
print "The degree standard deviation is %.3f for the in-degree and %.3f -- out-degree" % ( np.sqrt( np.var( in_deg ) ), np.sqrt( np.var( out_deg ) ) ) #, "\n"
print "The median in- and out-degree is %d and %d respectively" % ( np.median( in_deg ), np.median( out_deg ) ) #, "\n"

In [None]:
#####################################################################
##+ 3. Find $\alpha$ via Maximum Likelihood and calculate $\sigma^2$
##+ 4. Determine $x_{min}$ via Kolmogorov-Smirnov test

#####################################################################
## Get the ML estimate of the exponent parameter. There are some isolated
##  nodes in the provided graph, which means that it is necessary
##  to omit these nodes from the analysis using a simple power law.
## One of course could try to fit a model with an explicit atom at zero,
##  but that should wait for a better time.

in_alpha_ml, in_alpha_ml_se = mle_alpha( in_deg, min( in_deg )+1 )
out_alpha_ml, out_alpha_ml_se = mle_alpha( out_deg, min( out_deg )+1 )

#####################################################################
in_thresholds = np.unique( in_deg )
out_thresholds = np.unique( out_deg )

## Run the KS threshold selection routine
in_ks_min = np.array( [ ks_dist( in_deg, u ) for u in in_thresholds ] )
out_ks_min = np.array( [ ks_dist( out_deg, u ) for u in out_thresholds ] )

In [None]:
## Select the x_min that brings the KS metric to its minimum on the given
##  degree data. Note the first threshold is removed, since it is likely
##  to yield very biased estimate.
in_i_min = np.argmin( in_ks_min[1:,0,0] )+1
out_i_min = np.argmin( out_ks_min[1:,0,0] )+1

## Produce a dataset for cCDF plotting.
in_x = np.arange( in_thresholds[ in_i_min ], 2 * np.max( in_deg ) )
out_x = np.arange( out_thresholds[ out_i_min ], 2 * np.max( out_deg ) )

## Get the empirical complementary distribution fuction.
in_deg_ccdf = ccdf( in_deg, in_thresholds[ in_i_min ] )
out_deg_ccdf = ccdf( out_deg, out_thresholds[ out_i_min ] )

## ... and the fitted power law.
in_pwr_ccdf = lambda x : ( x / ( in_thresholds[ in_i_min ] + 0.0 ) ) ** ( 1.0 - in_ks_min[ in_i_min, 1, 0 ] )
out_pwr_ccdf = lambda x : ( x / ( out_thresholds[ out_i_min ] + 0.0 ) ) ** ( 1.0 - out_ks_min[ out_i_min, 1, 0 ] )

In [None]:
print "The MLE of the exponent of the inward and outward degree distribution is %.3f (%.4f) and %.3f (%.4f) respectively\n" % ( in_alpha_ml, in_alpha_ml_se, out_alpha_ml, out_alpha_ml_se )

## Produce the hill plot: the correspondence between the threshold
##  and the estimated exponent.
plt.figure( 1, figsize = ( 10, 5 ) )

plt.subplot(121)
plt.title( 'The Hill plot of the degree distribution' )

plt.axhline( y = in_ks_min[ in_i_min, 1, 0 ], linewidth = 1, color = 'r' )
plt.axvline( x = in_thresholds[ in_i_min ], linewidth = 1, color = 'r', linestyle = '--' )
plt.loglog( in_thresholds, in_ks_min[:,1,0], "r<-")

plt.axhline( y = out_ks_min[ out_i_min, 1, 0 ], linewidth = 1, color = 'b' )
plt.axvline( x = out_thresholds[ out_i_min ], linewidth = 1, color = 'b', linestyle = '--' )
plt.loglog( out_thresholds, out_ks_min[:,1,0], "bo-")

plt.ylabel( 'alpha' ) ; plt.ylabel( 'threshold' )


## In fact the KS-metric is the $L^\infty$ norm on the set of distribution
##  functions.
plt.subplot(122)
plt.title( 'The KS metric distance' )

plt.axhline( y = in_ks_min[ in_i_min,0, 0 ], linewidth = 1, color = 'r' )
plt.axvline( x = in_thresholds[ in_i_min ], linewidth = 1, color = 'r', linestyle = '--' )
plt.loglog( in_thresholds, in_ks_min[:,0,0], "r<-")

plt.axhline( y = out_ks_min[ out_i_min,0, 0 ], linewidth = 1, color = 'b' )
plt.axvline( x = out_thresholds[ out_i_min ], linewidth = 1, color = 'b', linestyle = '--' )
plt.loglog( out_thresholds, out_ks_min[:,0,0], "bo-")

plt.ylabel( 'max distance' ) ; plt.ylabel( 'threshold' )
plt.show( )

Hill plots (the estimated exponent $\hat{\alpha}_u$ against the employed threshold $u$) have a distinct upward curving trend, which can only be a result of an exponential behaviour in the tail of both the in- and the out-degreee distributions.

In [None]:
print "OUT-degree: The Kolmogorov-Smirnov metric yielded %.1f as the optimal threshold and %.3f (%.3f) as 'optimal' exponent\n" % ( out_thresholds[ out_i_min ], out_ks_min[ out_i_min, 1, 0 ], out_ks_min[ out_i_min, 1, 1 ] )
print "IN-degree: The Kolmogorov-Smirnov metric yielded %.1f as the optimal threshold and %.3f (%.3f) as 'optimal' exponent\n" % ( out_thresholds[ in_i_min ], in_ks_min[ in_i_min, 1, 0 ], in_ks_min[ in_i_min, 1, 1 ] )

plt.figure( 3, figsize = ( 10, 5 ) )

plt.subplot(121)
plt.title( "Out degree cCDF" )
plt.plot( out_x, out_pwr_ccdf( out_x ), "k-", linewidth = 2 )
plt.plot( out_deg_ccdf[:,0], out_deg_ccdf[:,1], "bo-", linewidth = 2 )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )

plt.subplot(122)
plt.title( "In degree cCDF" )
plt.plot( in_x, in_pwr_ccdf( in_x ), "k-", linewidth = 2 )
plt.plot( in_deg_ccdf[:,0], in_deg_ccdf[:,1], "r<-", linewidth = 2 )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )

plt.show( )

Indeed, both complimetary CDFs show decay rates faster than the power law.

### WEB Graph

In [None]:
#####################################################################
## + 0. Read the graph
import networkx as nx
G = nx.read_edgelist( "./data/web_Stanford.txt", create_using = nx.DiGraph( ) );

In [None]:
node_in_degree = G.in_degree( )
node_out_degree = G.out_degree( )

in_deg = np.array( node_in_degree.values( ), dtype = np.int )
out_deg = np.array( node_out_degree.values( ), dtype = np.int )

In [None]:
#####################################################################
##+ 1. Are they correspondent to power law?
degree_in_freq = np.array( counts( in_deg ) )
degree_out_freq = np.array( counts( out_deg ) )

plt.title( "Node degree frequency" )
plt.xlabel( "degree" ) ; plt.ylabel( "frequency" )
plt.loglog( degree_out_freq[:,0], degree_out_freq[:,1], "bo" )
plt.loglog( degree_in_freq[:,0], degree_in_freq[:,1], "r<" )
plt.show( )

plt.figure( 1, figsize = ( 10, 5 ) )

plt.subplot(121)
plt.title( "Degree cCDF-loglog" )
out_cc = ccdf( out_deg, 0 )
plt.loglog( out_cc[:,0], out_cc[:,1], "bo-", linewidth = 2 )
in_cc = ccdf( in_deg, 0 )
plt.loglog( in_cc[:,0], in_cc[:,1], "r<-", linewidth = 2 )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )

plt.subplot(122)

plt.title( "Mean excess plot" )
out_me = mean_excess( out_deg )
plt.loglog( out_me[:,0], out_me[:,1], "bo-", linewidth = 2 )
in_me = mean_excess( in_deg )
plt.loglog( in_me[:,0], in_me[:,1], "r<-", linewidth = 2 )
plt.ylabel( "mean excess" ) ; plt.xlabel( "threshold" )

plt.show( )

The ME plot a very common case, when it is not quite clear whether mean excesses have an upward trand or not. Neglecting the upper thresholds, both distributions behave concictently with a heavy tailed distiribution. However, the tail of a distribution by definition the asymptotic behaviour for increasing threshold. This means that one must needs look at the unstable estimates of the conditional mean at the right end of the threshold range.

In the case of the WEB graph the $\hat{M}(u)$ for extremely high thresholds shows clear oscillations around some constant level. This sugests an exponential tail, but still further investigation is required.

In [None]:
#####################################################################
##+ 2. Find max and mean values of incoming and outcoming node degrees
print "The degrees range from %d to %d for inward direction and from %d to %d for outward edges" % ( min( in_deg ), max( in_deg ), min( out_deg ), max( out_deg ) ) #, "\n"
print "The average degree over the sample is %.3f (IN) and %.3f (OUT)" % ( np.sum( in_deg ) / ( G.order( ) + 0.0 ), np.sum( out_deg ) / ( G.order( ) + 0.0 ) ) #, "\n"
print "The degree standard deviation is %.3f for the in-degree and %.3f -- out-degree" % ( np.sqrt( np.var( in_deg ) ), np.sqrt( np.var( out_deg ) ) ) #, "\n"
print "The median in- and out-degree is %d and %d respectively" % ( np.median( in_deg ), np.median( out_deg ) ) #, "\n"

In [None]:
#####################################################################
##+ 3. Find $\alpha$ via Maximum Likelihood and calculate $\sigma^2$
##+ 4. Determine $x_{min}$ via Kolmogorov-Smirnov test

in_alpha_ml, in_alpha_ml_se = mle_alpha( in_deg, min( in_deg )+1 )
out_alpha_ml, out_alpha_ml_se = mle_alpha( out_deg, min( out_deg )+1 )

in_thresholds = np.unique( in_deg )
out_thresholds = np.unique( out_deg )

## Run the KS threshold selection routine
in_ks_min = np.array( [ ks_dist( in_deg, u ) for u in in_thresholds ] )
out_ks_min = np.array( [ ks_dist( out_deg, u ) for u in out_thresholds ] )

In [None]:
## Select the x_min that brings the KS metric to its minimum on the given
##  degree data.
in_i_min = np.argmin( in_ks_min[1:,0,0] )+1
out_i_min = np.argmin( out_ks_min[1:,0,0] )+1

## Produce a dataset for cCDF plotting.
in_x = np.arange( in_thresholds[ in_i_min ], 2 * np.max( in_deg ) )
out_x = np.arange( out_thresholds[ out_i_min ], 2 * np.max( out_deg ) )

## Get the empirical complementary distribution fuction.
in_deg_ccdf = ccdf( in_deg, in_thresholds[ in_i_min ] )
out_deg_ccdf = ccdf( out_deg, out_thresholds[ out_i_min ] )

## ... and the fitted power law.
in_pwr_ccdf = lambda x : ( x / ( in_thresholds[ in_i_min ] + 0.0 ) ) ** ( 1.0 - in_ks_min[ in_i_min, 1, 0 ] )
out_pwr_ccdf = lambda x : ( x / ( out_thresholds[ out_i_min ] + 0.0 ) ) ** ( 1.0 - out_ks_min[ out_i_min, 1, 0 ] )

In [None]:
print "The MLE of the exponent of the inward and outward degree distribution is %.3f (%.4f) and %.3f (%.4f) respectively\n" % ( in_alpha_ml, in_alpha_ml_se, out_alpha_ml, out_alpha_ml_se )

## Produce the hill plot: the correspondence between the threshold
##  and the estimated exponent.
plt.figure( 1, figsize = ( 10, 5 ) )

plt.subplot(121)
plt.title( 'The Hill plot of the degree distribution' )

plt.axhline( y = in_ks_min[ in_i_min, 1, 0 ], linewidth = 1, color = 'r' )
plt.axvline( x = in_thresholds[ in_i_min ], linewidth = 1, color = 'r', linestyle = '--' )
plt.loglog( in_thresholds, in_ks_min[:,1,0], "r<-")

plt.axhline( y = out_ks_min[ out_i_min, 1, 0 ], linewidth = 1, color = 'b' )
plt.axvline( x = out_thresholds[ out_i_min ], linewidth = 1, color = 'b', linestyle = '--' )
plt.loglog( out_thresholds, out_ks_min[:,1,0], "bo-")

plt.ylabel( 'alpha' ) ; plt.ylabel( 'threshold' )

plt.subplot(122)
plt.title( 'The KS metric distance' )

plt.axhline( y = in_ks_min[ in_i_min,0, 0 ], linewidth = 1, color = 'r' )
plt.axvline( x = in_thresholds[ in_i_min ], linewidth = 1, color = 'r', linestyle = '--' )
plt.loglog( in_thresholds, in_ks_min[:,0,0], "r<-")

plt.axhline( y = out_ks_min[ out_i_min,0, 0 ], linewidth = 1, color = 'b' )
plt.axvline( x = out_thresholds[ out_i_min ], linewidth = 1, color = 'b', linestyle = '--' )
plt.loglog( out_thresholds, out_ks_min[:,0,0], "bo-")

plt.ylabel( 'max distance' ) ; plt.ylabel( 'threshold' )
plt.show( )

In [None]:
print "OUT-degree: The Kolmogorov-Smirnov metric yielded %.1f as the optimal threshold and %.3f (%.3f) as 'optimal' exponent\n" % ( out_thresholds[ out_i_min ], out_ks_min[ out_i_min, 1, 0 ], out_ks_min[ out_i_min, 1, 1 ] )
print "IN-degree: The Kolmogorov-Smirnov metric yielded %.1f as the optimal threshold and %.3f (%.3f) as 'optimal' exponent\n" % ( in_thresholds[ in_i_min ], in_ks_min[ in_i_min, 1, 0 ], in_ks_min[ in_i_min, 1, 1 ] )

plt.figure( 3, figsize = ( 10, 5 ) )

plt.subplot(121)
plt.title( "Out degree cCDF" )
plt.plot( out_x, out_pwr_ccdf( out_x ), "k-", linewidth = 2 )
plt.plot( out_deg_ccdf[:,0], out_deg_ccdf[:,1], "bo-", linewidth = 2 )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )

plt.subplot(122)
plt.title( "In degree cCDF" )
plt.plot( in_x, in_pwr_ccdf( in_x ), "k-", linewidth = 2 )
plt.plot( in_deg_ccdf[:,0], in_deg_ccdf[:,1], "r<-", linewidth = 2 )
plt.xlabel( "degree" ) ; plt.ylabel( "probability" )

plt.show( )

It is clear that the fit on the power law is quite poor for the WEB graph, and at the same time its degree distribution show weak signs of existence of extremely high node degrees. This means that it is impossible to definitely determine the tail behaviour of this graph and in fact, a much richer sample of the WEB graph is needed.