# <center>Structural Analysis and Visualization of Networks</center>

## <center/>Course Project #2

### <center>Student: *Nazarov Ivan*</center>

#### <hr /> General Information

**Hard Deadline:** 21.06.2015 23:59 <br \>

Please send your reports to <mailto:leonid.e.zhukov@gmail.com> and <mailto:shestakoffandrey@gmail.com> with message subject of the following structure:<br \> **[HSE Networks 2015] *Nazarov* *Ivan* Project*2***

Support your computations with figures and comments. <br \>
If you are using IPython Notebook you may use this file as a starting point of your report.<br \>
<br \>
<hr \>

# Description

## Task 1 

You are provided with the [DBLP dataset](https://www.dropbox.com/s/ft4ekv2f3r43u7b/dblp_2000.csv.gz?dl=0) (warning, raw data!). It contains coauthorships that were revealed during $2000$-$2014$. Particularly, the file contains $3$ colomns: first two for authors' names and the third for the year of publication. This data can be naturally mapped to undirected graph structure.

Your task is construct supervised link prediction scheme.

### Guidelines:

0. Use *pandas* module to load and manipulate the dataset in Python
1. Initiallize your classification set as follows:
    * Determine training and testing intervals on your time domain (for instance, in DBLP dataset take a period $2000$-$2010$ as training period and $2011$-$2014$ as testing period)
    * Pick pairs of authors that **have appeared during training interval** but **have not published together** during it
    * These pairs form **positive** or **negative** examples depending on whether they have formed coauthorships **during the testing interval**
    * You have arrived to binary classification problem. PROFIT!
2. Construct feature space:
    * Most of our features tend to be topological. Examples of the features can be: (weighted) sum of neigbours, shortest distance, etc
3. Choose at least $4$ classification algorithms from [scikit module](http://scikit-learn.org/stable/) (goes with Anaconda) and compare them in terms of Accuracy, Precision, Recall, F-Score (for positive class) and Mean Squared Error. Use k-fold cross-validation and average your results

In [None]:
import pandas as pd, numpy as np, scipy.sparse as sp
import os, gc, regex as re, time as tm

import matplotlib.pyplot as plt
%matplotlib inline

DATADIR = os.path.realpath( os.path.join( ".", "data", "proj02" ) )

raw_dblp_file = os.path.join( DATADIR, "dblp_2000.csv.gz" )
cached_dblp_file = os.path.join( DATADIR, "dblp_2000.ppdf" )
cached_author_index = os.path.join( DATADIR, "dblp_2000_authors.txt" )

Define some helper functions

In [None]:
## Return a mask of elements of b found in a: optimal for numeric arrays
def match( a, v, return_indices = False ) :
	index = np.argsort( a )
## Get insertion indices
	sorted_index = np.searchsorted( a, v, sorter = index )
## Truncate the indices by the length of a
	index = np.take( index, sorted_index, mode = "clip" )
	mask = a[ index ] == v
## return
	if return_indices :
		return mask, index[ mask ]
	return mask

A handy procedure for converting an $(v_{ij})$ list into a sparse matrix.

In [None]:
## Convert the edgelist into sparse matrix
def to_sparse_coo( u, v, shape, dtype = np.int32 ) :
## Create a COOrdinate sparse matrix from the given ij-indices
	assert( len( u ) == len( v ) )
	return sp.coo_matrix( (
			np.ones( len( u ) + len( v ), dtype = dtype ), (
				np.concatenate( ( u, v ) ), np.concatenate( ( v, u ) ) )
		), shape = shape )

## Remeber: when converting COO to CSR/CSC the duplicate coordinate
##  entries are summed!

Load the DBLP dataset, making a cached copy if required.

In [None]:
## Create cache if necessary
tick = tm.time( )
if not os.path.exists( cached_dblp_file ) :
	## Load the csv file into a dataframe
	dblp = pd.read_csv( raw_dblp_file, # nrows = 10000,
	## On-the-fly decompression
			compression = "gzip", header = None, quoting = 0,
	## Assign column headers
			names = [ 'author1', 'author2', 'year', ], encoding = "utf-8" )
	## Finish
	tock = tm.time( )
	print "Raw DBLP read in %.3f sec." % ( tock - tick, )
## Map author names to ids
	from sklearn.preprocessing import LabelEncoder
	le = LabelEncoder( ).fit( np.concatenate( (
		dblp["author1"].values, dblp["author2"].values, ) ) )
	dblp_author_index = le.classes_
	for col in [ 'author1', 'author2', ] :
		dblp[ col ] = le.transform( dblp[ col ] )
## Cache
	dblp.to_pickle( cached_dblp_file )
	with open( cached_author_index, "w" ) as out :
		for label in le.classes_ :
			out.write( label.strip( ).encode( "utf-8" ) + "\n" )
	del dblp, le
## Finish
	tick = tm.time( )
	print "Preprocessing took %.3f sec." % ( tick - tock, )
else :
## Load the database from pickled format
	dblp = pd.read_pickle( cached_dblp_file )
## Read the dictionary of authors
	with open( cached_author_index, "r" ) as author_index :
		dblp_author_index = [ line.decode( "utf-8" ) for line in author_index ]
## Report
tock = tm.time( )
print "DBLP loaded in %.3f sec." % ( tock - tick, )

Now split the DBLP dataset in two periods: pre and post 2010.

First preprocess the pre 2010 data.

In [None]:
pre2010 = dblp[ dblp.year <= 2010 ].copy( )

Reencode the vertices of the pre-2010 graph in a less wasteful format. Use sklearn's LabelEncoder() to this end.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder( ).fit( np.concatenate( ( pre2010[ "author1" ].values, pre2010[ "author2" ].values ) ) ) 
pre2010_values = le.classes_

## Recode the edge data
for col in [ 'author1', 'author2', ] :
    pre2010[ col ] = le.transform( pre2010[ col ] )      

Convert the edge list data into a sparse matrix

In [None]:
pre2010_adj = to_sparse_coo(
    pre2010[ "author1" ].values, pre2010[ "author2" ].values,
    shape = 2 * [ len( le.classes_ ) ] )

## Eliminate duplicates by converting them into ones
pre2010_adj = pre2010_adj.tocsr( )
pre2010_adj.data = np.ones_like( pre2010_adj.data )

Find the vertices of the pre 2010 period that are in post-2010

In [None]:
post2010 = dblp[ dblp.year > 2010 ]
common_vertices = np.intersect1d( pre2010_values,
	np.union1d( post2010[ "author1" ].values, post2010[ "author2" ].values ) )

Remove completely new vertices from post 2010 data

In [None]:
post2010 = post2010[ (
    match( common_vertices, post2010[ "author1" ].values ) &
    match( common_vertices, post2010[ "author2" ].values ) ) ]
del common_vertices

Map the post 2010 vertices to pre 2010 vertices and construct the adjacency matrix.

In [None]:
for col in [ 'author1', 'author2', ] :
    post2010[ col ] = le.transform( post2010[ col ] )

## The adjacency matrix
post2010_adj = sp.coo_matrix( ( np.ones( post2010.shape[0], dtype = np.bool ),
        ( post2010[ "author1" ].values, post2010[ "author2" ].values )
    ), shape = pre2010_adj.shape ).tolil( )

Leave only those edges in the post 2010 dataset, which had not existed during 2000-2010.

In [None]:
post2010_adj[ pre2010_adj.nonzero( ) ] = 0

Eliminate duplicate edges and transform into a CSR format

In [None]:
post2010_adj = post2010_adj.tocsr( )
post2010_adj.data = np.ones_like( post2010_adj.data )

Here we have two aligned symmetric adjacency matrices : for edges exisited before 2010 and new edges formed after 2010.

In [None]:
print post2010_adj.__repr__()
print pre2010_adj.__repr__( )

All edes of the post2010 graph are included and considered to be positive examples

In [None]:
positive = np.append( *( c.reshape((-1, 1)) for c in post2010_adj.nonzero( ) ), axis = 1 )

Now a slightly harder part is to generate an adequate number of negative examples, so that the final training sample would be balanced.

In [None]:
## Generate a sample of vertex pairs with no edge in both periods
negative = np.random.choice( pre2010_adj.shape[ 0 ], size = ( 2 * positive.shape[0], positive.shape[1] ) )

Compie the final training dataset.

In [None]:
E = np.vstack( ( positive, negative ) )
y = np.vstack( ( np.ones( ( positive.shape[ 0 ], 1 ), dtype = np.float ),
                np.zeros( ( negative.shape[ 0 ], 1 ), dtype = np.float ) ) )

So, finally, we got ourselves a trainig set of edges with 2:1 negative-to-postive ratio.

## Feature construction

The first pair of features is the degrees on the edge endpoints: for $(i,j) \in V\times V$
$$ \phi^1_{ij} = |N_i|\,\text{ and }\,\phi^2_{ij} = |N_j|\,,$$
where $N_v$ is the set of adjacent vertices of a node $v$.

In [None]:
def phi_degree( edges, A ) :
	deg = A.sum( axis = 1 ).astype( np.float )
	return np.append( deg[ edges[ :, 0 ] ], deg[ edges[ :, 1 ] ], axis = 1 )

It turns out that at least two edge features can be constructed via a so called "sandwich" matrix.

In [None]:
def __sparse_sandwich( edges, A, W = None ) :
    AA = A.dot( A.T ) if W is None else A.dot( W ).dot( A.T )
    result = AA[ edges[:,0], edges[:,1] ]
    del AA ; gc.collect( 0 ) ; gc.collect( 1 ) ; gc.collect( 2 )
    return result.reshape(-1, 1)

The next feature is the Adamic/adar score: for $(i,j)\in V\times V$
$$ \phi^3_{ij} = \sum_{v\in N_i \cap N_j } \frac{1}{\log |N_v|}\,.$$

Another feature is the numberod neighbours shared by the endpoints:
$$\phi^4_{ij} = |N_i\cap N_j|\,.$$

In fact both features are special cases of the same formula :
$$ (\phi_{ij}) = A W A'\,, $$
where $W$ is the weight matrix. In the case of common neigbours it is the unit matrix $I$, whereas for Adamic/Adar it is the diagonal matrix of reciprocal of degree logarithms :
$$ W = \text{diag}\Bigl( \frac{1}{\log |N_i|} \Bigr)_{i\in V}\,.$$

In [None]:
def phi_adamic_adar( edges, A ) :
    inv_log_deg = 1.0 / np.log( A.sum( axis = 1 ).getA1( ) )
    inv_log_deg[ np.isinf( inv_log_deg ) ] = 0
    result = __sparse_sandwich( edges, A, sp.diags( inv_log_deg, 0 ) )
    del inv_log_deg ; gc.collect( 0 ) ; gc.collect( 1 ) ; gc.collect( 2 )
    return result

In [None]:
def phi_common_neighbours( edges, A ) :
    return __sparse_sandwich( edges, A )

Yet another potential feature is the so-called personalized page rank. Basically it is the same page Rank score, but with the ability to teleport only to a single node.

In particular, the global Pagerank is the stationary distribution of the markov chain with this transition kernel:
$$ M = \beta P + (1-\beta) \frac{1}{|V|}\mathbb{1} \mathbb{1}'\,, $$
where $\mathbb{1} = (1)_{v\in V}$ and $ P = \bigl(D_{uu}^{-1} A_{uv} + \frac{1}{|V|} 1_{\delta^+_v=0} \bigr)_{u,v\in V}$ -- the transition probability matrix $u\leadsto v$, which removes the sink (dangling) nodes, by connecting them to  every other vertex. The normalizing matrix
$$D=\text{diag}\bigl( \delta^+_v + 1_{\delta^+_v=0} \bigr)_{v\in V}\,,$$
where the out-degree $\delta^+_v = \sum_{j\in V} A_{uj}$.

The personalised pagerank for some node $w\in V$ is only slightly different: the random walk still is forced ot teleport away from a dangling vertex to any other, but the general teleportation probability is altered so that the walk restarts from node $w\in V$.

Let $R\in \{0,1\}^{1\times V}$ be such that $R = e_w$. Then the personalized pagerank is the statrionary distribution of a random walk with this transition matrix:
$$ M_w = \beta P + (1-\beta) \frac{1}{\|R\|_0}\mathbb{1} R'\,, $$
where $\|R\|_0$ denotes the number of nonzero elements in $R$.

The stationary distribution is in fact the left-eigenvector of the transition kernel with eigenvalue $1$ : $ \pi = \pi M$, $\pi\in [0,1]^{1\times V}$ and $\pi\mathbb{1} = 1$. It is computed using hte power iterations methods, which basically gets the eigenvector with dominating eigenvalue. In the case of aperiodic, irreducible stochastic matrices the Perro-Frobenius theorem states that such eigenvector exists and the eigenvalues of $M_w$ are within $[-1,1]$. 

Basic iteration, with dangling vertex elimination, is
$$ \pi_1 = \beta \pi_0 D^{-1} A + \bigl( \beta \pi_0 R \frac{1}{|V|} + (1-\beta) \pi_0 \mathbb{1} \frac{1}{\|R\|_0} \bigr) e_w'\,. $$

In [None]:
def __sparse_pagerank( A, beta = 0.85, one = None, niter = 1000, rel_eps = 1e-6, verbose = True ) :
## Initialize the iterations
	one = one if one is not None else np.ones( ( 1, A.shape[ 0 ] ), dtype = np.float )
	one = sp.csr_matrix( one / one.sum( axis = 1 ) )
## Get the out-degree
	out = np.asarray( A.sum( axis = 1 ).getA1( ), dtype = np.float )
## Obtain the mask of dangling vertices
	dangling = np.where( out == 0.0 )[ 0 ]
## Correct the out-degree for sink nodes
	out[ dangling ] = 1.0
## Just one iteration: all dangling nodes add to the importance of all vertices.
	pi = np.full( ( one.shape[0], A.shape[0] ), 1.0 / A.shape[ 0 ], dtype = np.float )
## If there are no dangling vertices then use simple iterations
	kiter, status = 0, -1
## Make a stochastic matrix
	P = sp.diags( 1.0 / out, 0, dtype = np.float ).dot( A ).tocsc( )
	while kiter < niter :
## make a copy of hte current ranking estimates
		pi_last = pi.copy( )
## Use sparse inplace operations for speed. Firstt the random walk part
		pi *= beta ; pi *= P
## Now the teleportaiton ...
		pi += ( 1 - beta ) * one
##  ... and dangling vertices part
		if len( dangling ) > 0 :
			pi += beta * one.multiply( np.sum( pi_last[ :, dangling ], axis = 1 ).reshape( ( -1, 1 ) ) )
## Normalize
		pi /= np.sum( pi, axis = 1 )
		if np.sum( np.abs( pi - pi_last ) ) <= one.shape[0] * rel_eps * np.sum( np.abs( pi_last ) ) :
			status = 0
			break
## Next iteration
		kiter += 1
		if kiter % 10 == 0 :
			print kiter
	return pi, status, kiter

Now the feature extractors themselves: the global pagerank and the presonalized (rooted) pagerank.

In [None]:
## The global pagerank score
def phi_gpr( edges, A, verbose = True ) :
	pi, s, k = __sparse_pagerank( A, one = None, verbose = verbose )
	return np.concatenate( ( pi[ :, edges[ :, 0 ] ], pi[ :,edges[ :, 1 ] ] ), axis = 0 ).T

In [None]:
def phi_ppr( edges, A ) :
	result = np.empty( edges.shape, dtype = np.float )

    return __sparse_sandwich( edges, A )

## Computing the features

Vertex degrees

In [None]:
tick = tm.time()
phi_12 = phi_degree( E, pre2010_adj )
tock = tm.time()
print "Vertex degree computed in %.3f sec." % ( tock - tick, )

Adamic/Adar metric

In [None]:
tick = tm.time()
phi_3 = phi_adamic_adar( E, pre2010_adj )
tock = tm.time()
print "Adamic/adar computed in %.3f sec." % ( tock - tick, )

Common neighbours

In [None]:
tick = tm.time()
phi_4 = phi_common_neighbours( E, pre2010_adj )
tock = tm.time()
print "Common neighbours computed in %.3f sec." % ( tock - tick, )

Global Pagerank

In [None]:
tick = tm.time()
phi_56 = phi_gpr( E, pre2010_adj, verbose = False )
tock = tm.time()
print "Global pagerank computed in %.3f sec." % ( tock - tick, )

Rooted (personalized) pagerank

In [None]:
tick = tm.time()
phi_78 = phi_ppr( E, pre2010_adj, verbose = False )
tock = tm.time()
print "Personalized pagerank computed in %.3f sec." % ( tock - tick, )

Compute all-pairs shortest paths

In [None]:
# tick = tm.time()
# phi_5 = phi_shortest_paths( E, pre2010_adj )
# tock = tm.time()
# print "Shortest paths computed in %.3f sec." % ( tock - tick, )

Collect all features into a numpy matrix

In [None]:
X = np.hstack( ( phi_12, phi_3, phi_4, phi_56, phi_78 ) )

Having computed all the features, lets make a subsample so that the classfification would run faster.

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
X_modelling, X_main, y_modelling, y_main = train_test_split( X, y.ravel( ), train_size = 0.20 )

Attach SciKit's grid search and x-validation modules.

In [None]:
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score

We are going to analyze many classifiers at once.

In [None]:
classifiers = list( )

### Logistic Regression

Logistic regression for binary classification solves the following problem on the training dataset $(x_i,t_i)_{i=1}^n \in \mathbb{R}^{1+p}\times \{0,1\}$:
$$ \sum_{i=1}^n t_i \log \sigma( \beta'x_i ) + (1-t_i) \log \bigl( 1-\sigma( \beta'x_i ) \bigr) \to \min_{\beta, \beta_0} \,, $$
where $\sigma(z) = \bigr(1+e^{-z}\bigl)^{-1}$. The classification is done using the folowing rule :  
$$ \hat{t}(x) = \mathop{\text{argmax}}_{k=1,2}\, \mathbb{P}(T=k|X=x)\,, $$
where $\mathbb{P}(T=1|X=x) = \sigma(\beta'x)$.

In [None]:
from sklearn.linear_model import LogisticRegression

LR_grid = GridSearchCV( LogisticRegression( ), cv = 10, verbose = 1,
        param_grid = { "C" : np.logspace( -2, 2, num = 5 ) }, n_jobs = -1
    ).fit( X_modelling, y_modelling )

classifiers.append( ( "Logistic", LR_grid.best_estimator_ ) )

### Linear and Quadratic Discriminant Analysis

It is a widely known fact that sometimes simple models beat more complicated ones in terms of their accuracy. Thus let's consdier LDA and QDA.

In [None]:
from sklearn.lda import LDA
from sklearn.qda import QDA

classifiers.append( ( "LDA", LDA( ) ) )
classifiers.append( ( "QDA", QDA( ) ) )

### Decision tree classifiers

Let's employ the classification tree model. On its own a decision tree is a volatile classifier, meaning that the addition of new data can drammatically alter its structure, that is why let's utilize boosted trees and randomized forests. These methods learn the intirinsic nonlinear features of the data by iterative construction of weak classifiers focusing on different aspects of the dataset.

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF_grid = GridSearchCV( RandomForestClassifier( n_estimators = 50 ), cv = 10, verbose = 1,
        param_grid = { "max_depth" : [ 3, 5, 15, 30 ] }, n_jobs = -1
    ).fit( X_modelling, y_modelling )

classifiers.append( ( "RandomForest", RF_grid.best_estimator_ ) )

#### Boosted tree (AdaBoost)

In [None]:
from sklearn.ensemble import AdaBoostClassifier

classifiers.append( ( "AdaBoost", AdaBoostClassifier( n_estimators = 50 ) ) )

#### Simple tree

One does not expect a simple tree to do comparably well against ensemble classifiers.

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_grid = GridSearchCV( DecisionTreeClassifier( criterion = "gini" ), cv = 10, verbose = 1,
        param_grid = { "max_depth" : [ 3, 5, 15, 30 ] }, n_jobs = -1
    ).fit( X_modelling, y_modelling )

classifiers.append( ( "Tree", tree_grid.best_estimator_ ) )

### $k$-Nearest Negihbours

Another quite engineering approach to classification is to follow a simple rule : if the majority of a points $l$ nearest neighbours correspond to a class $c$ then this point is very likely to come from calss $c$ as well. Know them by their freinds!

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_grid = GridSearchCV( KNeighborsClassifier(  ), cv = 10, verbose = 1,
        param_grid = { "n_neighbors" : [ 3, 5, 15, 30 ] }, n_jobs = -1
    ).fit( X_modelling, y_modelling )

classifiers.append( ( "k-NN", knn_grid.best_estimator_ ) )

### Support Vector Machine classification

In [None]:
from sklearn.svm import SVC

In [None]:
from sklearn.linear_model import SGDClassifier

## Testing

Split the dataset into train and test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X_main, y_main, train_size = 0.20 )

Subsample the train dataset

In [None]:
subsample = np.random.permutation( X_train.shape[ 0 ] )#[ : 50000 ]
X_train_subsample, y_train_subsample = X_train[ subsample ], y_train[ subsample ]

Run tests

In [None]:
results = dict()
for name, clf in classifiers :
    tick = tm.time( )
    results[name] = cross_val_score( clf, X_train_subsample, y_train_subsample, n_jobs = -1, verbose = 1, cv = 10 )
    tock = tm.time( )
    print "k-fold crossvalidation for %s took %.3f sec." % ( name, tock - tick, )

In [None]:
k_fold_frame = pd.DataFrame( results )

In [None]:
# k_fold_frame.append( k_fold_frame.apply( np.average ), ignore_index = True )
k_fold_frame.apply( np.average )

## Task 2

Consider the [flickr dataset](https://www.dropbox.com/s/srsib3hq863drtp/flickr_data.tar.gz?dl=0) (warning, raw data!). <br/>
File ''*users.txt*'' provides a table of form *userID*, *enterTimeStamp*, *additionalInfo*... <br/>
File "*contacts.txt*" consists of pairs of *userID*'s and link establishment timestamp <br/>

Recall *scoring functions* for link prediction. Your task is to compare the performance of each scoring function as follows:
1. TOP-$n$ accuracy
    * Denote the number of links $E_\text{new}$ appeared during testing period as $n$
    * Denote the ranked list of node pairs provided by score $s$ as $\hat{E}_s$
    * Take top-$n$ pairs from $\hat{E}_s$ and intersect it with $E_\text{new}$. Performance is measured as the size of resulted set
2. ROC and AUC ('star' subtask)

Essentially, for this task you also have to follow the guideline points $1$ and $2$ above. The only thing you have to keep in mind is that flickr dataset is growing dataset. Since then, consider nodes that are significantly represented both in training and testing intervals (for instance, have at least $5$ adjacent edges in training and testing intervals)