# <center>Structural Analysis and Visualization of Networks</center>

## <center/>Course Project #2

### <center>Student: *Nazarov Ivan*</center>

#### <hr /> General Information

**Hard Deadline:** 21.06.2015 23:59 <br \>

Please send your reports to <mailto:leonid.e.zhukov@gmail.com> and <mailto:shestakoffandrey@gmail.com> with message subject of the following structure:<br \> **[HSE Networks 2015] *{LastName}* *{First Name}* Project*{Number}***

Support your computations with figures and comments. <br \>
If you are using IPython Notebook you may use this file as a starting point of your report.<br \>
<br \>
<hr \>

# Description

## Task 1 

You are provided with the [DBLP dataset](https://www.dropbox.com/s/ft4ekv2f3r43u7b/dblp_2000.csv.gz?dl=0) (warning, raw data!). It contains coauthorships that were revealed during $2000$-$2014$. Particularly, the file contains $3$ colomns: first two for authors' names and the third for the year of publication. This data can be naturally mapped to undirected graph structure.

Your task is construct supervised link prediction scheme.

### Guidelines:

0. Use *pandas* module to load and manipulate the dataset in Python
1. Initiallize your classification set as follows:
    * Determine training and testing intervals on your time domain (for instance, in DBLP dataset take a period $2000$-$2010$ as training period and $2011$-$2014$ as testing period)
    * Pick pairs of authors that **have appeared during training interval** but **have not published together** during it
    * These pairs form **positive** or **negative** examples depending on whether they have formed coauthorships **during the testing interval**
    * You have arrived to binary classification problem. PROFIT!
2. Construct feature space:
    * Most of our features tend to be topological. Examples of the features can be: (weighted) sum of neigbours, shortest distance, etc
3. Choose at least $4$ classification algorithms from [scikit module](http://scikit-learn.org/stable/) (goes with Anaconda) and compare them in terms of Accuracy, Precision, Recall, F-Score (for positive class) and Mean Squared Error. Use k-fold cross-validation and average your results

In [1]:
import pandas as pd, networkx as nx, numpy as np, scipy.sparse as sp
import os, regex as re, time as tm

import matplotlib.pyplot as plt
%matplotlib inline

DATADIR = os.path.realpath( os.path.join( ".", "data", "proj02" ) )

Pity there is no out-of-box solution for this in SKlearn.

In [2]:
from sklearn.preprocessing import LabelEncoder
class MultiColumnLabelEncoder :
    def __init__( self, columns = None ):
        self.columns = columns
        self.__le = LabelEncoder(  )
    def fit( self, X, y = None ) :
        self.__columns = X.columns if self.columns is None else self.columns
## Initialize the label encoder and make it assign labels to polled columns
        self.__le.fit( pd.concat( X[ col ] for col in self.__columns ) )
        self.classes_ = self.__le.classes_
        return self
    def transform( self, X, copy = True ) :
## Copy the input dataframe and figure out what coluns to re-code
        __output = X.copy( ) if copy else X
## Iterate over the required columns
        for col in self.__columns :
            __output[ col ] = self.__le.transform( __output[ col ] )
        return __output
    def fit_transform( self, X, y = None ) :
        return self.fit( X, y ).transform( X )
    def inverse_transform( self, y ) :
        return self.__le.inverse_transform( y )
    def set_params( self, **params ) :
        pass

Load the raw DBLP data

In [3]:
if not os.path.exists( os.path.join( DATADIR, "dblp_dataframe.ppdf" ) ) :
## Start
    tick = tm.time( )
## Load the raw DBLP dataset
    dblp_raw = pd.read_csv( os.path.join( DATADIR, "dblp_2000.csv.gz" ), # nrows = 100,
## On-the-fly decompression
                        compression = "gzip", header = None, quoting = 0,
## Assign column headers
                        names = [ 'author1', 'author2', 'year', ], encoding = "utf-8" )
    tock = tm.time( )
    print "DBLP loaded in %.3f sec." % ( tock - tick, )
## Pool the author columns together and let Pandas assign labels.
    le = MultiColumnLabelEncoder( [ 'author1', 'author2', ] )
    dblp = le.fit_transform( dblp_raw )
    authors_index = le.classes_
    del dblp_raw, le
    tick = tm.time( )
    print "DBLP preprocessed in %.3f sec." % ( tick - tock, )
## Cache
    dblp.to_pickle( "./data/proj02/dblp_dataframe.ppdf" )
    with open( "./data/proj02/author_index.dic", "wb" ) as out :
        out.writelines( label + "\n" for label in authors_index )
## Report
    tock = tm.time( )
    print "DBLP cached in %.3f sec." % ( tock - tick, )
else :
## Start
    tick = tm.time( )
## Load the database from pickled format
    dblp = pd.read_pickle( "./data/proj02/dblp_dataframe.ppdf" )
## Read the dictionary of authors
    with open( "./data/proj02/author_index.dic", "rb" ) as out :
        authors_index = out.readlines( )
## Finish
    tock = tm.time( )
    print "DBLP loaded in %.3f sec." % ( tock - tick, )

DBLP loaded in 5.721 sec.


It so happens that authors' names are quoted and their unicode letters escaped.

In [4]:
def get_author( txt ) :
    return re.sub( r"^\s*\"(.*)\"\s*$", r"\1", txt ).decode( "unicode_escape" )

## Analysis

Let's split the data in two non-overlapping samples by year:
* the coathorship data form 2000 till 2010 is the training data;
* the collaboration since 2011 is the test sample.

Constructing the train and test samples

In [68]:
dblp_X, dblp_y = dblp[ dblp.year <= 2010 ], dblp[ dblp.year >= 2011 ]

* Pick pairs of authors that **have appeared during training interval** but **have not published together** during it;
* These pairs form **positive** or **negative** examples depending on whether they have formed coauthorships **during the testing interval**;
* You have got yourself a binary classification problem.

from "M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Link prediction using supervised learning. Proceedings of SDM workshop on link analysis, 2006":

Each article bears, at least, its author information and publication year. To predict a link, we partition the range of publication year into two non-overlapping subranges. The first sub-range is selected as train years and the later one as the test years. Then, we prepare the classification dataset, by choosing those author pairs, that appeared in the train years, but did not publish any papers together in those years. Each such pair either represent a positive example or a negative example, depending on whether those author pairs published at least one paper in the test years or not. Coauthoring a paper in test years by a pair of authors, establishes a link between them, which was not there in the train years. Classification model of link prediction problem needs to predict this link by successfully distinguishing the positive classes from the dataset. Thus, link prediction problem can be posed an a binary classification problem, that can be solved by employing effective features in a supervised learning framework.

In [111]:
def to_sparse( df, shape = ( len( authors_index ), len( authors_index ) ), dtype = np.bool ) :
    return sp.coo_matrix( (
            np.ones( 2 * len( df ), dtype = dtype ), (
                np.concatenate( ( df[ "author1" ].values, df[ "author2" ].values ) ),
                np.concatenate( ( df[ "author2" ].values, df[ "author1" ].values ) ) )
        ), shape = shape )

A handy procedure to quickly find which elements of one array are in another. 

In [70]:
def match( a, b ) :
## Get insertion indices
    indices = np.searchsorted( a, b )
## Truncate the indices by the length of a
    mask = indices < len( a )
    result = np.zeros( len( b ), dtype = np.bool )
    result[ mask ] = a[ indices[ mask ] ] == b[ mask ]
    return result

The basic idea is to predict the edges in $G_1 = G[2011,2014]$ using the data in $G_0 = G[2000, 2010]$. The trainig sample is $\bigl( (u,v), t_{uv} \bigr)_{u,v\in G_0}$ where $t_{uv}$ indicates whether the edge $\langle u,v \rangle$ is in $G_1$ but not in $G_0$.

It is like predicting $y_{t+1}$ given $y_t$ in time-series, but instead of series one has graphs. And we want to predict **new** edges.

In [71]:
## Compactify the perdictors' data
le = MultiColumnLabelEncoder( [ 'author1', 'author2', ] )
dblp_X = le.fit_transform( dblp_X )

Now ditch all yet unseen vertices from the graph with edges to be predicted.

In [81]:
## Consider the coauthorship that has taken place since 2010 
dblp_y = dblp[ dblp.year >= 2011 ]

## Keep all paris with both vertices found among the predictors
dblp_y_classes_ = np.unique( np.concatenate( ( dblp_y[ "author1" ].values, dblp_y[ "author2" ].values ) ) )
common_classes_ = np.intersect1d( le.classes_, dblp_y_classes_ )
dblp_y_mask = match( common_classes_, dblp_y[ "author1" ].values ) & match( common_classes_, dblp_y[ "author2" ].values )

## Re-label the target data
dblp_y = le.transform( dblp_y[ dblp_y_mask ] )

In [117]:
## Split into sub-samples: lil matrices support advanced indexing
Adj_train, Adj_test = to_sparse( dblp_X ).tocsr( ), to_sparse( dblp_y ).tolil( )
## Remove edges present in the predictor graph
nnz_index = Adj_train.nonzero( )
## Remove teh edges and 
Adj_test[ nnz_index[ 0 ], nnz_index[ 1 ] ] = False
Adj_test = Adj_test.tocsr( )

Behold in awe the sheer magnitude of the dataset!!!

In [118]:
Adj_train, Adj_test

(<1259124x1259124 sparse matrix of type '<type 'numpy.bool_'>'
 	with 6518435 stored elements in Compressed Sparse Row format>,
 <1259124x1259124 sparse matrix of type '<type 'numpy.bool_'>'
 	with 1500844 stored elements in Compressed Sparse Row format>)

In [119]:
train_degree = Adj_train.sum( axis = 1 ).getA1( )

In [120]:
np.any( train_degree == 0 )

True

## Feature engineering

In [None]:
def PageRank_iter( A, beta = 0.85, x0 = None, rel_eps = 1.0E-8, niter = 10000 ) :
## Create a teleporation vector
    E = np.full( A.shape[ 0 ], 1.0 / A.shape[ 0 ], np.float )
## If the initial ranking is not provided us the uniform distribution over the nodes
    x0 = np.copy( E ) if x0 is None else x0
## Find the normalising constants for each row
    out = np.array( A.sum( axis = 1 ), np.int ).flatten( )
## Locate the dangling vertices
    dan = np.array( out == 0 )
##  ... and reset their normalising constant to avoid NANs
    out[ dan ] = 1.0
## The resulting status of the convergence procedure:
##  0 -- convergence within the set relative tolerance
##  1 -- exceeded the number of iterations.
    status = 1 ; i = 0
## First stopping rule: within the specified number of iterations
    while i < niter :
## The main computational step
        x1 = beta * ( x0 / out ) * A + ( beta * np.sum( x0 * dan ) + 1 - beta ) * E
## Second stopping rule: within the required tolerance. Correction for 
##  possible machine zeros in the denominator.
        if np.sum( np.abs( x1 - x0 ) / ( np.abs( x0 ) + rel_eps ) ) < rel_eps :
            status = 0
            break
## Proceed to the next iteration
        x0 = x1 ; i += 1
## return the stationary distribution and the convergence information
    return ( x1, { 'convergence': status, 'iterations' : i } )
## Some small test cases
# T = spma.csc_matrix( [ [ 0,1,1,0], [0,0,1,0], [1,0,0,1], [0,0,0,0] ] )
# T = spma.csc_matrix( [ [ 0,1,1,0,0], [0,0,1,0,0], [0,0,0,1,0], [0,0,0,0,1], [1,0,0,0,0] ] )
# print PageRank_iter( T, .9, rel_eps = 1e-10 )

Construct topological features:
* vertex's pagerank score;
* degree;
* 

In [None]:
train_degree = Adj_train.sum( axis = 1 ).getA1( )

In [None]:
train_common_neighbours = Adj_train.dot( Adj_train )

In [None]:
train_common_neighbours

In [None]:
train_prank = PageRank_iter( Adj_train )

In [None]:
ax = plt.subplot(111)
ax.set_yscale( 'log' ) ; ax.set_xscale( 'log' )
d,f = np.unique( degree, return_counts = True )
ax.plot( d, f, "b." )

## Task 2

Consider the [flickr dataset](https://www.dropbox.com/s/srsib3hq863drtp/flickr_data.tar.gz?dl=0) (warning, raw data!). <br/>
File ''*users.txt*'' provides a table of form *userID*, *enterTimeStamp*, *additionalInfo*... <br/>
File "*contacts.txt*" consists of pairs of *userID*'s and link establishment timestamp <br/>

Recall *scoring functions* for link prediction. Your task is to compare the performance of each scoring function as follows:
1. TOP-$n$ accuracy
    * Denote the number of links $E_\text{new}$ appeared during testing period as $n$
    * Denote the ranked list of node pairs provided by score $s$ as $\hat{E}_s$
    * Take top-$n$ pairs from $\hat{E}_s$ and intersect it with $E_\text{new}$. Performance is measured as the size of resulted set
2. ROC and AUC ('star' subtask)

Essentially, for this task you also have to follow the guideline points $1$ and $2$ above. The only thing you have to keep in mind is that flickr dataset is growing dataset. Since then, consider nodes that are significantly represented both in training and testing intervals (for instance, have at least $5$ adjacent edges in training and testing intervals)