# <center>Structural Analysis and Visualization of Networks</center>

## <center/>Course Project #2

### <center>Student: *Nazarov Ivan*</center>

#### <hr /> General Information

**Hard Deadline:** 21.06.2015 23:59 <br \>

Please send your reports to <mailto:leonid.e.zhukov@gmail.com> and <mailto:shestakoffandrey@gmail.com> with message subject of the following structure:<br \> **[HSE Networks 2015] *{LastName}* *{First Name}* Project*{Number}***

Support your computations with figures and comments. <br \>
If you are using IPython Notebook you may use this file as a starting point of your report.<br \>
<br \>
<hr \>

# Description

## Task 1 

You are provided with the [DBLP dataset](https://www.dropbox.com/s/ft4ekv2f3r43u7b/dblp_2000.csv.gz?dl=0) (warning, raw data!). It contains coauthorships that were revealed during $2000$-$2014$. Particularly, the file contains $3$ colomns: first two for authors' names and the third for the year of publication. This data can be naturally mapped to undirected graph structure.

Your task is construct supervised link prediction scheme.

### Guidelines:

0. Use *pandas* module to load and manipulate the dataset in Python
1. Initiallize your classification set as follows:
    * Determine training and testing intervals on your time domain (for instance, in DBLP dataset take a period $2000$-$2010$ as training period and $2011$-$2014$ as testing period)
    * Pick pairs of authors that **have appeared during training interval** but **have not published together** during it
    * These pairs form **positive** or **negative** examples depending on whether they have formed coauthorships **during the testing interval**
    * You have arrived to binary classification problem. PROFIT!
2. Construct feature space:
    * Most of our features tend to be topological. Examples of the features can be: (weighted) sum of neigbours, shortest distance, etc
3. Choose at least $4$ classification algorithms from [scikit module](http://scikit-learn.org/stable/) (goes with Anaconda) and compare them in terms of Accuracy, Precision, Recall, F-Score (for positive class) and Mean Squared Error. Use k-fold cross-validation and average your results

In [None]:
import pandas as pd, networkx as nx, numpy as np, scipy.sparse as sp
import os, regex as re, time as tm

DATADIR = os.path.realpath( os.path.join( ".", "data", "proj02" ) )

Pity there is no out-of-box solution for this in SKlearn.

In [None]:
from sklearn.preprocessing import LabelEncoder
class MultiColumnLabelEncoder :
    def __init__( self, columns = None ):
        self.columns = columns
        self.__le = LabelEncoder(  )
    def fit( self, X, y = None ) :
        self.__columns = X.columns if self.columns is None else self.columns
## Initialize the label encoder and make it assign labels to polled columns
        self.__le.fit( pd.concat( X[ col ] for col in self.__columns ) )
        self.classes_ = self.__le.classes_
        return self
    def transform( self, X, copy = True ) :
## Copy the input dataframe and figure out what coluns to re-code
        __output = X.copy( ) if copy else X
## Iterate over the required columns
        for col in self.__columns :
            __output[ col ] = self.__le.transform( __output[ col ] )
        return __output
    def fit_transform( self, X, y = None ) :
        return self.fit( X, y ).transform( X )
    def inverse_transform( self, y ) :
        return self.__le.inverse_transform( y )
    def set_params( self, **params ) :
        pass

Load the raw DBLP data

In [None]:
if not os.path.exists( os.path.join( DATADIR, "dblp_dataframe.ppdf" ) ) :
## Start
    tick = tm.time( )
## Load the raw DBLP dataset
    dblp_raw = pd.read_csv( os.path.join( DATADIR, "dblp_2000.csv.gz" ), # nrows = 100,
## On-the-fly decompression
                        compression = "gzip", header = None, quoting = 0,
## Assign column headers
                        names = [ 'author1', 'author2', 'year', ], encoding = "utf-8" )
    tock = tm.time( )
    print "DBLP loaded in %.3f sec." % ( tock - tick, )
## Pool the author columns together and let Pandas assign labels.
    le = MultiColumnLabelEncoder( [ 'author1', 'author2', ] )
    dblp = le.fit_transform( dblp_raw )
    authors_index = le.classes_
    del dblp_raw, le
    tick = tm.time( )
    print "DBLP preprocessed in %.3f sec." % ( tick - tock, )
## Cache
    dblp.to_pickle( "./data/proj02/dblp_dataframe.ppdf" )
    with open( "./data/proj02/author_index.dic", "wb" ) as out :
        out.writelines( label + "\n" for label in authors_index )
## Report
    tock = tm.time( )
    print "DBLP cached in %.3f sec." % ( tock - tick, )
else :
## Start
    tick = tm.time( )
## Load the database from pickled format
    dblp = pd.read_pickle( "./data/proj02/dblp_dataframe.ppdf" )
## Read the dictionary of authors
    with open( "./data/proj02/author_index.dic", "rb" ) as out :
        authors_index = out.readlines( )
## Finish
    tock = tm.time( )
    print "DBLP loaded in %.3f sec." % ( tock - tick, )

It so happens that authors' names are quoted and their unicode letters escaped.

In [None]:
def get_author( txt ) :
    return re.sub( r"^\s*\"(.*)\"\s*$", r"\1", txt ).decode( "unicode_escape" )

## Analysis

Let's split the data in two non-overlapping samples by year:
* the coathorship data form 2000 till 2010 is the training data;
* the collaboration since 2011 is the test sample.

In [None]:
dblp_train, dblp_test = dblp[ dblp.year <= 2010 ], dblp[ dblp.year >= 2011 ]

In [None]:
def make_coo( df, shape = ( len( authors_index ), len( authors_index ) ) ) :
    return sp.coo_matrix( (
            np.ones( 2 * len( df ), dtype = np.bool ), (
                np.concatenate( ( df[ "author1" ].values, df[ "author2" ].values ) ),
                np.concatenate( ( df[ "author2" ].values, df[ "author1" ].values ) ) )
        ), shape = shape ).tocsr( )

## Split into sub-samples
Adj_train, Adj_test = make_coo( dblp_train ), make_coo( dblp_test )

## Feature engineering

Construct topological features

In [None]:
train_degree = Adj_train.sum( axis = 1 ).getA1( )

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
ax = plt.subplot(111)
ax.set_yscale( 'log' ) ; ax.set_xscale( 'log' )
d,f = np.unique( degree, return_counts = True )
ax.plot( d, f, "b." )

## Task 2

Consider the [flickr dataset](https://www.dropbox.com/s/srsib3hq863drtp/flickr_data.tar.gz?dl=0) (warning, raw data!). <br/>
File ''*users.txt*'' provides a table of form *userID*, *enterTimeStamp*, *additionalInfo*... <br/>
File "*contacts.txt*" consists of pairs of *userID*'s and link establishment timestamp <br/>

Recall *scoring functions* for link prediction. Your task is to compare the performance of each scoring function as follows:
1. TOP-$n$ accuracy
    * Denote the number of links $E_\text{new}$ appeared during testing period as $n$
    * Denote the ranked list of node pairs provided by score $s$ as $\hat{E}_s$
    * Take top-$n$ pairs from $\hat{E}_s$ and intersect it with $E_\text{new}$. Performance is measured as the size of resulted set
2. ROC and AUC ('star' subtask)

Essentially, for this task you also have to follow the guideline points $1$ and $2$ above. The only thing you have to keep in mind is that flickr dataset is growing dataset. Since then, consider nodes that are significantly represented both in training and testing intervals (for instance, have at least $5$ adjacent edges in training and testing intervals)