# IAB - Hands-on Tutorial for Link Analysis

Welcome to IAB - hands-on tutorial for link analysis. 
In this tutorial, we will study several techniques for link analysis in graphs. 
This tutorial consists of two sessions and one homework, and each of them will handle the following topic:

* **Session 1-1**. Tutorial on PageRank - Part 1 (60 mins)</span>
* <span style="color:blue">**Session 1-2**. Tutorial on PageRank - Part 2 (60 mins)
* **Session 2**. Tutorial on Topic-specific PageRank (120 mins)
* **Homework**. Implementation of HITS

We recommend fully understanding the lecture videos related to link analysis (or ranking) models such as PageRank, Topic-specific PageRank, and HITS before entering this tutorial since we will **NOT** explain the theoretical backgrounds on these techniques during the tutorial. 
We will mainly focus on how to implement the algorithms of those models and how to rank nodes in real-world graphs using those ranking models. 

The main contributors of this material are as follows:
* *Jinhong Jung* (jinhongjung@snu.ac.kr)
* *Jun-gi Jang* (elnino4@snu.ac.kr)
* *U Kang* (ukang@snu.ac.kr)
------

-----
## Session 1-2. Tutorial on PageRank - Part 2 (60 mins)
In this session, we will explore how to implement PageRank in Python. 
The main goals of this session are summarized as follows:
* **Goal 1.** How to implement PageRank based on sparse matrices using `numpy` and `scipy` in Python
* **Goal 2.** How to handle the deadend issue in PageRank
* **Goal 3.** To perform a qualitative analysis of the ranking result from PageRank in real-world networks

The outline of this session is as follows:
* **Step 1.** Introduction to sparse matrices
* **Step 2.** Impelement PageRank - the sparse matrix version
* **Step 3.** Running time comparison between the dense and sparse versions of PageRank
* **Step 4.** Deadend handling and validation of the implementation of PageRank with the deadend handling
* **Step 5.** Qualitative analysis of the ranking result from PageRank

In the previous session, the problem of the dense matrix version of PageRank is to store all zero values in the adjacency matrix of the graph while most real-world networks are extremely sparse.
Due to the problem, the time and space complexities of the previous implementation are $O(n^2)$ where $n$ is the number of nodes in the graph. 
In this session, we will build an efficient implementation of PageRank using `sparse` matrices which stores only non-zero values in the matrix.

-----
### Step 1. Introduction to sparse matrices

There are various data structures for sparse matrices, e.g., compressed sparse column (CSC), compressed sparse row (CSR), coordinate list (COO), etc. 
Most data structures for sparse matrices aim to store only non-zero entries and their locations. 
The intuition behind this is that in fact, zero values in a matrix do not contribute to the result of an matrix operation at all. 
For example, consider the following (sparse) matrix vector multiplication: 

$$
\begin{bmatrix}
0 & 2 & 0 \\
2 & 0 & 0 \\
0 & 0 & 2
\end{bmatrix}
\begin{bmatrix}
2 \\
3 \\
4
\end{bmatrix}
=
\begin{bmatrix}
0 \times 2 + 2 \times 2 + 0 \times 2 \\
2 \times 3 + 0 \times 3 + 0 \times 3 \\
0 \times 4 + 0 \times 4 + 2 \times 4 \\
\end{bmatrix}
=
\begin{bmatrix}
4 \\
6 \\
8
\end{bmatrix}
$$

As you can see the example, we can ignore the zero values in the matrix vector multiplication, implying we do not need to store those zero values inside a data structure for sparse matrices. 
Also, this indicates we are able to do a matrix vector multiplication within $O(\text{nnz}(\mathbf{A}))$ time where $\text{nnz}(\mathbf{A})$ is the number of non-zeros in matrix $\mathbf{A}$.

In Python, we are able to achieve the purpose using `scipy` which provides various data structures for sparse matrices. 
We will use compressed sparse row (CSR, `csr_matrix` in `scipy`) to implement the sparse matrix version of PageRank. 

The details on CSR (e.g., how to store non-zero values) are out-of-scope for this tutorial. 
If you are interested in the details, you can refer to the below references:
* Basic CSR data structure: http://netlib.org/linalg/html_templates/node91.html
* Sparse matrix vector multiplication: https://www.it.uu.se/education/phd_studies/phd_courses/pasc/lecture-1

------

-----
### Step 2. Implement PageRank - the sparse matrix version

Let's implement the sparse matrix version of PageRank in this step.

#### Step 2-1. Set up requirements for this tutorial

First of all, we will use several Python packages such as `numpy`, `scipy`, `pandas`, and `matplotlib`. 
As in the previous session, please check if those packages are installed in your local system. 
If you encounter error messages, please install required packages. 
If there is no any message, move to the next step. 

In [None]:
try:
    import numpy
except ImportError:
    print("numpy is not installed, type pip install numpy")

try:
    import scipy
except ImportError:
    print("scipy is not installed, type pip install scipy")
    
%matplotlib inline
try:
    import matplotlib
except ImportError:
    print("matplotlib is not installed, type pip install matplotlib")
    
try:
    import pandas
except ImportError:
    print("pandas is not installed, type pip install pandas")

try:
    from IPython.display import display 
except ImportError:
    print("ipython is not installed, type pip install ipython")

To implment the sparse matrix version of PageRank, we need to import the following packages:

In [None]:
# the below commands restrict the number of computation threads to 1
import os
os.environ["MKL_NUM_THREADS"] = "1" 
os.environ["NUMEXPR_NUM_THREADS"] = "1" 
os.environ["OMP_NUM_THREADS"] = "1" 

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from IPython.display import display

#### Step 2-2. Play with `csr_matrix` of `scipy`

Let's construct `csr_matrix` with a simple example. 
The following examples shows how to build a sparse matrix from an edge list. 
You can refer to the following link to check other examples: 
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

In [None]:
edges = [ [0, 1, 1],
          [1, 2, 1],
          [2, 3, 1],
          [3, 1, 1] ]
edges = np.asarray(edges)

rows = edges[:, 0]
cols = edges[:, 1]
weights = edges[:, 2]

A = csr_matrix((weights, (rows, cols)), shape=(4, 4))
print("Data stored in A:")
print(A)

print("\nTo dense matrix:")
print(A.toarray())

#### Step 2-3. Implement the phase for loading the graph dataset

In this step, we will implement the phase for loading the graph dataset of the spare matrix version of PageRank. 
We briefly introduce several APIs used when implementing the below function which constructs the adjacency matrix of a graph.
* `loadtxt`: this loads data from a text file
    - https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
* `amax`: this returns the maximum of an array
    - https://docs.scipy.org/doc/numpy/reference/generated/numpy.amax.html
* `A.nnz`: this return the number of non-zeros of matrix `A`
* slice: `edge[:, 0]` will return the first column of matrix `edge`

In [None]:
class SparsePageRank:
    def load_graph_dataset(self, data_home, is_undirected=False):
        '''
        Load the graph dataset from the given directory (data_home)

        inputs:
            data_home: string
                directory path conatining a dataset (edges.tsv, node_labels.tsv)
            is_undirected: bool
                if the graph is undirected
        '''
        # Step 1. set file paths from data_home
        edge_path = "{}/edges.tsv".format(data_home)
        
        # Step 2. read the list of edges from edge_path
        edges = np.loadtxt(edge_path, dtype=int)
        n = int(np.amax(edges)) + 1
        
        # Step 3. convert the edge list to the adjacency matrix
        rows = edges[:, 0]
        cols = edges[:, 1]
        weights = edges[:, 2]
        self.A = csr_matrix((weights, (rows, cols)), shape=(n, n))
        if is_undirected == True:
            self.A = self.A + self.A.T
        
        # Step 5. set n (# of nodes) and m (# of edges)
        self.n = self.A.shape[0]     # number of nodes
        self.m = self.A.nnz          # number of edges

In [None]:
class SparsePageRank(SparsePageRank):
    def load_node_labels(self, data_home):
        '''
        Load the node labels from the given directory (data_home)

        inputs:
            data_home: string
                directory path conatining a dataset
        '''
        label_path = "{}/node_labels.tsv".format(data_home)
        self.node_labels = pd.read_csv(label_path, sep="\t")

Let's check if the function is correctly implemented. 
We will used the same dataset at `./data/small` used in the previous session.
Please run the below cell to check it.

In [None]:
data_home = './data/small'
spr = SparsePageRank()
spr.load_graph_dataset(data_home, is_undirected=False)
spr.load_node_labels(data_home)

# print the number of nodes and edges
print("The number n of nodes: {}".format(spr.n))
print("The number m of edges: {}".format(spr.m))

# print the heads (5) of the node labels
display(spr.node_labels.head(5))

#### Step 2-4. Implement the normalization phase
Next, we need the row-normalized adjacency matrix $\mathbf{\tilde{A}}$ of the adjacency matrix $\mathbf{A}$. 
Note that we are implementing the phase based on sparse matrices. 
For the degree diagonal matrix $\mathbf{D}$, we will use `spdiags` which is for a sparse diagonal matrix.

In [None]:
from scipy.sparse import spdiags

As described in Session 1-1, we aim to implement the following operation in this phase: 

$$\mathbf{\tilde{A}} = \mathbf{D}^{-1}\mathbf{A}$$

In [None]:
class SparsePageRank(SparsePageRank):
    def normalize(self):
        '''
        Perform the row-normalization of the given adjacency matrix
        '''
        d = self.A.sum(axis=1) # since A is csr_matrix, the result of sum() is not a normal vector
        d = np.asarray(d).flatten() # to make it vector
        
        d = np.maximum(d, np.ones(self.n))
        invd = 1.0 / d
        invD = spdiags(invd, 0, self.n, self.n)
        
        self.nA = invD.dot(self.A)
        self.nAT = self.nA.T
        
        self.out_degrees = d

Let's check if the function is correctly implemented. 
As described in Session 1-1, the sum of each row of the row-normalized matrix $\mathbf{\tilde{A}}$ should be $1$. 
Hence, let's check if the sum of each row is $1$.

In [None]:
spr = SparsePageRank()
spr.load_graph_dataset('./data/small', is_undirected=False)
spr.normalize()

# check the sum of each row in the row-normalized matrix nA
row_sums = np.asarray(spr.nA.sum(axis=1)).flatten()
for (i, degree, row_sum) in zip(range(spr.n), spr.out_degrees, row_sums):
    print("node: {:2d}, out-degree: {:2d},  row_sum: {:.2f}".format(i, int(degree), row_sum))

Note that `nA` is `csr_matrix`; hence, `nP.sum` returns a matrix. 
To convert it to a vector, we use `np.asarray` and `flatten` functions as above.

#### Step 2-5. Implement the iterative phase
Now, let's implement the iterative phase of PageRank based on sparse matrices. 
After constructing the sparse matrix used in this phase, the implementation is the same with the one of the dense matrix version. 
Hence, you do not need to modify the code itself.

For convenience, we provide the iterative algorithm in this cell again.

<img src="./images/iterative-algorithm-pagerank.png" width="400">

In [None]:
class SparsePageRank(SparsePageRank):
    def iterate_PageRank(self, b=0.15, epsilon=1e-9, maxIters=100):
        '''
        Iterate the PageRank equation to obatin the PageRank score vector
        
        inputs:
            b: float (between 0 and 1)
                the teleport probability
            epsilon : float
                the error tolerance of the iteration
            maxIters : int
                the maximum number of iterations

        outputs:
            p: np.ndarray (n x 1 vector)
                the final PageRank score vector
            residuals: list
                the list of residuals over the iteration

        '''
        q = np.ones(self.n)/self.n
        old_p = q
        residuals = []
        
        for t in range(maxIters):
            p = (1 - b) * (self.nAT.dot(old_p)) + (b * q)
            residual = np.linalg.norm(p - old_p, 1)
            residuals.append(residual)
            old_p = p
            
            if residual < epsilon:
                break
                
        return p, residuals

Let's check the result of the implementation.

In [None]:
spr = SparsePageRank()
spr.load_graph_dataset('./data/small', is_undirected=False)
spr.normalize()

p, residuals = spr.iterate_PageRank(b=0.15, epsilon=1e-9, maxIters=100)

for (i, score) in zip(range(spr.n), p):
    print("node: {:2d}, PageRank score: {:.4f}".format(i, score))

#### Step 2-6. Comparison between the sparse and dense matrix versions. 

Let's compare the sparse matrix version to the dense version. 
The code for the dense matrix version are from the previous session.

In [None]:
# !!! SHOULD NOT MODIFY THE BELOW CODES - JUST RUN !!!
# This class is copied from the previous session
class DensePageRank:
    def load_graph_dataset(self, data_home, is_undirected=False):
        '''
        Load the graph dataset from the given directory (data_home)

        inputs:
            data_home: string
                directory path conatining a dataset
            is_undirected: bool
                if the graph is undirected
        '''
        # Step 1. set file paths from data_home
        edge_path = "{}/edges.tsv".format(data_home)

        # Step 2. read the list of edges from edge_path
        edges = np.loadtxt(edge_path, dtype=int)
        n = int(np.amax(edges[:, 0:2])) + 1 # the current n is the maximum node id (starting from 0)

        # Step 3. convert the edge list to the adjacency matrix
        self.A = np.zeros((n, n))
        for i in range(edges.shape[0]):
            source, target, weight = edges[i, :]
            self.A[(source, target)] = weight
            if is_undirected:
                self.A[(target, source)] = weight

        # Step 4. set n (# of nodes) and m (# of edges)
        self.n = n                         # number of nodes
        self.m = np.count_nonzero(self.A)  # number of edges
    
    def normalize(self):
        '''
        Perform the row-normalization of the given adjacency matrix
        '''
        # Step 1. obatin the out-degree vector d
        d = self.A.sum(axis = 1)           # row-wise summation

        # Step 2. obtain the inverse of the out-degree matrix
        d = np.maximum(d, np.ones(self.n)) # handles zero out-degree nodes, `maximum` perform entry-wise maximum 
        invd = 1.0 / d                # entry-wise division
        invD = np.diag(invd)          # convert invd vector to a diagonal matrix

        # Step 3. compute the row-normalized adjacency matrix
        self.nA = invD.dot(self.A)   # nA = invD * A
        self.nAT = self.nA.T         # nAT is the transpose of nA
        
        self.out_degrees = d
    
    def iterate_PageRank(self, b=0.15, epsilon=1e-9, maxIters=100):
        '''
        Iterate the PageRank equation to obatin the PageRank score vector

        inputs:
            b: float (between 0 and 1)
                the teleport probability
            epsilon : float
                the error tolerance of the iteration
            maxIters : int
                the maximum number of iterations

        outputs:
            p: np.ndarray (n x 1 vector)
                the final PageRank score vector
            residuals: list
                the list of residuals over the iteration
        '''
        q = np.ones(self.n)/self.n     # set the query vector q
        old_p = q                 # set the previous PageRank score vector
        residuals = []            # set the list for residuals over iterations

        for t in range(maxIters):
            p = (1-b)*(self.nAT.dot(old_p)) + b*q
            residual = np.linalg.norm(p - old_p, 1)
            residuals.append(residual)
            old_p = p

            if residual < epsilon:
                break

        return p, residuals

Now, we are able to compare them. 
Let's compute the PageRank score vector from each version, and measure the error between them.

In [None]:
data_home = './data/small'

spr = SparsePageRank()
spr.load_graph_dataset(data_home, is_undirected=False)
spr.normalize()
p_spr, _ = spr.iterate_PageRank(b=0.15, epsilon=1e-9, maxIters=100)

dpr = DensePageRank()
dpr.load_graph_dataset(data_home, is_undirected=False)
dpr.normalize()
p_dpr, _ = dpr.iterate_PageRank(b=0.15, epsilon=1e-9, maxIters=100)

error = np.linalg.norm(p_spr - p_dpr, 1)
print("Error between sparse and dense PageRank scores: {:e}".format(error))

Note that the error between them is very small, indicating they are effectively equivalent.

-----
### Step 3. Running time comparison between the dense and sparse versions of PageRank

The reason why we implemented the sparse matrix version is that the dense version is not efficient. 
Let's empirically check the efficiency of the sparse matrix version. 
First, we need to import `time` package to measure wall-clock time.

In [None]:
from time import time

The usage of `time` is simple as follows:

```python
start_time = time()
... your codes
run_time = time() - start_time # in seconds
```

Using `time`, let's measure the wall-clock time of the whole procedure of each version for a medium size of dataset at `./data/medium`.

In [None]:
data_home = './data/medium'

start_time = time()
spr = SparsePageRank()
spr.load_graph_dataset(data_home, is_undirected=False)
spr.normalize()
spr_p, _ = spr.iterate_PageRank(b=0.01, epsilon=1e-9, maxIters=1000)
spr_run_time = time() - start_time
print("Running time of the sparse version: {:.4f} seconds".format(spr_run_time))

start_time = time()
dpr = DensePageRank()
dpr.load_graph_dataset(data_home, is_undirected=False)
dpr.normalize()
dpr_p, _ = dpr.iterate_PageRank(b=0.01, epsilon=1e-9, maxIters=1000)
dpr_run_time = time() - start_time

print("Running time of the dense version : {:.4f} seconds".format(dpr_run_time))

As you can see, the running time of the dense version is much larger than that of the sparse version. 
From now on, we only use the sparse version to analyze large-scale real-world networks.

-----
### Step 4. Deadend handling and validation of the implementation of PageRank with the deaded handling 

When we compute the PageRank score vector in *directed* networks, there is one issue unresolved in the previous steps. 
The issue is called `deadend` issue where a deadend node is a node whose out-degree is zero (i.e., there are only in-coming links to the node). 
Before describing the deadend issue, let's check how many deadend nodes exist in a directed network. 
We will use `enron` dataset (at `./data/enron`) which is a directed network (we will describe the details on the dataset later). 

In [None]:
data_home = './data/enron'
spr = SparsePageRank()
spr.load_graph_dataset(data_home, is_undirected=False)
spr.normalize()

# count deadend nodes
num_deadends = np.count_nonzero(spr.out_degrees == True)
print("The number n of nodes: {}".format(spr.n))
print("The number m of edges: {}".format(spr.m))
print("The number of deadend nodes: {}".format(num_deadends))

As you can see, nearly half of the nodes are deadend nodes ($5,840/9,958$). 
When a directed network contain deadend nodes, the problem is that PageRank scores are leaked out, i.e., the sum of the PageRank score vector will be less than $1.0$ (note that the PageRank score vector should be a probability distribution) since as explained in the lecture video, when a random surfer visits a deadend node, the surfer cannot escape from the node. 
The problem is called the deadend issue. 
The issue is easily checked by summing the PageRank score vector as follows:

In [None]:
p, _ = spr.iterate_PageRank(b=0.15, epsilon=1e-9, maxIters=300)

print("The sum of the PageRank score vector: {:.2f}".format(np.sum(p)))

Note that the sum of the PageRank score vector is less than $1$ indicating there is a score leak.
The way to resolve the deadend issue, you need to implement the following algorithm described in the lecture video. 

<img src="./images/iterative-algorithm-pagerank-deadend.png" width="400">

We will not explain the details on the solution, but using this, we will guarantee that the sum of the PageRank score is $1$.

In [None]:
class SparsePageRank(SparsePageRank):
    def iterate_PageRank(self, b=0.15, epsilon=1e-9, maxIters=100, handles_deadend=True):
        '''
        ///Try it yourself!///
        Iterate the PageRank equation to obatin the PageRank score vector
        
        inputs:
            b: float (between 0 and 1)
                the teleport probability
            epsilon: float
                the error tolerance of the iteration
            maxIters: int
                the maximum number of iterations
            handles_deadend: bool
                if it handles the deadend issue

        outputs:
            p: np.ndarray (n x 1 vector)
                the final PageRank score vector
            residuals: list
                the list of residuals over the iteration

        '''
        p = np.zeros(self.n)           # pagerank score vector
        residuals = []                 # set the list for residuals over iterations

        pass # TODO: implement Algorithm 1
    
        return p, residuals

Let's check the result of the modified iterative algorithm. 
Since we need to handle the deadend issues, we should set `handles_deadend` to `True`.

In [None]:
data_home = './data/enron'
spr = SparsePageRank()
spr.load_graph_dataset(data_home, is_undirected=False)
spr.normalize()
p, _ = spr.iterate_PageRank(b=0.15, epsilon=1e-9, maxIters=100, handles_deadend=True)

print("The sum of the PageRank score vector: {:.2f}".format(np.sum(p)))

-----
### Step 5. Qualitative analysis of the ranking result from PageRank

In this step, we will perform a qualitative analysis of the ranking result from PageRank using a real-world graph. 
The dataset is `enron` dataset. 
This is a communication network of emails where nodes represent email addresses and directed edges represent email communications (e.g., for an edge $u \rightarrow v$, $u$ sent $v$ an email).
The statistics of the dataset is as follows:

| Statistic | Value |
| --- | --- |
| $n$: the number of nodes | 9,958 |
| $m$: the number of edges | 53,116|

To perform the analysis, we implement a function for ranking nodes in the order of PageRank scores (in fact, we implemented this in the dense matrix version; hence, copy the codes).

In [None]:
class SparsePageRank(SparsePageRank):
    def rank_nodes(self, ranking_scores, topk=-1):
        '''
        Rank nodes in the order of given ranking scores. 
        This function reports top-k rankings. 

        inputs:
            ranking_scores: np.ndarray
                ranking score vector
            topk: int
                top-k ranking parameter, default is -1 indicating report all ranks
        '''
        sorted_nodes = np.flipud(np.argsort(ranking_scores)) # argsort in the descending order
        sorted_scores = ranking_scores[sorted_nodes]         # sort the ranking scores
        ranks = range(1, self.n+1) # 0~n-1

        result_labels = self.node_labels.iloc[sorted_nodes][0:topk]
        result_labels.insert(0, "rank", ranks[0:topk])
        result_labels["score"] = sorted_scores[0:topk]
        result_labels.reset_index(drop = True, inplace = True)
        return result_labels

Let's rank nodes based on the PageRank score vector. Print the top-$10$ rankings since there are almost $10,000$ nodes; we cannot visually check all nodes in a cell.

In [None]:
data_home = './data/enron'
spr = SparsePageRank()
spr.load_graph_dataset(data_home, is_undirected=False)
spr.load_node_labels(data_home)
spr.normalize()
p, _ = spr.iterate_PageRank(b=0.15, epsilon=1e-9, maxIters=100, handles_deadend=True)

# display top-10 ranking in the order of PageRank scores
display(spr.rank_nodes(p, topk=10))

With only the ranking results, we do not know the owner and position of each e-mail address. 
In the raw data of `enron`, there is no information on the positions of employees. 
Fortunately, someones have already surveyed the positions of several custodians of the enron company. 
You can check the data at the following link:
* https://github.com/enrondata/enrondata/blob/master/data/misc/edo_enron-custodians-data.tsv


Based on the data, we summarize the ranking result in the following table:

| Rank | E-mail address | Name | Title |
| --- | --- | --- | -- |
| 1 | `jeff.skilling at enron.com` | Jeffery Skilling | Chief Executive Officer (CEO) |
| 2 | `kenneth.lay at enron.com` | Kenneth Lay	| Chief Executive Officer (CEO)  |
| 3 | `louise.kitchen at enron.com` | Louise Kitchen | President, Enron Online |
| 4 | `sally.beck at enron.com` | Sally Beck | Chief Operating Officer (COO) | 
| 5 | `tana.jones at enron.com` | Tana Jones | No Information (maybe manager) | 
| 6 | `john.lavorato at enron.com` | John Lavorato	| Chief Executive Officer (CEO), Enron America |
| 7 | `greg.whalley at enron.com` | Lawrence Greg Whalley | President |
| 8 | `vince.kaminski at enron.com ` | Vince Kaminski | Manager, Risk Management Head | 
| 9 | `sara.shackleton at enron.com` | Sara Shackleton | No Information (maybe manager) |
| 10 | `rod.hayslett at enron.com` | Rod Hayslett | Vice President, Chief Financial Officer (CFO)|

Note that most nodes ranked high are senior officials such as CEO or COO at the company.
There is no information of `Tana Jones` and `Sara Shackleton`, but there are in the custodian (manager) list, implying they are also senior officials. 
This result naturally follows our intuition since many managers frequently communicate with other employees, especially, they would received many e-mails from many other seniors as well as normal employees. 
That is why their PageRank scores are high since according to the mechanism of PageRank, more important nodes are likely to receive more links from other nodes.

-----
## Session 1-2. Summary

In this session, we implemented PageRank (the sparse matrix version) in Python. 
More specifically, we are able to answer the following goals now. 
* **Goal 1.** How to implement PageRank based on sparse matrices using `numpy` and `scipy` in Python
    - We implemented the iterative algorithm for PageRank based on sparse matrices.
* **Goal 2.** How to handle the deadend issue in PageRank
    - We empirically check the deadend issue in a directed network, and implement the solution of the deadend issue.
* **Goal 3.** To perform a qualitative analysis of the ranking result from PageRank in real-world networks
    - We performed a qualitative analysis on the `enron` dataset which is a real-world network.