### Preliminaries
If you want to normalize a vector to L1-norm or L2-norm, use:

In [1]:
#from __future__ import print_function, division
import numpy as np

pr = np.array([1,2,3])

def normalize(vector, norm=2):
    return vector / np.linalg.norm(vector, norm)

print("L1-norm of {0} is {1}".format(pr, normalize(pr, norm=1)))
print("L2-norm of {0} is {1}".format(pr, normalize(pr, norm=2)))

L1-norm of [1 2 3] is [0.16666667 0.33333333 0.5       ]
L2-norm of [1 2 3] is [0.26726124 0.53452248 0.80178373]


# Exercise 3: Link based ranking
## Question 1 - Page Rank (Eigen-vector method)
Consider a tiny Web with three pages A, B and C with no inlinks,
and with initial PageRank = 1. Initially, none of the pages link to
any other pages and none link to them. 
Answer the following questions, and calculate the PageRank for
each question.

1. Link page A to page B.
2. Link all pages to each other.
3. Link page A to both B and C, and link pages B and C to A.
4. Use the previous links and add a link from page C to page B.

Hints: 
+ We are using the theoretical PageRank computation (without source of rank). See slide "Transition Matrix for Random Walker" in the lecture note. **Columns of link matrix are from-vertex, rows of link matrix are to-vertex**. We take the eigenvector with the largest eigenvalue.
+ We only care about final ranking of the probability vector. You can choose the normalization (or not) of your choice).

In [2]:
# My sol
def create_Rmatrix(L):
    L_t = L.transpose()
    R_t = np.array([[1/sum(c) if r else 0 for r in c] if sum(c)
        else [0] * L_t.shape[1] for c in L_t])
    # np.spacing(1) # suma un 'extra'
    return R_t.transpose()

def pagerank_eigen(L):
    # Construct transition probability matrix from L
    c_pj = L.sum(axis=0)
    R = create_Rmatrix(L)
    # Compute eigen-vectors and eigen-values of R
    eigenvalues, eigenvectors = np.linalg.eig(R)
    # Take the eigen-vector with maximum eigen-value
    p = eigenvectors[:,np.argmax(eigenvalues)]
    return R, abs(normalize(p, 1))

In [3]:
L = np.array([
    [0,0,0], 
    [1,0,0], 
    [0,0,0]
])

R,p = pagerank_eigen(L)
print("L={}\nR={}\np={}".format(L,R,p))

L=[[0 0 0]
 [1 0 0]
 [0 0 0]]
R=[[0. 0. 0.]
 [1. 0. 0.]
 [0. 0. 0.]]
p=[0. 1. 0.]


#### 1.0 None of the pages link to any other, and none link to them

In [4]:
L = np.array([
    [0,0,0], 
    [0,0,0], 
    [0,0,0]
])
R,p = pagerank_eigen(L)
print("L={}\nR={}\np={}".format(L,R,p))

L=[[0 0 0]
 [0 0 0]
 [0 0 0]]
R=[[0 0 0]
 [0 0 0]
 [0 0 0]]
p=[1. 0. 0.]


#### 1.1 A links to B

In [5]:
L = np.array([
    [0,0,0], 
    [1,0,0], 
    [0,0,0]
])
R,p = pagerank_eigen(L)
print("L={}\nR={}\np={}".format(L,R,p))

L=[[0 0 0]
 [1 0 0]
 [0 0 0]]
R=[[0. 0. 0.]
 [1. 0. 0.]
 [0. 0. 0.]]
p=[0. 1. 0.]


#### 1.2 All pages link to each other

QUESTION! Do we need to normalize so that the sum of probabilities is 1?

In [6]:
L = np.array([
    [0,1,1], 
    [1,0,1], 
    [1,1,0]
])
R,p = pagerank_eigen(L)
print("L={}\nR={}\np={}".format(L,R,p))

L=[[0 1 1]
 [1 0 1]
 [1 1 0]]
R=[[0.  0.5 0.5]
 [0.5 0.  0.5]
 [0.5 0.5 0. ]]
p=[0.33333333 0.33333333 0.33333333]


#### 1.3 A links to B and C; B and C link to A

In [7]:
L = np.array([
    [0,1,1], 
    [1,0,0], 
    [1,0,0]
])
R,p = pagerank_eigen(L)
print("L={}\nR={}\np={}".format(L,R,p))

L=[[0 1 1]
 [1 0 0]
 [1 0 0]]
R=[[0.  1.  1. ]
 [0.5 0.  0. ]
 [0.5 0.  0. ]]
p=[0.5  0.25 0.25]


#### 1.3 C links to B

In [8]:
L = np.array([
    [0,1,1], 
    [1,0,1], 
    [1,0,0]
])
R,p = pagerank_eigen(L)
print("L={}\nR={}\np={}".format(L,R,p))

L=[[0 1 1]
 [1 0 1]
 [1 0 0]]
R=[[0.  1.  0.5]
 [0.5 0.  0.5]
 [0.5 0.  0. ]]
p=[0.44444444 0.33333333 0.22222222]


## Question 2 - Page Rank (Iterative method)

The eigen-vector method has some numerical issues (when computing eigen-vector) and not scalable with large datasets.

We will apply the iterative method in the slide "Practical Computation of PageRank" of the lecture.

Dataset for practice: https://snap.stanford.edu/data/ca-GrQc.html

In [9]:
def pagerank_iterative(L, epsilon=0.001, q=0.9):
    N = L.shape[0]
    e_N = 1 / N * np.ones([N,1])
    
    # compute R
    R = create_Rmatrix(L)

    p = e_N # initialize to some vector
    delta = 1
    i = 0 # iteration counter

    while delta > epsilon:
        p_prev = p
        p = q * R.dot(p) + (1 - q) * e_N
        delta = np.linalg.norm(p - p_prev, 1)
        i += 1

    print("Converged after {} iterations".format(i))
    print("Ranking vector: p={}".format(p[:,0]))
    return R, p

#### Test with the dataset


In [10]:
# Construct link matrix from file
fname = "ca-GrQc.txt"

with open(fname) as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip().split() for x in content[4:]]
nodes = 5242
col_a = list(set([int(x[0]) for x in content]))
col_a.sort() 
assert(len(col_a) == nodes) # we check the length and see that all nodes are here

# dictionary to get original node IDs
dict_nodes = dict(zip(range(nodes), col_a))

# dictionary to get original node IDs
inv_dict_nodes = dict(zip(col_a, range(nodes)))

L = np.zeros([nodes, nodes])

for f, t in content:
    L[inv_dict_nodes[int(t)], inv_dict_nodes[int(f)]] += 1

In [11]:
# Run PageRank
import time
start_time = time.time()

R, p = pagerank_iterative(L)
print("Ranking vector: p={0}".format(p[:,0]))

print("--- %s seconds ---" % (time.time() - start_time))

Converged after 30 iterations
Ranking vector: p=[4.21393757e-04 1.90766883e-04 2.80866886e-04 ... 1.86141083e-04
 9.90705173e-05 2.25980291e-04]
Ranking vector: p=[4.21393757e-04 1.90766883e-04 2.80866886e-04 ... 1.86141083e-04
 9.90705173e-05 2.25980291e-04]
--- 19.936230659484863 seconds ---


In [12]:
print(max(p))

[0.00144239]


In [13]:
print(dict_nodes[np.argmax(p)])

14265


## Question 3 - Hub and Authority

### a)

Let the adjacency matrix for a graph of four vertices ($n_1$ to $n_4$) be
as follows:

$
A =
  \begin{bmatrix}
	0 & 1 & 1 & 1  \\
	0 & 0 & 1 & 1 \\
	1 & 0 & 0 & 1 \\
	0 & 0 & 0 & 1 \\
  \end{bmatrix}
$

Calculate the authority and hub scores for this graph using the
HITS algorithm with k = 6, and identify the best authority and
hub nodes.


QUESTION! Shouldn't A have the largest Authority score? (since it's the one with more incoming links!)

In [14]:
A = np.array([[0,1,1,1],[0,0,1,1],[1,0,0,1],[0,0,0,1]])

In [15]:
def hits_iterative(A, k=6, delta=0.001):
    n = A.shape[0]
    a = 1 / n**2 * np.ones(n)
    h = a.copy()
    a_new = None
    h_new = None

    for i in range(k):
        a_new = A @ h
        h_new = a @ A
        delta_a = np.linalg.norm(a_new - a, 1)
        delta_h = np.linalg.norm(h_new - h, 1)
        a = normalize(a_new.copy(), 2)
        h = normalize(h_new.copy(), 2)
        if delta_a <= delta and delta_h <= delta:
            break

    return a, h

In [16]:
hits_iterative(A, delta=1000)

(array([0.70710678, 0.47140452, 0.47140452, 0.23570226]),
 array([0.21320072, 0.21320072, 0.42640143, 0.85280287]))

In [17]:
a, h = hits_iterative(A)

In [18]:
max(a)

0.6535797066992358

In [19]:
np.argmax(a)

0

In [20]:
max(h)

0.8058737503296141

In [21]:
np.argmax(h)

3

### b)
Apply the HITS algorithm to the dataset: https://snap.stanford.edu/data/ca-GrQc.html

In [22]:
# Construct link matrix from file
fname = "ca-GrQc.txt"

with open(fname) as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip().split() for x in content[4:]]
nodes = 5242
col_a = list(set([int(x[0]) for x in content]))
col_a.sort() 
assert(len(col_a) == nodes) # we check the length and see that all nodes are here

# dictionary to get original node IDs
dict_nodes = dict(zip(range(nodes), col_a))

# dictionary to get original node IDs
inv_dict_nodes = dict(zip(col_a, range(nodes)))

L = np.zeros([nodes, nodes])

for f, t in content:
    L[inv_dict_nodes[int(t)], inv_dict_nodes[int(f)]] += 1

In [23]:
a, h = hits_iterative(L)

In [24]:
max(a)

0.1158500753831797

In [25]:
dict_nodes[np.argmax(a)]

21012

In [26]:
max(h)

0.11585007538317971

In [27]:
dict_nodes[np.argmax(h)]

21012

**Hint:** We follow the slide "HITS algorithm" in the lecture. **Denote $x$ as authority vector and $y$ as hub vector**. You can use matrix multiplication for the update steps in the slide "Convergence of HITS". Note that rows of adjacency matrix is from-vertex and columns of adjacency matrix is to-vertex.

## Question 4 - Ranking Methodology (Hard)

1 Give a directed graph, as small as possible, satisfying all the properties mentioned below:
1. There exists a path from node i to node j for all nodes i,j in the directed graph. Recall, with this property the jump to an arbitrary node in PageRank is not required, so that you can set q = 1 (refer lecture slides).

2. HITS authority ranking and PageRank ranking of the graph nodes are different.

2 Give intuition/methodology on how you constructed such a directed graph with the properties described in (a).


3 Are there specific graph structures with arbitrarily large instances where PageRank ranking and HITS authority ranking are the same?