### CS4423 - Networks
Prof. Götz Pfeiffer<br />
School of Mathematics, Statistics and Applied Mathematics<br />
NUI Galway

#### 5. Power Laws and Scale-Free Graphs

# Lecture 19:  Hubs and Authorities

In [None]:
import numpy as np
import pandas as pd
import networkx as nx

### In-Degree vs. Out-Degree

Recall **in-degree** and **out-degree centrality**:
$$
c_i^{D^{\text{in}}} = k_i^{\text{in}} = \sum_{j=1}^n a_{ij},
\quad
c_i^{D^{\text{out}}} = k_i^{\text{out}} = \sum_{j=1}^n a_{ji},
$$
where $A = (a_{ij})$ is the adjacency matrix of a directed graph
$G = (X, E)$ ...

... and the corresponding **eigenvector centralities**:
$$
A c^{E^{\text{in}}} = \lambda c^{E^{\text{in}}},
\quad
A^{T} c^{E^{\text{out}}} = \lambda c^{E^{\text{out}}}.
$$

###  Hub Centrality and Authority Centrality

In a network of nodes connected by directed edges, each node
plays two different roles, one as a receiver of links, and one as
a sender of links.  A first measure of importance, or recognition, of
a node in this network might be the number of
links it receives, i.e., its **in-degree** in the underlying graph.
If in a collection of web pages relating to a search query on the
subject of "networks", say, a particular page receives a high number
of links, this page might count as an **authority** on that subject,
with **authority score** measured by its in-degree.

In turn, the web pages linking to an authority in some sense know
where to find valuable information and thus serve as good "lists" for
the subject.
A high value list is called a **hub** for this query.
It makes sense to measure the value of a page as list in
terms of the values of the pages it points at, by assigning to its
**hub score** the sum of the authority scores of the pages it points
at.


![hubs](images/hubs.png)

Now
the authority score of a page  could take the hub scores
of the list pages into account, by using the sum of the hub scores
of the pages that point at it as an updated authority score.

Then again, applying the **Principle of Repeated Improvement**,
the hub scores can be updated on the basis of the new authority scores,
and so on.

This suggests a ranking procedure that tries to estimate, for each page $p$,
its value as an authority and its value as a hub in the form
of numerical scores, $a(p)$ and $h(p)$.

Starting off with values all equal to $1$, the estimates are updated
by applying the following two rules in an alternating fashion.

<div class="alert alert-warning">

**Authority Update Rule:**
For each page $p$, update $a(p)$
to be the sum of the hub scores of all the pages pointing to it.
</div>


<div class="alert alert-warning">
    
**Hub Update Rule:**
For each page $p$,
update $h(p)$
to be the sum of the authority
scores of all the pages
that it points to.
</div>

In order to keep the numbers from growing too large,
score vectors should be **normalized** after each step,
in a way that  replaces $h$ by a scalar multiple $\hat{h} = sh$
so that the entries in $\hat{h}$ add up to $100$, say,
representing relative percentage values,
similarly for $a$.

After a number of iterations, the values $a(p)$ and
$h(p)$ stabilize, in the sense that further applications of
the update rules do not yield essentially better relative estimates.

**Example.**
Continuing the example above ...

In [None]:
nodes = list(range(1,10)) + ["A%s" % (i+1) for i in range(7)]
print(nodes)

In [None]:
edges = [
    (1,"A1"),(2,"A1"),(3,"A1"),(3,"A2"),(4,"A2"),(5,"A3"),
    (5,"A5"),(6,"A2"),(6,"A4"),(7,"A4"),(7,"A5"),(8,"A4"),
    (8,"A5"),(8,"A6"),(8,"A7"),(9,"A5"),(9,"A6"),(9,"A7")
]

In [None]:
G = nx.DiGraph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)

In [None]:
pos = nx.circular_layout(G)
for i in [1,2,3,4]:
    j = 10 - i
    pos[i], pos[j] = pos[j], pos[i]
colors = 9 * ['y'] + 7 * ['w']

In [None]:
nx.draw(G, with_labels=True, node_color=colors, pos=pos)

Let's use dictionaries, with nodes as keys and hub or authority scores as values.
Here's a way to normalize such a record.

In [None]:
def normalized(d):
    s = sum(d.values())
    return { k: 100/s*v for k, v in d.items() }

Initially, all scores are set to $1$ (and then normalized).

In [None]:
hubs = normalized({ x : 1 for x in G })
auth = normalized({ x : 1 for x in G })

The update rules can then be implemented as follows.

In [None]:
def HubsUpdate(G, auth):
    h = { x: 0 for x in G }
    for x in G:
        for y in G.successors(x):
            h[x] += auth[y]
    return normalized(h)

def AuthUpdate(G, hubs):
    a = { x: 0 for x in G }
    for x in G:
        for y in G.successors(x):
            a[y] += hubs[x]
    return normalized(a)

Now we can apply the rules. alternating between the two, say 10 times, and observe how the scores stabilize.

In [None]:
for k in range(10):
    auth = AuthUpdate(G, hubs)
    print("auth= ", auth)
    hubs = HubsUpdate(G, auth)
    print("hubs = ", hubs)

All in one `python` function:

In [None]:
def HubsAuth(G, k):
    hubs = normalized({ x : 1 for x in G })
    auth = normalized({ x : 1 for x in G })
    for i in range(k):
        auth = AuthUpdate(G, hubs)
        hubs = HubsUpdate(G, auth)
    
    return hubs, auth

In [None]:
hubs, auth = HubsAuth(G, 10)
hubs

In [None]:
auth

Finally, let's apply this to a random directed graph.

In [None]:
n, m = 80, 120
G = nx.gnm_random_graph(n, m, directed=True)

In [None]:
hubs, auth = HubsAuth(G, 50)

Let's inspect the top and the bottom 10 scores.

In [None]:
[(k,auth[k]) for k in sorted(auth, key=auth.get, reverse=True)][:20]

In [None]:
[(k,auth[k]) for k in sorted(auth, key=auth.get)][:20]

In [None]:
[(k, hubs[k]) for k in sorted(hubs, key=hubs.get, reverse=True)][:20]

In [None]:
[(k, hubs[k]) for k in sorted(hubs, key=hubs.get)][:20]

In terms of matrix algebra this effect can be described as follows.

##  Spectral Analysis of Hubs and Authorities

Let $M = (m_{ij})$ be the **adjacency matrix** of the directed graph
$G = (X, E)$
that is $m_{ij} = 1$ if $x_j \to x_i$ and $m_{ij} = 0$ otherwise,
where $X = \{x_1, \dots, x_n\}$.

We write $h = (h_1, \dots, h_n)$ for a list of hub scores, with $h_i = h(x_i)$,
the hub score of node $x_i$.  Similarly, we write $a = (a_1, \dots, a_l)$ for
a list of authority scores.

The **hub update rule** can now be expressed as
a **matrix multiplication**:
$$
h \gets M^T a
$$
and similarly, the **authority update rule**, using the transpose of the matrix $M$:
$$
a \gets M h
$$

Applying two steps of the procedure at once yields update rules
$$
  h \gets M^T M h
$$
and
$$
  a \gets M M^T \, a
$$
for $h$ and $a$, respectively.  

**In the limit**, one expects
to get vectors $h^{\ast}$ and $a^{\ast}$ whose directions do not change
under the latter rules, i.e.,
$$
  (M^T M) h^{\ast} = c h^{\ast}
$$
and
$$
  (M M^T) a^{\ast} = d a^{\ast}
$$
for certain constants $c$ and $d$, meaning that $h^{\ast}$ and $a^{\ast}$
are **eigenvectors** for the matrices $M^T M$ and $M M^T$,
respectively.

Using the fact that $M^T M$ and $M M^T$ are **symmetric** matrices
($(M^T M)^T = M^T (M^T)^T = M^T M$),
it can indeed be shown that any sequence of hub score vectors
$h$ under repeated application of the above update rule
converges to a real-valued eigenvector $h^{\ast}$ of $M M^T$ for the real eigenvalue $c$.
The argument uses the [Spectral Theorem](https://en.wikipedia.org/wiki/Spectral_theorem)
for [real symmetric matrices](https://en.wikipedia.org/wiki/Symmetric_matrix#Real_symmetric_matrices).


A similar result exists for any sequence of authority score vectors $a$.

## PageRank

A simpler model of endorsement for web pages involves only
one numerical value $r(p)$ per page $p$, built on the principle that
**a page is as important as the pages linking to it**.
As before, these importance values can be obtained by
repeatedly applying a suitable update rule to a set of current values.

Specifically, PageRank is computed as follows.

* If the network has $n$ nodes, each page $p$ receives an initial PageRank
of $r(p) = 1/n$.

* Choose a number of steps, $k$.

* Perform the following update rule $k$ times.

<div class="alert alert-warning" markdown="1">

**Basic PageRank Update Rule:**
Each page divides its current PageRank by the number of
pages it links to, and passes this value on to those pages.
In this way, each page updates its PageRank to be the sum of
all the shares it receives from the pages linking to it.
</div>

As in each step, the total PageRank of all nodes is maintained
(each node splits its PageRank into equal parts and passes this on,
nothing is lost or gained overall), there is no need for normalization.

After a number of steps, the PageRank values of the individual nodes 
stabilize.  These values form an equilibrium in the sense that
another application of the update rule will produce exactly the same
values.

**Example.**  The following graph represents
a network of $8$ web pages with hyperlinks.

![paherank](images/pagerank.png)

The following table shows how the initial PageRank
of $\frac18$ of each page is updated under six iterations
of the basic PageRank update rule
and, in the bottom row, the limit values.

![pagerank-p](images/pagerank-p.png)

For a slightly larger example, let's implement this algorithm as a `python` program.

In [None]:
n, m = 80, 120
G = nx.gnm_random_graph(n, m, directed=True)

In [None]:
G.out_degree()

The algorithm doesn't work if there is a node $x$ with no successors in $G$. (Why?)
So for now, let's add some random edges to make sure each node $x$ has out-degree at least $1$.

In [None]:
import random

for x in G:
    y = x
    while y == x:
        y = random.randrange(n)
    G.add_edge(x, y)

In [None]:
def PageRank(G, k):
    n = G.order()
    r = { x: 1/n for x in G }
    for i in range(k):
        s = { x : 0 for x in G }
        for x in G:
            l = G.out_degree(x)
            v = r[x]/l
            for y in G.successors(x):
                s[y] += v
        r = s
    return r

In [None]:
k = 20
pr = PageRank(G, k)
[(k,pr[k]) for k in sorted(pr, key=pr.get)][:20]

In [None]:
[(k,pr[k]) for k in sorted(pr, key=pr.get, reverse=True)][:20]

In terms of matrix algebra this effect can be described as follows.

##  Spectral Analysis of PageRank

Here, we use a **variant of the adjacency matrix** $A$ of the directed graph $G = (X, V)$.

Let $N$ be the $n \times n$ matrix with entries $N_{ij} = 0$
if node $x_j$ is not linked to node $x_i$ (as for the adjacency matrix $A$).
And when $x_j \to x_i$, then set $N_{ij} = 1/l_j$, 
where $l_j$ is the number of links out of $x_j$.

Write $r = (r_1, \dots, r_n)$ for the list of PageRank values of the nodes
$X = \{x_1, \dots, x_n\}$.  Then the **basic PageRank update rule**
can be expressed as **matrix multiplication**:
$$
r \gets N \,r.
$$

It can be shown that repeated application of the basic PageRank update rule
lets the PageRank values converge towards a vector $r^{\ast}$ with the property
$$
N\, r^{\ast} = r^{\ast},
$$
that is, $r^{\ast}$ is an **eigenvector** of $N for the eigenvalue $1$.

The argument uses the [Perron-Frobenius Theorem](https://en.wikipedia.org/wiki/Perron%E2%80%93Frobenius_theorem) which we have seen before.  Recall that, for a matrix in which all entries are non-negative (such as the matrix $N^T$) the theorem guarantees the existence of a **real eigenvalue**
with corresponding **eigenvector having non-negative entries**.
(Not every matrix with real entries has this property.)

## Code Corner

### `python`

* `sorted`: [[doc]](https://docs.python.org/3/library/functions.html#sorted)

### `random`

* `randrange` vs. `randint`: [[doc]](https://docs.python.org/3/library/random.html)

### `networkx`

* `DiGraph`: [[doc]](https://networkx.github.io/documentation/stable/reference/classes/digraph.html)


* `circular_layout`: [[doc]](https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.layout.circular_layout.html)

## Exercises

1. It might be tempting to combine the hubs and authorities update rules into a single `for` loop.
Why is this not a good idea?

1. Create a random $G(n, m)$ graph (with $n = 80$ and $m = 100$, say) and identify its Bow-Tie components
(Giant SCC, IN, OUT, tendrils and tubes).  Then compute Hubs and Authority centralities to identify
strong hubs and heavy authorities.  Which Bow-Tie components do hubs and authorities prefer?

1. Create a random $G(n, m)$ graph (with $n = 80$ and $m = 100$, say) and 
complete the graph with further random edges so that each node has out-degree at least $1$.
Identify the graph's Bow-Tie components
(Giant SCC, IN, OUT, tendrils and tubes).  Then compute PageRank centralities to identify
important pages.  Which Bow-Tie components do these pages prefer?

