---
title: 8.3 Page Rank and the Google Matrix
subject:  Iteration
subtitle: ranking websites
short_title: 8.3 Page Rank and the Google Matrix
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/07_Ch_8_Iteration/093-Random_Walks.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 15 - Linear Iterative Systems, Matrix Powers, Markov Chains, and Google’s PageRank.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be found in LAA $5^{th}$ edition 10.2.

## Learning Objectives

By the end of this page, you should know:
- what is a random walk on a graph
- how a simple internet is modeled as a random walk on a graph
- how ranking webpages helps internet users
- the primary modifications made by Google to address the non-regularity of transition matrices that model links between webpages
- the magnanimity of Google's transition matrix and how they handle them

## Random Walk on the Internet

An internet user's behavior while surfing the web can be modeled as a Markov chain that captures a _random walk on a graph_. We start with a simple example of this, and then explain how it can be used to design a search engine.

Consider the following graph:

:::{figure}../figures/08-graph_google.jpg
:label:graph_google
:alt:Graph
:width: 250px
:align: center
:::

which has seven vertices interconnected by edges. Let's pretend this graph models a very simple internet! each vertex, or node, is a webpage, and each edge is a hyperlink connecting pages to each other. (For now we assume that if page $i$ links to page $j$, then page $j$ also links to page $i$, but this isn't necessary.) We assume that if a user is on page $i$, they will click on one of the hyperlinks with equal probability. For example, in our simple internet, if a user is on page 5, they will visit page 2 next 50\% of the time and page 6 next 50\% of the time. Similarly, if a user is on page 3, they will visit page 1, 2, or 4 next 33\% of the time each.

We can model this user behavior using a Markov chain, which is called a _simple random walk on a graph_. The transition matrix for this graph is given by:

$$
& \quad \quad  \begin{matrix}
1 & 2 & \ 3 & 4 & 5 & 6 & 7 \end{matrix} \\
P &= \begin{bmatrix}
0 & \frac{1}{3} & \frac{1}{4} & 0 & 0 & 0 & 0 \\
\frac{1}{2} & 0 & \frac{1}{4} & 0 & \frac{1}{2} & 0 & 0 \\
\frac{1}{2} &  \frac{1}{3} & 0 & 1 & 0 &  \frac{1}{3} & 0 \\
0 & 0 & \frac{1}{4} & 0 & 0 & 0 & 0 \\
0 &  \frac{1}{3} & 0 & 0 & 0 &  \frac{1}{3} & 0 \\
0 & 0 & \frac{1}{4} & 0 & \frac{1}{2} & 0 & 1 \\
0 & 0 & 0 & 0 & 0 &  \frac{1}{3} & 0
\end{bmatrix} 
\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7\end{matrix} 
$$

This allows us to answer questions such as the following: suppose 100 users start on page 6. After each user clicks on three hyperlinks, what \% of users do we expect to find on each web page. The solution is given by setting our initial user distribution to:

$$
\vv x(0) = \begin{bmatrix}
0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0
\end{bmatrix}
$$

and computing $\vv x(3) = P^3\vv x(0) = \begin{bmatrix}
.0833 \\ .0417 \\ .4028 \\ 0 \\ .2778 \\ 0 \\ .1944
\end{bmatrix}$

This tells, for example, that 40.28\% of users starting on page 6 end up on page 3 after 3 hyperlink clicks.

## Page Rank

The founders of Google, Sergey Brin and Lawrence Page, reasoned that important pages had links coming from other "important" pages, and thus, a typical internet user would spend more time on more important pages, and less time on less important pages. This can be captured by the steady state distribution $\vv x^*$ of the Markov chain we are using to model the internet: in the long run, a typical user will spend $\vv x^*_i$ \% of their time on page $i$. This is precisely the observation used to define the Page Rank algorithm, which Google uses to rank the importance of the webpages it catalogs.

:::{important} Key Idea
The importance of a webpage $i$ can be measured by the corresponding entry $\vv x^*_i$ of the steady state vector $\vv x^*$ of the Markov chain describing the behavior of a typical internet user.
:::

Now, if a typical transition matrix $P$ describing the internet were [regular](./092-Markov_Chains.ipynb#regular_defn), we would be done - we simply compute $\vv x^* = P\vv x^*$ and use $\vv x^*$ to rank websites. Unfortunately, typical models of the web are [directed graphs](https://en.wikipedia.org/wiki/Directed_graph), which lead to non-regular transition matrices $P$. To address this, Google makes two adjustments, which we illustrate with the following slight modification of our previous example:

:::{figure}../figures/08-directed_graph.jpg
:label:directed_graph
:alt:Directed Graph
:width: 250px
:align: center
:::

$$
& \quad \quad  \begin{matrix}
1 & 2 & \ 3 & 4 & 5 & 6 & 7 \end{matrix} \\
P &= \begin{bmatrix}
0 & \frac{1}{2} & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & \frac{1}{3} & 0 & \frac{1}{2} & 0 & 0 \\
1 &  0 & 0 & 0 & 0 &  \frac{1}{3} & 0 \\
0 & 0 & \frac{1}{3} & 1 & 0 & 0 & 0 \\
0 &  \frac{1}{2} & 0 & 0 & 0 &  \frac{1}{3} & 0 \\
0 & 0 & \frac{1}{3} & 0 & \frac{1}{2} & 0 & 0 \\
0 & 0 & 0 & 0 & 0 &  \frac{1}{3} & 1
\end{bmatrix} 
\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7\end{matrix} 
$$

The first issue that arises here is that pages 4 and 7 are dangling nodes: if a user ever ends up on page 4 or 7, they never leave as there are no outgoing links. This means that columns 4 and 7 never change as we compute $P^k$, and hence $P$ cannot be regular. To handle dangling nodes, the following adjustment is made:

:::{note}Adjustment 1 
If an internet user reaches a dangling node, they will pick any page on the web with equal probability and move to that page.
:::

The above means that if state $j$ is an absorbing state, we replace column $j$ of $P$ with the vector $\bm \frac{1}{n} & \ldots & \frac{1}{n} \em$. For example, our modified transition matrix is now:

$$
& \quad \quad  \begin{matrix}1 & 2 & \ 3 & 4 & 5 & 6 & 7 \end{matrix} \\
P_{*} &= \begin{bmatrix}
0 & \frac{1}{2} & 0 & \frac{1}{7} & 0 & 0 & \frac{1}{7} \\
0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\
1 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\
0 & 0 & \frac{1}{3} &\frac{1}{7} & 0 & 0 & \frac{1}{7} \\
0 & \frac{1}{2} & 0 &\frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\
0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\
0 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7}
\end{bmatrix}
\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7\end{matrix} 
$$

While this helps eliminate dangling nodes, we may still not have a regular transition matrix, as there might still be **cycles** of pages. For example, if page $i$ only links to page $j$, and page $j$ only links to page $i$, a user entering either page is condemned to moving back and forth between the two pages. This means the corresponding columns of $P_*^k$ will always have zeros in them, and hence $P_*$ would not be regular.


:::{note}Adjustment 2
Pick a number $p$ between 0 and 1. If a user is on page $i$, then $p$ proportion of the time, they will pick from all possible hyperlinks on that page with equal probability and move to that page. The other $1-p$ fraction of the time, they will pick any page on the web with equal probability and move to that page.
:::

In terms of the modified transition matrix $P_*$, the new transition matrix will be

$$
G = pP_* + (1-p)K,
$$

where $K \in \mathbb{R}^{n \times n}$ and $K_{ij} = \frac{1}{n}$ for $i,j=1,\ldots,n$. The matrix $G$ is called the _Google matrix_. $G$ is easily seen to be regular, as all entries of $G$ are positive.

Although any value of $p$ is valid, Google is thought to use $p = 0.85$. For our example, the Google matrix is

$$
G &= .85\begin{bmatrix}
0 & \frac{1}{2} & 0 & \frac{1}{7} & 0 & 0 & \frac{1}{7} \\
0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\
1 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\
0 & 0 & \frac{1}{3} &\frac{1}{7} & 0 & 0 & \frac{1}{7} \\
0 & \frac{1}{2} & 0 &\frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\
0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\
0 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7}
\end{bmatrix} 
+ .15\begin{bmatrix}
\frac{1}{7}  & \frac{1}{7} & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  \\
\frac{1}{7}  & \frac{1}{7} & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7} \\
\frac{1}{7}  & \frac{1}{7} & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7} \\
\frac{1}{7}  & \frac{1}{7} & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  \\
\frac{1}{7}  & \frac{1}{7} & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}  & \frac{1}{7}
\end{bmatrix} \\
&= \begin{bmatrix}
.021429 & .446429 & .021429 & .142857 & .021429 & .021429 & .142857 \\
.021429 & .021429 & .304762 & .142857 & .446429 & .021429 & .142857 \\
.871429 & .021429 & .021429 & .142857 & .021429 & .304762 & .142857 \\
.021429 & .021429 & .304762 & .142857 & .021429 & .021429 & .142857 \\
.021429 & .446429 & .021429 & .142857 & .021429 & .304762 & .142857 \\
.021429 & .021429 & .304762 & .142857 & .446429 & .021429 & .142857 \\
.021429 & .021429 & .021429 & .142857 & .021429 & .304762 & .142857
\end{bmatrix}
$$

We can now compute the steady state vector $\vv x^* = G\vv x^*$, which is found to be:

$$
\vv x^* = \begin{bmatrix}
.116293 \\
.168567 \\
.191263 \\
.098844 \\
.164054 \\
.168567 \\
.092413
\end{bmatrix}
$$

Thus, we can rank the pages in terms of descending importance as: 3, 2, 6, 5, 1, 4, 7.

:::{note}
The Google matrix for the world wide web has over 8 billion rows and columns, and computing $\vv x^* = G \vv x^*$ is a very non-trivial task. An iterative approach known as the _power method_ is used in practice, and typically takes several days for Google to compute a new $\vv x^*$, which it does every month. 

Some resources for the power method:

1. https://www.mpp.mpg.de/~jingliu/ECPI/PowerMethodProof.pdf
2. https://en.wikipedia.org/wiki/Power_iteration
:::

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/07_Ch_8_Iteration/093-Random_Walks.ipynb)