---
title: 8.2 Markov Processes
subject:  Iteration
subtitle: 
short_title: 8.2 Markov Processes
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/07_Ch_8_Iteration/092-Markov_Chains.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 15 - Linear Iterative Systems, Matrix Powers, Markov Chains, and Google’s PageRank.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be found in Section 4.9 and Chapter 10 of LAA $5^{th}$ edition, and ALA 9.3.

## Learning Objectives

By the end of this page, you should know:
- 

## Weather Prediction: Introduction

We will spend this section on _Markov Chains_, which are a widely used linear iterative model to describe a wide variety of situations in biology, business, chemistry, engineering, physics, and elsewhere.

In each case, the model is used to describe an experiment or measurement that is performed many times in the same way. The outcome of an experiment can be one of several known possible outcomes, and importantly, the outcome of one experiment depends only on the experiment conducted immediately before it. Before introducing a formal model for Markov chains, let's look at an example.

:::{prf:example} Weather Prediction
:label: weather_eg
Suppose you would like to predict the weather in your city. Looking at local weather records over the past 10 years, you notice that:
1. If today is sunny, tomorrow is sunny 70% of the time and cloudy 30% of the time.
2. If today is cloudy, tomorrow is cloudy 80% of the time and sunny 20% of the time.

Now, suppose today is sunny. What is the _probability_$^{1}$ that the weather 8 days from now will also be sunny?

$^{1}$ You will learn how to properly define probabilities in ESE 3010. For our purposes, you can think of it as confidences or likelihood. So saying that 8 days from now will be sunny with probability $60\%$ is the same as saying that the weather 10 days from now is determined by flipping a biased coin that comes up "sunny" $60\%$ of the time and "cloudy" 40% of the time.

To formulate this problem mathematically, let's use $S(k)$ to denote the probability that day $k$ is sunny and $C(k)$ the probability that it is cloudy. If these are the only two possibilities, then the individual probabilities must sum to 1 (1 represents 100% likely, .5 50% likely, etc.): $S(k) + C(k) = 1$.

According to our historical data, the probability that day $k+1$ is sunny or cloudy can be expressed as:

\begin{equation}
\label{sun_cloud}
S(k+1) = .7 S(k) + .2 C(k),    \quad C(k+1) = .3 S(k) + .8 C(k)
\end{equation}

For example, the [equation](#sun_cloud) says that if day $k$ was sunny, i.e., $S(k)=1$ and $C(k)=0$ there is a 70% chance day $k+1$ is too; similarly, if day $k$ was cloudy, i.e., $S(k)=0$ and $C(k)=1$, there is a 40% chance day $k+1$ is sunny.

We rewrite [](#sun_cloud) as the linear iterative system $\vv x(k+1) = P \vv x(k)$, where

$$
P = \bm .7 & .2 \\ .3  & .8 \em  \ \text{and}\ \ \vv x(k) = \bm S(k) \\ C(k) \em            
$$

We use $P$ instead of $T$ here as this is a typical convention for describing the _transition matrix_ of a Markov chain. The vector $\vv x(k)$ is called the $k^{th}$ _state vector_.

Now, given that today is sunny, i.e., that $S(0) = 1$ and $C(0) = 0$, what is the probability that 8 days from now is sunny? We can answer this easily by iterating the system $\vv x(k+1) = P \vv x(k)$ to compute $\vv x(8)$!

\begin{align*}
\vv x(0) &= \bm 1 \\ 0 \em \vv x(1) = P \vv x(0) = \bm .7 \\ .3 \em, \ \vv x(2) = P \vv x(1) = P^2 \vv x(0) \approx \bm .55 \\ .45 \em \\

\vv x(3) &= P^3 \vv x(0) \approx  \bm .475 \\ .525 \em, \ \vv x(4) \approx \bm .438 \\ .562 \em, \ \vv x(5) \approx \bm .419 \\ .581 \em, \ \vv x(6) \approx \bm .410 \\ .590 \em \\

\vv x(7) &\approx  \bm .405 \\ .595 \em, \ \vv x(8) \approx \bm .402 \\ .598 \em
\end{align*}

So we conclude that 40.2% of the time, if today is sunny, then 8 days from now is also sunny.
:::

:::{note} Observations from [](#weather_eg)
We make a few observations about the state vectors $\vv x(k)$ to motivate some of the new tools we'll introduce:

1) Every state vector $\vv x(k)$ is a _probability vector_, i.e., $ x_1(k)$ and $x_2(k) \geq 0$ and $x_1(k) + x_2(k) = 1$
2) The process converges fairly quickly to $\vv x^* = \bm .4 \\ .6 \em $, which is a _fixed point_ of $\vv x(k+1) = P\vv x(k)$, i.e., $\vv x^* = P\vv x^*$
3) The convergence to $\vv x^*$ actually happens for any initial probability vector $\vv x(0)$. This means that in the long run, 40% of days are sunny and 60% are rainy.

:::

## Convergence in Markov Chains

Let's try to understand why the convergence in [](#weather_eg) happens, and then we'll look at some interesting applications of Markov chains.

Our starting point is a general definition of a _probability vector_.
:::{prf:definition} Probability Vector
:label: prob_vec_defn
A vector $\vv x \in \mathbb{R}^n$ is called a _probability vector_ if $x_i \geq 0$ for $i=1,\ldots,n$ and $x_1 + \ldots + x_n = 1$. We interpret $x_i$ as the probability (or likelihood) that the system is in state $i$.
:::

In general, a _Markov chain_ is given by the first order linear iterative system
\begin{equation}
\label{MC_eqn}
\vv x(k+1) = P \vv x(k)  \quad (\text{MC})
\end{equation}

whose initial state $\vv x(0)$ is a probability vector. The entries of the _transition matrix_ $P$ must satisfy
\begin{equation}
\label{TM_eqn}
0 \leq p_{ij} \leq 1 \  \text{and} \  p_{1j} + \cdots + p_{nj} = 1.  \quad (\text{TM})
\end{equation}

for all $i,j=1,\ldots,n$. The entry $p_{ij}$ is the _transition probability_ that the system will switch from state $j$ to state $i$. Because this covers all possible transitions, this means each column sums to 1. Under these conditions, we can guarantee that if $\vv x(k)$ is a probability vector, so is $\vv x(k+1) = P\vv x(k)$. To see this, note that $\vv 1^{\top} P = \bm \vv 1^{\top} \vv  p_1 & \cdots & 1^{\top} \vv p_n\em = \bm 1 & \cdots & 1 \em = \vv 1^{\top}$ so that $\vv 1^{\top} \vv x(k+1) = \vv 1^{\top} P\vv x(k) = \vv 1^{\top} \vv x(k) = 1$. That $\vv x(k+1)$ is entrywise non-negative follows from $P$ and $\vv x(k)$ being entry-wise non-negative.

Next, let's investigate convergence properties. We first need to impose a very mild technical condition on the transition matrix $P$, namely we assume that it is _regular_.

:::{prf:definition} Regular Transition Matrix
:label:regular_defn
A transition matrix $P$ ([TM](#TM_eqn)) is _regular_ if for some power $k$, $P^k$ contains no zero entries. This means that it is possible to get from one state to any other state in $k$ steps.
:::

The long-term behavior of a Markov chain with regular transition matrix $P$ is governed by the _Perron-Frobenius_ theorem, which we state next. The proof is quite involved, so we won't cover it, but if you're curious, check out the end of ALA 9.3.

:::{prf:theorem}
:label: regular_thm
If $P$ is a [regular transition matrix](#regular_defn), then it admits a unique probability vector $\vv x^*$ with eigenvalue $\lambda_1=1$. Moreover, a Markov chain with coefficient matrix $P$ will converge to $\vv x^*: \vv x(k) \to \vv x^*$ as $k \to \infty$.
:::

This is a very exciting development! It tells us that we can understand the long-term behavior of a regular Markov chain by solving for the eigenvector $\vv x^*$ associated with the eigenvalue $\lambda_1=1$ of $P$.

Returning to our weather prediction example, we compute the steady state probability vector $\vv x^*$ by just solving $(P-I)\vv v=\vv 0$:
$$
(P-I)\vv v = \bm -.3 & .2 \\ .3 & -.2 \em \bm v_1 & v_2 \em = \vv 0 => v_1 = \frac{2}{3} v_2 \Rightarrow \vv v = \bm \frac{2}{3} \\ 1 \em
$$

and then normalizing $\vv v$ so that its entries add up to 1:
$$
\vv x^* = \frac{1}{1+\frac{2}{3}} \bm \frac{2}{3} \\ 1 \em = \bm \frac{2}{5} \\ \frac{3}{5} \em = \bm 0.4 \\ 0.6 \em
$$

This special eigenvector $\vv x^*$ tells us that _no matter the initial state_ $\vv x(0)$, the long term behavior is that we are in State 1 (sunny) 40% of days and State 2 (cloudy) 60% of days.



Example: Get out the vote!

Suppose the voting results of a congressional election at a certain voting precinct are represented by a vector x ∈ ℝ³:

x = [% voting Democratic (D)]
    [% voting Republican (R)]
    [% voting Libertarian (L)]

We record the outcome of this election every two years by a vector of this type, and let's assume that the outcome of one election depends only on results of the previous one. Then the sequence x(k) of vectors that describe the votes in each election form a Markov chain. Suppose, using historical data, we estimate the following transition matrix P:

        From:
        D   R   L   To:
    [.7  .1  .3] D
P ≈ [.2  .8  .3] R
    [.1  .1  .4] L

The entries in the first column, labeled D, describe what % of persons who voted D in the last election will vote D, R, and L in this one: in this example, 70% of prior D voters will vote D again, 20% will vote R, and 10% will vote L.

If we assume that P remains fixed across many elections, we can predict not only the next election's results, but long-term election results as well. For example, if last election had results:

x(0) = [.55]
       [.40]
       [.05]

then the next election will have a likely outcome of

x(1) = P x(0) ≈ [.44]
               [.445]
               [.115]

and the following election will have likely outcome

x(2) ≈ Px(1) ≈ [.387]
              [.475]
              [.1345]


In the long run, we expect when converge to the steady state distribution $\pi^*$ satisfying $\pi^* = P\pi^*$, which we obtain by solving:

\[(P - I)\pi = 0\]

and setting $\pi^* = \frac{\pi}{\sum \pi}$. In this case, this works out to:

\[\pi^* \approx \begin{bmatrix} 0.521 \\ 0.336 \\ 0.143 \end{bmatrix}\]

which informs that assuming voter patterns do not change, 52.1\% of voters will go to D, 33.6\% to R, and 14.3\% to I in this precinct.

\textbf{Random Walks, Page Rank, and the Google Matrix}

An internet user's behavior while surfing the web can be modeled as a Markov chain that captures a random walk on a graph. We start with a simple example of this, and then explain how it can be used to design a search engine.

Consider the following graph:

\begin{center}
\includegraphics[scale=0.5]{figure4}
\end{center}

Which has seven vertices interconnected by edges. Let's pretend this graph models a very simple internet: each vertex (or node) is a webpage and each edge is a hyperlink connecting page to each other. (For now we assume that if page i links to page j, then page j also links to page i, but this isn't necessary.) We assume that if a user is on page i, they will click on one of the hyperlinks uniformly at random. For example, if a user is on page 5, they will visit page 3 next 50\% of the time and page 6 next 50\% of the time. Similarly, if a user is on page 3, they will visit page 1, 2, or 4 next 33\% of the time each.

We can model this user behavior using a Markov chain, which is called a simple random walk on a graph. The transition matrix for this graph is given by:

\[
P = \begin{bmatrix}
0 & 1/3 & 1/4 & 0 & 0 & 0 & 0 \\
1/2 & 0 & 1/4 & 1 & 1/2 & 0 & 0 \\
1/2 & 1/3 & 0 & 0 & 0 & 1/3 & 0 \\
0 & 0 & 1/4 & 0 & 0 & 0 & 0 \\
0 & 1/3 & 0 & 0 & 0 & 1/3 & 0 \\
0 & 0 & 1/4 & 0 & 1/2 & 0 & 1 \\
0 & 0 & 0 & 0 & 0 & 1/3 & 0
\end{bmatrix}
\]

This allows us to answer questions such as the following: suppose 100 users start on page 6. After each user clicks on three hyperlinks, what \% of users do we expect to find on each new page. The solution is given by setting our initial user distribution to:

\[
x(0) = \begin{bmatrix}
0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0
\end{bmatrix}
\]

and computing $x(3) = P^3x(0) = \begin{bmatrix}
.0833 \\ .0417 \\ .4028 \\ 0 \\ .2778 \\ 0 \\ .1944
\end{bmatrix}$

This tells, for example, that 40.28\% of users starting on page 6 end up on page 3 after 3 hyperlink clicks.

\textbf{PageRank}

The founders of Google, Sergey Brin and Lawrence Page, reasoned that important pages had links coming from other "important" pages, and thus, a typical internet user would spend more time on more important pages, and less time on less important pages. This can be captured by the steady state distribution $\pi^*$ of the Markov chain we described above. Using this model, we can say that a typical user will spend $\pi^*_i$ \% of their time on page $i$. This is precisely the observation used to define the PageRank algorithm, which Google uses to rank the importance of the webpages it catalogs.

\textbf{Key Idea:} The importance of a webpage $i$ can be measured by the corresponding entry $\pi^*_i$ of the steady state vector $\pi^*$ of the Markov chain describing the behavior of a typical internet user.

Now, if a typical transition matrix $P$ describing the internet were regular, we would be done - we simply compute $\pi^* = P\pi^*$ and use $\pi^*$ to rank webpages. Unfortunately, typical models of the web are directed graphs, which lead to non-regular transition matrices $P$. To address this, Google makes two adjustments, which we illustrate with the following slight modification of our previous example:

\begin{figure}[h]
\centering
\includegraphics[width=0.3\textwidth]{figure1}
\caption{A seven-page Web}
\end{figure}

\[
P = \begin{bmatrix}
0 & 1/2 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1/3 & 0 & 1/2 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 & 1/3 & 0 \\
0 & 0 & 1/3 & 1 & 0 & 0 & 0 \\
0 & 1/2 & 0 & 0 & 0 & 1/3 & 0 \\
0 & 0 & 1/3 & 0 & 1/2 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 1/3 & 1
\end{bmatrix}
\]

The first issue that arises here is that pages 4 and 7 are dangling nodes: if a user ever ends up on page 4 or 7, they never leave as there are no outgoing links. This means that columns 4 and 7 never change as we compute $P^k$, and hence $P$ cannot be regular. To handle dangling nodes, the following adjustment is made:

\textbf{Adjustment 1:} If an internet user reaches a dangling node, they will pick any page on the web with equal probability and move to that page.

This means that if state $j$ is an absorbing state, we replace column $j$ of $P$ with the vector $(1/n, ..., 1/n)$. For example, our modified transition matrix is now:

\[
P_* = \begin{bmatrix}
0 & 1/2 & 0 & 1/7 & 0 & 0 & 1/7 \\
0 & 0 & 1/3 & 1/7 & 1/2 & 0 & 1/7 \\
1 & 0 & 0 & 1/7 & 0 & 1/3 & 1/7 \\
0 & 0 & 1/3 & 1/7 & 0 & 0 & 1/7 \\
0 & 1/2 & 0 & 1/7 & 0 & 1/3 & 1/7 \\
0 & 0 & 1/3 & 1/7 & 1/2 & 0 & 1/7 \\
0 & 0 & 0 & 1/7 & 0 & 1/3 & 1/7
\end{bmatrix}
\]

While this helps eliminate dangling nodes, we may still not have a regular transition matrix, as there might still be cycles of pages. For example, if page $i$ only links to page $j$, and page $j$ only links to page $i$, a user entering either page is condemned to moving back and forth between the two pages. This means the corresponding columns of $P_*^k$ will always have zeros in them, and hence $P_*$ would not be regular.


\textbf{Adjustment 2:} Pick a number $p$ between 0 and 1. If a user is on page $i$, then $p$ proportion of the time they will pick from all possible hyperlinks on that page with equal probability and move to that page. The other $1-p$ fraction of the time, they will pick any page on the web with equal probability and move to that page.

In terms of the modified transition matrix $P_*$, the new transition matrix will be

\[G = pP_* + (1-p)E,\]

where $E \in \mathbb{R}^{n \times n}$ and $E_{ij} = 1/n$ for $i,j=1,...,n$. The matrix $G$ is called the Google matrix. $G$ is easily seen to be regular, as all entries of $G$ are positive.

Although any value of $p$ is valid, Google is thought to use $p \approx 0.85$. For our example, the Google matrix is

\[
G = .85\begin{bmatrix}
0 & 1/2 & 0 & 1/7 & 0 & 0 & 1/7 \\
0 & 0 & 1/3 & 1/7 & 1/2 & 0 & 1/7 \\
1 & 0 & 0 & 1/7 & 0 & 1/3 & 1/7 \\
0 & 0 & 1/3 & 1/7 & 0 & 0 & 1/7 \\
0 & 1/2 & 0 & 1/7 & 0 & 1/3 & 1/7 \\
0 & 0 & 1/3 & 1/7 & 1/2 & 0 & 1/7 \\
0 & 0 & 0 & 1/7 & 0 & 1/3 & 1/7
\end{bmatrix}
\]

\[
+ .15\begin{bmatrix}
1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 \\
1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 \\
1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 \\
1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 \\
1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 \\
1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 \\
1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7 & 1/7
\end{bmatrix}
\]

\[
= \begin{bmatrix}
.021429 & .446429 & .021429 & .142857 & .021429 & .021429 & .142857 \\
.021429 & .021429 & .304762 & .142857 & .446429 & .021429 & .142857 \\
.871429 & .021429 & .021429 & .142857 & .021429 & .304762 & .142857 \\
.021429 & .021429 & .304762 & .142857 & .021429 & .021429 & .142857 \\
.021429 & .446429 & .021429 & .142857 & .021429 & .304762 & .142857 \\
.021429 & .021429 & .304762 & .142857 & .446429 & .021429 & .142857 \\
.021429 & .021429 & .021429 & .142857 & .021429 & .304762 & .142857
\end{bmatrix}
\]

We can now compute the steady state vector $\pi^* = G\pi^*$, which is found to be:

\[
\pi^* = \begin{bmatrix}
.116293 \\
.168567 \\
.191263 \\
.098844 \\
.164054 \\
.168567 \\
.092413
\end{bmatrix}
\]

Thus, we can rank the pages in terms of descending importance as: 3, 2, 6, 5, 1, 4, 7.

:::{note}
The Google matrix for the world wide web has over 8 billion rows and columns, and computing $\vv x^* = G \vv x^*$ is a very non-trivial task. An iterative approach known as the power method is used in practice, and typically takes several days for Google to compute a new $\vv x^*$, which it does every month. 

**TO DO**: accessible links to power method description
:::

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/07_Ch_8_Iteration/092-Markov_Chains.ipynb)