# Pagerank Algorithm: Explanation

## Background

The Pagerank algorithm is developed by the co-founders of Google, Larry Page and Sergey Brin, as a research project on an optimal search engine algorithm at Stanford University. Sergey Brin conceptualized this algorithm using the `heuristic` of link popularity - the idea that the popularity of a webpage is directly correlated to the number of websites which link to it. Hector Garcia-Molina, Sergey's university advisor, as well as Scott Hassan and Alan Steremberg, were among the people critical in the development of this algorithm.

The goal of this project is to construct an algorithm that is capable of calculating the PageRank values of each webpage, given a universe of websites, `corpus`. To achieve this goal, `Markov Chains` will be utilized. Refer to the section on `Markov Chains` for more details.

Although this project aims to model the PageRank algorithm created by Larry Page and Sergey Brin, the knowledge of Markov Chains has vast applications in a multitude of domains.

## Initialization & Crawling

In [1]:
import os
import random
import re
import sys
import numpy as np

DAMPING = 0.85
SAMPLES = 10000


def main():
    if len(sys.argv) != 2:
       sys.exit("Usage: python pagerank.py corpus")
    corpus = crawl("corpus0")
    print(corpus)
    ranks = sample_pagerank(corpus, DAMPING, SAMPLES)
    print(f"PageRank Results from Sampling (n = {SAMPLES})")
    for page in sorted(ranks):
        print(f"  {page}: {ranks[page]:.4f}")
    ranks = iterate_pagerank(corpus, DAMPING)
    print(f"PageRank Results from Iteration")
    for page in sorted(ranks):
        print(f"  {page}: {ranks[page]:.4f}")


def crawl(directory):
    """
    Parse a directory of HTML pages and check for links to other pages.
    Return a dictionary where each key is a page, and values are
    a list of all other pages in the corpus that are linked to by the page.
    """
    pages = dict()

    # Extract all links from HTML files
    for filename in os.listdir(directory):
        if not filename.endswith(".html"):
            continue
        with open(os.path.join(directory, filename)) as f:
            contents = f.read()
            links = re.findall(r"<a\s+(?:[^>]*?)href=\"([^\"]*)\"", contents)
            pages[filename] = set(links) - {filename}

    # Only include links to other pages in the corpus
    for filename in pages:
        pages[filename] = set(
            link for link in pages[filename]
            if link in pages
        )

    return pages

# Markov Chains

This project utilizes the concept of Markov Chains to estimate PageRank values. A Markov Chain is a mathematical model that experiences transitions from one state to another according to probability values. It is a `stochastic process` that fulfills the `Markov Property` - the probability of getting to a future state is only dependent on a `FINITE` number of past states.

This problem will be modelled as a time-homogeneous Markov Chain. The probability of getting to a future state - the next website - is only dependent on the model's current state - the current website the hypothetical surfer is on. This makes the process completely `memoryless` - knowledge of past websites that a surfer traversed has zero influence in predicting the next websites they would visit.

A Markov Chain is probabilistic, differing from other search problems like Tic-Tac-Toe and Nim demonstrated in this portfolio, which are deterministic.

## Modelling PageRank as a Markov Chain

To model this problem as a Markov Chain, the state space of this problem will be defined by the variable `corpus`, which contains a universe of webpages.

For simplicity, consider a hypothetical universe, corpus, with webpages `a.html` to `z.html`. The function `transition_model` generates a `conditional probability distribution` of clicking on another page, given the current page the surfer is on.

This function considers two scenarios:
1. Suppose the hypothetical random surfer is at webpage `x.html`. `x.html` contains zero outgoing links - represented by `len(corpus[page] == 0)`. We then assume that he would click on any of the 26 webpages with equal probability. The probability of accessing each webpage would therefore be 1/26, or generalized to `1/len(corpus)`.

2. Suppose the hypothetical random surfer is at webpage `y.html`, and `y.html` contains links to `x.html` and `z.html`.
- With `1-damping_factor`, the random surfer would click on any of the 26 webpages in the hypothetical universe. Initialize the pagerank value of each page in the universe with this value.
- Update the pagerank value for `x.html` and `z.html` by the formula (`damping_factor/ len(corpus[page]`). `Len(corpus[page])` represents the number of links on the page - in this case, `y.html` has 2 links.

## Damping Factor

The damping factor is predetermined to be 0.85. A full discussion of the damping factor is out of the scope of this project.

In [None]:
def transition_model(corpus, page, damping_factor):
    """
    Returns transition model.
    """
    tm = {}
    
    if len(corpus[page]) == 0:
        for c in corpus:       
            tm[c] = 1/ len(corpus)
    
    else:
        for c in corpus:
            tm[c] = (1-damping_factor)/len(corpus)
            
        for webpage in corpus[page]:
            tm[webpage] += damping_factor / len(corpus[page])
    
    return tm

## Calculation of Pagerank

Two different algorithms are used to elucidate the final pagerank values of a webpage, `sample_pagerank` and `iterate_pagerank`.

### Sampling 

Much like randomly rolling a dice 10000 times and recording its value, the sampling algorithm simulates the process of randomly surfing the internet and counts the number of times each webpage is clicked on, dividing each of the final count by n.

In [None]:
def sample_pagerank(corpus, damping_factor, n):
    """
    Returns pagerank values via sampling
    """
    rank = {}
    count = 0
    
    #initialize
    for c in corpus:
        rank[c] = 0.0
    
    while count < n:
        
        #randomly generate 1st page
        if count == 0:
            pg = np.random.choice(list(corpus.keys()))
            
            #generate transition model for page
            transition = transition_model(corpus, pg, DAMPING)
            
            #increment p & count
            rank[pg] += 1
            count += 1
            
        else:
            next_page = random.choices(list(transition.keys()), weights = list(transition.values()))            
            transition = transition_model(corpus, next_page[0], DAMPING)
            
            rank[next_page[0]] += 1
            count += 1
            
    for item in rank:
        rank[item] = rank[item]/ n
        
    return rank

### Iterative Algorithm & its Convergence Criteria

This algorithm utilizes the Power Iteration strategy to compute PageRank values. At t = 0, the initial probability distribution is assumed as 1/n, where n refers to the number of pages in the universe.

At each time step, the algorithm updates the pagerank values of each page in the corpus using the following formula.

This step continues until a convergence criteria is reached. According to Haveliwala et al of Stanford University, the greater the eigengap of the PageRank adjacency matrix, the more resilient it is to pertubations in the Markov chain. This causes the algorithm to rapidly converge after a few iterations, which may allow the algorithm to be generalized to a larger universe and deployed efficiently at high speeds.

For a full discussion of this phenomenon, please refer to http://www-cs-students.stanford.edu/~taherh/papers/secondeigenvalue.pdf.

In [None]:
def iterate_pagerank(corpus, damping_factor):
    """
    Returns pagerank values via iteration and updates values until convergence
    """
    
    ranked = {}
    previous = {}
    
    #initialize
    for link in corpus:
        ranked[link] = 1/ len(corpus)
    
    converged = False
    while not converged:
            
        for link in corpus:  
                     
            previous = ranked.copy()
            
            #update based on damping factor formula
            for k, v in corpus.items():
                if link in v and k != link:
                    ranked[link] += damping_factor * (ranked[k]/len(v))
            
        #convergence test
        for key in ranked:
            if np.abs(ranked[key] - previous[key]) != 0.001:
                converged = False
        
        converged = True    
    
    return ranked

# Conclusion

By using sampling and Markov chain algorithms to model AI decision making under uncertainty (ie. in a random state instead of a deterministic state), one is able to develop similar algorithms to model problems with time-series properties. This applies to the field of quantitative finance, investments, economics and game theory.