# Question 335

## Description

PageRank is an algorithm used by Google to rank the importance of different websites. While there have been changes over the years, the central idea is to assign each site a score based on the importance of other pages that link to that page.

More mathematically, suppose there are `N` sites, and each site `i` has a certain count `Ci` of outgoing links. Then the score for a particular site `Sj` is defined as :

`score(Sj) = (1 - d) / N + d * (score(Sx) / Cx + score(Sy) / Cy+ ... + score(Sz) / Cz))`

Here, `Sx, Sy, ..., Sz` denote the scores of all the other sites that have outgoing links to `Sj`, and d is a damping factor, usually set to around 0.85, used to model the probability that a user will stop searching.

Given a directed graph of links between various websites, write a program that calculates each site's page rank.


First, we must determine the form of our input. Let us suppose we are given a graph of outbound links, such as the following:

```python
outlinks = {
      0: [1, 2, 3],
      1: [0],
      2: [0],
      3: [0]
}
```

This corresponds to the following graph:

```
      2
      ∧
      |
      v
1 <-> 0 <-> 3
```

For our solution, we will need to efficiently find all the sites that link to a given site, so we can first create an additional dictionary to store this information.


In [1]:
def make_inlinks(outlinks):
    inlinks = {site: [] for site in outlinks}

    for site, neighbors in outlinks.items():
        for neighbor in neighbors:
            inlinks[neighbor].append(site)

    return inlinks

Now we can focus on the page rank function. From the description, we know that each website's score actually depends on each other website's score. This may seem confusing- won't we end up going in circles? And how should we set the scores in the first place?

One simplifying assumption we can make is that we would like to normalize our page rank scores so that their sum equals one. Therefore, we should initially set each value to be 1 / N, where N is the number of sites.

Calculating each site's page rank is then an iterative process. We follow the formula above for each site, and replace our score dictionary once each value has been fully computed. After a set number of rounds, the page ranks should converge toward a stable solution.


In [2]:
def update_scores(inlinks, outlinks, scores, d, num_rounds):
    for _ in range(num_rounds):
        new_scores = {}

        for site, neighbors in inlinks.items():
            score = sum(
                [scores[neighbor] / len(outlinks[neighbor]) for neighbor in neighbors]
            )
            new_scores[site] = (1.0 - d) / len(inlinks) + d * score

        scores.update(new_scores)

For the example above, for example, after around twenty iterations we converge to the following scores: {0: 0.48, 1: 0.17, 2: 0.17, 3: 0.17}. This makes sense, as the website with three times as many incoming links receives around three times the score.

Finally, we can wrap these functions in a PageRank class, giving us the freedom to easily update the damping factor and the number of rounds. We also include a helper function that returns a more user-friendly display of each site's score.


In [3]:
def make_inlinks(outlinks: dict[int, list[int]]) -> dict[int, list[int]]:
    inlinks = {site: [] for site in outlinks}
    for site, neighbors in outlinks.items():
        for neighbor in neighbors:
            inlinks[neighbor].append(site)
    return inlinks


def update_scores(
    inlinks: dict[int, list[int]],
    outlinks: dict[int, list[int]],
    scores: dict[int, float],
    d: float,
    num_rounds: int,
) -> None:
    for _ in range(num_rounds):
        new_scores = {}
        for site, neighbors in inlinks.items():
            score = sum(
                [scores[neighbor] / len(outlinks[neighbor]) for neighbor in neighbors]
            )
            new_scores[site] = (1.0 - d) / len(inlinks) + d * score
        scores.update(new_scores)


class PageRank:
    def __init__(
        self, links: dict[int, list[int]], d: float = 0.85, num_rounds: int = 10
    ) -> None:
        self.d = d
        self.num_rounds = num_rounds
        self.num_sites = len(links)
        self.outlinks = links
        self.inlinks = self.get_inlinks()
        self.scores = {site: 1.0 / self.num_sites for site in links}

    def get_inlinks(self) -> dict[int, list[int]]:
        inlinks = {site: [] for site in self.outlinks}
        for site, neighbors in self.outlinks.items():
            for neighbor in neighbors:
                inlinks[neighbor].append(site)
        return inlinks

    def update_scores(self) -> None:
        update_scores(self.inlinks, self.outlinks, self.scores, self.d, self.num_rounds)

    def get_ranks(self) -> dict[int, float]:
        return {site: round(score, 2) for site, score in self.scores.items()}


# Example usage:
if __name__ == "__main__":
    outlinks = {0: [1, 2, 3], 1: [0], 2: [0], 3: [0]}
    pr = PageRank(outlinks)
    pr.update_scores()
    print(pr.get_ranks())

{0: 0.43, 1: 0.19, 2: 0.19, 3: 0.19}
