# Task 6: Link Analysis with PageRank


## PageRank


Two key challenges with regards to search on the World Wide Web:
- The information is large and heterogeneous
- Search engines must also be able to handle inexperienced users and pages engineered to manipulate search engine ranking functions.


The initution behind PageRank is that a page is important if it is pointed by other important pages, which can be formulated as follows:
- Page $i$ with importance $r_i$ has $d_i$ outlinks, each link $\frac{r_i}{d_i}$
- Page $j$'s importance is the sum of votes from its' incoming links

    $$r_j = \sum_{i \rightarrow j} \frac{r_i}{d_i}$$
 
    where $d_i$ is the out degree of node $i$, and $i \rightarrow j$ refers to the income page $i$ for page $j$
    
    
### Matrix Formulation

Let's define **stochastic adjacency matrix** $M$

$$\mathrm{If} \quad  i \rightarrow j  \mathrm{,\quad then} \quad M_{ij} = \frac{1}{d_i}$$

Where $M_{ij}$ is $i$th item from the $j$th column of the stochastic adjacency matrix $M$.


We further define **rank vector** $\mathbf{r}$ or importance scores of pages. For each page, there is an entry.

$$\mathbf{r} = [r_1, r_2, ..., r_i, ...r_n]^T$$

Where $r_i$ is the importance score of page $i$. $n$ is the total number of pages. The sum of the the vector equals to 1:

$$\sum_{i=1}^{n} r_i = 1$$


### Random Walk


### How to Solve PageRank

Given a graph with n nodes, iiterative procedure:
- Assign each node an initial page rank
    $$r^{0} = [\frac{1}{n},...,\frac{1}{n}]^T$$
  
   All nodes are equally important.

- Calculate the page rank of each node

    $$r_j^{(t+1)} = \sum_{i \rightarrow j} \frac{r_i^{t}}{d_i}$$

- Repeat until convergence

    $$\sum_{i=1 }^{n}|r_i^{t+1}- r_i^{t}|<\epsilon$$
    
    Here $\mathbf{L}_1$ norm is used. Other vector norm also works, e.g., Euclidean.
    

### Problems with PageRank

Ther are two problems with the PageRank 

- **Dead Ends**: some pages have no out-links and cause importance to leak out.



- **Spider Traps**: all out-links are within the group which leads to all importances are absorbed by spider traps.





## Personalized PageRank

pg 41 to 54

## Matrix Factorization and Node Embeddings

pg 55 to 64

## References

\[1\][The PageRank Citation Ranking:Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)

\[2\][Page Rank Deck by Martin Klein and Santosh Vuppala](https://homepages.dcc.ufmg.br/~nivio/cursos/ri11/sources/pagerank.pdf)

\[3\][Link Analysis with PageRank Slides ](http://snap.stanford.edu/class/cs224w-2021/slides/04-pagerank.pdf)