## Homework 3
### BIOSTAT 257
#### Joanna Boland
#### May 15, 2020

## Question 1: Problem Structure 

Let $\mathbf{A} \in \{0,1\}^{n \times n}$ be the connectivity matrix of $n$ web pages with entries,
\begin{eqnarray*}
    a_{ij}= \begin{cases},
    1 & \text{if page $i$ links to page $j$} \\,
    0 & \text{otherwise}
    \end{cases}.
\end{eqnarray*}
$r_i = \sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of pages according to the following rules, 
* From a page $i$ with $r_i>0$, with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page 
* with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page, 
* From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page, 

The process defines a Markov chain on the space of $n$ pages. Write the transition matrix $\mathbf{P}$ of the Markov chain as a sparse matrix plus rank 1 matrix."

## Question 2: Relate to Numerical Linear Algebra

According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\mathbf{x} = (x_1,\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\mathbf{P}$. That is $\mathbf{P}^T \mathbf{x} = \mathbf{x}$. Therefore $\mathbf{x}$ can be solved as an **eigen-problem**. It can also be cast as **solving a linear system**. Since the row sums of $\mathbf{P}$ are 1, $\mathbf{P}$ is rank deficient. We can replace the first equation by the $\sum_{i=1}^n x_i = 1$.

Hint: For iterative solvers, we don't need to replace the 1st equation. We can use the matrix $\mathbf{I} - \mathbf{P}^T$ directly if we start with a vector with all positive entries.


## Question 3: Explore the Data

In [18]:
using MatrixDepot, LinearAlgebra, BenchmarkTools, UnicodePlots

md = mdopen("SNAP/web-Google")
# display documentation for the SNAP/web-Google data
mdinfo(md)

┌ Info: Precompiling UnicodePlots [b8865327-cd53-5732-bb35-84acbb429228]
└ @ Base loading.jl:1273


# SNAP/web-Google

###### MatrixMarket matrix coordinate pattern general

---

  * UF Sparse Matrix Collection, Tim Davis
  * http://www.cise.ufl.edu/research/sparse/matrices/SNAP/web-Google
  * name: SNAP/web-Google
  * [Web graph from Google]
  * id: 2301
  * date: 2002
  * author: Google
  * ed: J. Leskovec
  * fields: name title A id date author ed kind notes
  * kind: directed graph

---

  * notes:
  * Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets,
  * Jure Leskovec http://snap.stanford.edu/data/index.html
  * email jure at cs.stanford.edu
  * 
  * Google web graph
  * 
  * Dataset information
  * 
  * Nodes represent web pages and directed edges represent hyperlinks between them.
  * The data was released in 2002 by Google as a part of Google Programming
  * Contest.
  * 
  * Dataset statistics
  * Nodes   875713
  * Edges   5105039
  * Nodes in largest WCC    855802 (0.977)
  * Edges in largest WCC    5066842 (0.993)
  * Nodes in largest SCC    434818 (0.497)
  * Edges in largest SCC    3419124 (0.670)
  * Average clustering coefficient  0.6047
  * Number of triangles     13391903
  * Fraction of closed triangles    0.05523
  * Diameter (longest shortest path)    22
  * 90-percentile effective diameter    8.1
  * 
  * Source (citation)
  * 
  * J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large
  * Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters.
  * arXiv.org:0810.1355, 2008.
  * 
  * Google programming contest, 2002
  * http://www.google.com/programming-contest/
  * 
  * Files
  * File    Description
  * web-Google.txt.gz   Webgraph from the Google programming contest, 2002

---

916428 916428 5105039


In [5]:
# connectivity matrix
A = md.A

916428×916428 SparseArrays.SparseMatrixCSC{Bool,Int64} with 5105039 stored entries:
  [11343 ,      1]  =  1
  [11928 ,      1]  =  1
  [15902 ,      1]  =  1
  [29547 ,      1]  =  1
  [30282 ,      1]  =  1
  [31301 ,      1]  =  1
  [38717 ,      1]  =  1
  [43930 ,      1]  =  1
  [46275 ,      1]  =  1
  [48193 ,      1]  =  1
  [50823 ,      1]  =  1
  [56911 ,      1]  =  1
  ⋮
  [608625, 916427]  =  1
  [618730, 916427]  =  1
  [622998, 916427]  =  1
  [673046, 916427]  =  1
  [716616, 916427]  =  1
  [720325, 916427]  =  1
  [772226, 916427]  =  1
  [785097, 916427]  =  1
  [788476, 916427]  =  1
  [822938, 916427]  =  1
  [833616, 916427]  =  1
  [417498, 916428]  =  1
  [843845, 916428]  =  1

* There are 3 bits in a single digit, so if we had a matrix of 916428 x 916428 entries of zeros or 1, then the number of gigabytes of that matrix would be:

In [8]:
GB = ((916428)^2 * 3)/(8000000000)

314.940104694

* The number of web pages is $916,428$ as the A matrix is and $n \times n$ matrix where $n = $ total number of web page entries
* Since the only stored entries are 1's and the rest are assumed to be 0's, then there are $5,105,039$ web links.

In [40]:
r = vec(sum(A, dims = 1)) # out-degrees
x = vec(sum(A, dims = 2)) # in-degrees
count(iszero, r)

201883

* The number of dangling nodes is $201,883$.

In [44]:
histogram(x, nbins = 25, closed = :left, title = "Histogram of In-Degrees")

[1m                           Histogram of In-Degrees[22m
[90m                  ┌                                        ┐[39m 
   [0m[90m[[0m  0.0[90m, [0m 20.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 891798 [90m [39m 
   [0m[90m[[0m 20.0[90m, [0m 40.0[90m)[0m[90m ┤[39m[32m▇[39m[0m 22628                                 [90m [39m 
   [0m[90m[[0m 40.0[90m, [0m 60.0[90m)[0m[90m ┤[39m[0m 1329                                   [90m [39m 
   [0m[90m[[0m 60.0[90m, [0m 80.0[90m)[0m[90m ┤[39m[0m 371                                    [90m [39m 
   [0m[90m[[0m 80.0[90m, [0m100.0[90m)[0m[90m ┤[39m[0m 124                                    [90m [39m 
   [0m[90m[[0m100.0[90m, [0m120.0[90m)[0m[90m ┤[39m[0m 86                                     [90m [39m 
   [0m[90m[[0m120.0[90m, [0m140.0[90m)[0m[90m ┤[39m[0m 29                                     [90m [39m 
   [0m[90m[[0m140.0[90m, 

In [45]:
histogram(r, nbins = 25, closed = :left, title = "Histogram of Out-Degrees")

[1m                            Histogram of Out-Degrees[22m
[90m                    ┌                                        ┐[39m 
   [0m[90m[[0m   0.0[90m, [0m 500.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 916114 [90m [39m 
   [0m[90m[[0m 500.0[90m, [0m1000.0[90m)[0m[90m ┤[39m[0m 180                                    [90m [39m 
   [0m[90m[[0m1000.0[90m, [0m1500.0[90m)[0m[90m ┤[39m[0m 32                                     [90m [39m 
   [0m[90m[[0m1500.0[90m, [0m2000.0[90m)[0m[90m ┤[39m[0m 20                                     [90m [39m 
   [0m[90m[[0m2000.0[90m, [0m2500.0[90m)[0m[90m ┤[39m[0m 16                                     [90m [39m 
   [0m[90m[[0m2500.0[90m, [0m3000.0[90m)[0m[90m ┤[39m[0m 20                                     [90m [39m 
   [0m[90m[[0m3000.0[90m, [0m3500.0[90m)[0m[90m ┤[39m[0m 18                                     [90m [39m 
   [0m[90m[[0m3500

In [47]:
A1 = A[1:10000, 1:10000]
spy(A1, title = "Sparsity of a Submatrix of A")

[1m                Sparsity of a Submatrix of A[22m
[90m         ┌──────────────────────────────────────────┐[39m    
       [90m1[39m[90m │[39m[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[31m⠂[39m[0m⠀[0m⠀[0m⠀[0m⠀[31m⢀[39m[31m⢂[39m[31m⡂[39m[31m⠐[39m[0m⠀[0m⠀[0m⠀[31m⠠[39m[0m⠀[0m⠀[0m⠀[31m⠔[39m[31m⠄[39m[0m⠀[31m⠈[39m[31m⢀[39m[0m⠀[0m⠀[31m⠉[39m[0m⠀[0m⠀[31m⠁[39m[0m⠀[0m⠀[31m⠐[39m[0m⠀[31m⠄[39m[0m⠀[31m⠐[39m[0m⠀[90m│[39m [31m> 0[39m
        [90m │[39m[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[31m⠂[39m[31m⡂[39m[0m⠀[31m⠄[39m[31m⠄[39m[31m⠈[39m[31m⠈[39m[31m⡔[39m[0m⠀[0m⠀[31m⠁[39m[31m⠂[39m[0m⠀[0m⠀[0m⠀[0m⠀[31m⠠[39m[31m⠁[39m[31m⠒[39m[31m⠁[39m[0m⠀[31m⠄[39m[0m⠀[31m⠐[39m[0m⠀[31m⠈[39m[0m⠀[0m⠀[31m⡐[39m[0m⠀[0m⠀[31m⠠[39m[0m⠀[31m⡀[39m[0m⠀[31m⠉[39m[0m⠀[90m│[39m [34m< 0[39m
        [90m │[39m[0m⠀[31m⠐[39m[31m⠠[39m[0m⠀[0m⠀[31m⠁[39m[31m⢐[39m[31m⠄[39m[0m⠀[31m⠢[39m[31m⠠[39m[0m⠀[0m⠀[31m⡀[39m[31m⠈[

## Question 4: Dense Linear Algebra?

Consider the following methods to obtain the page ranks of the `SNAP/web-Google` data. 

1. A dense linear system solver such as LU decomposition.
2. A dense eigen-solver for asymmetric matrix.

For the LU approach, estimate (1) the memory usage and (2) how long it will take assuming that the LAPACK functions can achieve the theoretical throughput of your computer.

## Question 5: Iterative Solvers

Set the _teleportation_ parameter at $p = 0.85$. Consider the following methods for solving the PageRank problem. 

1. An iterative linear system solver such as GMRES.
2. An iterative eigen-solver such as Arnoldi method.




In [None]:
using BenchmarkTools, LinearAlgebra, SparseArrays, Revise

# a type for the matrix M = I - P^T in PageRank problem
struct PageRankImPt{TA <: Number, IA <: Integer, T <: AbstractFloat} <: AbstractMatrix{T}
    A         :: SparseMatrixCSC{TA, IA} # adjacency matrix
    telep     :: T
    # working arrays
    # TODO: whatever intermediate arrays you may want to pre-allocate
end

# constructor
function PageRankImPt(A::SparseMatrixCSC, telep::T) where T <: AbstractFloat
    n = size(A, 1)
    # TODO: initialize and pre-allocate arrays
    PageRankImPt(A, telep)
end

LinearAlgebra.issymmetric(::PageRankImPt) = false
Base.size(M::PageRankImPt) = size(M.A)
# TODO: implement this function for evaluating M[i, j]
Base.getindex(M::PageRankImPt, i, j) = M.telep

# overwrite `out` by `(I - Pt) * v`
function LinearAlgebra.mul!(
        out :: Vector{T}, 
        M   :: PageRankImPt{<:Number, <:Integer, T}, 
        v   :: Vector{T}) where T <: AbstractFloat
    # TODO: implement mul!(out, M, v)
    sleep(1e-2) # wait 10 ms as if your code takes 1ms
    return out
end

# overwrite `out` by `(I - P) * v`
function LinearAlgebra.mul!(
        out :: Vector{T}, 
        Mt  :: Transpose{T, PageRankImPt{TA, IA, T}}, 
        v   :: Vector{T}) where {TA<:Number, IA<:Integer, T <: AbstractFloat}
    M = Mt.parent
    # TODO: implement mul!(out, transpose(M), v)
    sleep(1e-2) # wait 10 ms as if your code takes 1ms
    out
end