In [1]:
using Pkg
pkg"activate .."

In [2]:
pkg"add FillArrays"

[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[2K[?25h[32m[1m Resolving[22m[39m package versions...
[32m[1m Installed[22m[39m BinaryProvider ─ v0.5.1
[32m[1m Installed[22m[39m Clustering ───── v0.12.1
[32m[1m  Updating[22m[39m `../Project.toml`
 [90m [aaaa29a8][39m[93m ↑ Clustering v0.12.0 ⇒ v0.12.1[39m
[32m[1m  Updating[22m[39m `../Manifest.toml`
 [90m [b99e7846][39m[93m ↑ BinaryProvider v0.5.0 ⇒ v0.5.1[39m
 [90m [aaaa29a8][39m[93m ↑ Clustering v0.12.0 ⇒ v0.12.1[39m


## Node2Vec biases walks:
 
 Quoting:  [node2vec paper](https://cs.stanford.edu/people/jure/pubs/node2vec-kdd16.pdf)
 section 3.2.2 Search bias
 
The simplest way to bias our random walks would be to sample the next node based on the static edge weights $w_{vx}$ \ie, $\pi_{vx} = w_{vx}$. (In case of unweighted graphs $w_{vx}=1$.) However, this does not allow us to account for the network structure and guide our search procedure to explore different types of network neighborhoods. 
Additionally, unlike BFS and DFS which are extreme sampling paradigms suited for structural equivalence and homophily respectively, our random walks should accommodate for the fact that these notions of equivalence are not competing or exclusive, and real-world networks commonly exhibit a mixture of both. 

We define a 2$^{\textrm{nd}}$ order random walk with two parameters $p$ and $q$ which guide the walk: Consider a random walk that just traversed edge $(t,v)$ and now resides at node $v$ . The walk now needs to decide on the next step so it evaluates the transition probabilities $\pi_{vx}$ on edges $(v,x)$ leading from $v$. We set the unnormalized transition probability to $\pi_{vx} = \alpha_{pq}(t,x)\cdot w_{vx}$, where 

$$
	\alpha_{pq}(t,x) = 
	\begin{cases}
	\frac{1}{p}  & \text{if } d_{tx} = 0\\
	1 & \text{if } d_{tx} = 1\\
	\frac{1}{q} & \text{if } d_{tx} = 2
	\end{cases}
$$

and $d_{tx}$ denotes the shortest path distance between nodes $t$ and $x$. Note that $d_{tx}$ must be one of $\{0,1,2\}$, and hence, the two parameters are necessary and sufficient to guide the walk.


In [3]:
using JSON
using LightGraphs, SimpleWeightedGraphs
using MLDataUtils

using MappedArrays
using LinearAlgebra
using FillArrays
using SparseArrays

In [4]:
const data = JSON.parse(open("jean.json"))

const characters = LabelEnc.NativeLabels([character["id"] for character in data["nodes"]])
char_id(name) = convertlabel(LabelEnc.Indices, name, characters)
char_name(ind) = MLDataUtils.ind2label(ind, characters) # hack around https://github.com/JuliaML/MLLabelUtils.jl/issues/18

const char_group = [character["group"] for character in data["nodes"]];

const les_graph = SimpleWeightedGraph(nlabel(characters))
for link in data["links"]
    add_edge!(les_graph,
        char_id(link["source"]),
        char_id(link["target"]),
        link["value"]
    )
end
les_graph

└ @ SimpleWeightedGraphs /home/wheel/oxinabox/.julia/packages/SimpleWeightedGraphs/UuCCE/src/simpleweightedgraph.jl:96


{77, 254} undirected simple Int64 graph with Float64 weights

## Functional Form

In [56]:
function π_vx(W, t,v,x; p=1, q=1)
    w_vx = @inbounds W[v,x]
    π_vx = α(A, t, x; p=p, q=q) * w_vx
end

function α(W, t, x; p=1, q=1)
    if t==x #d_tx=0
        1/p
    elseif @inbounds W[t,x] > 0 #d_tx=1
        1
    else #d_tx=2 as go t->v,v->x
        1/q
    end
end

α (generic function with 1 method)

## Generating the collocation matrix

In [89]:
function cooccurance_matrix(W, window_size;p=1, q=1)
    X = copy(W)
    
    πW1=copy(W) # First step
    for d in 2:window_size
        πW2=spzeros(size(W)...) # whip it
        for (t, v, val) in zip(findnz(πW1)...)
            for x in SparseArrays.nonzeroinds(@inbounds W[v,:])
                @inbounds πW2[v, x] += π_vx(W, t,v,x; p=p, q=q)
            end
        end
        πW1 *= πW2
        X .+= πW1/=d # GloVE proximity weighting
    end
    X
end
    

cooccurance_matrix (generic function with 4 methods)

In [90]:
prob_norm(W) = W./sum(W,dims=2)

prob_norm (generic function with 1 method)

In [96]:
W = prob_norm(weights(les_graph))

@time cooccurance_matrix(W,4)

  0.297579 seconds (2.18 M allocations: 107.327 MiB, 8.48% gc time)


77×77 SparseMatrixCSC{Float64,Int64} with 5875 stored entries:
  [1 ,  1]  =  2588.25
  [2 ,  1]  =  1302.84
  [3 ,  1]  =  1653.23
  [4 ,  1]  =  1537.17
  [5 ,  1]  =  1302.84
  [6 ,  1]  =  1302.84
  [7 ,  1]  =  1302.84
  [8 ,  1]  =  1302.84
  [9 ,  1]  =  1302.84
  [10,  1]  =  1302.84
  [11,  1]  =  324.117
  [12,  1]  =  289.983
  ⋮
  [66, 77]  =  101.028
  [67, 77]  =  103.586
  [68, 77]  =  77.966
  [69, 77]  =  18.2692
  [70, 77]  =  18.1322
  [71, 77]  =  18.9874
  [72, 77]  =  22.6258
  [73, 77]  =  21.4899
  [74, 77]  =  80.2723
  [75, 77]  =  80.2723
  [76, 77]  =  20.6658
  [77, 77]  =  104.02

In [75]:
 |> methods