# Node Embeddings and Skip Gram Examples

**Purpose:** - to explore the node embedding methods used for methods such as Word2Vec.

**Introduction-** one of the key methods used in node classification actually draws inspiration from natural language processing. This based in the fact that one approach for natural language processing views the ordering of words in a manner similar to a graph since each n-gram has a set of words that follow it. Strategies that treat text this way are naturally amenable to domains where we are explicitly working on a network structure.

Methods which employ node embeddings have several fundamental steps:
1. Create a "corpus" of node connections using a random walk.
2. Define a transformation on the list of node connections from **1** which groups node values that are close together with a high number, and nodes that have less of a relationship with a small number.
3. Run a standard machine learning method on the new set of factors from step **2**.


## Random Walks:

Here we explore the first step in this process: The random choosing of node values in the graph structure. This step is taken to approximate the connections each node has as a list. This carries two advantages:
1. Each node similarity measure has both local (direct) connections, and also expresses higher order connections (indirect). This is known as **Expressivity**.
2. All node pairs don't need to be encoded; we don't have to worry about coding the zero probabilities. This is **Efficiency**.

We will discuss some of the methods used for random walks in the sections below in reference to the paper where they were originally discussed.

### DeepWalk Method

*DeepWalk: Online Learning of Social Representations* uses short random walks. In this case, we define a random walk starting at vertex $V_i$ as $W_i$. This random walk is a stochastic process composed of random variables $W_i^k$ where k denotes the step in the sequence of each random walk.

For this method, a stream of random walks is created. This method has the added advantage of being easy to parallelize and is also less sensitive to changes in the underlying graph than using a larger length random walk.

The implementation of the DeepWalk method is used in the function below:

In [2]:
import pandas as pd, numpy as np, os, random
from IPython.core.debugger import set_trace
np.random.seed(13)
dat = pd.read_csv("../Data/soc-sign-bitcoinalpha.csv", names = ["SOURCE", "TARGET", "RATING", "TIME"])

In [3]:
len(pd.unique(dat.SOURCE)) 

3286

In [4]:
len(pd.unique(dat.TARGET) )

3754

In [5]:
#from_vals = pd.unique(dat.SOURCE)

#a = dat.TARGET[dat.SOURCE == from_vals[1]]
# Generate list comprehension using from values as a key; to values are saved as a list.
#node_lists = {x:dat.TARGET[dat.SOURCE == x].values for x in from_vals  }

# Generate a step by selecting one value randomly from the list of "to" nodes:
def gen_step(key_val,dict_vals):
   # print(dict_vals[key_val])
    return( dict_vals[key_val][random.randint(0,len(dict_vals[key_val])-1)]  )

def gen_walk(key_val,dict_vals,steps):
    walk_vals = [key_val]    
    for i in range(0,steps-1):
        walk_vals.append(gen_step(walk_vals[-1],dict_vals) )
    return(walk_vals)

def RW_DeepWalk( orig_nodes, to_vals, walk_length=3):
    from_vals = pd.unique(orig_nodes)
    node_lists = {x:to_vals[orig_nodes == x].values for x in from_vals}
    start_nodes = [* node_lists]
    start_nodes=[x for x in start_nodes if x in node_lists.keys()]
    walks = {x:gen_walk(key_val= x,dict_vals = node_lists,steps=walk_length) for x in start_nodes}
    return(walks)

In [6]:
# In order to sort these values, we need to make a full list of "from" and "to" for the random walk. This is performed in the script below:
# Identify values in "to" column that might not be in the from column:
f = dat.SOURCE
t = dat.TARGET
unique_t = [x for x in pd.unique(t) if not(x in pd.unique(f))]
x_over = dat[dat['TARGET'].isin( unique_t)]
# Add entries from the "to" column to the from column; add corresponding entries from the "from" column. This way, we include mappings of nodes in the "to" column as part of the random walk.
full_from = f.append(x_over.TARGET)
full_to = t.append(x_over.SOURCE)

In [7]:
random_walk = RW_DeepWalk( full_from, full_to, walk_length=10)

An example of one of the arrays obtained using a random walk:

In [8]:
random_walk[1]

[1, 2273, 2202, 1134, 35, 1385, 114, 1202, 605, 230]

The choice of the random walk method provides a way of representing the network that can be performed quickly. This method is also simple to parallelize. Finally, this method and the speed it can be used allows for a quick way to update calculations due to changes in the graph structure. 

### Node2vec Method

The paper "Scalable Feature Learning for Networks" uses a separate method called a "biased random walk". 


One of the points made in the paper is the type of sampling strategies that can be used to try to approximate the neighborhood around some node (this is denoted as $N_s$ in the paper). There are two extremes for sampling strategies that can be employed:

* Breadh-first sampling (BFS) - The neighborhood is restricted to nodes which are immediate neighbors of the source node. For this, we define the neighborhood **only** with directly adjacent nodes.
* Depth-first sampling (DFS) - The neighborhood consists of nodes sequentially sampled at increasing distances from the source node. This is represented in the random walk algorithm that was shown in the last section.


A biased random walk as expressed by the authors is an interpolation between the two strategies mentioned above.

Let $u$ be the source node, and $l$ be the length of the random walk. Let $c_i$ be the $i$th node in the walk where $c_0 = u$. Then, $c_i$ is generated as follows:

$$ P(c_i = x | c_{i-1} =v) = \frac{\pi_{v,x} }{Z} $$ and 0 otherwise.

Where $\pi_{v,x}$ is the unnormalized transition probability between nodes $v$ and $x$, and $Z$ is some constant that normalizes the probability between the two nodes. This is very similar to the formulation that was desecribed earlier for DeepWalk. 

The simplest way to introduce bias to the random walks is to sample based onthe static edge weights: $w_{v,x} = \pi_{v,x} $. In the case of an unweighted graph like the one used in the example above, $w_{v,x} =1$. 

We will define a $2$nd order random walk with parameters $p,q$. We will set the unnoramlized transition probability to $\pi_{v,x} = \alpha_{p,q}(t,x)*w_{v,x}$ where \alpha_{p,q}(t,x) is defined as:

\begin{equation}
  \alpha_{p,q}(t,x) =
    \begin{cases}
      \frac{1}{p} & \text{if $d_{t,x}=0$ }\\
     1 & \text{if $d_{t,x}=1$ }\\
      \frac{1}{q}  & \text{if $d_{t,x}=2$ }
    \end{cases}       
\end{equation}

Where $d_{t,x}$ defines the shortest path distance between nodes $t$ and $x$ Also note that $d_{t,x} \in \{0,1,2\}$

Changing parameters $p$ and $q$ will impact the speed that the walk leaves the current neighborhood. In the example provided in the paper, the authors consider a process which as just transitioned to node *v* from node *t*. It has three potential choices for its next step:

* Transition back to *t* with the bias of $\alpha_{t,v} = \frac{1}{p}$ being applied.
* Transition to a shared node with a bias of 1 being applied.
* Transition to an unshared node with a bias of $\alpha_{t,v} = \frac{1}{q}$ being applied.

Then - a lower q-value and higher p-value will increase the likelihood of leaving the initial neighborhood of *t*. At the extreme, you would get the original random walk implementation described above by letting $p =1$ and $q=1$.

A higher q value will decrease the likelihood of the current step moving to a node that neig


In [9]:
from_vals = pd.unique(full_from)
node_lists = {x:full_to[full_from == x].values for x in from_vals}
node_lists

{7188: array([1]),
 430: array([   1,   13,   59,  247,  831,  817, 1055, 7595, 7509]),
 3134: array([  1,  22,  27, 617]),
 3026: array([1]),
 3010: array([1]),
 804: array([   1,   25,   26,   85,  204, 7583, 1020]),
 160: array([   1,   18,   57,   89,  294, 7579,  952, 1845,  817,  945]),
 95: array([   1,    3,    4,    6,    7,    8,   11,   19,   24,   25,   26,
          29,   31,   32,   33,   36,   38,   40,   41,   42,   43,   47,
          56,   62,   67,   73,   75,   82,   92,   93,  188,  394, 1829,
         493,  526,  391,  315,  242,  331, 5679,  179,  221,  966,  345,
         411,  278, 2410, 3403,  245,  464, 1065, 2336,  191,  205,  105,
        1889,  154, 2953,  373, 3302, 1370,  666, 5342, 1874,  136, 3246,
         413,  246, 2358,  553, 3179, 1045,  332,  244, 1278,  104,  174,
        2330, 1307,  241, 7432, 7550,  172,  643, 2304,  111,  752,  941,
        1171,  318, 1348,  123,  185,  882,  813,  228,  396,  362,  428,
        7497,  103, 3134, 2257,  177

In [10]:
gen_step(430,node_lists)

1

In [11]:
 
cur_node = gen_step(430,node_lists)
prev_node_list = node_lists[cur_node]
cur_node_list = node_lists[430]

shared_nodes = list(set(prev_node_list) & set(cur_node_list))
unshared_nodes = list(set(prev_node_list) ^ set(cur_node_list))
prev_node = 430

In [12]:
def gen_biased_step(cur_val, prev_val,dict_vals,p = 1, q = 1):
  #  set_trace()
    prev_node_list = node_lists[prev_val]
    cur_node_list = node_lists[cur_val]
    shared_nodes = list(set(prev_node_list) & set(cur_node_list))
    unshared_nodes = list( set(prev_node_list) ^ set(cur_node_list)^set([prev_val]) )
    all_nodes = shared_nodes + unshared_nodes + [prev_val]
        
    shared_weights = [1/p]*len(shared_nodes)
    unshared_weights = [1/q]*len(unshared_nodes)
    all_weights = shared_weights +unshared_weights + [1]
   # set_trace()
    node_step = random.choices(all_nodes,all_weights)
    return( node_step )

In [13]:
test = gen_biased_step(cur_val = 59, prev_val = 430,dict_vals = node_lists,p = 1, q = 1)
test

[247]

In [14]:
def gen_walk_biased(key_val,dict_vals,steps,p=1,q=1):
    walk_vals = [key_val]    
    for i in range(0,steps-1):
        if i==0:
            
            prev_val = key_val
        else:
            prev_val =walk_vals[-1]                
         #   set_trace()             
        walk_vals.append(
            gen_biased_step(
                cur_val = key_val, prev_val = prev_val,dict_vals = dict_vals,p = p, q = q)[0] )
        
 #       gen_biased_step(cur_val = 59, prev_val = 430,dict_vals = node_lists,p = 1, q = 1)
  #      walk_vals.append(gen_step(walk_vals[-1],dict_vals) )
 #   set_trace()
    return(walk_vals)

# Split the node values into three different groups




# Apply weightings to each edge to change the likelihood of leaving the neighborhood.
 
# A biased random walk as described in the node2vec paper. The p and q values are defaulted to 1 which will make this the same as the RW_DeepWalk paper described earlier.
def RW_Biased( orig_nodes, to_vals, walk_length=3,p = 1,q =1):
    from_vals = pd.unique(orig_nodes)
    node_lists = {x:to_vals[orig_nodes == x].values for x in from_vals}
    start_nodes = [* node_lists]
    start_nodes=[x for x in start_nodes if x in node_lists.keys()]
  #  set_trace()
  #  walks = {x:gen_walk_biased(key_val= start_nodes[x], prev_key = start_nodes[x-1],dict_vals = node_lists,steps=walk_length,p = p, q = q) for x in range(1,len(start_nodes))}
#    walks = {x:gen_walk(key_val= x,dict_vals = node_lists,steps=walk_length) for x in start_nodes}
    walks = [gen_walk(key_val= x,dict_vals = node_lists,steps=walk_length) for x in start_nodes]
    return(walks)

In [15]:
full_from = full_from.astype(str )
full_to = full_to.astype(str )

In [16]:
test = RW_Biased(full_from, full_to,walk_length =10,p = .5, q = .7)
test

[['7188', '1', '7557', '542', '48', '2114', '48', '270', '4', '131'],
 ['430', '817', '430', '1055', '430', '817', '125', '249', '25', '1031'],
 ['3134', '27', '754', '755', '3', '227', '8', '107', '7565', '18'],
 ['3026', '1', '35', '192', '171', '73', '163', '59', '494', '11'],
 ['3010', '1', '472', '1235', '2588', '1235', '328', '661', '1007', '448'],
 ['804', '7583', '804', '7583', '1020', '51', '40', '51', '1020', '166'],
 ['160', '945', '18', '613', '226', '75', '411', '67', '69', '256'],
 ['95', '179', '88', '461', '88', '15', '80', '698', '64', '133'],
 ['377', '363', '399', '363', '377', '3413', '377', '3413', '377', '399'],
 ['888', '1', '725', '1661', '725', '10', '1040', '10', '2204', '874'],
 ['89', '3774', '1684', '3774', '1247', '102', '1005', '102', '62', '42'],
 ['1901', '1', '296', '203', '70', '279', '177', '52', '2195', '52'],
 ['161', '7', '768', '2', '2446', '2', '51', '107', '32', '24'],
 ['256', '67', '15', '906', '96', '19', '1199', '459', '1884', '151'],
 ['35

## Creating a Node Embedding

Now that we've created a representation of the likelihood of getting to different nodes in each graph, we can the methods which we will use to represent the network as an embedding vector. Note that this is an alternative to other methods such as one-hot encoding of the results which are extremely memory/computation intensive. In principle, what we want to do is represent the "context" or relationship of each of these nodes to all other nodes by mapping each node into an $N$ dimensional vector space. The length of the vector is arbitrary; As it is increased the precision will rise while the speed of the computation will fall.Nodes which are in the immediate neighborhood of the current node will be heavily favored, second order connections, less so, and those that are completely unconnected, not at all. This method was first explored in [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf). The paper that was just mentioned provides two methods for natural language processing:

1. Continuous bag-of-words
2. Skip-gram models.

Both methods are valid and have their strengths and weaknesses, but we will rely on skip-gram models in this discussion. For skip-gram models, the node embedding is generated using a simple neural network. We will step through an independent implementation of this below which leans on Tensorflow, but [Stellargraph](https://www.stellargraph.io/) provides a good straightforward interface to it as well.

### Step 1: Identify neighborhood for each node

This is the step that we discussed above by implementing the biased random walk and the random walk methods. This has a key impact: The longer and more biased our random walk, the greater of range of connections we will identify, but we will possibly draw in more tenuous connections.


### Step 2: Map neighborhood values to one-hot autoencoders:

The neighborhoods are used to generate vectors which encode the relationship of nodes. This includes a one-hot autoencoder for the target node, and a set of autoencoders for neighboring nodes.

### Step 3: Perform Optimization:

The following procedure is used for each one-hot autoencoders:

1. The $1 \times N$ encoder multiplies an $N \times w $ matrix. 
2. 



In [24]:
def gen_auto_encoders(node_lists):
    

[['7188', '1', '7557', '542', '48', '2114', '48', '270', '4', '131'],
 ['430', '817', '430', '1055', '430', '817', '125', '249', '25', '1031'],
 ['3134', '27', '754', '755', '3', '227', '8', '107', '7565', '18'],
 ['3026', '1', '35', '192', '171', '73', '163', '59', '494', '11'],
 ['3010', '1', '472', '1235', '2588', '1235', '328', '661', '1007', '448'],
 ['804', '7583', '804', '7583', '1020', '51', '40', '51', '1020', '166'],
 ['160', '945', '18', '613', '226', '75', '411', '67', '69', '256'],
 ['95', '179', '88', '461', '88', '15', '80', '698', '64', '133'],
 ['377', '363', '399', '363', '377', '3413', '377', '3413', '377', '399'],
 ['888', '1', '725', '1661', '725', '10', '1040', '10', '2204', '874'],
 ['89', '3774', '1684', '3774', '1247', '102', '1005', '102', '62', '42'],
 ['1901', '1', '296', '203', '70', '279', '177', '52', '2195', '52'],
 ['161', '7', '768', '2', '2446', '2', '51', '107', '32', '24'],
 ['256', '67', '15', '906', '96', '19', '1199', '459', '1884', '151'],
 ['35

In [20]:
biased_rw_for_training =[list(x.values()) for x in [RW_Biased(full_from, full_to,walk_length =10,p = .5, q = 2) for i in range(10)   ]   ]

AttributeError: 'list' object has no attribute 'values'

In [None]:
biased_rw_for_training[0].append(biased_rw_for_training[1])
final_results = []

for i in biased_rw_for_training:
    final_results = final_results + i

In [None]:
biased_rw_for_training[0][0]

In [None]:
from stellargraph.data import BiasedRandomWalk
from stellargraph import StellarGraph
from stellargraph import datasets
from IPython.display import display, HTML
from gensim.models import Word2Vec
model = Word2Vec(biased_rw_for_training[0] , size=128, window=5, min_count=0, sg=1, workers=2, iter=1)

## References:

1. [NRL Totorial Part 1](http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part1-embeddings.pdf)