# HandIn 3 - Embeddings

In this Hand in you should implement and compare two different embedding methods on a classification task. 
We will use the BlogCatalog graph which can be found in the zip with the data. 

The purpose is to implement two random-walk based approaches (a) based on Word2Vec to learn a low-dimensional representation of the data, and (2) a variant of VERSE using Noise contrastive estimation. 

For simplicity, we will use the library GenSim which can be installed from pip using the command: 

```
pip install gensim
```

**Start Early, Use The Study Cafe and the Discussion Board, the generation of random walks and the input given, know there may be numerial issues and therefore the results might not exactly the same as we expect.** 


## Embeddings based on Word2Vec

In this exercise you must implement the main methods for generating random-walks on a graph. We have provided the starter code in the file **rwemb.py**.

Random walks should be generated started from an initial node and traversing the graph for a fixed number of steps. For each node a number of walks per node are generated. 

Here you should complete the following methods.

* build_adjacency_list 
* random_walk
* build_graph_walks

All needed theory can be found in the slides.

You can test your implementation by running 

```
python test.py /path/to/rwemb.w2v /path/to/verse.npy
```

which contains the classifier when both models are trained. 

Gradient computations sometimes suffers from numerical issues, moreover the methods are randomized although the seed is fixed. There could be differences in the results. The classification accuracy for 10% of the labels should be around 0.32 in micro F1.

### Report 

Add a secion called "PART I: Word2Vec embeddings" with subsections "Code" and "Theory" to your report. In the code subsection you should have the following subsections

* Summary and Results: Include the table generated by test.py and include the in sample and test accuracy you achieve. Add at most two lines explaining the results and comment anything you believe sticks out. Explain if anything does not work.
* Actual code: include in your handing code snippets of random_walk and build_graph_walk

Furthermore you must answer the following theoretical questions

### Theoretical questions

1. What is the complexity of the method given the number of nodes in the graph, the number of edges, the size of the random walks, the number of walks per node, and the dimension of the embeddings? 

2. Connected components: Suppose the graph has two connected components (meaning two subgraphs which have no edges among them). How do you expect the embeddings look like? Why? 

3. Assuming your alpha parameter is set to a very low value (e.g. 0.01), how do you expect the embeddings look like? **HINT: think about how a random walk works and the fact that the embedding represents a similarity. The slides should help you.**


## Embeddings based on sampled VERSE

In this last exercise we will work with sampling techniques with considerably speed up the computation of VERSE. Sampling can become handy cause it speeds up the computation with a little loss in precision. 

The idea is that some similarity measure can be sampled. In the case of PPR this is possible by running random walks on the graph. Every random walk defines a sequence of nodes at a fixed length.

Even simpler one can define a simplified version of the loss function, which transforms the problem of learning similarity functions row-by-row in the problem of classifying whether pairs of nodes come the sampled similarity or from a random distribution. Samples from the noise distributions are called **negative samples**. 

There are a couple of techniques that produce the same result, *Negative Sampling* and *Noise contrastive Estimation*, the different is that the latter converges to the gradient of the cross-entropy loss above. 

$$
\mathcal{L}_{NCE} =\sum_{u \sim \mathcal{P}, v \sim \operatorname{sim}_{\mathrm{G}}(u, \cdot)}\left[\log \operatorname{Pr}_{W}\left(C=1 | \operatorname{sim}_{\mathrm{E}}(u, v)\right)+  k \mathbb{E}_{\widetilde{v} \sim Q(u)} \log \operatorname{Pr}_{W}\left(C=0 | \operatorname{sim}_{\mathrm{E}}(u, \widetilde{v})\right)\right]
$$

where $\operatorname{sim}_{\mathrm{E}}(v, \cdot)=\frac{\exp \left(Z_{v} Z^{\top}\right)}{\sum_{i=1}^{n} \exp \left(Z_{v} \cdot Z_{i}\right)}$, $k$ is the number of negative samples, C is a binary random variable for classification, and $\mathcal{Q}(\cdot)$ is a noise distribution. 

### Your task is 

To implement the main methods for VERSE with Noise Contrastive Estimation in the file "**verseemb.py**". 
The functions to complete are

* pagerank_matrix
* sigmoid
* verse
* update

The description of the method is in the file. Remember that sampled-VERSE uses negative samples, therefore the gradients are updated at each iteration through the update method. The derivation of the gradients should be also written in the report and explained. The sample VERSE iterates over a fixed amount of time sampling pairs of nodes from the distribution obtained by PageRank and the noise distribution. 


### Report 

Add a secion called "PART II: Sampled VERSE" with subsections "Code" and "Theory" to your report. In the code subsection you should have the following subsections.

* Summary and Results: Include the table generated by test.py and include the in sample and test accuracy you achieve. Add at most two lines explaining the results and comment anything you believe sticks out. Explain if anything does not work.
* Actual code: include in your hand-in code snippets of verse and update

Furthermore you must answer the following theoretical questions

1. Compute the gradient of the loss above **for each individual pair of nodes** 
2. What is the running time of sampled-VERSE? 
3. Can you think to a way of skipping the generation of the PageRank matrix and still preserving the correct distribution? **HINT: think to what other methods have done**



## Uploading to Blackboard

Upload **one pdf** with the report to blackboard together with a zip file containg rwemb.py and verseemb.py files.

**Ensure you upload the pdf separately!**

**Remember to put your names and student ids inside the pdf report!**

The PDF should be **at the most** 5 pages!

