# Link prediction - Exploration

In this notebook we will get you acquainted with the problem of link prediction. Roughly speaking, link prediction methods take a graph $G = (V, E)$ where the edge set $E$ is *incomplete*. The task of link prediction methods is then to point out likely missing edges from $E$. We will start looking at a simple graph where you probably see exactly where the missing links are.

**Exercise 1.** Write code to read the graph in the file ```graph_with_missing_links_1.gz```, and plot it. In this graph some arcs are missing. Try to identify by simply looking at the graph where arcs could be missing. What is the reason behing your choice?

In [None]:
import networkx as nx

#Your answer goes here

**Exercise 2.** Write a function that puts your reasoning in previous question into practice. Ensure it outputs *one* edge that is most likely missing. Test your code on the previous graph. Does it output an edge you expected? If it did not: can you explain why?

In [None]:
def findMissingEdge(G):
    edge = [0, 0]
    
    # Your answer goes here
    
    return edge

The method you have just found probably works well on the previous graph, because it is quite regular. Maybe you can even guess how we generated it. Of course, real-life networks are often less regular, meaning your method might not be well suited for other networks. Let us try your method on a network on a network that is less regular.

**Exercise 3.** Try your method on the graph in the file ```graph_with_missing_links_2.gz```. We reveal we have removed links $\{0, 5\}, \{0, 6\}$ and $\{0, 7\}$. Did your method output one of these edges?

In [None]:
#Your answer goes here

We reveal that the graph of the previous exercise was generated with a stochastic block model with parameters $\vec{n} = (15, 15)$ and $$P = \begin{pmatrix}
0.8 & 0.05 \\
   0.05  & 0.4
\end{pmatrix}.$$ Note that all removed edges were moving from a vertex of block 1 to another vertex of block 1.

**Exercise 4.** Look back at the edge your method outputted in Exercise 3. Does it connect vertices of the same blocks? If it does not, can you think of a reason why it does not. Adapt your method, so it is better suited for stochastic block models, and test it on the graph from Exercise 3. Ensure your improved method does *not* input $\vec{n}$ or $P$.

In [None]:
def predictLinkSBM(G):
    edge = [0, 0]
    
    # Your answer goes here
    
    return edge

It is actually pretty weird that your methods only output *one* edge. In the examples before multiple edges were missing. In practice, you never know how many edges are missing. Moreover, it could be that the first edges your algorithm outputs is wrong while the subsequent ones are correct. Therefore, it makes more sense to give all edges a *score* (called its **reliability**) that measure how "ceartain" your can be this edge is missing from the graph.

**Exercise 5.** Rework your general and stochastic block model specific method to give each edge a reliability. Test your methods again on the two graphs. How high do the actual missing edges rank?

In [None]:
def findMissingEdgeReliable(G):
    Adj = nx.adjacency_matrix(G)
    reliabilityMatrix = np.zeros_like(Adj)
    
    # Your answer goes here
    
    return reliabilityMatrix

def predictLinkSBMReliable(G):
    Adj = nx.adjacency_matrix(G)
    reliabilityMatrix = np.zeros_like(Adj)
    
    # Your answer goes here
    
    return reliabilityMatrix

We end by testing your new methods on a real-life network. From this network we have removed edge $\{13, 33\}$ and $\{2, 3\}$.

**Exercise 6.** Load the graph ```graph_with_missing_links_3.gz```, and apply both your link prediction methods on this graph. Which method has performed better? Think of reasons why this method has performed better on this real-life network. Can you think of real-life networks where the other method might have been better?

In [None]:
#Your answer goes here.