
### Link prediction: how to execute and evaluate methods on a real-world dataset?
#### Objectives: 
We should address the following points:

- What specific method(s) did you use from the class of methods you have investigated to find ***missing/spurious links***?
- Why did you ***choose*** this specific method?
- Which links were ***missing according to your method***?
- Does it ***seem logical*** that these links were missing, or does it seems like an artifact of the method used?
- How can you be sure that what you found ***is the truth***?
- How well did your method ***perform***?

#### Dataset: 
We are going to study the Jazz musicians dataset, some stat about the dataset:

- Number of nodes: 198 
- Edge Volume: 2742 
- Max Degree: 100
- Average Degree: 27.7
- Median Node Pair Distance: 2
- Mean Node Pair Distance: 2.2

In [1]:
from lib.utils import jazz_generator

k = 10 
k_folds = jazz_generator(k=k)
print(f"Number of folds: {len(k_folds)}")
for fold in k_folds: print(fold.shape)

Number of folds: 10
(548, 2)
(548, 2)
(548, 2)
(548, 2)
(548, 2)
(548, 2)
(548, 2)
(548, 2)
(548, 2)
(548, 2)


### Similarity Indices:
#### Local Similarity Indices:
The implemented indexes and their relative function are:


- Common Neighbours:        ***cn_score()***
- Leight Holme Newman:      ***leight_holme_newman()***
- Preferential Attachment:  ***preferential_attachment_wrapper()***
- Jaccard:                  ***jaccard_wrapper()***
- Adamic Adar:              ***adamic_adar_wrapper()***
- Resource Allocation:      ***resource_allocation_wrapper()***



In [2]:
from lib.utils import train_probe_split, from_el_to_nx
from lib.utils import accuracy_metric, auc_metric
from lib.local_similarity import cn_score

def execute(similarity_fun):
    accuracy,auc = [], []
    for index in range(len(k_folds)): 
        # Get the Train and Probe split for this fold  
        train_edge_list,probe_edge_list= train_probe_split(k_folds,index)

        # Trainsform the train edge list to a networkx Graph  
        train_G, _ = from_el_to_nx(train_edge_list)

        # Get the similarity scores for the train Graph 
        scores = similarity_fun(train_G)  

        accuracy.append(accuracy_metric(scores,probe_edge_list))
        auc.append(auc_metric(scores,train_edge_list,probe_edge_list))
    
    mean_accuracy, mean_auc = sum(accuracy)/k ,sum(auc)/k 
    return mean_accuracy, mean_auc

mean_accuracy, mean_auc = execute(cn_score)

In [3]:
print(mean_accuracy,mean_auc)

0.07664233576642335 0.26999999999999996
