Skip to content

similarity between graph nodes based on local information with PySpark

Notifications You must be signed in to change notification settings

Howuhh/link_pred_spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Similarity between nodes

This module allows you to calculate similarity metrics based on local information for the nodes of the graph, ranking by the number of the most similar ones, and easily add new metrics.

Friends of friends are selected as candidates for each node. The final dataset with metrics can be used to train machine learning models to predict edges between nodes which are likely to appear in the future.

For additional info about metrics see predicting missing links via local information paper.

How to use

The input graph data should be in an edge list format. In order to compute the metrics, you need to specify the path to the source data and the path along which the final dataset will be saved.

Usage: extract_features.py [OPTIONS] IN_PATH OUT_PATH

  This module allows you to calculate similarity metrics based on local
  information for the nodes of the graph.

  You should specify IN_PATH for input graph data in edgelist format  and
  OUT_PATH along which the final dataset with metrics will be saved.
  Metrics and Spark configuration settings should be specified in config.py.

Options:
  --top_n INTEGER   The number of most similar candidates to leave for final
                    dataset for each node. All other candidates will be
                    filtered.
  --csv_delim TEXT  СSV file separator, should be string.
  --help            Show this message and exit.

How to add new metrics

Each metric function must take a dictionary of the form of {node_id: [list of neighbors]}, as well as a pair of nodes for comparison.

Example:
The use case of how to add new metrics is demonstrated. We will add a metric based on resource allocation. To do this, let's define the function according to the arguments' requirements.

# linkpred/metrics.py
import numpy as np

def res_allocation_score(neighbors, node1, node2):
    """
    Similarity measure based on resource allocation. 
    For a detailed explanation see Tao Zhou, Linyuan Lu ̈ 
    and Yi-Cheng Zhang paper: https://arxiv.org/abs/0901.0553
    
    Parameters
    ----------
    neighbors: dict
        Dictionary in format {node_id: [list_of_neighbors]},
    node1: int
        Node id
    node2: int
        Node id
    Returns
    -------
    score: int
        Nodes similarity score.
    """
    common = _common_neighbors(neighbors, node1, node2)
    degrees = _common_degree(neighbors, common)

    score = np.sum(degrees)

    return score

We will also use the helper functions from metrics module to calculate the common neighbors for the two nodes and the degree of common neighbors.

The only thing left to do is to add this to the config, which in this case is just a list or dictionary.

# config.py
METRICS = [
    "common_neighbors_score",
    "adamic_adar_score",
    "res_allocation_score",
]

That's all it is. The order in the final dataset is maintained, so this metric will be the last column.

Spark config

To configure configuration settings for spark, you can also change config. For example, let's change the number of cores to two.

# config.py
CONF_PARAMS = {
    "spark.master": "local[2]",
    "spark.app.name": "linkpred_spark",
}

Available metrics

By default, these metrics are available:

  • Common Neighbours
  • Salton Index
  • Jaccard Index
  • Sørensen Index
  • Hub Promoted Index
  • Leicht-Holme-Newman Index
  • Preferential Attachment
  • Adamic-Adar Index
  • Resource Allocation

P.S.

A more optimized implementation in which you do not have to return the data to the driver to calculate metrics for pairs. This is achieved by generating triplets and calculating partial sums.

About

similarity between graph nodes based on local information with PySpark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages