# Recommending Peer Reviewers Based on Existing Concepts

The following experiment attempts to recommend peer reviewers for new research based on tagged OmniScience concepts. The underlying technology used is Neo4j, and the Neosemantics (n10s) plugin. The following notebook provides useful markup for understanding the process of loading data, and experimentation via the py2neo library.

In [1]:
from py2neo import Graph
from py2neo.matching import NodeMatcher, RelationshipMatcher
from py2neo.data import Node, Relationship
import numpy as np
import os
import itertools

### Neo4J Connection

We connect to a linked neo4j docker instance within our bridged network. Since this is a local demonstration, security is not a concern.

In [2]:
uri = 'http://neo4j:7474/'
graph = Graph(uri, user='neo4j', password='test', name='test')

### Neosemantics Configuration

Calling Neosemantics `graphconfig.init()` to load the constraints for dealing with Linked Data.

In [3]:
# If graph isn't empty, don't initialize graphconfigs
if graph.match_one() == None:
    graph.call.n10s.graphconfig.init()
graph.call.n10s.graphconfig.show()

 param           | value     
-----------------|-----------
 handleVocabUris | SHORTEN   
 handleMultival  | OVERWRITE 
 handleRDFTypes  | LABELS    

Presetting some common prefixes to make the graph readable, and forcing Neosemantics to use appropriate namespace prefixes.

In [4]:
graph.call.apoc.cypher.runFile('/var/lib/neo4j/import/omniscience/queries/prefixSetter.cypher')

 row | result                                                                                            
-----|---------------------------------------------------------------------------------------------------
   0 | {prefix: 'dct', namespace: 'http://purl.org/dc/terms/'}                                           
   1 | {prefix: 'knovelproperties', namespace: 'http://data.elsevier.com/vocabulary/knovel/properties/'} 
   2 | {prefix: 'xsd', namespace: 'http://www.w3.org/2001/XMLSchema#'}                                   

Creating a uniqueness constraint on Resources. Resources are defined as nodes that can be dereferenced via their `uri` property.

In [5]:
if not graph.schema.get_uniqueness_constraints('Resource'):
    graph.schema.create_uniqueness_constraint('Resource', 'uri')

### Data Loading

#### Loading OmniScience

In [6]:
# Make sure taxonomy isn't already loaded...
if graph.run("MATCH (c:skos__Concept) RETURN c LIMIT 1") == None:
    graph.run("CALL n10s.rdf.import.fetch('file:///var/lib/neo4j/import/omniscience/statements.ttl', 'Turtle')")
    print('Successfully loaded Omniscience vocabulary.')
else:
    print('OmniScience taxonomy appears to be loaded.')

OmniScience taxonomy appears to be loaded.


#### Loading Mock C-Graph Data

In this section, we will load some mocked C-Graph data to prove the concept. First we create some constraints:

In [7]:
if not graph.schema.get_uniqueness_constraints('Person'):
    graph.schema.create_uniqueness_constraint('Person', 'ID')
if not graph.schema.get_uniqueness_constraints('Work'):
    graph.schema.create_uniqueness_constraint('Work', 'ID')

Loading a sample file with authors and works around a small set of concepts:

In [8]:
graph.call.apoc.cypher.runFile('/var/lib/neo4j/import/omniscience/queries/peer-review-recommender/setup/small-c-graph-sample.cypher')
print('Loaded sample c-graph data mapped to concepts.')

Loaded sample c-graph data mapped to concepts.


### Experiment: Predicting Peer Reviewers

In this section, we will go over an algorithmic approach to recommend peer reviewers by finding a topic-focused h-index value.

#### Algorithm

#### Implementation

Assume a potential work is tagged with the following 4 concepts:

In [3]:
concepts = ['http://data.elsevier.com/vocabulary/OmniScience/Concept-237619301', 
            'http://data.elsevier.com/vocabulary/OmniScience/Concept-237619045',
            'http://data.elsevier.com/vocabulary/OmniScience/Concept-237619307',
            'http://data.elsevier.com/vocabulary/OmniScience/Concept-237619301']

Let's collect the nodes from the taxonomy graph as Node objects in py2neo

In [7]:
node_list = []
node_matcher = NodeMatcher(graph)
for concept in concepts:
    node_list.append(node_matcher.match("skos__Concept", uri=concept).first())

In [9]:
node_list

[Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra americana', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619301'),
 Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619045'),
 Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra californica', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619307'),
 Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra americana', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619301')]

First, let's filter out duplicate nodes:

In [52]:
unique_nodes = set(node_list)
unique_nodes

{Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra americana', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619301'),
 Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra californica', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619307'),
 Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619045')}

Next, we make sure permutations of this set have no strictly broader concepts to one another.

In [60]:
# This is bad, not worried about performance just yet :)
def invalid(concepts: list):
    for perm in itertools.permutations(list(concepts), 2):
        invalid_concepts = []
        compare = graph.run("MATCH (c1:skos__Concept {uri: '%s'})-[:skos__broader*1..]->(c2:skos__Concept {uri: '%s'}) RETURN c2" % (perm[0]['uri'], perm[1]['uri']))
        if compare is not None:
            invalid_concepts.append(perm[1])
    return set(invalid_concepts)

In [62]:
slim_concepts = [concept for concept in unique_nodes if concept not in invalid(unique_nodes)]

Now our list contains concepts are as narrow as possible, without duplication.

In [63]:
slim_concepts

[Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra californica', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619307'),
 Node('Resource', 'skos__Concept', dct__creator='Chiara Latronico', skos__prefLabel='Ephedra americana', uri='http://data.elsevier.com/vocabulary/OmniScience/Concept-237619301')]

Let's expand these concepts to get our First Degree Works.

In [4]:
def hIndex(citation_counts: list):
    citation_counts.sort(reverse=True)
    for index, citation_count in enumerate(citation_counts):
        if index > citation_count:
            return index
    return len(citation_counts)

In [7]:
print(hIndex([3,3,4]))

3
