# Exercise: assigning reviewers to grants submission

> in progress: updating this so to use dimli && being able to publish online!

In this notebook we look at the problem of assigning reviewers to grants, using Dimensions as a knowledge base. Let us assume the following:

* 100 grant applications (with title and abstract) are coming in 
* 20 researchers (identified with Dimensions Researcher ID) are panel members
* The applications should be distributed among the panel members 
    * The final distribution is done manually
    * We should inform the manual distribution with a **ranked order of researchers per grant application**, ideally with some scoring.
    * Scoring does not have to be normalized 

## Approach

The goal is to produce a list of weighted researchers for each grant eg 
```
{'grant-1' : [
    (researcher-1, 0.9), (researcher-2, 0.8), (researcher-3, 0.5)
]}
```

Hence the algorithm shoud be repeated for each grant and set of researchers coming in, and it breaks down as follows:

* [1] for each grant
    * [2] for each reviewer
        * [3] score **relevance (Rel)** of grant against reviewer expertise
        * [4] score **conflict of interest (CoI)** of grant against reviewer network
        * save (reviewer, Rel score, CoI score) into candidates list 
    * sort candidates list by CoI score && Rel score

Steps [1] and [2] are essentially a repetition of [3] and [4], so below we show the process only for 1 grant and 1 researcher.

### Sample Data

For the purpose of this demo we are considering a sample grant from Dimensions: 
* [grant.2554006](https://app.dimensions.ai/details/grant/grant.2554006?search_text=A%20MULTI%27OMICS%20APPROACH%20TOWARDS%20DECIPHERING%20THE%20INFLUENCE%20OF%20THE%20MICROBIOME%20ON%20PRE&search_type=kws&search_field=full_search)

And one scientist/reviewer which is working in a closely associated area to the previous grant (genetics/pediatrics):
* [ur.01316535077.54](https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01316535077.54)

Next, we store this data in variables and set up the Python libraries we will use to query the Dimensions API.

In [2]:
import dimcli
dsl = dimcli.Dsl()

In [3]:
# pick a sample grant in genetics/pediatrics
sample_grant = dsl.query(
    """search grants where id="grant.2554006" return grants[title+abstract]""")
TITLE, ABSTRACT = sample_grant['grants'][0]['title'], sample_grant['grants'][0]['abstract']

# pick a sample reviewer
REVIEWER = "ur.01316535077.54"

# final list we want to obtain
CANDIDATES_LIST = [] 

## 1. Calculating the Relevance of a Grant to a Reviewer 

We do this by 
* extracting key terms [K1] from the grant's title and description
* extracting key terms [K2] from the reviewer's publications in last N years (divided by title and abstract) 
* getting the overlap between the two groups [K1~K2], and generating a single score from it

NOTE: the 'terms' dsl api accepts only lowercase keywords, so we use a little helper function to deal with that

> TODO update this after we release 1.18! extract_concepts

In [8]:
# helper for terms extractor PS Adam: lowercase can be omitted here
def cleanup(s):
    return s.lower().replace("'", " ").replace("\"", " ")

# concepts extractor function (encoding required to prevent unicode errors with non standard chars)
def dsl_concepts(p):
    q = """extract_terms("%s")""" % cleanup(p)
    return dsl.query(q.encode("utf-8"))['extracted_terms']

### 1.1 Extract keywords from the grant title/description

In [9]:
GRANT_TITLE_KEYWORDS = dsl_concepts(TITLE)
GRANT_DESC_KEYWORDS = dsl_concepts(ABSTRACT)
print("GRANT_TITLE_KEYWORDS", GRANT_TITLE_KEYWORDS)
print("GRANT_DESC_KEYWORDS", GRANT_DESC_KEYWORDS)

GRANT_TITLE_KEYWORDS ['omics approaches', 'microbiome']
GRANT_DESC_KEYWORDS ['microbiome', 'microbiota', 'host genomics', 'resident microbiome', 'host', 'collective genomes', 'oral microbiome', 'placenta', 'remarkable host', 'microbiome communities', 'human cells', 'sequencing', 'cells', 'sterile intrauterine environment', 'microbial gene catalogue', 'metabolomics', 'genome', 'metagenomics', 'humans', 'microbes', 'intrauterine environment', 'biomarkers', 'infection', 'hypothesis', 'birth', 'omics sciences', 'genomics', 'susceptibility', 'membrane', 'microorganisms', 'tract', 'future large-scale studies', 'community', 'human-associated microorganisms', 'human genome', 'whole-genome shotgun sequencing', 'throughput technologies', 'association', 'Aim 1', 'high-throughput technologies', 'distinct microbial communities', 'study', 'umbilical cord', 'mechanism', 'shotgun sequencing', 'large-scale studies', 'identification', 'species identification', 'mechanistic studies', 'discovery', 'genita

### 1.2 Extract keywords from the reviewer's recent publications 
We extract keywords from the reviewer's publication list in the last 5 years. This is gonna take a little while so we print out a message to show the progress of the iteration as we go through all publications. 

In [20]:
pubs = dsl.query(f"""search publications where researchers.id="{REVIEWER}" and year>=2013 return publications[title+abstract] limit 512""")

## NOTE THIS FAILS UNLESS WE USE A BACKEND THAT SHOWS ABSTRASCTS!!!! 

REVIEWER_TITLE_KEYWORDS = []
REVIEWER_DESC_KEYWORDS = []
c, tot = 0, len(pubs['publications'])

for x in pubs['publications']:
    c += 1
    print("[%d / %d]" % (c, tot), x['title'])
    REVIEWER_TITLE_KEYWORDS += dsl_concepts(x['title'])
    REVIEWER_DESC_KEYWORDS += dsl_concepts(x['abstract'])

print("=========")
print("TOT REVIEWER_TITLE_KEYWORDS: ", len(REVIEWER_TITLE_KEYWORDS))
print("TOT REVIEWER_DESC_KEYWORDS: ", len(REVIEWER_DESC_KEYWORDS))


[1 / 32] BMDExpress 2: enhanced transcriptomic dose-response analysis workflow.


KeyError: 'abstract'

**Note** If we don't want to split up title/desc keywords, we can simply query like this: `search publications where researchers.id="%s" and year>=2013 return publications[terms] limit 1000`

In [16]:
pubs = dsl.query(f"""search publications where researchers.id="{REVIEWER}" and year>=2013 return publications""")

In [18]:
pubs['publications']

[{'pages': '1780-1782',
  'id': 'pub.1107651006',
  'journal': {'id': 'jour.1345383', 'title': 'Bioinformatics'},
  'author_affiliations': [[{'first_name': 'Jason R',
     'last_name': 'Phillips',
     'orcid': '',
     'current_organization_id': '',
     'researcher_id': 'ur.013533065307.47',
     'affiliations': [{'name': 'Sciome LLC, Research Triangle Park, NC, USA.'}]},
    {'first_name': 'Daniel L',
     'last_name': 'Svoboda',
     'orcid': '',
     'current_organization_id': '',
     'researcher_id': 'ur.015305557555.63',
     'affiliations': [{'name': 'Sciome LLC, Research Triangle Park, NC, USA.'}]},
    {'first_name': 'Arpit',
     'last_name': 'Tandon',
     'orcid': '',
     'current_organization_id': 'grid.280664.e',
     'researcher_id': 'ur.01134171264.57',
     'affiliations': [{'name': 'Sciome LLC, Research Triangle Park, NC, USA.'}]},
    {'first_name': 'Shyam',
     'last_name': 'Patel',
     'orcid': '',
     'current_organization_id': '',
     'researcher_id': '',


In [13]:
dsl.query(f"""search publications where researchers.id="{REVIEWER}" and year>=2013 return publications[concepts+id] limit 512""")

<dimcli.Result object #4480684816. Dict keys: '_stats', 'publications'>

#### Grouping by Frequency of keywords

The same keyword can be found in more that one publication, so we want to take that as an indication that the reviewer is more closely associated to that topic. 
By grouping keywords this way we can end up with a dictionary where each keyword has a frequency of appearance, which can be used eventually to weight the relevancy score of a candidate. 


In [None]:
from collections import Counter
REVIEWER_TITLE_KEYWORDS_GROUPED = Counter(REVIEWER_TITLE_KEYWORDS)
REVIEWER_DESC_KEYWORDS_GROUPED = Counter(REVIEWER_DESC_KEYWORDS)

Now let's print these out as a table, using the Pandas library.

In [None]:
import pandas as pd
# wrapper
def build_df(a_dict):
    if a_dict:
        df = pd.DataFrame.from_dict(a_dict, orient='index').reset_index()
        df = df.rename(columns={'index': 'term', 0: 'count'})
        return df
    else:
        print("No values to show")
        return pd.DataFrame() # return empty DF
print("========= REVIEWER_TITLE_KEYWORDS BY MOST FREQUENT ========")
df = build_df(REVIEWER_TITLE_KEYWORDS_GROUPED)
if not df.empty: 
    display(df.sort_values('count', ascending=False))

Let's do the same for the keywords extracted from the descriptions/abstracts: 

In [None]:
print("========= REVIEWER_DESC_KEYWORDS BY MOST FREQUENT ========")
df = build_df(REVIEWER_DESC_KEYWORDS_GROUPED)
if not df.empty: display(df.sort_values('count', ascending=False))

### 1.3 Calculating the keywords overlap between the grant and the reviewer publications 

This is simply a matter of iterating through the two pair of keywords' lists, and highlighting the ones that appear in both places. 

**Note** we still deal with title and abstract/descriptions separately, based on the assumption that a match in the title is stronger that a match in the abstract. This will then be represented via a different weight multiplier for title matches, when calculating the final score.  

In [None]:
print("========= OVERLAP_TITLE_KEYWORDS BY MOST FREQUENT ========")
OVERLAP_TITLE_KEYWORDS = {}
for t in GRANT_TITLE_KEYWORDS:
    if t in REVIEWER_TITLE_KEYWORDS_GROUPED.keys():
        OVERLAP_TITLE_KEYWORDS[t] = REVIEWER_TITLE_KEYWORDS_GROUPED[t]
df = build_df(OVERLAP_TITLE_KEYWORDS)
if not df.empty: display(df.sort_values('count', ascending=False))

**Note** 
In this case the are no title keywords in common between the grant submissiona and the reviewer!

In [None]:
print("========= OVERLAP_DESC_KEYWORDS BY MOST FREQUENT ========")
OVERLAP_DESC_KEYWORDS = {}
for t in GRANT_DESC_KEYWORDS:
    if t in REVIEWER_DESC_KEYWORDS_GROUPED.keys():
        OVERLAP_DESC_KEYWORDS[t] = REVIEWER_DESC_KEYWORDS_GROUPED[t]
df = build_df(OVERLAP_DESC_KEYWORDS)
if not df.empty: display(df.sort_values('count', ascending=False))

Finally, we want to **reduce all of these results to a single number**, so that it'll be easier to then compare it with scores for other reviewers. 

This can be easily achieved in two steps:

* sum up all counts for description keywords 
* sum up all counts for title keywords, applying a booster coefficient eg '2'. This is because matching title keywords can be more indicative of shared expertise.

The sum of the two numbers above is the final score. 

In [None]:
TITLE_WEIGHT = 2  # multiplier that gives title hits more weight

score = (sum(OVERLAP_TITLE_KEYWORDS.values()) * TITLE_WEIGHT) + (sum(
    OVERLAP_DESC_KEYWORDS.values()))

print("The total combined relevance score is: ", score)

CANDIDATES_LIST += [
    [REVIEWER, score], 
    # other reviewers data would go here.. 
]

### 1.4 Wrapping up

We have calculated the relevance of a *single* reviewer/researcher VS a submission by looking at the keywords-overlap of title and abstracts as distinct factors. 

We can easily **repeat the same process** for all the remaining N reviewers/researchers, simply by looping over a list of researcher IDs. 

The final scores can be added to the `CANDIDATES_LIST` variable, which can then be sorted before returning it to the end user.

In [None]:
CANDIDATES_LIST = sorted(CANDIDATES_LIST, key=lambda x: x[1],  reverse=True)
print(CANDIDATES_LIST)

### 1.5 Addendum

A much faster but also much less transparent (and possibly less precise) approach would be to use a single query like this:

```
search publications in terms for "..[1].." where researcher.id in [..[2]..] return researchers
```

Where: 

* [1] is a _list of terms_ obtained from extracting terms from the grant submission title+description
* [2] is a _list of researchers_ i.e. the list of reviewers we are trying to rank

This query essentially returns a researchers-facet on a list of publications filtered by terms and reviewers. The facet includes researchers that are not in the reviewers list, which can be skipped. The results are already ordered by relevance of the keywords. 

Main drawback is that the relevance score is not returned, nor it is clear how it is calculated. It would be useful to test these results on real world data and compare with the previous appraoch. 

## 2. Calculating a Conflict of Interest score of a Reviewer VS a Grant Authors 

Prerequisisite: 

* we must know the researcher ID of the authors of the grant submission

Two aspects: 

* overlap between all coauthors of reviewer, and all authors of the grant 
    * extra: this could be extended by including grant author's co-authors, or second-degree co-authors etc..
* overlap between all institutions of reviewer, and all institutions of the grants' authors 
    * extra: same considerations about second level objects
    * extra: expand institutions by using full GRID hierarchy info



In [None]:
RESEARCHERS = ["ur.01331772327.01"]  # the authors of the grant submission, one or more
REVIEWER = "ur.01316535077.54" # the score is calculated against each single reviewer

### 2.1 Co-authors overlap

Note: the following query retrieves co-authors based on a publications list from the last 10 years. this This will make the query faster, but of course it can be changed. 

In [None]:
REVIEWER_PUBS_RESULTS = dsl.query(
    """search publications where researchers.id="%s" and year>=2008 return researchers limit 1000"""
    % REVIEWER)

REVIEWER_COAUTHORS = [
    x['id'] for x in REVIEWER_PUBS_RESULTS["researchers"]
    if x['id'] != REVIEWER
]

# find if there's an overlap between reviever's coauthors and the researchers
overlap = [x for x in RESEARCHERS if x in REVIEWER_COAUTHORS]
overlap_score = len(overlap)
print("CO-AUTHORS CONFLICT OF INTEREST SCORE (0=no, 1+=yes): \n=> ", overlap_score)

### 2.2 Organizations overlap

In [None]:
REVIEWER_ORGANIZATIONS = next((item for item in REVIEWER_PUBS_RESULTS['researchers']
                      if item['id'] == REVIEWER), None)["research_orgs"]

RESEARCHERS_ORGANIZATIONS = []
for r in RESEARCHERS:
    res = dsl.query(
        """search publications where researchers.id="%s" return researchers limit 1"""
        % r)
    RESEARCHERS_ORGANIZATIONS += res["researchers"][0]["research_orgs"]

# find if there's an overlap between reviever's orgs and the researchers orgs
overlap_institutions = [
    x for x in RESEARCHERS_ORGANIZATIONS if x in REVIEWER_ORGANIZATIONS
]
overlap_institutions_score = len(overlap_institutions)
print("ORGANIZATIONS CONFLICT OF INTEREST SCORE (0=no, 1+=yes): \n=> ", overlap_institutions_score)

### 2.3 Wrapping up 

The steps above must be repeated for each grant submission - reviewer pair.

Eventually the scores can be combined within the `CANDIDATES_LIST` result (for a given grant submission) as follows:

In [None]:
for c in CANDIDATES_LIST:
    if c[0] == REVIEWER:
        c += [overlap_score, overlap_institutions_score]
print(CANDIDATES_LIST)