# Publication matching
- take a (small) text sample and identify matching publications (i.e., that belong to the same scientific field)
- match based on: \
    (1) text similarity and \
    (2) references similarity

## Outline
- generate target corpus (the publications for which we want to find matches)
    - retrieve XML publications information from PubMed for specific search term
    - restructure information into Article dataclass for easier processing
    - pickle parsed articles (create break point in work flow)
- generate general corpus (the pool of publications from which we later extract matches)
- retrieve reference information for both target corpus and pool corpus
- determine text similarity 
    - text-frequency inverse document frequency (tf-idf) approach on titles and abstracts
- determine reference similarity

### 0.1  Import libraries

In [2]:
import importlib
import sys
from pathlib import Path  # construct file paths
import configparser       # retrieve private credentials from file (which is ignored by git)
import pickle
from Bio import Entrez    # query the NCBI API
from crossref.restful import Works, Etiquette

### 0.2 Import custom functions

In [3]:
# ammend script folder to the Python path (run once --> adds path for duration of this session)
sys.path.append('./scripts/')
import pubmatch

In [9]:
# reload pubmatch (re-run if library code changed during development)
importlib.reload(pubmatch)

<module 'pubmatch' from './scripts/pubmatch.py'>

### Python library imports

In [None]:
# import nltk
# from nltk.tokenize import sent_tokenize, word_tokenize 
# import numpy as np
# import re
# from collections import defaultdict

## 1. Generate target corpus
- As example, we use publications of Madlen Vetter 
- Retrieve publication info from PubMed
- Build target corpus from PubMed XML results

In [5]:
# search term for PupMed query
search_term =  '(madlen vetter[author])'

In [6]:
# choose a name for result directory (e.g., according to PubMed search_term)
result_dir_name = "my_publications"
# create path object to result folder (adjust if not in current folder)
result_dir = Path("./" , result_dir_name)
# mkdir result directory, if it does not exist
result_dir.mkdir(parents=True, exist_ok=True)
# create path object for clean XML records
file_cleaned = result_dir / 'cleaned.xml'

In [7]:
# read credentials for NCBI API (Entrez) that are stored in text file (not uploaded to git)
config = configparser.ConfigParser()
config.read("../credentials/publication_matching_creds.txt")
pubmed_user = config.get("pubmed", "user")
pubmed_key = config.get("pubmed", "api_key")

In [10]:
# provide pubmed search term, pubmed user name, pubmed api key, 
# batch size, intermediate batch file path, and path object for final file
pubmatch.get_clean_xml(search_term, pubmed_user, pubmed_key, 5000, file_cleaned)

There are 4 records for (madlen vetter[author])
Going to download record 1 to 4


In [11]:
target_articles = pubmatch.create_corpus(file_cleaned)

## 2. Generate general corpus (pool against which target is matched)

In [15]:
# define search term for general pool of articles
# tutorial on creating good search terms https://www.nlm.nih.gov/bsd/disted/pubmedtutorial/cover.html
search_term =  'plants[MH] AND immunity[MH]'

### 2.1 Set up directory structure for general pool

In [16]:
# choose a name for result directory (e.g., according to PubMed search_term)
result_dir_name = "plant_publications"
# create path object to result folder (adjust if not in current folder)
result_dir = Path("./" , result_dir_name)
# mkdir result directory, if it does not exist
result_dir.mkdir(parents=True, exist_ok=True)
# create path object for clean XML records
file_cleaned = result_dir / 'cleaned.xml'

In [17]:
# read in credentials for NCBI API (Entrez)
config = configparser.ConfigParser()
config.read("../credentials/publication_matching_creds.txt")
pubmed_user = config.get("pubmed", "user")
pubmed_key = config.get("pubmed", "api_key")

### 2.2 Explore PubMed records for search term
- Adjust search term if not sufficient or too many hits (--> insufficient RAM to process too many hits)
- Aim for less than 50,000 records (capped at that number)

In [18]:
# before retrieving anything, identify the number of hits in PubMed
Entrez.email = pubmed_user
apikey = pubmed_key

handle = Entrez.esearch(db = "pubmed", term = search_term, retmax = 500000, usehistory = "y")
record = Entrez.read(handle)

webenv = record["WebEnv"] 
query_key = record["QueryKey"]

id_list = record["IdList"]
print(len(id_list))

13795


In [19]:
# retrieve info on frequency of individual terms
record['TranslationStack']

[{'Term': '"plants"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '775888', 'Explode': 'Y'}, {'Term': '"immunity"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '336173', 'Explode': 'Y'}, 'AND']

In [20]:
# Retrieve the titles of some summary records to evaluate topical fit 
# (i.e., does the search term provide meaningfull PubMed records?)
numrec = 10 # number of records
pubmatch.get_pubmed_summary(webenv, query_key, pubmed_key, numrec)

Atypical Resistance Protein RPW8/HR Triggers Oligomerization of the NLR Immune Receptor RPP7 and Autoimmunity.
Phenolic Amides with Immunomodulatory Activity from the Nonpolysaccharide Fraction of <i>Lycium barbarum</i> Fruits.
Cell Wall Membrane Fraction of <i>Chlorella sorokiniana</i> Enhances Host Antitumor Immunity and Inhibits Colon Carcinoma Growth in Mice.
Identification of lncRNAs and their regulatory relationships with target genes and corresponding miRNAs in melon response to powdery mildew fungi.
Genetic mapping using a wheat multi-founder population reveals a locus on chromosome 2A controlling resistance to both leaf and glume blotch caused by the necrotrophic fungal pathogen Parastagonospora nodorum.
Identification of a Recessive Gene <i>PmQ</i> Conferring Resistance to Powdery Mildew in Wheat Landrace Qingxinmai Using BSR-Seq Analysis.
PRR Cross-Talk Jump Starts Plant Immunity.
A Rapid Survey of Avirulence Genes in Field Isolates of <i>Magnaporthe oryzae</i>.
Plant metabo

### 2.3 Build the corpus of general publications (i.e., article pool)

In [None]:
# retrieve XML records for general publications
pubmatch.get_clean_xml(search_term, pubmed_user, pubmed_key, 5000, file_cleaned)

In [None]:
# corpus:
#     search term
#     xml file
#     list of article objects
    
# for term, file in corpuses:  

### 2.4 Build corpus of general articles: generate from PubMed XML information or read from pickle

In [None]:
# general_articles = pubmatch.create_corpus(file_to_open_cleaned, file_to_open_parsed)

In [None]:
# read the pickles corpus of articles
with file_to_open_parsed.open("rb") as infile:
    general_articles = pickle.load(infile)

## 3. Retrieve reference information using Crossref

In [None]:
# set up crossref etiquette
config = configparser.ConfigParser()
config.read("../credentials/publication_matching_creds.txt")
crossref_url = config.get("crossref", "url")
crossref_email = config.get("crossref", "email")
my_etiquette = Etiquette('Publication Matching', '0.1', crossref_url, crossref_email)

In [None]:
# set up user agent for crossref API calls
works = Works(etiquette=my_etiquette)

In [None]:
len(general_articles)

# TODO: write functions and apply to target_articles and general_articles

### 3.1 Retrieve references for target articles

In [None]:
# read in the pickle
with file_to_open_parsed.open("rb") as infile:
    general_articles = pickle.load(infile)

In [None]:
no_references = not_in_crossref = 0
ref_articles = []
for article in general_articles:
    if article.doi:
        ref_list = []
        record = works.doi(article.doi)
        if record:
            if 'reference' in record:
                for ref in record['reference']:
                    title = ref.get('article-title', None)
                    authors = ref.get('author', None)
                    year = ref.get('year', None)
                    journal = ref.get('journal-title', None)
                    doi = ref.get('DOI', None)
                    ref_list.append(Article(my_id=doi, doi=doi, title=title, authors=authors, year=year, journal=journal))
                article.references = ref_list
                ref_articles.append(article)
            else:
                no_references += 1
        else: 
            not_in_crossref += 1

In [None]:
# write out pickle of processed publication information
with file_to_open_parsed.open("wb") as outfile:
    pickle.dump(ref_articles, outfile)

In [None]:
not_in_crossref

In [None]:
no_references

## Text similarity

### Prep data structures

In [None]:
# retrieve general articles with reference data
with open("./plant_publications/parsed_articles.pickle", "rb") as infile:
    general_articles = pickle.load(infile)
print("Read in {} general articles.".format(len(general_articles)))

In [None]:
# retrieve target articles
with open("./my_publications/parsed_articles.pickle", "rb") as infile:
    target_articles = pickle.load(infile)
print("Read in {} target articles.".format(len(target_articles)))

In [None]:
# remove target articles from the general article pool
def remove_targets_from_general(target_articles, general_articles):
    removed_targets = []
    target_myids = set()

    for article in target_articles:
        target_myids.add(article.my_id)

    for article in general_articles:
        if article.my_id in target_myids:
            removed_targets.append(article)
            general_articles.remove(article)
    for removed in removed_targets:
        print("Removed target from pool: {}".format(removed.title))
    return general_articles

In [None]:
general_articles = remove_targets_from_general(target_articles, general_articles)

In [None]:
# list of articles and abstracts from general publications, if abstract is sufficiently long
pool_articles = []
pool_abstracts = []
for article in general_articles:
    abstract = article.abstract or ''
    abstract = abstract.strip()
    if len(abstract) > 50:
        pool_articles.append(article)
        pool_abstracts.append(abstract)
print("Retained {} articles from {} general articles.".format(len(pool_articles), len(general_articles)))

In [None]:
# build a list with all target abstracts, and list of all target articles in same order
target_abstracts = []
for article in target_articles:
    target_abstracts.append(article.abstract)

In [None]:
# build a joint corpus
all_corpus = pool_abstracts + target_abstracts
print("Kept total of {} articles for NLP processing.".format(len(all_corpus)))

In [None]:
# build a dictionary for easier look-up of matched articles
pool_articles_dict = {}
for article in pool_articles:
    pool_articles_dict[article.my_id] = article

In [None]:
# define STOP words
STOP = set(nltk.corpus.stopwords.words("english"))

In [None]:
def normalize_abstract(abstract):
    # lower case and remove special characters/whitespaces
    abstract = re.sub(r'[^a-zA-Z0-9\s]', '', abstract, re.I|re.A)
    abstract = abstract.lower()
    abstract = abstract.strip()
    # tokanize
    tokens = nltk.word_tokenize(abstract)
    # filter stop words
    filtered_tokens = [token for token in tokens if token not in STOP]
    # re-create text from filtered tokens
    abstract = ' '.join(filtered_tokens)
    return abstract

In [None]:
normalize_corpus = np.vectorize(normalize_abstract)
norm_corpus = normalize_corpus(all_corpus)
print("Normalized {} articles.".format(len(norm_corpus)))

## Feature engineering

In [None]:
# set up TF-IDF representation
from sklearn.feature_extraction.text import TfidfVectorizer
# We take uni-gram and bi-grams as our features and remove terms 
# that occur only in one document across the whole corpus.
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

### Similarity comparison (Cosine similarity for pairwise document similarity)

In [None]:
# separate target and pool tfidf
target_tfidf = tfidf_matrix[-len(target_abstracts):]
pool_tfidf = tfidf_matrix[:-len(target_abstracts)]

In [None]:
# run full matrix similarity for pool vs target
sim = pool_tfidf @ target_tfidf.T

In [None]:
# save similarity matrix as numpy object (natural break-point in work flow)
# np.save("doc_sim.npy", sim)

In [None]:
# load numpy object: 
# sim = np.load("doc_sim.npy")

In [None]:
# create sparse matrix
coo_sim = sim.tocoo(copy=False)
pool_idx = coo_sim.row
target_idx = coo_sim.col
flat_sim = coo_sim.data
#free up some memory
del tfidf_matrix, target_tfidf, pool_tfidf, sim 

In [None]:
# filter for similarity threshold
useful = np.argwhere(flat_sim > 0.13)
filtered_pool_idx = pool_idx[useful].flatten()
filtered_target_idx = target_idx[useful].flatten()
filtered_flat_sim = flat_sim[useful].flatten()
print("Identified {} articles above similarity threshold.".format(len(useful)))

In [None]:
order = np.argsort(filtered_flat_sim)[::-1]

In [None]:
# sorted_matches has all matches in order
filtered_pool_idx = np.array(filtered_pool_idx, dtype=int)
filtered_target_idx = np.array(filtered_target_idx, dtype=int)
sorted_matches = []
for i in order:
    match = (filtered_flat_sim[i], pool_articles[filtered_pool_idx[i]], target_articles[filtered_target_idx[i]])
    sorted_matches.append(match)

In [None]:
# Create frequency table (how many matches does each pool article have)
from collections import Counter
pool_hits = Counter(filtered_pool_idx)

In [None]:
# how many articles have at least X matches?
sum([1 for x in pool_hits.values() if x >= 1])

In [None]:
# filter the counter
{x : pool_hits[x] for x in pool_hits if pool_hits[x] >= 1}

In [None]:
pool_matches = defaultdict(list) #keys are pool Article.my_id's, values are lists of matched target article obj
for sim, pool, target in sorted_matches:
    # create key; add similarity score; append a tuple that has matched target article and it
    pool_matches[pool.my_id].append((sim, target))

In [None]:
# write out matches
with open("./abstract_matches.pickle", "wb") as outfile:
    pickle.dump(pool_matches, outfile)

In [None]:
# remove those with less than X matches
# for my_id, match_list in list(pool_matches.items()):
#     if len(match_list) < 2:
#         pool_matches.pop(my_id)   

In [None]:
match_iter = iter(pool_matches.items())

In [None]:
my_id, matches = next(match_iter)
print("Pool article:")
print(pool_articles_dict[my_id].title)
print(pool_articles_dict[my_id].my_id)
print(pool_articles_dict[my_id].year)
print(pool_articles_dict[my_id].abstract)

for sim, jm in matches:
    print()
    print(jm.title, jm.my_id, jm.year, sim)
    print(jm.abstract)

## Find matching articles based on reference similarity

In [None]:
# retrieve general articles with reference data
with open("./plant_publications/parsed_articles.pickle", "rb") as infile:
    general_articles = pickle.load(infile)
print("Read in {} general articles.".format(len(general_articles)))
# retrieve target articles
with open("./my_publications/parsed_articles.pickle", "rb") as infile:
    target_articles = pickle.load(infile)
print("Read in {} target articles.".format(len(target_articles)))

In [None]:
# Remove targets from general pool
pool_articles = remove_targets_from_general(target_articles, general_articles)

In [None]:
# pool_articles_dict = {}
# # build a dictionary for easier look-up of matched articles
# for article in pool_articles:
#     pool_articles_dict[article.my_id] = article

In [None]:
target_articles_dict = {}
# build a dictionary for easier look-up of matched articles
for article in target_articles:
    target_articles_dict[article.my_id] = article

### Build data structures

In [None]:
target_temp = []
for article in target_articles:
    # check if the string references have been converted (to article objects)
    if not any(isinstance(r, str) for r in article.references):
        target_temp.append(article)
print("{} of {} target articles have references.".format(len(target_temp), len(target_articles)))
# re-asign target_articles to remove target articles without reference information
target_articles = target_temp

In [None]:
UNIQUE_ID = 100

def get_reference_token(article):
    global UNIQUE_ID
    if article.doi:
        return article.doi
    elif article.title:
        title = article.title.lower()
        return re.sub(r'[^a-z0-9]', '', title)
    else:
        # NOTE: if you want to try matching just on the understandable references,
        # you can instead return "None" here. (Expect more matches, but also more false positives)
        UNIQUE_ID += 1
        return "LOCAL" + str(UNIQUE_ID)

def reference_tokenizer(article):
    tokens = []
    for ref in article.references:
        token = get_reference_token(ref)
        if token:
            tokens.append(token)
    return tokens 

In [None]:
n_target = 0 # number of useful target articles (i.e., 5+ refs)
ref_texts = []
ref_articles = []
for article in pool_articles:
    tokens = reference_tokenizer(article)
    if tokens and len(tokens) > 5:
        ref_texts.append(tokens)
        ref_articles.append(article)
for article in target_articles:
    tokens = reference_tokenizer(article)
    if tokens and len(tokens) > 5:
        n_target += 1
        ref_texts.append(tokens)
        ref_articles.append(article)
print("Identified {} target articles with sufficient reference information.".format(n_target))

### Feature engineering of reference information

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# We take uni-gram and bi-grams as our features and remove terms 
# that occur only in one document across the whole corpus.         <- is that smart?
tf = TfidfVectorizer(tokenizer=lambda x: x, lowercase=False)
tfidf_matrix = tf.fit_transform(ref_texts)

In [None]:
# Get the cosine similarity matrix between pool and target articles
pool_tfidf = tfidf_matrix[:-n_target]
pool_articles = ref_articles[:-n_target]
target_tfidf = tfidf_matrix[-n_target:]
target_articles = ref_articles[-n_target:]
sim = pool_tfidf @ target_tfidf.T

In [None]:
# Extract all matching pairs of articles, in sorted order
coo_sim = sim.tocoo(copy=False)
pool_idx = coo_sim.row
target_idx = coo_sim.col
flat_sim = coo_sim.data

In [None]:
# clear memory and order per similarity
# del tfidf_matrix, target_tfidf, pool_tfidf, sim

In [None]:
# adjust stringency of matches by filtering for flat_sim (similarity) 
useful = np.argwhere(flat_sim > 0.01)
filtered_pool_idx = pool_idx[useful].flatten()
filtered_target_idx = target_idx[useful].flatten()
filtered_flat_sim = flat_sim[useful].flatten()

from collections import Counter
target_hits = Counter(filtered_target_idx)
# for reference... number of target articles with XXX matches in pool
print(sum([1 for x in target_hits.values() if x >= 1]))

In [None]:
order = np.argsort(filtered_flat_sim)[::-1]
filtered_pool_idx = np.array(filtered_pool_idx, dtype=int)
filtered_target_idx = np.array(filtered_target_idx, dtype=int)
sorted_matches = []
for i in order:
    match = (filtered_flat_sim[i], target_articles[filtered_target_idx[i]], pool_articles[filtered_pool_idx[i]])
    sorted_matches.append(match)

In [None]:
target_matches = defaultdict(list) #keys are target Article.my_id's, values are lists of matched pool article obj
for sim, target, pool in sorted_matches:
    # create key; add similarity score; append a tuple that has matched pool article and its sim score
    target_matches[target.my_id].append((sim, pool))

In [None]:
# write out matches
with open("./ref_matches.pickle", "wb") as outfile:
    pickle.dump(pool_matches, outfile)

In [None]:
# # if desired, remove those with less than X matches
# for my_id, match_list in list(target_matches.items()):
#     if len(match_list) < 4:
#         target_matches.pop(my_id)

In [None]:
match_iter = iter(target_matches.items())

In [None]:
# step through the results by re-running this cell multiple times
my_id, matches = next(match_iter)
print("Target article:")
print(target_articles_dict[my_id].title)
print(target_articles_dict[my_id].my_id)
print(target_articles_dict[my_id].year)
print(target_articles_dict[my_id].abstract)

for sim, pool_match in matches:
    print()
    print("Title: {} \n my_id: {} \n Year: {} \n Similarity score: {} \n Abstract: {}".format(pool_match.title, pool_match.my_id, pool_match.year, sim, pool_match.abstract))
    