In [118]:
import requests
import pandas as pd
baseline_url = "https://xdddev.chtc.io/api/articles?max=20&dataset=xdd-covid-19&match=true&additional_fields=title,abstract"
ssearch_url = "https://xdddev.chtc.io/api/articles?max=20&dataset=xdd-covid-19&semantic_search=true"
pd.set_option('display.max_colwidth', None)

def get_docs(url, query, abstract_length=400):
    resp = requests.get(url + f"&term={query}")
    rdata = resp.json()['success']['data']
    data = []
    for d in rdata:
        xddid = d["_gddid"]
        doi = list(filter(lambda i: i["type"]=="doi", d["identifier"]))
        doi = doi[0]['id'] if len(doi) >0 else ""
        data.append((xddid, doi, d.get('title', '')))
    return pd.DataFrame(data, columns=["xddid", "DOI", "Title"])


# xDD - Semantic Search proof-of-concept

## Overview
The current xDD articles endpoint provides phrase (default) matching, bag-of-words (with `match` parameter), and the ability to extend the search to title and abstract fields (with boosted scoring for relevance). Scoring is the Elasticsearch default (BM25 as of ES5). 

We have prototyped a [MPNet](https://arxiv.org/abs/2004.09297) LM-backed search. Thanks to Brandon, David-Andrew, and Joel for providing an implementation of an MPNet embedder and for providing guidance on the query structure! Each document is embedded, with the resulting vector stored in Elasticsearch. Queries are similarly embedded, and an exhaustive search is done to retrieve nearest neighbors.

### (Qualitative) Results
Unsurprisingly, the semantic search does really well with general free-text type queries of the type "adding age stratification to a covid-19 model". Also unsurprisingly, it also suffers for very specific out-of-vocabulary queries of the type "SVIIVR".

### Takeaways
It's very promising! Establishing quantitative metrics is obviously important for determining whether it *really* improves the search. The question of how to handle out-of-vocabulary searches remains, but overall it was a successful experiment worth pursuing furthure. And there are a lot of directions we could go with (L)LMs..

### Notes

- The "exhaustive search" is a potential pain point that can be sidestepped by upgrading to ES8, which includes performance updates.
    - Filtering to only sets of documents can also help until the upgrade
- Currently only the xdd-covid-19 set has been embedded.
- This approach also lends itself well to "nearby document" explorations
    - And "nearby artifacts," once implemented.



In [153]:
query = "SVIIVR"

In [154]:
baseline = get_docs(baseline_url, query)
ssearch = get_docs(ssearch_url, query)

In [155]:

pd.concat([baseline["Title"], ssearch["Title"]], axis=1).set_axis(["baseline", "semantic_search"], axis="columns")

Unnamed: 0,baseline,semantic_search
0,An algebraic framework for structured epidemic modelling,Weer een warm welkom
1,A fractional-order mathematical model based on vaccinated and infected compartments of SARS-CoV-2 with a real case study during the last stages of the epidemiological event,Down-regulation of viral replication by lentiviral-mediated expression of short-hairpin RNAs against vesicular stomatitis virus ribonuclear complex genes
2,,High-fives for FIV?
3,,Vaknieuws
4,,Understanding and altering cell tropism of vesicular stomatitis virus
5,,Groovy virus tails
6,,Gene expression of vesicular stomatitis virus genome RNA
7,,KORT
8,,Synthesis of VSV RNPs in vitro by cellular VSV RNPs added to uninfected HeLa cell extracts: VSV protein requirements for replication in vitro
9,,Consistent conjugation
