# Introducing paperai

[paperai](https://github.com/neuml/paperai) is a semantic search and workflow application for medical/scientific papers. Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.

This notebook gives a brief overview of paperai.

# Install dependencies

Install `paperai` and all dependencies. This step also downloads input data to process.

In [None]:
%%capture
!pip install git+https://github.com/neuml/paperai scipy==1.10.0

# Download NLTK data
!python -c "import nltk; nltk.download('punkt')"

# Download data
!mkdir -p paperai
!wget -N https://github.com/neuml/paperai/releases/download/v1.10.0/tests.tar.gz
!tar -xvzf tests.tar.gz

# Index data

First, we'll index a dataset previously created with [paperetl](https://github.com/neuml/paperetl).

In [None]:
!python -m paperai.index paperai

Building new model
Iterated over 34959 total rows


The index process reads each row from the sections table in an articles database and builds an embeddings index. In this case, 34,959 text sections were indexed.

# Query data

Next, we'll run a sample query to find matching articles. The command below runs a similarity query for `COVID-19 and hypertension` and returns the top 2 documents with a score of at least 0.75.

In [None]:
!python -m paperai.query "COVID-19 and hypertension" 2 paperai 0.75

Loading model from paperai
[91mQuery: COVID-[0m[1;91m19[0m[91m and hypertension[0m

[36mArticles[0m

Title: Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study
Published: 2020-05-11
Publication: None
Entry: 2020-05-19
Id: 0l10n9n7
Reference: [4;94mhttp://medrxiv.org/cgi/content/short/2020.05.06.20092957v1?[0m[4;94mrss[0m[4;94m=[0m[4;94m1[0m
[94m - (0.8635): As with the reports from case series, 17 we observed that a history of hypertension was a strong risk factor for severe covid-19.[0m
[94m - (0.8062): This suggests it is more likely that hypertension per se, in particular hypertension severe enough to require multiple medications, that is the main risk factor for severe covid-19.[0m
[94m - (0.7860): Additionally, we examine whether antihypertensive medication use is associated with risk of severe covid-19.[0m
[94m - (0.7718): Nevertheless, questions remain as to why hypertension should be such a strong ris

# Query data with Python

The section above queried for matches via a command line program. The section below shows how the same data can be pulled programmatically with Python.

In [None]:
import pandas as pd

from paperai.models import Models
from paperai.query import Query

from IPython.display import display, HTML

# Load model
embeddings, db = Models.load("paperai")
cur = db.cursor()

def search(query, topn, threshold):
  # Query for best matches
  results = Query.search(embeddings, cur, query, topn, threshold)

  # Get results grouped by document
  documents = Query.documents(results, topn)

  articles = []

  # Print each result, sorted by max score descending
  for uid in sorted(
    documents, key=lambda k: sum(x[0] for x in documents[k]), reverse=True
  ):
    cur.execute(
      "SELECT Title, Published, Publication, Entry, Id, Reference "
      + "FROM articles WHERE id = ?",
      [uid],
    )
    
    article = cur.fetchone()

    matches = "\n\n".join([text for _, text in documents[uid]])

    article = {
      "Title": article[0],
      "Published": Query.date(article[1]),
      "Publication": article[2],
      "Entry": article[3],
      "Id": article[4],
      "Content": matches,
    }

    articles.append(article)

  df = pd.DataFrame(articles)
  display(HTML(df.to_html(index=False).replace("\\n","<br>")))

search("COVID-19 and hypertension", 2, 0.75)

Loading model from paperai


Title,Published,Publication,Entry,Id,Content
"Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study",2020-05-11,,2020-05-19,0l10n9n7,"As with the reports from case series, 17 we observed that a history of hypertension was a strong risk factor for severe covid-19. This suggests it is more likely that hypertension per se, in particular hypertension severe enough to require multiple medications, that is the main risk factor for severe covid-19. Additionally, we examine whether antihypertensive medication use is associated with risk of severe covid-19. Nevertheless, questions remain as to why hypertension should be such a strong risk factor for covid-19. Though in univariable analyses all classes of antihypertensive medication were associated with increased risk, in detailed multivariable analyses none of these were independently associated with severe covid-19 infection after adjusting for hypertension history, age, sex and ethnicity; rather it was the number of antihypertensive medications in use that was significantly related, which is probably a surrogate for severity of hypertension."
Management of osteoarthritis during COVID‐19 pandemic,2020-05-21,Clin Pharmacol Ther,2020-06-11,uxfk6k3c,"Also, arterial hypertension may be associated with increased risk of mortality in hospitalized COVID-19 infected subjects (38) . A recent meta-analysis showed that the most prevalent COVID-19 comorbidities were hypertension, cardiovascular diseases and diabetes mellitus (21, 32) , and their presence increased life threatening complications."


# Reports

The last item we'll cover is running a simple report. Reports run a series of queries combined with a list of extractive QA queries. This combination builds structured outputs designed to bulk query large article datasets.

In [None]:
%%writefile report.yml
name: Report

Hypertension:
    query: COVID-19 and hypertension
    columns:
        - name: Date
        - name: Study
        - {name: Sample Size, query: number of people/patients, query: how many people/patients, type=int}
        - {name: Comorbidities, query: covid-19 and hypertension, question: what diseases}

Overwriting report.yml


In [None]:
!python -m paperai.report report.yml 5 csv paperai

Loading model from paperai


In [None]:
display(HTML(pd.read_csv("Hypertension.csv").to_html(index=False)))

Date,Study,Sample Size,Comorbidities
2020-07-13,Disproportionate impact of the COVID-19 pandemic on immigrant communities in the United States,60.8 million cases,"obesity, hypertension, and diabetes--comorbidities"
2020-06-08,"Diet Supplementation, Probiotics, and Nutraceuticals in SARS-CoV-2 Infection: A Scoping Review",,systemic inflammation or endothelial damage
2020-05-21,Management of osteoarthritis during COVID‐19 pandemic,,"hypertension, cardiovascular diseases and diabetes mellitus"
2020-05-11,"Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study",406793,"1, 2 or 3+ antihypertensive medications"
2020,"COVID-19, Renin-angiotensin System and Hematopoiesis",,Renin-angiotensin System and Hematopoiesis


In this case, we built a report that queries for `COVID-19 and hypertension`. Then it builds a CSV with metadata and extractions for the sample size and comorbidities.

Historical reports built for the CORD-19 Kaggle Challenge are available [here](https://github.com/neuml/cord19q/tree/master/tasks). There is also a [Streamlit example application](https://github.com/neuml/paperai/blob/master/examples/search.py) available.

# Wrapping up

This notebook gave a brief overview of paperai. Applications range from semantic search to more complex reports. More notebooks will be released in the future covering additional aspects of paperai. 
