<a href="https://colab.research.google.com/github/kyunghyuncho/bio-ret-viz/blob/master/pyserini_covid19_huggingface_demoo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on COVID-19 Dataset (Title + Abstract Index)


This notebook provides a demo on how to get started in searching the [COVID-19 Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research) (release of 2020/03/20) from AI2.
In this notebook, we'll be working with the title + abstract index. 
Specifically, we're not indexing the full text (that'll come later, soon!).


First, install Python dependencies

In [0]:
%%capture
!pip install pyserini==0.8.1.0

import json
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://www.dropbox.com/s/uvjwgy4re2myq5s/lucene-index-covid-2020-03-20.tar.gz
!tar xvfz lucene-index-covid-2020-03-20.tar.gz

Sanity check of index size (should be 1.3G):

In [0]:
!du -h lucene-index-covid-2020-03-20

1.3G	lucene-index-covid-2020-03-20


You can use `pysearch` to search over an index. Here's the basic usage:

In [0]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index-covid-2020-03-20/')
hits = searcher.search('integration degradation synthesis host immune')

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1} {hits[i].docid} {hits[i].score} {hits[i].lucene_document.get("title")} {hits[i].score} {hits[i].lucene_document.get("doi")}')

1 35473 8.178999900817871 Severe Acute Respiratory Syndrome Coronavirus nsp1 Suppresses Host Gene Expression, Including That of Type I Interferon, in Infected Cells 8.178999900817871 10.1128/JVI.02472-07
2 36110 8.103400230407715 Severe acute respiratory syndrome coronavirus nsp1 protein suppresses host gene expression by promoting host mRNA degradation 8.103400230407715 10.1073/pnas.0603144103
3 24257 7.972499847412109 The TRIMendous Role of TRIMs in Virus–Host Interactions 7.972499847412109 10.3390/vaccines5030023
4 42237 7.800899982452393 Selective Degradation of Host RNA Polymerase II Transcripts by Influenza A Virus PA-X Host Shutoff Protein 7.800899982452393 10.1371/journal.ppat.1005427
5 39718 7.789599895477295 Enhanced replication of mouse adenovirus type 1 following virus-induced degradation of protein kinase R (PKR) 7.789599895477295 10.1101/584680
6 7087 7.780099868774414 Antigen Presentation and the Ubiquitin‐Proteasome System in Host–Pathogen Interactions 7.780099868774414

From the hits array, use `.lucene_document` to access the underlying indexed Lucene `Document`, and from there, call `.get(field)` to fetch specific fields, like "title", "doc", etc.
The complete list of available fields is [here](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/CovidGenerator.java#L46).

For hit #1, we don't have the full text, but we can access available information via `.raw`.

In [0]:
hit1_json = json.loads(hits[0].raw)
print(json.dumps(hit1_json, indent=4))

{
    "abstract": "The severe acute respiratory syndrome coronavirus (SARS-CoV) nsp1 protein has unique biological functions that have not been described in the viral proteins of any RNA viruses; expressed SARS-CoV nsp1 protein has been found to suppress host gene expression by promoting host mRNA degradation and inhibiting translation. We generated an nsp1 mutant (nsp1-mt) that neither promoted host mRNA degradation nor suppressed host protein synthesis in expressing cells. Both a SARS-CoV mutant virus, encoding the nsp1-mt protein (SARS-CoV-mt), and a wild-type virus (SARS-CoV-WT) replicated efficiently and exhibited similar one-step growth kinetics in susceptible cells. Both viruses accumulated similar amounts of virus-specific mRNAs and nsp1 protein in infected cells, whereas the amounts of endogenous host mRNAs were clearly higher in SARS-CoV-mt-infected cells than in SARS-CoV-WT-infected cells, in both the presence and absence of actinomycin D. Further, SARS-CoV-WT replication st

For hit #8, we have the full text, which we can also fetch via `.raw`:

In [0]:
hit8_json = json.loads(hits[7].raw)
print(json.dumps(hit8_json, indent=4))

{
    "paper_id": "604397da653890b54d4af23b45adab3365e1f042",
    "metadata": {
        "title": "Functional Analysis of Rift Valley Fever Virus NSs Encoding a Partial Truncation",
        "authors": [
            {
                "first": "J",
                "middle": [
                    "A"
                ],
                "last": "Head",
                "suffix": "",
                "affiliation": {},
                "email": ""
            },
            {
                "first": "B",
                "middle": [],
                "last": "Kalveram",
                "suffix": "",
                "affiliation": {},
                "email": ""
            },
            {
                "first": "T",
                "middle": [],
                "last": "Ikegami",
                "suffix": "",
                "affiliation": {},
                "email": ""
            }
        ]
    },
    "abstract": [
        {
            "text": "Rift Valley fever virus (RVFV), belongs to 