# Pyserini Demo on COVID-19 Dataset (Paragraph Index)


This notebook provides a demo on how to get started in searching the [COVID-19 Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research) (release of 2020/04/03) from AI2.
Here, we'll be working with the paragraph index.
We have [another notebook](https://github.com/castorini/anserini-notebooks/blob/master/pyserini_covid19_default.ipynb) for working with the simpler title + abstract index.

First, install Python dependencies

In [None]:
%%capture
!pip install pyserini==0.9.0.0

import json
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Let's grab the pre-built index:

In [None]:
%%capture
!wget https://www.dropbox.com/s/ivk87journyajw3/lucene-index-covid-paragraph-2020-04-10.tar.gz
!tar xvfz lucene-index-covid-paragraph-2020-04-10.tar.gz

Sanity check of index size (should be 5.3G):

In [None]:
!du -h lucene-index-covid-paragraph-2020-04-10

5.8G	lucene-index-covid-paragraph-2020-04-10


Now, a bit of explanation of how the index is organized.
For each source article, we create a paragraph-level index as follows, for a hypothetical article with id `docid`, in the index there'll be:

+ `docid`: title + abstract
+ `docid.00001`: title + abstract + 1st paragraph
+ `docid.00002`: title + abstract + 2nd paragraph
+ `docid.00003`: title + abstract + 3rd paragraph
+ ...

That is, each article is chopped up into individual paragraphs.
Each paragraph is indexed as a "document" (with the title and abstract). 
The suffix of the `docid`, `.XXXXX` identifies which paragraph is being indexed (numbered sequentially).

You can use `pysearch` to search over an index. Here's the basic usage:

In [None]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('lucene-index-covid-paragraph-2020-04-10/')
hits = searcher.search('nsp1 synthesis degradation', 10)

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:14} {hits[i].score:.5f} {hits[i].lucene_document.get("title")} {hits[i].lucene_document.get("doi")}')

 1 o328y8ax.00004 10.87830 Modulation of type I interferon induction by porcine reproductive and respiratory syndrome virus and degradation of CREB-binding protein by non-structural protein 1 in MARC-145 and HeLa cells 10.1016/j.virol.2010.03.039
 2 06z7p7rc       10.79910 Severe Acute Respiratory Syndrome Coronavirus nsp1 Suppresses Host Gene Expression, Including That of Type I Interferon, in Infected Cells 10.1128/jvi.02472-07
 3 ncufofro.00026 10.77400 Chapter Five Viral and Cellular mRNA Translation in Coronavirus-Infected Cells 10.1016/bs.aivir.2016.08.001
 4 mtj46j82.00029 10.62580 MERS coronavirus nsp1 participates in an efficient propagation through a specific interaction with viral RNA 10.1016/j.virol.2017.08.026
 5 42saxb98.00002 10.61430 A novel two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein 10.1038/nsmb.1680
 6 42saxb98.00001 10.60960 A novel two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein 1

From the hits array, use `.lucene_document` to access the underlying indexed Lucene `Document`, and from there, call `.get(field)` to fetch specific fields, like "title", "doc", etc.
The complete list of available fields is [here](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/CovidGenerator.java#L46).

Note that we retrieve multiple paragraphs from the same article "A novel two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein" (hits #5 and #6). Note that this actually a good thing, because a downstream module can do evidence integration.

Considering hit #5 (`42saxb98.00002`) and hit #6 (`42saxb98.00001`), use `.contents` of the hit to see exactly what was indexed.

For hit #5:

In [None]:
hits[4].contents.split('\n')

['A novel two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein',
 "Severe acute respiratory syndrome coronavirus nsp1 protein suppresses host gene expression, including type I interferon production, by promoting host mRNA degradation and inhibiting host translation, in infected cells. We present evidence that nsp1 uses a novel, two-pronged strategy to inhibit host translation/gene expression. Nsp1 bound to the 40S ribosomal subunit and inactivated the translational activity of the 40S subunits. Furthermore, the nsp1-40S ribosome complex induced the modification of the 5'-region of capped mRNA template and rendered the template RNA translationally incompetent. Nsp1 also induced RNA cleavage in templates carrying the internal ribosome entry site (IRES) from encephalomyocarditis virus, but not in those carrying IRESs from hepatitis C and cricket paralysis viruses, demonstrating that the nsp1-induced RNA modification was template-dependent. We speculate 

For hit #6:

In [None]:
hits[5].contents.split('\n')

['A novel two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein',
 "Severe acute respiratory syndrome coronavirus nsp1 protein suppresses host gene expression, including type I interferon production, by promoting host mRNA degradation and inhibiting host translation, in infected cells. We present evidence that nsp1 uses a novel, two-pronged strategy to inhibit host translation/gene expression. Nsp1 bound to the 40S ribosomal subunit and inactivated the translational activity of the 40S subunits. Furthermore, the nsp1-40S ribosome complex induced the modification of the 5'-region of capped mRNA template and rendered the template RNA translationally incompetent. Nsp1 also induced RNA cleavage in templates carrying the internal ribosome entry site (IRES) from encephalomyocarditis virus, but not in those carrying IRESs from hepatitis C and cricket paralysis viruses, demonstrating that the nsp1-induced RNA modification was template-dependent. We speculate 

The first two lines contain the title and abstract, respectively, and they are exactly the same for both, since they're from the same article.

To access the full text, we need to fetch the "base" document, which is `42saxb98` (without the `.XXXXX` suffix).
This is to avoid wasting space by repeatedly storing the full text.

We can use the `searcher` to fetch the document, and then fetch the underlying raw article JSON, as follows:

In [None]:
article = json.loads(searcher.doc('42saxb98').raw())

# Uncomment to print the entire article... warning, it's long! :)
#print(json.dumps(article, indent=4))

article['metadata']['title']

'A novel two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein'

Finally, if you want to create a DataFrame comprising all the results, here's a snippet of code to do so:

In [None]:
import pandas as pd

ranks = list(range(1, len(hits)+1))
docids = [ hit.docid for hit in hits]
scores = [ hit.score for hit in hits]
titles = [ hit.lucene_document.get('title') for hit in hits]
dois = [ hit.lucene_document.get('doi') for hit in hits]
data = {'rank': ranks, 'docid': docids, 'score': scores, 'title': titles, 'doi': dois} 

df = pd.DataFrame(data)
df

Unnamed: 0,rank,docid,score,title,doi
0,1,o328y8ax.00004,10.8783,Modulation of type I interferon induction by p...,10.1016/j.virol.2010.03.039
1,2,06z7p7rc,10.7991,Severe Acute Respiratory Syndrome Coronavirus ...,10.1128/jvi.02472-07
2,3,ncufofro.00026,10.774,Chapter Five Viral and Cellular mRNA Translati...,10.1016/bs.aivir.2016.08.001
3,4,mtj46j82.00029,10.6258,MERS coronavirus nsp1 participates in an effic...,10.1016/j.virol.2017.08.026
4,5,42saxb98.00002,10.6143,A novel two-pronged strategy to suppress host ...,10.1038/nsmb.1680
5,6,42saxb98.00001,10.6096,A novel two-pronged strategy to suppress host ...,10.1038/nsmb.1680
6,7,vj3wk150.00035,10.6006,Regulation of Stress Responses and Translation...,10.3390/v8070184
7,8,sq9hh50d.00005,10.5451,"Unique SARS-CoV protein nsp1: bioinformatics, ...",10.1016/j.tim.2006.12.005
8,9,42saxb98.00020,10.5415,A novel two-pronged strategy to suppress host ...,10.1038/nsmb.1680
9,10,pdfs6ojs.00027,10.5229,Coronavirus nonstructural protein 1: Common an...,10.1016/j.virusres.2014.11.019
