<a href="https://colab.research.google.com/github/khushsi/Aggregator/blob/master/Sparse_Retrieval_%2B_Rerank_SIGIR_2021_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, you will learn how to use the following IR packages:

- Pyserini: for indexing and initial retrieval with either sparse (e.g., BM25) or dense (e.g. Faiss + ANCE) methods.

- Pygaggle: for neural reranking (e.g., with monoT5).

## Installing pygaggle and pyserini

We only need to install pyggagle as it's installation comes with pyserini.

In [None]:
!pip install git+https://github.com/castorini/pygaggle.git

Collecting git+https://github.com/castorini/pygaggle.git
  Cloning https://github.com/castorini/pygaggle.git to /tmp/pip-req-build-ft3iidz5
  Running command git clone -q https://github.com/castorini/pygaggle.git /tmp/pip-req-build-ft3iidz5
  Running command git submodule update --init --recursive -q
Collecting coloredlogs==14.0
[?25l  Downloading https://files.pythonhosted.org/packages/5c/2f/12747be360d6dea432e7b5dfae3419132cb008535cfe614af73b9ce2643b/coloredlogs-14.0-py2.py3-none-any.whl (43kB)
[K     |████████████████████████████████| 51kB 7.1MB/s 
Collecting pydantic==1.7.4
[?25l  Downloading https://files.pythonhosted.org/packages/ca/fa/d43f31874e1f2a9633e4c025be310f2ce7a8350017579e9e837a62630a7e/pydantic-1.7.4-cp37-cp37m-manylinux2014_x86_64.whl (9.1MB)
[K     |████████████████████████████████| 9.1MB 35.1MB/s 
[?25hCollecting pyserini==0.12.0
[?25l  Downloading https://files.pythonhosted.org/packages/d2/71/47eff475a39072b82cb1c368200770fd414c2dd0d0ede4a01dde883a90dc/pyserin

In [None]:
!nvidia-smi

Thu Jul  8 18:33:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Initializing the MS MARCO Passage index

This index for the MS MARCO Passage corpus is already available on pyserini and can be loaded with `from_prebuilt_index`.

In [None]:
from pyserini.search import SimpleSearcher

# `SimpleSearcher` defaults to BM25 scoring function.
searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')

index-msmarco-passage-20201117-f87c94.tar.gz: 0.00B [00:00, ?B/s]

Attempting to initialize pre-built index msmarco-passage.
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20201117-f87c94.tar.gz...


index-msmarco-passage-20201117-f87c94.tar.gz: 2.07GB [00:44, 50.1MB/s]                            


Extracting /root/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.tar.gz into /root/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.1efad4f1ae6a77e235042eff4be1612d...
Initializing msmarco-passage...


## Initializing the reranker (monoT5)

[monoT5](https://aclanthology.org/2020.findings-emnlp.63.pdf) is a reranking model based on T5 and it was trained on the MS MARCO Passage dataset.

In [None]:
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

# This step loads the weights of model 
# `castorini/monot5-base-msmarco`, available at HuggingFace's model hub.
reranker = MonoT5() 

Downloading:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

## Sparse Retrieval + Reranking on MS MARCO

In [None]:
import json
from pygaggle.rerank.base import hits_to_texts

# Here's our query:
query = Query('who proposed the geocentric theory')

hits = searcher.search(query.text, k=10)

texts = hits_to_texts(hits)
print('BM25:')
# Let's print out the passages prior to reranking:
for i, text in enumerate(texts):
    content = json.loads(text.text)['contents']
    print(f'{i+1:2} {text.metadata["docid"]:15} {text.score:.5f} {content}')

# Finally, rerank:
reranked = reranker.rerank(query, texts)

print('Reranked:')
# Print out reranked results:
for i, text in enumerate(reranked):
    content = json.loads(text.text)['contents']
    print(f'{i+1:2} {text.metadata["docid"]:15} {text.score:.5f} {content}')

BM25:
 1 7744105         14.35620 For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory.
 2 2593796         13.93430 Copernicus proposed a heliocentric model of the solar system â a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.he geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.
 3 6217200         13.93430 The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greec

## Reranking my own texts

In [None]:
passages = [
    ['7744105', 'For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory.'],
    ['2593796', 'Copernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.he geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.'],
    ['6217200', 'The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.opernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.'],
    ['3276926', 'The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.ou might want to check out one article on the history of the geocentric model and one regarding the geocentric theory. Here are links to two other articles from Universe Today on what the center of the universe is and Galileo one of the advocates of the heliocentric model.'],
    ['3276925', 'Copernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.Simple tools, such as the telescope â\x80\x93 which helped convince Galileo that the Earth was not the center of the universe â\x80\x93 can prove that ancient theory incorrect.ou might want to check out one article on the history of the geocentric model and one regarding the geocentric theory. Here are links to two other articles from Universe Today on what the center of the universe is and Galileo one of the advocates of the heliocentric model.'],
    ['6217208', 'Copernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.Simple tools, such as the telescope â\x80\x93 which helped convince Galileo that the Earth was not the center of the universe â\x80\x93 can prove that ancient theory incorrect.opernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.'],
    ['4280557', 'The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.imple tools, such as the telescope â\x80\x93 which helped convince Galileo that the Earth was not the center of the universe â\x80\x93 can prove that ancient theory incorrect. You might want to check out one article on the history of the geocentric model and one regarding the geocentric theory.'],
    ['4280558', 'A Geocentric theory is an astronomical theory which describes the universe as a Geocentric system, i.e., a system which puts the Earth in the center of the universe, and describes other objects from the point of view of the Earth. Geocentric theory is an astronomical theory which describes the universe as a Geocentric system, i.e., a system which puts the Earth in the center of the universe, and describes other objects from the point of view of the Earth.'],
    ['264181', 'Nicolaus Copernicus (b. 1473â\x80\x93d. 1543) was the first modern author to propose a heliocentric theory of the universe. From the time that Ptolemy of Alexandria (c. 150 CE) constructed a mathematically competent version of geocentric astronomy to Copernicusâ\x80\x99s mature heliocentric version (1543), experts knew that the Ptolemaic system diverged from the geocentric concentric-sphere conception of Aristotle.'],
    ['5183032', "After 1,400 years, Copernicus was the first to propose a theory which differed from Ptolemy's geocentric system, according to which the earth is at rest in the center with the rest of the planets revolving around it."]]

texts = [Text(p[1], {'docid': p[0]}, 0) for p in passages]  # Note, pyserini scores don't matter since T5 will ignore them.

# Rerank:
reranked = reranker.rerank(query, texts)

# Print out reranked results:
for i, text in enumerate(reranked):
    print(f'{i+1:2} {text.metadata["docid"]:15} {text.score:.5f} {text.text}')

 1 6217200         -0.01113 The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.opernicus proposed a heliocentric model of the solar system â a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.
 2 7744105         -0.01206 For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory.
 3 264181          -0.02000 Nicolaus Copernicus (b. 1473âd. 1543) was the first modern author to propose a heliocentric theory of the universe. From the 

## Reraking with a different model

In [None]:
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco-10k')
reranker = MonoT5(model=model)

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
# Rerank:
reranked = reranker.rerank(query, texts)

# Print out reranked results:
for i, text in enumerate(reranked):
    print(f'{i+1:2} {text.metadata["docid"]:15} {text.score:.5f} {text.text}')

 1 6217200         -0.00838 The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.opernicus proposed a heliocentric model of the solar system â a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.
 2 2593796         -0.01694 Copernicus proposed a heliocentric model of the solar system â a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.he geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to exp

## Searching my own texts



### Creating pyserini files

We first need to create JSON Lines (jsonl) files in the following format:

```
{
  "id": "doc1",
  "contents": "this is the contents."
}
```

We will next create two simple documents:

In [None]:
import json

!mkdir my_collection_jsonl
with open('my_collection_jsonl/file1.jsonl', 'w') as fout:
    fout.write(json.dumps({'id': 'doc1', 'contents': 'this is a document about cats.'}) + '\n')
    fout.write(json.dumps({'id': 'doc2', 'contents': 'this is a document about dogs.'}))

### Indexing

In [None]:
!python -m pyserini.index -collection JsonCollection \
                          -generator DefaultLuceneDocumentGenerator \
                          -threads 1 \
                          -input my_collection_jsonl \
                          -index my_index \
                          -storeRaw

2021-07-08 18:36:45,829 INFO  [main] index.IndexCollection (IndexCollection.java:631) - Setting log level to INFO
2021-07-08 18:36:45,831 INFO  [main] index.IndexCollection (IndexCollection.java:634) - Starting indexer...
2021-07-08 18:36:45,832 INFO  [main] index.IndexCollection (IndexCollection.java:636) - DocumentCollection path: my_collection_jsonl
2021-07-08 18:36:45,832 INFO  [main] index.IndexCollection (IndexCollection.java:637) - CollectionClass: JsonCollection
2021-07-08 18:36:45,835 INFO  [main] index.IndexCollection (IndexCollection.java:638) - Generator: DefaultLuceneDocumentGenerator
2021-07-08 18:36:45,836 INFO  [main] index.IndexCollection (IndexCollection.java:639) - Threads: 1
2021-07-08 18:36:45,836 INFO  [main] index.IndexCollection (IndexCollection.java:640) - Stemmer: porter
2021-07-08 18:36:45,836 INFO  [main] index.IndexCollection (IndexCollection.java:641) - Keep stopwords? false
2021-07-08 18:36:45,837 INFO  [main] index.IndexCollection (IndexCollection.java:6

### Retrieving from my collection

In [None]:
from pyserini.search import SimpleSearcher

searcher = SimpleSearcher('./my_index')
hits = searcher.search('dogs', k=10)

for i in range(len(hits)):
    content = json.loads(hits[i].raw)['contents']
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f} {content}')

 1 doc2 0.36480 this is a document about dogs.


End of the notebook.