# Overview

In this notebook, you will learn how to:
1. Index a collection of documents
2. Use BM25 to retrieve relevant documents for a given query
3. Rerank documents from (2) with monoT5/miniLM

We will use the following IR packages:

- Pyserini: for indexing and initial retrieval with either sparse (e.g., BM25) or dense (e.g. Faiss + ANCE) methods.

- Pygaggle: for neural reranking (e.g., with monoT5).

## Installing Java on the notebook

In [None]:
import os
#%%capture
!curl -O https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
!mv openjdk-11.0.2_linux-x64_bin.tar.gz /usr/lib/jvm/; cd /usr/lib/jvm/; tar -zxvf openjdk-11.0.2_linux-x64_bin.tar.gz
!update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk-11.0.2/bin/java 1
!update-alternatives --set java /usr/lib/jvm/jdk-11.0.2/bin/java
os.environ["JAVA_HOME"] = "/usr/lib/jvm/jdk-11.0.2"

## Installing pygaggle and pyserini

We only need to install pyggagle as it's installation comes with pyserini.

In [None]:
## We need this for the pip's dependency resolver to take into account all the packages that are installed
## to activate pip dependancy resolver
!pip install h5py
!pip install typing-extensions
!pip install wheel

In [None]:
!pip install git+https://github.com/castorini/pygaggle.git
!pip install faiss-cpu  # Needed by pyserini, but we will not use dense retrieval in this notebook.

In [None]:
## again We need this for the pip's dependency resolver to take into account all the packages that are installed
!pip install h5py
!pip install typing-extensions
!pip install wheel

##then we need to upgrade datasets package
!pip install -U datasets

In [None]:
## check if datasets package is correctly installed. Output should include something like: {'id': '5733be284776f41900661182', 'title': 'University_of_Notre_ ....
!python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"

In [None]:
!nvidia-smi

## Initializing the FinancialQA (FiQA) index

This index for the FiQA corpus is already available on pyserini and can be loaded with `from_prebuilt_index`.

In [None]:
from pyserini.search import LuceneSearcher

# LuceneSearcher defaults to BM25 scoring function.
searcher = LuceneSearcher.from_prebuilt_index('beir-v1.0.0-fiqa-flat')

## Initializing the reranker (monoT5)

[monoT5](https://aclanthology.org/2020.findings-emnlp.63.pdf) is a reranking model based on T5 and it was trained on the MS MARCO Passage dataset.

In [None]:
## we need to upgrade our sentence-transformers package first
!pip install -U sentence-transformers

In [None]:
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

# This step loads the weights of model 
# castorini/monot5-base-msmarco-10k, available at HuggingFace's model hub.
reranker = MonoT5() 

## Sparse Retrieval then Reranking

In [None]:
import json
from pygaggle.rerank.base import hits_to_texts

# Here's our query:
query = Query("What is considered a business expense on a business trip?")

hits = searcher.search(query.text, k=10)

texts = hits_to_texts(hits)
print('BM25:')
# Let's print out the passages prior to reranking:
for i, text in enumerate(texts):
    content = json.loads(text.text)['text']
    print(f'{i+1:2} {text.metadata["docid"]:10} {text.score:.5f} {content}')

# Finally, rerank:
reranked = reranker.rerank(query, texts)

print(f"\n{100 * '-'}\n")

print('Reranked:')
# Print out reranked results:
for i, text in enumerate(reranked):
    content = json.loads(text.text)['text']
    print(f'{i+1:2} {text.metadata["docid"]:10} {text.score:.5f} {content}')

## Reraking with miniLM (faster)

MiniLM is ~10x faster and smaller than T5-base, but it is a bit worse in zero-shot tasks.

In [None]:
from pygaggle.rerank.transformer import SentenceTransformersReranker

# This class uses cross-encoder/ms-marco-MiniLM-L-2-v2 by default
reranker = SentenceTransformersReranker()

In [None]:
reranked = reranker.rerank(query, texts)

# Print out reranked results:
for i, text in enumerate(reranked):
    content = json.loads(text.text)['text']
    print(f'{i+1:2} {text.metadata["docid"]:10} {text.score:.5f} {content}')

## Searching my own texts



### Creating pyserini files

We first need to create JSON Lines (jsonl) files in the following format:

```
{
  "id": "doc1",
  "contents": "Sea turtle. Habitat, taxonomy."
  'Title': "sea turtle",
  'sections': ['habitat', 'taxonomy']
}
```

We will create two simple documents:

In [None]:
import json

!mkdir my_collection_jsonl
with open('my_collection_jsonl/file1.jsonl', 'w') as fout:
    fout.write(json.dumps({'id': 'doc1', 'contents': 'this is a document about cats.'}) + '\n')
    fout.write(json.dumps({'id': 'doc2', 'contents': 'this is a document about dogs.'}))

### Indexing

In [None]:
!python -m pyserini.index -collection JsonCollection \
                          -generator DefaultLuceneDocumentGenerator \
                          -threads 1 \
                          -input my_collection_jsonl \
                          -index my_index \
                          -storeRaw

### Retrieving from my collection

In [None]:
from pyserini.search import LuceneSearcher

searcher = LuceneSearcher('./my_index')
hits = searcher.search('dogs', k=10)

for i in range(len(hits)):
    content = json.loads(hits[i].raw)['contents']
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f} {content}')

## Reranking my own texts

In [None]:
passages = [
    ['7744105', 'For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory.'],
    ['2593796', 'Copernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.he geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.'],
    ['6217200', 'The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.opernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.'],
    ['4280558', 'A Geocentric theory is an astronomical theory which describes the universe as a Geocentric system, i.e., a system which puts the Earth in the center of the universe, and describes other objects from the point of view of the Earth. Geocentric theory is an astronomical theory which describes the universe as a Geocentric system, i.e., a system which puts the Earth in the center of the universe, and describes other objects from the point of view of the Earth.'],
    ['264181', 'Nicolaus Copernicus (b. 1473â\x80\x93d. 1543) was the first modern author to propose a heliocentric theory of the universe. From the time that Ptolemy of Alexandria (c. 150 CE) constructed a mathematically competent version of geocentric astronomy to Copernicusâ\x80\x99s mature heliocentric version (1543), experts knew that the Ptolemaic system diverged from the geocentric concentric-sphere conception of Aristotle.'],
    ['5183032', "After 1,400 years, Copernicus was the first to propose a theory which differed from Ptolemy's geocentric system, according to which the earth is at rest in the center with the rest of the planets revolving around it."]]

texts = [Text(p[1], {'docid': p[0]}, 0) for p in passages]  # Note, pyserini scores don't matter since T5 will ignore them.

# Rerank:
reranked = reranker.rerank(query, texts)

# Print out reranked results:
for i, text in enumerate(reranked):
    print(f'{i+1:2} {text.metadata["docid"]:10} {text.score:.5f} {text.text}')