# Dense retrieval with ANCE

We will use a [PyTerrier plugin for ANCE](https://github.com/terrierteam/pyterrier_ance) for dense passage retrieval. 

[ANCE](https://github.com/microsoft/ANCE) is a dense retrieval system leveraging single representations to encode documents and queries. ANCE does not require combination with sparse retrieval. ANCE leverages a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances than the negative training instances selected by a sparse retrieval mechanism.

The experiments are run on [CORD19 corpus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7251955/) and [TREC Covid test collection](https://ir.nist.gov/covidSubmit/).

This is exercise is based on the [example](https://colab.research.google.com/github/terrierteam/pyterrier_ance/blob/master/pyterrier_ance_vaswani.ipynb) provided by the PyTerrier team and [CIKM 2021 Tutorial Notebook](https://notebooks.githubusercontent.com/view/ipynb?browser=chrome&color_mode=auto&commit=451eb743b6a9202f20fde3ac85dbe6ad00103506&device=unknown&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f746572726965722d6f72672f63696b6d323032317475746f7269616c2f343531656237343362366139323032663230666465336163383564626536616430303130333530362f6e6f7465626f6f6b732f6e6f7465626f6f6b345f322e6970796e62&logged_in=false&nwo=terrier-org%2Fcikm2021tutorial&path=notebooks%2Fnotebook4_2.ipynb&platform=android&repository_id=416829271&repository_type=Repository&version=96).

In [1]:
%%capture
!pip install -q python-terrier
!apt install -q --upgrade libomp-dev
!pip install -q --upgrade faiss-gpu==1.6.3
!pip install -q git+https://github.com/terrierteam/pyterrier_ance.git
!pip install -q ipytest

In [2]:
%%capture
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

In [3]:
%%capture
# Collecting the topics and qrels of the TREC-COVID19 dataset
cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = cord19.get_topics(variant='title')
qrels = cord19.get_qrels()

### BM25 inverted index

We use a pre-built Terrier inverted index for the TREC-COVID19 collection.

In [4]:
%%capture
bm25_index = pt.datasets.get_dataset('trec-covid').get_index('terrier_stemmed')

### ANCE dense index


We download a pre-built ANCE FAISS index for the TREC-COVID19 collection. The indexing procedure generates a number of FAISS shards, together with some additional files.


In [5]:
%%capture
ance_index = pt.datasets.get_dataset('trec-covid').get_index('ance_msmarco_psg')
!ls /content/anceindex

### Retrieval

We create BM25 baseline transformer, and the ANCE retrieve transformer. Since most documents exceed the maximum length supported by ANCE, a sliding window of 150 tokens is used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using pt.text.max_passage().



In [20]:
%%capture
from pyterrier_ance import ANCERetrieval

bm25_retriever = pt.BatchRetrieve(bm25_index, wmodel="BM25")
ance_retriever = ANCERetrieval.from_dataset('trec-covid', 'ance_msmarco_psg') >> pt.text.max_passage()

We retrieve the top 50 ranked documents for the official topics, and compute several effectiveness metrics.

In [21]:
pt.Experiment(
    [bm25_retriever % 50, ance_retriever % 50], 
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "P_10", "ndcg_cut_3", "recall_100"],
    names=['BM25', 'ANCE'],
)

***** inference of 50 queries *****


Inferencing: 0it [00:00, ?it/s]

Not running in distributed mode


Inferencing: 1it [00:09, 10.00s/it]

***** faiss search for 50 queries on 1 shards *****





  0%|          | 0/1 [00:00<?, ?shard/s]

Unnamed: 0,name,map,recip_rank,P_10,ndcg_cut_3,recall_100
0,BM25,0.04827,0.807094,0.678,0.631732,0.062774
1,ANCE,0.024658,0.656,0.452,0.479889,0.037527


The underperforming results computed our ANCE retriever are due to the lack of fine-tuning of the underlying BERT-based model with COVID19 and medical-related documents.

### Retrieval

In [17]:
%%capture
def show_res_with_text_labels(retriever, qid):
    """Displays the texts of the retrieved documents.
    
    Args:
        retriever: The retriever to be used to retreive documents.
        qid: Query ID.
        
    Returns:
        Retrieved documents with the text.    
    """
    def make_doi_url(df):
      df["doi"] = df["doi"].apply(lambda doi: "https://doi.org/" + doi)
      return df
    pipe = (retriever % 10) >> pt.text.get_text(cord19, ["title", "doi"]) >> pt.apply.generic(make_doi_url)
    res = pipe.transform(topics[topics.qid == qid])
    res = res.merge(qrels, how='left')
    def make_clickable(val):
        return '<a target="_blank" href="{}">{}</a>'.format(val, val)
    res = res.sort_values("rank", ascending=True)
    res.style.format({'doi': make_clickable})
    return res

In [18]:
show_res_with_text_labels(bm25_retriever, "1")

Unnamed: 0,qid,docid,docno,rank,score,query,title,doi,label,iteration
0,1,122553,75773gwg,0,11.558188,coronavirus origin,Zoonotic origins of human coronavirus 2019 (HC...,https://doi.org/,2,5.0
1,1,122554,kn2z7lho,1,11.558188,coronavirus origin,Zoonotic origins of human coronavirus 2019 (HC...,https://doi.org/,2,3.0
2,1,122555,4fb291hq,2,11.558188,coronavirus origin,Zoonotic origins of human coronavirus 2019 (HC...,https://doi.org/,1,3.0
3,1,135022,ne5r4d4b,3,11.558188,coronavirus origin,Origin and evolution of pathogenic coronaviruses,https://doi.org/10.1038/s41579-018-0118-9,0,1.5
4,1,186652,hl967ekh,4,11.558188,coronavirus origin,Zoonotic origins of human coronavirus 2019 (HC...,https://doi.org/10.24272/j.issn.2095-8137.2020...,2,3.0
5,1,120776,kqqantwg,5,11.417061,coronavirus origin,Possible Bat Origin of Severe Acute Respirator...,https://doi.org/,2,5.0
6,1,158983,12dcftwt,6,11.417061,coronavirus origin,Possible Bat Origin of Severe Acute Respirator...,https://doi.org/10.3201/eid2607.200092,2,5.0
7,1,81979,8ccl9aui,7,11.324364,coronavirus origin,Mosaic evolution of the severe acute respirato...,https://doi.org/,2,1.0
8,1,68472,4dtk1kyh,8,11.255136,coronavirus origin,Origin of Novel Coronavirus (COVID-19): A Comp...,https://doi.org/10.1101/2020.05.12.091397,2,3.0
9,1,104060,pl48ev5o,9,11.230726,coronavirus origin,Origin and evolution of the 2019 novel coronav...,https://doi.org/,1,4.0


In [19]:
show_res_with_text_labels(ance_retriever, "1")

***** inference of 1 queries *****


Inferencing: 0it [00:00, ?it/s]

Not running in distributed mode


Inferencing: 1it [00:00,  3.43it/s]

***** faiss search for 1 queries on 1 shards *****





  0%|          | 0/1 [00:00<?, ?shard/s]

Unnamed: 0,qid,query,score,docno,rank,title,doi,label,iteration
6,1,coronavirus origin,715.294006,j1cdoxqs,0,Coronavirus,https://doi.org/,,
3,1,coronavirus origin,714.861938,be0mr85h,1,Coronavirus.,https://doi.org/10.1177/0025817220933546,,
2,1,coronavirus origin,714.552856,9pla28n4,2,Coronaviruses: origin and evolution,https://doi.org/,,
7,1,coronavirus origin,714.552856,jkejiuf2,3,Coronaviruses: origin and evolution,https://doi.org/,2.0,3.0
4,1,coronavirus origin,714.37854,bp9xz9wk,4,Coronavirus?,https://doi.org/,,
8,1,coronavirus origin,714.107727,l0sbncnp,5,Exploration on mechanism of anti-coronavirus o...,https://doi.org/10.7501/j.issn.0253-2670.2020....,,
5,1,coronavirus origin,714.046753,hmvo5b0q,6,Understanding Coronavirus,https://doi.org/,1.0,5.0
1,1,coronavirus origin,713.970276,8l411r1w,7,"Discovery of a novel coronavirus, China Rattus...",https://doi.org/10.1128/jvi.02420-14,0.0,1.0
9,1,coronavirus origin,713.963501,utsr0zv7,8,The Human Coronavirus Disease COVID-19: Its Or...,https://doi.org/10.3390/pathogens9050331,1.0,3.0
0,1,coronavirus origin,713.77887,7mfedn03,9,Coronavirus Infections,https://doi.org/10.1016/b978-1-4160-2406-4.500...,,
