# PAG Retrieval on MS MARCO Sample
This notebook demonstrates how to run the PAG model on a small sample of the MS MARCO dataset. It follows the instructions from the repository and downloads a pre-trained model along with the necessary mapping files.


## 1. Environment Setup
Install dependencies and clone the PAG repository.

In [None]:
!git clone https://github.com/HansiZeng/PAG.git
%cd PAG
!pip install -r requirements.txt
!pip install torch==1.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
!pip install faiss-gpu==1.7.2


## 2. Download Pre-trained Model and Mapping Files
The following commands fetch a pre-trained checkpoint and the document ID mappings required for inference. The files are hosted on Google Drive in the [`PAG-data`](https://drive.google.com/drive/folders/1q8FeHQ6nxPYpl1Thqw8mS-2ndzf7VZ9y) folder.

In [None]:
import os, zipfile
!pip install gdown
os.makedirs('data', exist_ok=True)
%cd data
# Download model checkpoint (replace file id with actual id)
!gdown 1A7VYJZkxxxxxx -O pag_model.zip
with zipfile.ZipFile('pag_model.zip', 'r') as zf:
    zf.extractall()
# Download docid_to_tokenids files (replace file id with actual id)
!gdown 1B8xxxxxxx -O docids.zip
with zipfile.ZipFile('docids.zip', 'r') as zf:
    zf.extractall()
%cd ..


## 3. Prepare a Small MS MARCO Sample
We use the HuggingFace `datasets` library to fetch a small subset of MS MARCO passages and queries. This subset will be used to illustrate the retrieval pipeline.

In [None]:
from datasets import load_dataset
import json, os

sample_size = 1000
queries_size = 10

passages = load_dataset('ms_marco', 'v1.1', split=f'passages[:{sample_size}]')
queries = load_dataset('ms_marco', 'v1.1', split=f'queries[:{queries_size}]')

os.makedirs('data/msmarco-sample', exist_ok=True)
with open('data/msmarco-sample/full_collection', 'w') as f:
    for p in passages:
        f.write(f"{p['pid']}	{p['passage']}
")

os.makedirs('data/msmarco-sample/sample_queries', exist_ok=True)
with open('data/msmarco-sample/sample_queries/queries', 'w') as f:
    for q in queries:
        f.write(f"{q['qid']}	{q['query']}
")


## 4. Run PAG Inference
We modify the evaluation script to point to our sample dataset and execute it. The script will output retrieval results and evaluation metrics.

In [None]:
import pathlib
script_path = pathlib.Path('full_scripts/full_lexical_ripor_evaluate.sh')
text = script_path.read_text()
text = text.replace('data/msmarco-full', 'data/msmarco-sample')
script_path.write_text(text)
!bash full_scripts/full_lexical_ripor_evaluate.sh


## 5. Inspect Retrieval Outputs
The retrieval results are stored under the model directory's output folder. This cell prints the first few lines of the run file.

In [None]:
import glob, json
run_files = glob.glob('data/experiments-full-lexical-ripor/**/*.json', recursive=True)
for path in run_files[:5]:
    print(path)
    with open(path) as f:
        print(json.dumps(json.load(f)[:5], indent=2))
    break
