# Demo for Patapsco

Welcome to the demo notebook for patapsco! 

In this notebook, we will walk you through several commonly used CLIR retrieval models using patpasco. 

## Getting Started

First, we need to install the packages. We use pyTerrier here as an example for how we can integrate other monolingual retreival framework into patapsco. 

In [None]:
!pip install 'python-terrier>=0.7.1' git+https://github.com/hltcoe/patapsco  --upgrade

import stanza
stanza.download('zh')
stanza.download('en')

Then we download the demo data from our GitHub repository into the colab workspace. 

In [None]:
!wget https://raw.githubusercontent.com/hltcoe/patapsco/demo-notebook/samples/notebooks/zho_eng_clean_reduced_pdt.dict
!wget https://raw.githubusercontent.com/hltcoe/patapsco/demo-notebook/samples/data/cc-news-zho-1000.jsonl
!wget https://raw.githubusercontent.com/hltcoe/patapsco/demo-notebook/samples/data/dev.topics.v1-0.jsonl
!wget https://raw.githubusercontent.com/hltcoe/patapsco/demo-notebook/samples/data/zho.toy-dev.qrels.v1-0.txt

Finally, we start importing the major packages. 

In [1]:
import patapsco

import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')

PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/eyang/miniconda3/envs/patapsco/lib/python3.8/site-packages/pyserini/resources/jars/anserini-0.13.1-fatjar.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/eyang/.pyterrier/terrier-assemblies-5.6-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]


In [2]:
import copy
import random
from pathlib import Path

import pandas as pd

The configuration can be specified through a dictionary in python. 
If you like to run `patapsco` through our command line interface, it can be specified by a YAML file.

A config file can specify a partial run from the start, e.g. from the beginning to indexing, or starting from an existing run, e.g., retrieval based on a pre-built index. 

## A full retrieval pipeline -- from reading documents to producing scores

The following is a configuration dictionary of a full `patapsco` pipeline. 

Here, we use a query translation approach as an example. Patapsco has the ability to select the source of the queries for the topics to support CLIR experiments. 

In [4]:
config_qt = {
    # The identifier of the run -- patapsco prevent the same run from executing twice
    "run": {
        "name": "query translation" 
    },
    
    # Documents for retrieval
    "documents": {
        # The source of the collection
        # To use `ir_datasets`, specify `irds` as the format and the name of the dataset as the path
        "input": {
            "format": "json",
            "lang": "zho",
            "encoding": "utf8",
            "path": "cc-news-zho-1000.jsonl",
        },
        
        # The preprocessing of the documents
        "process": {
            "normalize": {
                "lowercase": True,
            },
            "tokenize": "jieba",
            "strict_check": True,
            "stopwords": "lucene"
        },
        
        # comments are ommitted but good for documenting the process
        "comment": "Small CC collection for demo", 
    },
    
    # We store the preprocessed documents in a sqlite database
    "database": {
        "name": "sqlite"
    },
    
    # The index of the collection
    "index": {
        "name": "lucene"
    },
    
    # The format of the topic file. We support json(l), xml, SGML(the TREC style) and ir_datasets 
    "topics": {
        "input": {
            "format": "json",
            "lang": "zho",
            "source": "human translation", # selecting the human translation version
            "encoding": "utf8",
            "path": "dev.topics.v1-0.jsonl"
        },
        
        # Here we use the title queries
        "fields": "title"
    },
    
    # Query text preprocessing -- we check to make sure this aligns with the document preprocessing
    "queries": {
        "process": {
            "normalize": {
                "lowercase": True,
            },
            "tokenize": "jieba",
            "stopwords": "lucene"
        }
    },
    
    # Specifying the retrieval model
    "retrieve": {
        "name": "bm25",
        "number": 5
    },
    
    # Evaluting the run
    "score": {
        "input": {
            "path": "zho.toy-dev.qrels.v1-0.txt"
        }
    }
}

The following cell execute the pipeline. The log is presented in the console and stored in the run dictory as well. 

In [5]:
runner = patapsco.Runner(config_qt)
runner.run()

2021-10-20 23:23:58,420 - patapsco.run - INFO - Patapsco version 1.0.0-dev
2021-10-20 23:23:58,425 - patapsco.run - INFO - Writing output to: /Users/eyang/Documents/Repositories/patapsco/samples/notebooks/runs/query-translation
2021-10-20 23:23:58,554 - patapsco.job - INFO - Stage 1 is a streaming pipeline.
2021-10-20 23:23:58,555 - patapsco.job - INFO - Stage 1 pipeline: Hc4JsonDocumentReader | DocumentProcessor | DatabaseWriter | LuceneIndexer
2021-10-20 23:23:58,558 - patapsco.retrieve - INFO - Index location: runs/query-translation/index
2021-10-20 23:23:58,559 - patapsco.job - INFO - Stage 2 is a streaming pipeline.
2021-10-20 23:23:58,560 - patapsco.job - INFO - Stage 2 pipeline: Hc4JsonTopicReader | TopicProcessor | QueryProcessor | QueryWriter | PyseriniRetriever | JsonResultsWriter | TrecResultsWriter
2021-10-20 23:23:58,561 - patapsco.job - INFO - Starting run: query translation
2021-10-20 23:23:58,561 - patapsco.job - INFO - Stage 1: Starting processing of documents
2021-10-

Everthing is stored in the run directory.

In [6]:
!ls runs/query-translation

config.yml        patapsco.log      [1m[34mretrieve[m[m
[1m[34mdatabase[m[m          [1m[34mprocessed_queries[m[m scores.txt
[1m[34mindex[m[m             results.txt       timing.json


The configuration is dumpped into a YAML file. 

By executing `patapsco runs/full-run/config.yml`, you would be able to reproduce the exect same run. 

In [7]:
!cat runs/query-translation/config.yml

database:
  name: sqlite
  output: database
documents:
  comment: Small CC collection for demo
  input:
    encoding: utf8
    format: json
    lang: zho
    path: /Users/eyang/Documents/Repositories/patapsco/samples/data/cc-news-zho-1000.jsonl
  output: false
  process:
    normalize:
      lowercase: true
      report: false
    stem: false
    stopwords: lucene
    strict_check: true
    tokenize: jieba
index:
  name: lucene
  output: index
queries:
  output: processed_queries
  parse: false
  process:
    normalize:
      lowercase: true
      report: false
    stem: false
    stopwords: lucene
    strict_check: true
    tokenize: jieba
retrieve:
  b: 0.4
  fb_docs: 10
  fb_terms: 10
  input:
    index:
      path: index
  k1: 0.9
  log_explanations: false
  log_explanations_cutoff: 10
  mu: 1000
  name: bm25
  number: 5
  original_query_weight: 0.5
  output: retrieve
  parse: false
  psq: false
  rm3: false
  rm3_logging: false
r

## Running Probabilistic Structured Query (PSQ)

PSQ is one of the most common CLIR approaches that leverages the translation table from statistical machine translation models. 

Patpasco also supports it natively. For this demo, we provide a Chinese-English translation table. 

Besides running everything from scratch, patapsco has the ability to use components from existing runs. 

Here, we use the index built in the query translation run and rerun the experment from retrieval. 
This allows better reproduciblity and efficient ablation studies. 

In [8]:
config_psq = {
    "run": {
        "name": "PSQ" 
    },
        
    # Now we select the original English queries (and working on topics that are supported for Chinese)
    "topics": {
        "input": {
            "format": "json",
            "lang": "eng",
            "filter_lang": "zho",
            "source": "original",
            "encoding": "utf8",
            "path": "dev.topics.v1-0.jsonl"
        },
        "fields": "title"
    },
    
    # Query text preprocessing for PSQ
    "queries": {
        "output": "processed_queries",
        "parse": False,
        "process": {
            "normalize": {
                "lowercase": True,
                "report": False
            },
            "stem": False,
            "stopwords": "lucene",
            "strict_check": False,
            "tokenize": "moses"
        },
        "psq": {
            "lang": "eng",
            "normalize": {
                "lowercase": True,
                "report": False
            },
            "path": "zho_eng_clean_reduced_pdt.dict",
            "stem": False,
            "stopwords": "lucene",
            "threshold": 0.97
        }
    },
    
    "retrieve": {
        "input": {
            "index": {
                # Use the index from the previous run
                "path": "runs/query-translation/index"
            }
        },
        "name": "bm25",
        "number": 1000,
        "psq": True # Use PSQ
    },
    
    # Evaluting the run
    "score": {
        "input": {
            "path": "zho.toy-dev.qrels.v1-0.txt"
        }
    }
}

runner = patapsco.Runner(config_psq)
runner.run()

2021-10-20 23:24:17,392 - patapsco.run - INFO - Patapsco version 1.0.0-dev
2021-10-20 23:24:17,396 - patapsco.run - INFO - Writing output to: /Users/eyang/Documents/Repositories/patapsco/samples/notebooks/runs/PSQ
2021-10-20 23:24:17,408 - patapsco.retrieve - INFO - Index location: /Users/eyang/Documents/Repositories/patapsco/samples/notebooks/runs/query-translation/index
2021-10-20 23:24:17,409 - patapsco.job - INFO - Stage 2 is a streaming pipeline.
2021-10-20 23:24:17,410 - patapsco.job - INFO - Stage 2 pipeline: Hc4JsonTopicReader | TopicProcessor | QueryProcessor | QueryWriter | PyseriniRetriever | JsonResultsWriter | TrecResultsWriter
2021-10-20 23:24:17,412 - patapsco.job - INFO - Starting run: PSQ
2021-10-20 23:24:17,415 - patapsco.job - INFO - Stage 2: Starting processing of topics
2021-10-20 23:24:19,620 - patapsco.text - INFO - Loading the xx spacy model
2021-10-20 23:24:32,723 - patapsco.retrieve - INFO - Using PSQ
2021-10-20 23:24:32,724 - patapsco.retrieve - INFO - Using 

## Running a T5 Reranker from pyTerrier

We can also rerank the results from the initial retrieval model. 
You can define your own reranker with your favorite reranking framework. 
Here, we use T5 supported by pyTerrier as an example. 

In [None]:
# install T5 
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_t5

We define our custom reranker by implementing the abstract class `patapsco.Reranker`.  

In [12]:
from pyterrier_t5 import MonoT5ReRanker
pt_t5reranker = MonoT5ReRanker(text_field='text')

class T5Reranker(patapsco.Reranker):
    LOGGER = patapsco.get_logger("reranker")
        
    def process(self, results):        
        df = pd.DataFrame(results.results)
        res = df.assign(text=df.doc_id.apply(lambda k: self.db[k].text), query=results.query, qid=1)\
                .rename(columns={'doc_id': 'docno'})\
                .pipe(pt_t5reranker).sort_values('score', ascending=False)\
                .apply(lambda x: patapsco.results.Result(doc_id=x.docno, rank=x['rank'], score=x.score), axis=1).tolist()
        
        return patapsco.Results(results.query, results.doc_lang, 'T5Rerank', res)

Register it with patapsco. 

In [13]:
patapsco.RerankFactory.register('t5', T5Reranker)

In [14]:
config_rerank = {
    "run": {
        "name": "T5 Reranking" 
    },
    
    "topics": {
        "input": {
            "format": "json",
            "lang": "zho",
            "source": "20211005-scale21-sockeye2-tm1", # machine query translation
            "encoding": "utf8",
            "path": "dev.topics.v1-0.jsonl"
        },
        "fields": "title"
    },
    
    "queries": {
        "process": {
            "normalize": {
                "lowercase": True,
            },
            "tokenize": "jieba",
            "stopwords": "lucene"
        }
    },
    
    "retrieve": {
        "input": {
            "index": {
                "path": "runs/query-translation/index"
            }
        },
        "name": "bm25",
        "number": 100
    },
    
    # Define the reranker here, the name needs to match the one registered with patapsco. 
    "rerank": {
        "input": {
            "database": {
                "path": "runs/query-translation/database"
            }
        },
        "name": "t5",
        "output": "rerank"
    },
    
    # Evaluting the run
    "score": {
        "input": {
            "path": "zho.toy-dev.qrels.v1-0.txt"
        }
    }
}

runner = patapsco.Runner(config_rerank)
runner.run()

2021-10-20 23:25:43,407 - patapsco.run - INFO - Patapsco version 1.0.0-dev
2021-10-20 23:25:43,409 - patapsco.run - INFO - Writing output to: /Users/eyang/Documents/Repositories/patapsco/samples/notebooks/runs/T5-Reranking
2021-10-20 23:25:43,425 - patapsco.retrieve - INFO - Index location: /Users/eyang/Documents/Repositories/patapsco/samples/notebooks/runs/query-translation/index
2021-10-20 23:25:43,432 - patapsco.job - INFO - Stage 2 is a streaming pipeline.
2021-10-20 23:25:43,433 - patapsco.job - INFO - Stage 2 pipeline: Hc4JsonTopicReader | TopicProcessor | QueryProcessor | QueryWriter | PyseriniRetriever | JsonResultsWriter | T5Reranker | JsonResultsWriter | TrecResultsWriter
2021-10-20 23:25:43,438 - patapsco.job - INFO - Starting run: T5 Reranking
2021-10-20 23:25:43,439 - patapsco.job - INFO - Stage 2: Starting processing of topics
2021-10-20 23:25:43,445 - patapsco.retrieve - INFO - Using BM25 with parameters k1=0.9 and b=0.4


monoT5:   0%|          | 0/25 [00:00<?, ?batches/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (731 > 512). Running this sequence through the model will result in indexing errors


monoT5:   0%|          | 0/25 [00:00<?, ?batches/s]

2021-10-20 23:29:56,386 - patapsco.job - INFO - Stage 2: Processed 2 topics
2021-10-20 23:29:56,387 - patapsco.job - INFO - Stage 2 took 252.9 secs
2021-10-20 23:29:56,411 - patapsco.score - INFO - Average scores over 2 queries: map: 0.051, ndcg: 0.214, ndcg_prime: 0.532, recall_100: 0.625, recall_1000: 0.625
2021-10-20 23:29:56,413 - patapsco.job - INFO - Memory usage: 1.5 GB
2021-10-20 23:29:56,413 - patapsco.job - INFO - Run complete


In [15]:
!ls runs/T5-Reranking/

config.yml        [1m[34mprocessed_queries[m[m [1m[34mretrieve[m[m
config_full.yml   [1m[34mrerank[m[m            scores.txt
patapsco.log      results.txt       timing.json
