PyGaggle

PyGaggle provides a gaggle of deep neural architectures for text ranking and question answering. It was designed for tight integration with Pyserini, but can be easily adapted for other sources as well.

Currently, this repo contains implementations of the rerankers for CovidQA on CORD-19, as described in "Rapidly Bootstrapping a Question Answering Dataset for COVID-19".

Installation

Install via PyPI pip install pygaggle. Requires Python 3.6+
Install PyTorch 1.4+.
Download the index: sh scripts/update-index.sh.
Make sure you have an installation of Java 11+: javac --version.
Install Anserini.

Additional Instructions

Clone the repo with git clone --recursive https://github.com/castorini/pygaggle.git
Make you sure you have an installation of Python 3.6+. All python commands below refer to this.
For pip, do pip install -r requirements.txt
- If you prefer Anaconda, use conda env create -f environment.yml && conda activate pygaggle.

A simple reranking example - T5

The code below exemplifies how to score two documents for a given query using a T5 reranker from Document Ranking with a Pretrained Sequence-to-Sequence Model.

import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration
from pygaggle.model import T5BatchTokenizer
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import T5Reranker

model_name = 'castorini/monot5-base-msmarco'
tokenizer_name = 't5-base'
batch_size = 8

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = T5ForConditionalGeneration.from_pretrained(model_name)
model = model.to(device).eval()

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
tokenizer = T5BatchTokenizer(tokenizer, batch_size)
reranker =  T5Reranker(model, tokenizer)

query = Query('what causes low liver enzymes')

correct_doc = Text('Reduced production of liver enzymes may indicate dysfunction of the liver. This article explains the causes and symptoms of low liver enzymes. Scroll down to know how the production of the enzymes can be accelerated.')

wrong_doc = Text('Elevated liver enzymes often indicate inflammation or damage to cells in the liver. Inflamed or injured liver cells leak higher than normal amounts of certain chemicals, including liver enzymes, into the bloodstream, elevating liver enzymes on blood tests.')

documents = [correct_doc, wrong_doc]

scores = [result.score for result in reranker.rerank(query, documents)]
# scores = [-0.1782158613204956, -0.36637523770332336]

A simple reranking example - BERT

You can also try the code below, which uses a BERT reranker from Passage Re-ranking with BERT. Note that the T5 reranker produces slightly better scores than the BERT reranker.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from pygaggle.model import BatchTokenizer
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import SequenceClassificationTransformerReranker

model_name = 'castorini/monobert-large-msmarco'
tokenizer_name = 'bert-large-uncased'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = AutoModelForSequenceClassification.from_pretrained(model_name)
model = model.to(device).eval()

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
reranker =  SequenceClassificationTransformerReranker(model, tokenizer)

query = Query('what causes low liver enzymes')

correct_doc = Text('Reduced production of liver enzymes may indicate dysfunction of the liver. This article explains the causes and symptoms of low liver enzymes. Scroll down to know how the production of the enzymes can be accelerated.')

wrong_doc = Text('Elevated liver enzymes often indicate inflammation or damage to cells in the liver. Inflamed or injured liver cells leak higher than normal amounts of certain chemicals, including liver enzymes, into the bloodstream, elevating liver enzymes on blood tests.')

documents = [correct_doc, wrong_doc]

scores = [result.score for result in reranker.rerank(query, documents)]
# scores = [-3.077077865600586, -5.45782470703125]

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
docs		docs
indexes		indexes
logs		logs
models		models
notebooks		notebooks
pygaggle		pygaggle
runs		runs
scripts		scripts
tests		tests
tools @ 9665e54		tools @ 9665e54
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyGaggle

Installation

Additional Instructions

A simple reranking example - T5

A simple reranking example - BERT

About

Releases

Packages

Languages

License

mrkarezina/pygaggle

Folders and files

Latest commit

History

Repository files navigation

PyGaggle

Installation

Additional Instructions

A simple reranking example - T5

A simple reranking example - BERT

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages