SPARe is an innovative approach to enhancing the efficiency of lexical retrievers through the utilization of GPU acceleration with sparse kernels.
-
High-level API:
- Sparse Collection API: Manages interactions with backend compressed sparse row and column matrices.
- Searcher API: Executes searches across the compressed sparse matrices using GPU.
-
Supported Backends:
- PyTorch
- [WIP] Jax
-
Compatibility:
- Pyserini: Converts Anserini indices into SPARe collections.
- PyTerrier: Converts Terrier indices into SPARe collections. (Implementation is complete, but not included in the current version of the API.)
-
Additional Features to Note:
- Parallelism (SPMD, [Planned] Tensor)
- [WIP] Hardware-aware searching: Optimizes retrieval strategies based on the available hardware resources.
- Collection sharding: Allows collections to be divided into smaller segments to manage memory constraints more effectively (currently undergoing performance validation).
Instructions on how to install and set up SPARe.
pip install git+https://github.com/ieeta-pt/SPARe.git
Detailed guidelines on how to use SPARe in various scenarios.
import spare
from transformers import AutoTokenizer
# mock documents
docs = [{
"id": "my first document",
"contents": "This is my first document in my document collection"
},{
"id": "my second document",
"contents": "This is another example of a shorter document"
}]
# SPARe expects to always recieve a tuple with doc id and the document text.
collection_mapped = map(lambda doc: (doc["id"], doc["contents"]), docs)
# Simple tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Wraps the tokenizer to be compatible and with SPARe
bow = spare.BagOfWords(lambda x: tokenizer(x, add_special_tokens=False).input_ids, tokenizer.vocab_size)
# Creates a default counting sparse collection
collection = spare.SparseCollection.from_text_iterator(collection_mapped,
text_to_vec=bow,
collection_maxsize=len(docs),
dtype=spare.float32,
backend="torch")
# save to a index folder
collection.save_to_file("my_two_document_collection")
import spare
from spare.metadata import MetaDataDocID
# same collection, but in the BOW format
[('my first document',
{2023: 1.0,
2003: 1.0,
2026: 2.0,
2034: 1.0,
6254: 2.0,
1999: 1.0,
3074: 1.0}),
('my second document',
{2023: 1.0,
2003: 1.0,
2178: 1.0,
2742: 1.0,
1997: 1.0,
1037: 1.0,
7820: 1.0,
6254: 1.0})]
collection = spare.SparseCollection.from_vec_iterator(iter(bow_docs),
vec_dim=30522,
collection_maxsize=len(docs),
dtype=spare.float32,
backend="torch",
metadata=MetaDataDocID) # this defines the metadata that is stored, in this case its only the docID
# by default SPARe uses MetaDataDFandDL, that stores docID, Doc Freq and Doc Length.
# This is useful when loading weight values from LSR.
# save to a index folder
collection.save_to_file("my_two_document_collection")
import spare
# (...) Use any of the previous collections
collection = (...)
# Apply a static BM25 transformation to the collection. Note that for this transformation is required to have metadata of type MetaDataDFandDL.
collection.transform(spare.BM25Transform(k1=1.2, b=0.75))
# save to a index folder
collection.save_to_file("my_two_document_collection_converted_to_BM25")
import spare
# lets load the previous bm25 collection to show how to load the collection
collection = spare.SparseCollection.load_from_file("my_two_document_collection_converted_to_BM25")
# if our question are in text format we need to provide the *bow* function to the retrieval as well.
sparse_retriver = spare.SparseRetriever(collection, algorithm="dot")
question = {
7820: 1.0,
6254: 1.0,
}
sparse_retriver.retrieve([question], top_k=10, return_scores=True)
import spare
# lets load the previous bm25 collection to show how to load the collection
collection = spare.SparseCollection.load_from_file("my_two_document_collection_converted_to_BM25")
# if our question are in text format we need to provide the *bow* function to the retrieval as well.
sparse_retriver = spare.SparseRetriever(collection, algorithm="iterative")
question = {
7820: 1.0,
6254: 1.0,
}
sparse_retriver.retrieve([question], top_k=10, return_scores=True)
import spare
# lets load the previous bm25 collection to show how to load the collection
collection = spare.SparseCollection.load_from_file("my_two_document_collection_converted_to_BM25")
# The execution datatype is controlled by the variable *objective* follwing the map:
# accuracy: spare.float32
# half: spare.float16
# performance: spare.uint8
#
# This for sure will be change to dtype=spare.(...)
sparse_retriver = spare.SparseRetriever(collection, algorithm="iterative", objective="performance") # runs the accumulation with uint8
question = {
7820: 1.0,
6254: 1.0,
}
sparse_retriver.retrieve([question], top_k=10, return_scores=True)
Guidelines for contributing to SPARe will be made available soon. Please note that this project is currently maintained by a small team (me), and any assistance is greatly appreciated. Furthermore, I am open to collaborations.
Currently, Apache License 2.0
If you use SPARe in your research, please cite it using the following format:
@InProceedings{10.1007/978-3-031-56063-7_33,
author="Almeida, Tiago
and Matos, S{\'e}rgio",
editor="Goharian, Nazli
and Tonellotto, Nicola
and He, Yulan
and Lipani, Aldo
and McDonald, Graham
and Macdonald, Craig
and Ounis, Iadh",
title="SPARe: Supercharged Lexical Retrievers on GPU with Sparse Kernels",
booktitle="Advances in Information Retrieval",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="413--421",
isbn="978-3-031-56063-7"
}```