-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using huggingface's datasets library as key part of the pipeline #26
Comments
Thank you for taking the time to install and look at txtai! I've heard of Hugging Face's dataset library and it does look very nice. But I'm not exactly clear what integration you envisioned. Would you mind expanding more? Are you thinking of a way to take a dataset and have an easy integration to build a txtai index? |
If you see - https://huggingface.co/docs/datasets/faiss_and_ea.html , Their library allows you to index memory mapped columns of underlying pyarrow dataframe with faiss and elastic search. The So, say I have a large corpus which I want to index, then I can map my vectorizer over class DenseInformationRetriever:
_EMBEDDING_COL_NAME = "embeddings"
def __init__(
self,
documents: nlp.Dataset,
doc_vectorizer: Vectorizer,
query_vectorizer: Vectorizer,
batch_size: int = 512,
cache_file: Optional[str] = None,
string_factory: Optional[str] = "Flat",
train_size: Optional[int] = None,
min_train_pct: Optional[float] = 0.4,
):
self.docs = documents
self.batch_size = batch_size
self.query_vectorizer = query_vectorizer
self.docs = self.docs.map(
lambda examples: {self._EMBEDDING_COL_NAME: doc_vectorizer(examples)},
batched=True,
batch_size=batch_size,
cache_file_name=cache_file,
)
if train_size:
train_size = min(train_size, int(min_train_pct * len(self.docs)))
self.docs.add_faiss_index(
column=self._EMBEDDING_COL_NAME,
string_factory=string_factory,
train_size=train_size,
)
def search(
self, queries: nlp.Dataset, k: int = 10
) -> List[Tuple[List[float], Dict[str, List]]]:
query_vectorizer = self.query_vectorizer
embedding_col_name = self._EMBEDDING_COL_NAME
queries = queries.map(
lambda examples: {embedding_col_name: query_vectorizer(examples)},
batched=True,
batch_size=self.batch_size,
)
embeddings = queries[self._EMBEDDING_COL_NAME]
results = []
for i in tqdm.tqdm(range(0, len(embeddings), 32)):
batch_scores, batch_retrieved = self.docs.get_nearest_examples_batch(
self._EMBEDDING_COL_NAME,
np.array(embeddings[i : i + 32], dtype=np.float32),
k=k,
)
results.extend(list(zip(batch_scores, batch_retrieved)))
return results
def vector_search(
self, vector: List[float], k: int = 10
) -> Tuple[List[float], Dict[str, List]]:
scores, retrieved_examples = self.docs.get_nearest_examples(
self._EMBEDDING_COL_NAME, np.array(vector, dtype=np.float32), k=k
)
return scores, retrieved_examples I thought it could be quiet useful if we use this to support processing large datasets which users of txtai might have. They could even easily publish their processed datasets (before vectorizing) to huggingface's hub. I will try to put a working POC in a repo this week. |
Got it, thank you, I will keep this in mind and consider options to integrate datasets into txtai. |
An example notebook has been added to show how to integrate txtai and Hugging Face's Datasets. |
I implemented a similar customizable indexing + retrieval pipeline. Huggingface's datasets (previously named NLP) libary allows one to vectorize index huge datasets without having to worry about RAM. They use Apache Arrow for memory mapped zero deserialization cost dataframes to do this. And It also supports easy integration with FAISS and elastic search.
Key advantages of making this the key part of the pipeline are as follows.
datasets
library already provides access to tonnes of datasets. Refer https://huggingface.co/datasets/viewer/. They allow adding new datasets, making it a good choice for distributing datasets which users of txtai would rely upon.The text was updated successfully, but these errors were encountered: