Skip to content

metarank/esci-playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Metarank ESCI playground

This repo is a complementary set of configs and links for the Haystack US 23 talk on Hybrid Search

Datasets

We use a combination of original Amazon ESCI and community ESCI-S datasets.

The ESCI+ESCI-S small dataset in the Metarank format can be downloaded here: s3://esci-s/metarank-esci-small.jsonl.zst

Models

See huggingface.co/metarank repo with all models used in the final configuration:

You can always translate your own model to ONNX, see translation scripts on each model repos: https://huggingface.co/metarank/all-MiniLM-L6-v2/blob/main/convert.py

Config file

The Metarank config file is stored in this repo: config.yml

To speed-up all the experiments, we used precomputed embeddings for all models:

  • with no caching bootstrapping over CE models takes hours.
  • pre-computed embeddings for all experiments are ~30GB, so we're not sharing them for the sake of saving bandwidth. If you need them, contact us in Slack

BM25

All the BM25 features mention term-frequencies file. You can create it with the following command:

java -jar metarank.jar termfreq --data events.jsonl --out tf-title.json --fields title

Term-freq files should be build per field to match the behavior of Lucene.

Running the experiments

Then take a look into the config.yml file: there is a section with feature definitions, and the actual feature layout over different models. In this example there's only a single model, which includes all the features:

  • you should uncomment the features you need to be included into the ensemble
  • then run metarank standalone -d events.jsonl -c config.yml and write down the NDCG values

License

Licensed under the Apache 2.0.

Releases

No releases published

Packages

No packages published