- Install a python (3.10+) enviroment with requirements.txt (
conda create -n mteb-lite python=3.10 -y && conda activate mteb-lite && pip install -r requirements.txt
) - Install Java
conda install -c conda-forge openjdk -y
The tasks to downsample are located in tasks_to_downsample.txt and include both the name and the split.
Run bash run_all.sh NFCorpus intfloat/e5-small-v2 384 test
switching out the datasets and models you prefer. It needs the dimension (384
) and the split of the dataset (default is test
if none is passed). If the model's embeddings should not be normalized, pass an additional false
parameter after test
Then we need to push the shared run files to mteb/mteb-lite-run-files. They will be located locally in artifacts/run_{MODEL_NAME}_{DATASET_NAME}-{DATASET_SPLIT}.tsv
- Start by embedding a corpus so we can search for the top results:
python embed_corpus.py --dataset DATASET --model MODEL --split SPLIT_NAME
- Convert the embeddings to faiss so we can search with pyserini
bash convert_to_faiss.sh PATH_TO_EMBEDDING.json DIM_SIZE
- Download the queries file with
python download_queries.py --dataset DATASET_NAME --split SPLIT_NAME
- Mine the run file with the hard negatives per corpus embedding using
- Subsample the dataset and push them to the hub with `python subsample_dataset.py --run_files RUN_FILE_LIST --dataset_name DATASET_NAME"
- Add all of those new datasets to MTEB using
python automatically_add_dataset_to_mteb.py --original_repo_name DATASET_NAME --path_to_existing PATH_TO_ORIGINAL_FILE_IN_MTEB_REPO
- Evaluate all models on those new datasets with
python run_mteb_on_datasets.py --dataset_name DATASET_NAME
For example, use "NQ" for the dataset name and "intfloat/e5-large-v2" as the model name.