Skip to content

orionw/mteb-lite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Downsampling MTEB Retrieval Datasets

Enviroment Setup

  1. Install a python (3.10+) enviroment with requirements.txt (conda create -n mteb-lite python=3.10 -y && conda activate mteb-lite && pip install -r requirements.txt)
  2. Install Java conda install -c conda-forge openjdk -y

Tasks to Downsample

The tasks to downsample are located in tasks_to_downsample.txt and include both the name and the split.

To Reproduce (two steps)

Run bash run_all.sh NFCorpus intfloat/e5-small-v2 384 test switching out the datasets and models you prefer. It needs the dimension (384) and the split of the dataset (default is test if none is passed). If the model's embeddings should not be normalized, pass an additional false parameter after test.

Then we need to push the shared run files to mteb/mteb-lite-run-files. They will be located locally in artifacts/run_{MODEL_NAME}_{DATASET_NAME}-{DATASET_SPLIT}.tsv.

To Reproduce (step by step)

  1. Start by embedding a corpus so we can search for the top results: python embed_corpus.py --dataset DATASET --model MODEL --split SPLIT_NAME
  2. Convert the embeddings to faiss so we can search with pyserini bash convert_to_faiss.sh PATH_TO_EMBEDDING.json DIM_SIZE
  3. Download the queries file with python download_queries.py --dataset DATASET_NAME --split SPLIT_NAME
  4. Mine the run file with the hard negatives per corpus embedding using bash mine_hard_negatives.sh INDEX_FOLDER/full QUERIES_FILE NORMALIZE_TRUE_OR_FALSE BATCH_SIZE NUM_HITS MODEL_NAME
  5. Subsample the dataset and push them to the hub with `python subsample_dataset.py --run_files RUN_FILE_LIST --dataset_name DATASET_NAME"
  6. Add all of those new datasets to MTEB using python automatically_add_dataset_to_mteb.py --original_repo_name DATASET_NAME --path_to_existing PATH_TO_ORIGINAL_FILE_IN_MTEB_REPO
  7. Evaluate all models on those new datasets with python run_mteb_on_datasets.py --dataset_name DATASET_NAME

For example, use "NQ" for the dataset name and "intfloat/e5-large-v2" as the model name.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published