Skip to content

A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.

License

Notifications You must be signed in to change notification settings

project-miracl/miracl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 

Repository files navigation

image

Build License Downloads

πŸ™Œ MIRACL

MIRACL πŸŒπŸ™ŒπŸŒ (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. The website for the event can be found at miracl.ai. This repo provides pointers to access the actual dataset.

For more details, check out our arXiv paper: Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages.

Connect with us!

πŸ™Œ Corpora

The Wikipedia corpora used in MIRACL are available as a HuggingFace Dataset. So far, we have released corpora for the 16 "known languages"; the remaining 2 "surprise languages" will be revealed later!

  • πŸ€— = direct link to HuggingFace Dataset
  • 🌏 = link to raw wiki dumps
Language # of Passages # of Articles Links
Arabic (ar) 2,061,414 656,982 πŸ€— 🌏
Bengali (bn) 297,265 63,762 πŸ€— 🌏
English (en) 32,893,221 5,758,285 πŸ€— 🌏
Spanish (es) 10,373,953 1,669,181 πŸ€— 🌏
Persian (fa) 2,207,172 857,827 πŸ€— 🌏
Finnish (fi) 1,883,509 447,815 πŸ€— 🌏
French (fr) 14,636,953 2,325,608 πŸ€— 🌏
Hindi (hi) 506,264 148,107 πŸ€— 🌏
Indonesian (id) 1,446,315 446,330 πŸ€— 🌏
Japanese (ja) 6,953,614 1,133,444 πŸ€— 🌏
Korean (ko) 1,486,752 437,373 πŸ€— 🌏
Russian (ru) 9,543,918 1,476,045 πŸ€— 🌏
Swahili (sw) 131,924 47,793 πŸ€— 🌏
Telugu (te) 518,079 66,353 πŸ€— 🌏
Thai (th) 542,166 128,179 πŸ€— 🌏
Chinese (zh) 4,934,368 1,246,389 πŸ€— 🌏

The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \n\n in the wiki markup). Each of these passages comprise a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.

The corpus data files are in JSON lines format, compressed with gzip. Each line in the file corresponds to a passage. Consider an example from the English corpus:

{
    "docid": "39#0",
    "title": "Albedo", 
    "text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}

The docid has the schema X#Y, where all passages with the same X come from the same Wikipedia article, whereas Y denotes the passage within that article, numbered sequentially. The text field contains the text of the passage. The title field contains the name of the article the passage comes from.

πŸ™Œ Topics and Relevance Judgments

Topics (= queries) and relevance judgments (= relevance labels) of the MIRACL training sets and development sets for each of the 16 known languages are available on HuggingFace Dataset!

πŸ€— = direct link to HuggingFace Dataset

Train Dev
Language #Q #J #Q #J Links
Arabic (ar) 3,495 25,382 2,896 29,197 πŸ€—
Bengali (bn) 1,631 16,754 411 4,206 πŸ€—
English (en) 2,863 29,416 799 8,350 πŸ€—
Spanish (es) 2,162 21,531 648 6,443 πŸ€—
Persian (fa) 2,107 21,844 632 6,571 πŸ€—
Finnish (fi) 2,897 20,350 1,271 12,008 πŸ€—
French (fr) 1,143 11,426 343 3,429 πŸ€—
Hindi (hi) 1,169 11,668 350 3,494 πŸ€—
Indonesian (id) 4,071 41,358 960 9,668 πŸ€—
Japanese (ja) 3,477 34,387 860 8,354 πŸ€—
Korean (ko) 868 12,767 213 3,057 πŸ€—
Russian (ru) 4,683 33,921 1,252 13,100 πŸ€—
Swahili (sw) 1,901 9,359 482 5,092 πŸ€—
Telugu (te) 3,452 18,608 828 1,606 πŸ€—
Thai (th) 2,972 21,293 733 7,573 πŸ€—
Chinese (zh) 1,312 13,113 393 3,928 πŸ€—
Total 40,203 343,177 13,071 126,076

The above table shows the number of queries (#Q) and the number of judgments (#J) in each (language, split) combination, where the judgments include both positive and negative labels.

The topics are formatted in TSV, with each line organized as follows:

qid\tquery

The relevance judgments are formatted in standard TREC qrels format, as follows:

qid Q0 docid relevance

πŸ™Œ Baselines

Reproduce the results with Pyserini:

We have released baselines using BM25, mDPR, and hybrid of the two, as described in our arXiv paper. Reuslts of BM25 and mDPR could be reproduced using Pyserini.

To reproduce our baselines:

  1. Install the development version of Pyserini following these instructions. (To run baselines on surprise languages, you'll need to re-build both Anserini and Pyserini)
  2. Manually place all topics and qrels files under tools/topics-and-qrels. The topics and qrels files can be found under miracl-v1.0-${lang}/topics and miracl-v1.0-${lang}/qrels in the HuggingFace dataset.
    git clone https://huggingface.co/datasets/miracl/miracl
    mv miracl/*/*/* $PYSERINI_PATH/tools/topics-and-qrels/
    
  3. Following the commands in our 2-click-reproduction (2CR) website.

Note that the 2CR above is only for reproducing the search stage, where the indexes are pre-computed and loaded automatically by Pyserini. If you are interested in reproducing the indexing stage, please refer to this documentation:

Checkpoints for dense models:

  • mDPR (w/o fine-tuning on MIRACL): castorini/mdpr-tied-pft-msmarco
  • mContriever (w/o fine-tuning on MIRACL): facebook/mcontriever-msmarco
  • mDPR (fine-tuned on MIRACL): castorini/mdpr-tied-pft-msmarco-ft-miracl-{lang}, where {lang} is the two-letter ISO code (e.g., ar, bn, ...)

πŸ™Œ Citation

If you find this dataset and repository helpful, please cite MIRACL as follows:

@article{10.1162/tacl_a_00595,
    author = {Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy},
    title = "{MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {1114-1131},
    year = {2023},
    month = {09},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00595},
    url = {https://doi.org/10.1162/tacl\_a\_00595},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf},
}

πŸ™Œ Contact

If you have any questions, feel free to email us (project.miracl [at] gmail.com) or start a Github issue under this repository.

About

A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published