Skip to content

SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.

google-research-datasets/swim-ir

Repository files navigation

SWIM-IR

Overview

SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training dataset consisting of 28 million query-passage pairs spanning 33 languages. Multilingual passages are sampled from Wikipedia and are paired with queries generated by PaLM-2 using a novel summarize-then-ask prompting (SAP) generation method.

Models trained on SWIM-IR achieve good performance on XOR-Retrieve (cross-lingual), and MIRACL (multilingual). SWIM-IR based models achieved a new state-of-the-art on XTREME-UP, a cross-lingual retrieval benchmark for under-represented and scarce-data languages.

Announcements

  • [Nov 2023] SWIM-IR v1.0 currently covers a portion (10 of the 18) of the MIRACL languages in the SWIM-IR monolingual set. The cross-lingual SWIM-IR data contains synthetic training pairs for all of the languages in XOR-Retrieve and XTREME-UP.

Dataset Generation

"Figure illustrating how the SMIM-IR dataset was created" Figure 1: SWIM-IR dataset generation process. Sampled Wikipedia passages are provided to an LLM (PaLM-2) using the novel summarize-then-ask prompting (SAP) method.

The SWIM-IR dataset is generated by first sampling passages from Wikipedia. The passages are then provided to PaLM-2 along with a prompt that asks the model to summarize the passage. The model is then prompted to ask a question that can be answered by the passage. The end-to-end process is illustrated in Figure 1. Summarize-then-ask prompting (SAP) aids the model in generating good information seeking queries for each specific input passage.

Download

The SWIM-IR dataset can be downloaded using the links below:

Data Format

SWIM-IR is partitioned into three sections (/directories): cross_lingual, cross_lingual_ext, and monolingual. The cross_lingual section contains training data that can be used for evaluation on XOR Retrieve, while the cross_lingual_ext section can be used for evaluations on XTERME-UP. The monolingual section can be used for MIRACL evaluation.

Each section contains language specific JSONL files with the fields: _id, lang, code, query, title and text. Synthetic questions generated by PaLM-2 about the passage are stored in the query field. The text field contains a sampled passage from Wikipedia, while title is the title of the passage's article.

For the monolingual data, lang is the language of both the query and passage with the corresponding langauge code being stored in code (e.g., 'fr'). For the cross_lingual and cross_lingual_ext data, the queries are in English, while lang and code indicate the language and language code of the passage.

Below is a JSON example from SWIM-IR for a question in Chinese, "托马斯·爱迪生在哪里发明了留声机?" [Where did Thomas Edison invent the phonograph?].

{'_id': '10770836',
'lang': 'Chinese',
'code': 'zh',
'query': '托马斯·爱迪生在哪里发明了留声机?', 
'title': 'Menlo Park, New Jersey',
'text': 'Menlo Park is an unincorporated community located \
within Edison Township in Middlesex County, New Jersey, United \
States. In 1876, Thomas Edison set up his home and research \
laboratory in Menlo Park, which at the time was the site of an \ unsuccessful real estate development named after the town of \
Menlo Park, California. While there, he earned the nickname \
"the Wizard of Menlo Park". The Menlo Park lab was significant \
in that it was one of the first laboratories to pursue practical \
commercial applications of research. It was in his Menlo Park \
laboratory that Thomas Edison invented the phonograph and developed'
}

Note that within the SWIM-IR dataset, JSON examples are stored as JSONL with one JSON example per line. Multiple line JSON is used above to make the example more readable.

Prompts

Prompts given to PaLM-2 to generate the three parts of our dataset are provided below:

Paper

SWIM-IR is described in datail in the paper Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval by Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin and Daniel Cer. Please cite our paper in research work that uses or discusses SWIM-IR.

BibTeX

@article{swim-ir-dataset,
  author    = {Nandan Thakur and
               Jianmo Ni and
               Gustavo Hern\'andez \'Abrego$^\lozenge$ and
               John Wieting and
               Jimmy Lin and
               Daniel Cer},
  title     = {Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval},
  journal   = {CoRR},
  volume    = {abs/2311.05800},
  year      = {2023},
  url       = {https://arxiv.org/abs/2311.05800},
  eprinttype = {arXiv},
  primaryClass={cs.IR},
  eprint    = {2311.05800},
}
 

Contact

Questions about the SWIM-IR dataset can asked by creating an issue on this repository or by sending them to swim-ir-dataset@googlegroups.com

License

The SWIM-IR dataset is licensed under CC BY-SA 4.0

About

SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published