<a href="https://colab.research.google.com/github/louisbrulenaudet/tax-retrieval-benchmark/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Massive Text Embedding Benchmark for French Taxation 🤗

[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow)
![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)

In this notebook, we will explore the process of adding a new task to the Massive Text Embedding Benchmark (MTEB). The MTEB is an open-source framework developed to facilitate the evaluation and benchmarking of multilingual and multi-task models across a diverse set of tasks and languages.

The task we will be integrating is the TaxRetrievalBenchmark, a retrieval task focused on retrieving relevant tax articles or content based on provided queries. This task is particularly useful in the legal and financial domains, where accurate and efficient retrieval of relevant information is crucial.
To add this task to the MTEB framework, we will follow a structured approach:

- Understanding the task: We will start by analyzing the TaxRetrievalBenchmark task, its data format, and the evaluation metrics used to assess model performance.
- Preparing the data: Next, we will preprocess the data from the HuggingFace Hub, converting it to the MTEB format. This step involves organizing the corpus, queries, and relevant document information into the required data structures.
- Implementing the task class: We will then implement the TaxRetrievalBenchmark class, which inherits from the AbsTaskRetrieval class provided by the MTEB framework. This class will encapsulate the task-specific logic, including data loading, metadata management, and evaluation methods.
- Integrating with MTEB: Finally, we will integrate the TaxRetrievalBenchmark class into the MTEB framework, allowing it to be used alongside other tasks for multi-task training and evaluation.

By adding the TaxRetrievalBenchmark task to the MTEB framework, we will contribute to the growing collection of diverse tasks, enabling researchers and practitioners to develop and evaluate multilingual and multi-task models more effectively. This notebook will serve as a practical guide for anyone interested in extending the MTEB framework with new tasks, fostering collaboration and advancing the field of natural language processing.

## Citing this project

If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {Massive Text Embedding Benchmark for French Taxation},
  year =         {2024}
}
```

## Feedback

If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).

In [1]:
!pip install -U sentence-transformers accelerate transformers tqdm datasets mteb

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.41.1-py3-none-any.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m100.8 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mteb
  Downloading mteb-1.11.13-p

# Configuration

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from typing import List, Tuple

import datasets
import numpy as np

from mteb import MTEB
from mteb.abstasks.AbsTaskRetrieval import AbsTaskRetrieval
from mteb.abstasks.TaskMetadata import TaskMetadata
from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    evaluation,
    util
)
from tqdm import tqdm

# Integrating with MTEB

In [8]:
class TaxRetrievalBenchmark(AbsTaskRetrieval):
    """
    TaxRetrievalBenchmark class for retrieve related tax articles or contents based on queries.

    This class inherits from the AbsTaskRetrieval class and is designed to load and process
    a dataset from the HuggingFace Hub for the task of retrieve related tax articles or contents based on queries.

    Parameters
    ----------
    None

    Attributes
    ----------
    metadata : TaskMetadata
        Metadata for the TaxRetrievalBenchmark task.

    corpus : Dict[str, Dict[str, Dict[str, str]]]
        Dictionary containing the corpus data, where keys are split names, and values are dictionaries
        with document IDs as keys and dictionaries containing document queries and texts as values.

    queries : Dict[str, Dict[str, str]]
        Dictionary containing the queries, where keys are split names, and values are dictionaries
        with query IDs as keys and query texts as values.

    relevant_docs : Dict[str, Dict[str, Dict[str, int]]]
        Dictionary containing the relevant documents, where keys are split names, and values are
        dictionaries with query IDs as keys and dictionaries containing document IDs as keys and
        relevance scores as values.

    Methods
    -------
    description()
        Get the description of the TaxRetrievalBenchmark task.

    load_data(**kwargs)
        Load the dataset from the HuggingFace Hub and convert it to the MTEB format.

    Examples
    --------
    >>> task = TaxRetrievalBenchmark()
    >>> task.load_data()
    >>> corpus = task.corpus['train']
    >>> queries = task.queries['train']
    >>> relevant_docs = task.relevant_docs['train']
    """
    metadata = TaskMetadata(
        name="TaxRetrievalBenchmark",
        description="Retrieve related tax articles or contents based on queries.",
        reference="https://huggingface.co/datasets/louisbrulenaudet/tax-fr",
        type="Retrieval",
        category="s2p",
        eval_splits=["train"],
        eval_langs=["fra-Latn"],
        main_score="ndcg_at_10",
        dataset={
            "path": "louisbrulenaudet/tax-fr",
            "revision": "96593ed",
        },
        date=("2023-01-01", "2024-12-31"),
        form=["written"],
        domains=["Academic", "Non-fiction", "Legal"],
        task_subtypes=["Article retrieval"],
        license="cc-by-4.0",
        socioeconomic_status="high",
        annotations_creators="derived",
        dialect=[],
        text_creation="found",
        avg_character_length={"test": 69.0},
        n_samples={"test": 440},
        bibtex_citation="""
        @misc {louisbrulenaudet2024,
            author       = { {Louis Brulé Naudet} },
            title        = { tax-fr (Revision 96593ed) },
            year         = 2024,
            url          = { https://huggingface.co/datasets/louisbrulenaudet/tax-fr },
            doi          = { 10.57967/hf/1227 },
            publisher    = { Louis Brulé Naudet }
        }
        """,
    )


    @property
    def description(self):
        """
        Get the description of the TaxRetrievalBenchmark task.

        Returns
        -------
        Dict[str, Union[str, List[str]]]
            A dictionary containing the description of the task.

        Examples
        --------
        >>> task = TaxRetrievalBenchmark()
        >>> description = task.description
        >>> print(description['name'])
        TaxRetrievalBenchmark
        """
        return {
            "name": "TaxRetrievalBenchmark",
            "hf_hub_name": "louisbrulenaudet/tax-fr",
            "type": "Retrieval",
            "category": "s2p",
            "eval_splits": ["train"],
            "eval_langs": ["fra"],
            "main_score": "ndcg_at_10",
            "beir_name": "NA",
        }


    def load_data(self, **kwargs):
        """
        Load the dataset from the HuggingFace Hub and convert it to the MTEB format.

        This method loads the dataset from the HuggingFace Hub and processes it to create
        dictionaries for the corpus, queries, and relevant documents. The dataset is assumed
        to have three columns: "instruction", "data", and "output". The "instruction" column
        contains the query, and the "output" column contains the relevant document.

        Parameters
        ----------
        **kwargs
            Additional keyword arguments.

        Returns
        -------
        None

        Examples
        --------
        >>> task = TaxRetrievalBenchmark()
        >>> task.load_data()
        >>> corpus = task.corpus['train']
        >>> queries = task.queries['train']
        >>> relevant_docs = task.relevant_docs['train']
        """
        if self.data_loaded:
            return

        self.corpus, self.queries, self.relevant_docs = {}, {}, {}
        dataset = datasets.load_dataset(
            "louisbrulenaudet/tax-fr"
        )

        for split, data in dataset.items():
            corpus = {}
            queries = {}
            relevant_docs = {}

            for row in data:
                query = row["instruction"]
                relevant_doc = row["output"]
                corpus[relevant_doc] = {"title": "", "text": relevant_doc}
                queries[query] = query
                relevant_docs[query] = {relevant_doc: 1}

            self.corpus[split] = corpus
            self.queries[split] = queries
            self.relevant_docs[split] = relevant_docs

        self.data_loaded = True

In [6]:
model = SentenceTransformer(
    "lemoneresearch/lemone-embed-m-512",
    device="cuda"
)



model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

# Evaluation

In [11]:
evaluation = MTEB(
    tasks=[
        TaxRetrievalBenchmark()
    ]
)

evaluation.run(model)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

[MTEBResults(task_name=TaxRetrievalBenchmark, scores=...)]