<a href="https://colab.research.google.com/github/louisbrulenaudet/lemone-embed/blob/main/notebooks/lemone_embed_notebook_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://huggingface.co/louisbrulenaudet/lemone-embed-pro/resolve/main/assets/thumbnail.webp" width="800px">

# Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation
[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)

<div class="not-prose bg-gradient-to-r from-gray-50-to-white text-gray-900 border" style="border-radius: 8px; padding: 0.5rem 1rem;">
    <p>This series is made up of 7 models, 3 basic models of different sizes trained on 1 epoch, 3 models trained on 2 epochs making up the Boost series and a Pro model with a non-Roberta architecture.</p>
</div>

This sentence transformers model, specifically designed for French taxation, has been fine-tuned on a dataset comprising 43 million tokens, integrating a blend of semi-synthetic and fully synthetic data generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation.

The model is tailored to meet the specific demands of information retrieval across large-scale tax-related corpora, supporting the implementation of production-ready Retrieval-Augmented Generation (RAG) applications. Its primary purpose is to enhance the efficiency and accuracy of legal processes in the taxation domain, with an emphasis on delivering consistent performance in real-world settings, while also contributing to advancements in legal natural language processing research.

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation},
  year =         {2024}
  howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/lemone-embed-pro}},
}
```

## Feedback

If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).

# Collecting and installing dependencies

In [None]:
!pip3 install chromadb polars datasets sentence-transformers huggingface_hub

# Importing packages

## Core Database and Data Processing

- ChromaDB: A specialized vector database that will be used to store and query our embeddings efficiently
- Polars: A modern, high-performance DataFrame library chosen as an alternative to pandas for data manipulation tasks

## Machine Learning Infrastructure

- Datasets: Integration with Hugging Face's dataset library for streamlined data handling
- PyTorch CUDA: Capability check for GPU acceleration to optimize model performance

## Utility Components

- Hashlib: Implementation of secure hash functions, likely used for creating unique identifiers for documents or embeddings
- Datetime: Temporal data handling for tracking embedding creation and modifications
- Type Hints: Comprehensive typing imports for enhanced code documentation and maintainability

In [None]:
import hashlib

from datetime import datetime
from typing import (
    IO,
    TYPE_CHECKING,
    Any,
    Dict,
    List,
    Type,
    Tuple,
    Union,
    Mapping,
    TypeVar,
    Callable,
    Optional,
    Sequence,
)

import chromadb
import polars as pl

from chromadb.config import Settings
from chromadb.utils import embedding_functions
from datasets import Dataset
from torch.cuda import is_available

# Datasets registration

This cell loads a Parquet dataset from Hugging Face's repository (lemone-docs-embeded) using Polars' efficient lazy loading method (scan_parquet), filters out any rows with null values in the 'text' column to ensure data quality, and finally materializes the data into memory with .collect() for further processing.

In [None]:
dataframe = pl.scan_parquet(
  "hf://datasets/louisbrulenaudet/lemone-docs-embeded/data/train-00000-of-00001.parquet"
).filter(
    pl.col(
        "text"
    ).is_not_null()
).collect()

If you want to re-create your dataset from the source, here is a code snippet that will help you:

In [None]:
bofip_dataframe = pl.scan_parquet(
    "hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet"
).with_columns(
    [
        (
            pl.lit("Bulletin officiel des finances publiques - impôts").alias(
                "title_main"
            )
        ),
        (
            pl.col("debut_de_validite")
            .str.strptime(pl.Date, format="%Y-%m-%d")
            .dt.strftime("%Y-%m-%d 00:00:00")
        ).alias("date_publication"),
        (
            pl.col("contenu")
            .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)
            .alias("hash")
        )
    ]
).rename(
    {
        "contenu": "text",
        "permalien": "url_sourcepage",
        "identifiant_juridique": "id_sub",
    }
).select(
    [
        "text",
        "title_main",
        "id_sub",
        "url_sourcepage",
        "date_publication",
        "hash"
    ]
)

books: List[str] = [
    "hf://datasets/louisbrulenaudet/code-douanes/data/train-00000-of-00001.parquet",
    "hf://datasets/louisbrulenaudet/code-impots/data/train-00000-of-00001.parquet",
    "hf://datasets/louisbrulenaudet/code-impots-annexe-i/data/train-00000-of-00001.parquet",
    "hf://datasets/louisbrulenaudet/code-impots-annexe-ii/data/train-00000-of-00001.parquet",
    "hf://datasets/louisbrulenaudet/code-impots-annexe-iii/data/train-00000-of-00001.parquet",
    "hf://datasets/louisbrulenaudet/code-impots-annexe-iv/data/train-00000-of-00001.parquet",
    "hf://datasets/louisbrulenaudet/code-impositions-biens-services/data/train-00000-of-00001.parquet",
    "hf://datasets/louisbrulenaudet/livre-procedures-fiscales/data/train-00000-of-00001.parquet"
]

legi_dataframe = pl.concat(
    [
        pl.scan_parquet(
            book
        ) for book in books
    ]
).with_columns(
    [
        (
            pl.lit("https://www.legifrance.gouv.fr/codes/article_lc/")
            .add(pl.col("id"))
            .alias("url_sourcepage")
        ),
        (
            pl.col("dateDebut")
            .cast(pl.Int64)
            .map_elements(
                lambda x: datetime.fromtimestamp(x / 1000).strftime("%Y-%m-%d %H:%M:%S"),
                return_dtype=pl.Utf8
            )
            .alias("date_publication")
        ),
        (
            pl.col("texte")
            .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)
            .alias("hash")
        )
    ]
).rename(
    {
        "texte": "text",
        "num": "id_sub",
    }
).select(
    [
        "text",
        "title_main",
        "id_sub",
        "url_sourcepage",
        "date_publication",
        "hash"
    ]
)

print("Starting embeddings production...")

dataframe = pl.concat(
    [
        bofip_dataframe,
        legi_dataframe
    ]
).filter(
    pl.col(
        "text"
    ).is_not_null()
).with_columns(
    pl.col("text").map_elements(
        lambda x: sentence_transformer_ef(
            [x]
        )[0].tolist(),
        return_dtype=pl.List(pl.Float64)
    ).alias("lemone_pro_embeddings")
).collect()

# Index creation

This cell initializes a ChromaDB client with telemetry disabled, sets up a SentenceTransformer embedding model (using "lemone-embed-pro" with GPU acceleration if available), and creates or retrieves a collection named "tax" that will store the document embeddings using this model configuration.

In [None]:
client = chromadb.Client(
    settings=Settings(anonymized_telemetry=False)
)

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="louisbrulenaudet/lemone-embed-pro",
    device="cuda" if is_available() else "cpu",
    trust_remote_code=True
)

collection = client.get_or_create_collection(
    name="tax",
    embedding_function=sentence_transformer_ef
)

Populates the ChromaDB collection by adding document embeddings from the "lemone_pro_embeddings" column, their corresponding text content, all remaining columns as metadata, and automatically generated sequential IDs for each document.


In [None]:
collection.add(
    embeddings=dataframe["lemone_pro_embeddings"].to_list(),
    documents=dataframe["text"].to_list(),
    metadatas=dataframe.remove_columns(
        [
            "lemone_pro_embeddings",
            "text"
        ]
    ).to_list(),
    ids=[
        str(i) for i in range(0, dataframe.shape[0])
    ]
)

# Collection querying

In [None]:
collection.query(
    query_texts=["Les personnes morales de droit public ne sont pas assujetties à la taxe sur la valeur ajoutée pour l'activité de leurs services administratifs, sociaux, éducatifs, culturels et sportifs lorsque leur non-assujettissement n'entraîne pas de distorsions dans les conditions de la concurrence."],
    n_results=10,
)