<a href="https://colab.research.google.com/github/louisbrulenaudet/ragoon/blob/main/notebooks/RAGoon_EmbeddingsDataLoader_cookbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAGoon EmbeddingsDataLoader cookbook ⚡
[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)
[![GitHub](https://img.shields.io/badge/GitHub-Project-blue?logo=github)](https://github.com/louisbrulenaudet/ragoon)

![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/thumbnail.png?raw=true)

RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.

In this notebook, you will learn how to generate embeddings for several models in order to process large datasets.

## Quick install
The reference page for RAGoon is available on the official page of PyPI: [RAGoon](https://pypi.org/project/ragoon/).

```python
pip install ragoon
```

## Citing this project
If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2024,
	author = {Louis Brulé Naudet},
	title = {RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantitized indexes processing},
	howpublished = {\url{https://github.com/louisbrulenaudet/ragoon}},
	year = {2024}
}
```

## Feedback
If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).



# Installation

The RAGoon project leverages a variety of libraries to provide robust functionality for tasks such as embeddings generation, retrieval-augmented generation (RAG), and web-based processing. Below is an overview of some key dependencies:

- `transformers`: This library from Hugging Face is esential for working with state-of-the-art language models, enabling the project to perform tasks like text generation and model inference.
- `torch`: PyTorch is used for deep learning operations, particularly for model training and inference. It is a fundamental component for handling neural networks and tensor computations.
- `sentence_transformers`: This library simplifies the generation of dense vector representations (embeddings) from text, which is crucial for tasks like semantic search and information retrieval.
- `faiss_cpu`: FAISS is a powerful library for efficient similarity search, used in RAGoon to handle large-scale indexing and retrieval tasks with high performance.
- `httpx` and `beautifulsoup4`: These libraries are used for web scraping and making HTTP requests, enabling the project to fetch and process data from web sources efficiently.
- `openai`: This library connects to OpenAI's APIs, allowing integration with models like GPT for advanced text generation capabilities.
- `huggingface_hub`: Essential for interacting with Hugging Face’s model repository, enabling easy access to pre-trained models and datasets.

These dependencies work together to empower RAGoon with advanced capabilities in natural language processing, machine learning, and web data processing, making it a versatile tool for developers and researchers in AI.

In [1]:
!pip3 install ragoon

Collecting ragoon
  Downloading ragoon-0.0.8-py3-none-any.whl.metadata (7.7 kB)
Collecting datasets==2.20.0 (from ragoon)
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting faiss-cpu==1.8.0 (from ragoon)
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting google-api-python-client==2.126.0 (from ragoon)
  Downloading google_api_python_client-2.126.0-py2.py3-none-any.whl.metadata (6.7 kB)
Collecting groq==0.9.0 (from ragoon)
  Downloading groq-0.9.0-py3-none-any.whl.metadata (13 kB)
Collecting httpx==0.27.0 (from ragoon)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub==0.24.2 (from ragoon)
  Downloading huggingface_hub-0.24.2-py3-none-any.whl.metadata (13 kB)
Collecting myst-parser==3.0.1 (from ragoon)
  Downloading myst_parser-3.0.1-py3-none-any.whl.metadata (5.5 kB)
Collecting numpydoc==1.7.0 (from ragoon)
  Downloading numpydoc-1.7.0-py3-none-any.whl.metad

In [2]:
from ragoon import EmbeddingsDataLoader

from datasets import (
    load_dataset,
    Dataset,
    DatasetDict
)

## Process specific column of a dataset

The EmbeddingsDataLoader class is a versatile tool designed to streamline the process of embedding textual data. It handles:

- Loading Datasets: From the Hugging Face Hub or directly from a preloaded dataset.
- Loading and Managing Models: Including GPU-accelerated models and inference API models.
- Processing Text Data: By mapping embedding functions over the specified dataset column, allowing for batch processing.
- Handling Output: The loader can convert outputs into tensors or retain them as lists, making it adaptable to various machine learning workflows.

In [3]:
# Initialize the dataset loader with multiple models
loader = EmbeddingsDataLoader(
    token="your_token",
    dataset=load_dataset("louisbrulenaudet/dac6-instruct", split="train"), # If dataset is already loaded.
    # dataset_name="louisbrulenaudet/dac6-instruct", # If you want to load dataset from class.
    model_configs=[
        {"model": "bert-base-uncased", "query_prefix": "Query:"},
        {"model": "distilbert-base-uncased", "query_prefix": "Query:"}
        # Add more model configurations as needed
    ]
)

# Uncomment this line if pass dataset_name instead of dataset.
# loader.load_dataset()

# Process the splits with all models loaded
loader.process(
    column="output",
    preload_models=True
)

# To access the processed dataset
processed_dataset = loader.get_dataset()
processed_dataset[0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/389k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/414 [00:00<?, ? examples/s]

2024-08-09 05:41:19,782 - INFO - Loading model bert-base-uncased with CUDA.
2024-08-09 05:41:19,782 - INFO - Loading model bert-base-uncased with CUDA.
2024-08-09 05:41:19,782 - INFO - Loading model bert-base-uncased with CUDA.
2024-08-09 05:41:19,782 - INFO - Loading model bert-base-uncased with CUDA.
2024-08-09 05:41:19,782 - INFO - Loading model bert-base-uncased with CUDA.
INFO:ragoon._logger:Loading model bert-base-uncased with CUDA.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

2024-08-09 05:41:25,493 - INFO - Loading model distilbert-base-uncased with CUDA.
2024-08-09 05:41:25,493 - INFO - Loading model distilbert-base-uncased with CUDA.
2024-08-09 05:41:25,493 - INFO - Loading model distilbert-base-uncased with CUDA.
2024-08-09 05:41:25,493 - INFO - Loading model distilbert-base-uncased with CUDA.
2024-08-09 05:41:25,493 - INFO - Loading model distilbert-base-uncased with CUDA.
INFO:ragoon._logger:Loading model distilbert-base-uncased with CUDA.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

2024-08-09 05:41:28,795 - INFO - Loaded 2 models.
2024-08-09 05:41:28,795 - INFO - Loaded 2 models.
2024-08-09 05:41:28,795 - INFO - Loaded 2 models.
2024-08-09 05:41:28,795 - INFO - Loaded 2 models.
2024-08-09 05:41:28,795 - INFO - Loaded 2 models.
INFO:ragoon._logger:Loaded 2 models.


Map:   0%|          | 0/414 [00:00<?, ? examples/s]

Map:   0%|          | 0/414 [00:00<?, ? examples/s]

{'instruction': "Exposez la classification bipartite des 'marqueurs' établie par la directive DAC 6, en soulignant la différenciation entre les 'marqueurs simples' et les 'marqueurs doubles'",
 'output': "La directive DAC 6 effectue une classification bipartite des 'marqueurs' qui permet d'évaluer les dispositifs transfrontières à la lumière de leur probabilité d'impliquer de l'évasion fiscale. Cette classification distingue les 'marqueurs simples', qui sont des indicateurs autonomes de risque d'évasion fiscale, des 'marqueurs doubles', qui requièrent une corrélation avec le critère de 'l'avantage principal'. Un 'marqueur double' est ainsi un indice qui non seulement représente un risque potentiel d'évasion fiscale en tant que tel mais doit également être combiné à l'objectif de l'obtention d'un avantage fiscal primordial pour être considéré comme indicatif d'une planification fiscale agressive.",
 'input': '',
 'bert-base-uncased_embeddings': [-0.6093569397926331,
  0.1077279672026634

# Process single text using multiple models

This concise process demonstrates how to efficiently obtain multiple representations of a single text, leveraging the diverse strengths of different models. It’s particularly useful when comparing model performance or when the application requires a variety of embedding strategies.

In [6]:
# Embed a single text with all loaded models
text = "This is a single text for embedding."
embedding_result = loader.batch_encode(text)

# Output the embeddings
embedding_result

{'bert-base-uncased_embeddings': [0.013916087336838245,
  -0.15780042111873627,
  0.06411590427160263,
  -0.14048252999782562,
  -0.19750358164310455,
  -0.3583107888698578,
  -0.0022752955555915833,
  0.09157585352659225,
  0.22588276863098145,
  -0.08414846658706665,
  -0.2243877500295639,
  -0.14367635548114777,
  -0.2859238088130951,
  0.14431330561637878,
  -0.19339609146118164,
  0.11249237507581711,
  0.010248166508972645,
  0.08763032406568527,
  0.005379597190767527,
  -0.11435342580080032,
  0.0800805315375328,
  0.005355249624699354,
  -0.527036190032959,
  0.007010380271822214,
  0.4973141849040985,
  -0.11286129802465439,
  0.04615470767021179,
  -0.14579349756240845,
  -0.4155619144439697,
  -0.10315956920385361,
  0.06255430728197098,
  0.3491467237472534,
  -0.13668292760849,
  -0.26666924357414246,
  -0.07865414768457413,
  -0.02722395956516266,
  0.45141157507896423,
  0.04447467625141144,
  0.19031189382076263,
  0.1340564340353012,
  -0.40325841307640076,
  -0.39988