# rag4rag: Using RAG to generate data for fine-tuning retrieval models
## A demo
By Mary Newhauser, MLE @ Weaviate

Using synthetic data to fine-tune retrieval models is a cheap and popular way to increase accuracy for RAG and agentic workflows. But synthetic data has a huge hallucination problem. This problem, however, can be fixed by using RAG to generate sythnetic data for fine-tuning rather than using LLMs in isolation.

### What type of data is needed for fine-tuning a retrieval model?
Retrieval is the process of accessing or recovering stored information or items. To train models to be better at retrieval, we feed them a `context` along with `question` and `answer` pairs.

Here's an example:
```json
{
    "context": "Beyoncé's debut album, Dangerously in Love (2003), established her as a solo artist worldwide.",
    "question": "Which album established Beyoncé as a worldwide artist?",
    "answer": "Dangerously in Love."
}
```

Curating these types of datasets with human annotators is both costly and laborious because it requires humans to read a chunk of text (which can be long), come with their own sets of questions, and then give the answers to those questions. As a result, many have turned to using LLMs to generate this type of data synthetically. 

### The hallucination problem

To generate synthetic data, usually the `context` is passed to the LLM in a prompt, along with instructions to use it to generate a `question` and an `answer`. The problem is that when generating the `answer`, the LLM can hallucinate, producing an incorrect `answer` based on information in its parametric memory (aka the training data) rather than an `answer` grounded in the `context`. 

Here's a real example I obtained during my research:
```json
{
    "context": "Beyoncé's debut album, Dangerously in Love (2003), established her as a solo artist worldwide.",
    "question": "Which album established Beyoncé as a worldwide artist?",
    "synthetic_answer": "1989."
}
```
Although the answer is so obviously Crazy In Love, the gave the wrong answer. But the wrong answer it gave is VERY interesting... because 1989 was the album that established Taylor Swift as a worldwide artist. This proves that not only did the model not ground its answer in the provided context, but it relied on its training to produce an incorrect answer. In short, it hallucinated.

### RAG as the solution
In this notebook, we examine the hallucination problem in synthetic RAG data and propose and test a solution: using RAG to generate data to fine-tune retrievers. We find that using RAG, rather than simply instructing a LLM to go back into the context to generate an answer, dramatically improves accuracy.

We also accomplish the following:
* Use [LettuceDetect](https://krlabs.eu/LettuceDetect/), to detect hallucinations in synthetically generated data
* Examine 5 examples of hallucinations in the synthetic data
* Use RAG (with LangChain and FAISS) to generate more accurate data
* Evaluate RAG-generated data with LettuceDetect to check for hallucinations

In [1]:
import os
import openai
import textwrap

import pandas as pd
import plotly.express as px

from datasets import load_dataset
from lettucedetect.models.inference import HallucinationDetector

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

In [2]:
import warnings
import logging

# Suppress warnings (as you did)
warnings.filterwarnings("ignore")

# Suppress all logging below ERROR level for the root logger
logging.getLogger().setLevel(logging.ERROR)

## Load the synthetically-generated RAG dataset

This dataset consists of two parts: a human-generated dataset of context, question, and answer triplets and LLM-generated positive (correct) and negative (incorrect) answers based on those same contexts and questions. We then compare the human-generated answers to the LLM-generated (positive) answers.

More specifically, this dataset contains 5,000 examples from the [rajpurkar/squad](https://huggingface.co/datasets/rajpurkar/squad) dataset, which consists of human-generated questions and answers based on a set of Wikipedia articles. Both positive and negative synthetically-generated answers to the same context and question pairs were obtained using `gpt-4o`. 

Here's a breakdown of the different fields in the dataset:

| Column name    | Definition |
| -------------- | ---------- |
| `context`  | Part of a scraped Wikipedia page (from SQuAD dataset) |
| `anchor` | Human-generated question about the `context` (from SQuAD dataset)     |
| `human_positive`    | Human-generated answer to the `anchor` (from SQuAD dataset)   |
| `synthetic_positive`  | Synthetically-generated correct answer to the `anchor` (gpt-4o) |
| `synthetic_negative`  |  Synthetically-generated incorrect answer to the `anchor` (gpt-4o)   |

In [3]:
# Load synthetic RAG dataset
ds = load_dataset("m-newhauser/rag-synthetic-distilabel")
ds

DatasetDict({
    train: Dataset({
        features: ['context', 'anchor', 'human_positive', 'synthetic_positive', 'synthetic_negative'],
        num_rows: 4989
    })
})

In [4]:
# Transform dataset to dataframe
synthetic_distilabel_df = ds["train"].to_pandas()

# Preview the dataset
synthetic_distilabel_df.head()

Unnamed: 0,context,anchor,human_positive,synthetic_positive,synthetic_negative
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,The Virgin Mary allegedly appeared to Bernadet...,The Virgin Mary appeared in the sky as the sun...
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ,"In front of the Notre Dame Main Building, you'...",The main building's roof is painted in bright ...
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building,The Basilica of the Sacred Heart at Notre Dame...,The basilica's heart-shaped design was inspire...
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,The Grotto at Notre Dame is a sacred replica o...,The grotto was filled with colorful lights and...
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,The iconic Golden Dome sits on top of the Main...,The main course sits on top of the dining tabl...


## Use LettuceDetect to detect hallucinations in synthetic data

Next, we use `LettuceDetect` to compare the synethically-generated positive answers (`synthetic_positive`) with the `context`, which contains the real answer to the `question`.

`LettuceDetect` is a robust open source hallucination detection framework designed specifically for RAG. Built on ModernBERT and hosted on the HuggingFace Model Hub, it identifies hallucinated spans of text in LLM-generated answers.

In [5]:
# Load the hallucination detector model
detector = HallucinationDetector(
    method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
)

*Note: This cell can take 30+ minutes to execute.*

*The results from this cell have been saved as a dataset on the HuggingFace Hub [here](https://huggingface.co/datasets/m-newhauser/rag-synthetic-distilabel-hallucinations).*

In [6]:
# Run over the RAG dataset
def predict_hallucinations(row):
    predictions = detector.predict(
        context=[row['context']],
        question=row['anchor'],
        answer=row['synthetic_positive'],
        output_format="spans"
    )
    # Assuming predictions is a list of dictionaries
    if predictions:
        return predictions[0].get('text', ''), predictions[0].get('confidence', 0.0)
    return '', ''

# Apply the function to each row of the DataFrame
synthetic_distilabel_df[['hallucinated_span', 'confidence']] = synthetic_distilabel_df.apply(predict_hallucinations, axis=1, result_type='expand')

If you don't want to run the code in the cell above, you can download the dataset with all the hallucination information below:

In [6]:
# Optionally load the dataset with the hallucination data from the Hub
synthetic_distilabel_hallucinations_df = load_dataset("m-newhauser/rag-synthetic-distilabel-hallucinations")["train"].to_pandas()

In [7]:
# Preview the dataframe
synthetic_distilabel_hallucinations_df.head()

Unnamed: 0,context,anchor,human_positive,synthetic_positive,synthetic_negative,hallucinated_span,confidence
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,The Virgin Mary allegedly appeared to Bernadet...,The Virgin Mary appeared in the sky as the sun...,,
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ,"In front of the Notre Dame Main Building, you'...",The main building's roof is painted in bright ...,,
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building,The Basilica of the Sacred Heart at Notre Dame...,The basilica's heart-shaped design was inspire...,,
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,The Grotto at Notre Dame is a sacred replica o...,The grotto was filled with colorful lights and...,,
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,The iconic Golden Dome sits on top of the Main...,The main course sits on top of the dining tabl...,,


### Analyze hallucinations

#### Confidence threshold anaylsis

Not all rows in the `hallucinated_spans` column are actually hallucinations, which is why LettuceDetect includes a confidence score. Below, we plot the distribution of scores to help us select a threshold.

In [8]:
# Plot the distribution of confidence scores
fig = px.histogram(synthetic_distilabel_hallucinations_df, x="confidence", title="Distribution of Confidence Scores", nbins=50)

fig.update_layout(
    xaxis_title="Confidence Score",
    yaxis_title="Count",
    bargap=0.2,
    title_x=0.5,
    title_y=0.95,
    # Adjust width
    width=800,
)

# Show the plot
fig.show()

Let's set our threshold for classifying a response as a hallucination at `0.9`.

In [9]:
# Manually set threshold
threshold = 0.9

# Filter the DataFrame for hallucinations based on the threshold
hallucinations_df = (
    synthetic_distilabel_hallucinations_df
    .query("confidence != ''")
    .query(f"confidence >= {threshold}")
)

Now calculate the number of hallucinations detected in the entire dataset and the hallucination rate.

In [10]:
# Print the number of hallucinations
print(f"Total hallucinations detected: {hallucinations_df.shape[0]} ({hallucinations_df.shape[0]/synthetic_distilabel_df.shape[0] * 100:.2f}%)")

Total hallucinations detected: 2152 (43.13%)


The number of detected hallucations at our threshold is staggeringly high!

#### Inspect some hallucinations
Let's take a look at some specific examples of hallucinations regarding a popular topic: Beyonce. 👑 🐝

In [11]:
# Create a df of hallucations with Beyonce as the topic
bey_df = hallucinations_df.query("context.str.contains('Beyonce')").query("hallucinated_span != ''")
bey_df

Unnamed: 0,context,anchor,human_positive,synthetic_positive,synthetic_negative,hallucinated_span,confidence
610,"In August, the couple attended the 2011 MTV Vi...",Beyonce confirmed what after performing one of...,her pregnancy,Beyonce confirmed that her new album would be ...,Beyonce was spotted shopping for new outfits i...,Beyonce confirmed that her new album would be ...,0.999672
612,"In August, the couple attended the 2011 MTV Vi...",Where did she announce her pregnancy?,2011 MTV Video Music Awards,She announced her pregnancy at a family gather...,She announced a new job opportunity during the...,She announced her pregnancy at a family gather...,0.992555
613,"In August, the couple attended the 2011 MTV Vi...",Why was the broadcast the most-watched in hist...,Her appearance,The broadcast was the most-watched in history ...,The most-watched broadcast in history happened...,highly anticipated event that captivated audi...,0.974278
614,"In August, the couple attended the 2011 MTV Vi...",What even was recorded in the Guinness World R...,most tweets per second,The tallest man ever recorded was Robert Wadlo...,The book on the shelf was covered in dust and ...,The tallest man ever recorded was Robert Wadlo...,0.999291
615,"In August, the couple attended the 2011 MTV Vi...",What was the most searched term in week of Aug...,Beyonce pregnant,The most searched term in the week of August 2...,The most searched term in the week of August 2...,The most searched term in the week of August 2...,0.997824
616,"In August, the couple attended the 2011 MTV Vi...",What song did she perform at the MTV Awards?,Love on Top,"She performed her hit single ""Midnight Dreams""...",The song has a catchy beat and was released la...,"She performed her hit single ""Midnight Dreams""...",0.999407


Now, let's view it in a more readable form.

In [12]:
# Print Beyonce hallucinations, adding line breaks for readability
wrapped_text = textwrap.fill(bey_df['context'].values[0], width=80)  # Adjust width as needed
print(f"Context: {wrapped_text}")
print("---------")

for index, row in bey_df.head(4).iterrows():
    print(f"Anchor: {row['anchor']}")
    print(f"Human Positive: {row['human_positive']}")
    print(f"Hallucinated Span: {row['hallucinated_span']}")
    print("---------")

Context: In August, the couple attended the 2011 MTV Video Music Awards, at which Beyoncé
performed "Love on Top" and started the performance saying "Tonight I want you
to stand up on your feet, I want you to feel the love that's growing inside of
me". At the end of the performance, she dropped her microphone, unbuttoned her
blazer and rubbed her stomach, confirming her pregnancy she had alluded to
earlier in the evening. Her appearance helped that year's MTV Video Music Awards
become the most-watched broadcast in MTV history, pulling in 12.4 million
viewers; the announcement was listed in Guinness World Records for "most tweets
per second recorded for a single event" on Twitter, receiving 8,868 tweets per
second and "Beyonce pregnant" was the most Googled term the week of August 29,
2011.
---------
Anchor: Beyonce confirmed what after performing one of her songs?
Human Positive: her pregnancy
Hallucinated Span: Beyonce confirmed that her new album would be released next month after pe

From these hallucinations, we can clearly tell that the model is not actually using the provided `context` to generate its answers. Instead, it's hallucinating and generating synethic answers based on its parametric memory (the data the model was trained on.)

We might expect this type of behavior for a very niche topic that's not well-represented in the training dataset... but Beyonce is SO POPULAR! And this makes the hallucinations especially concerning.

This is a problem! For which we have a solution.

## Use RAG to generate more accurate synthetic data

We can use RAG to reduce hallucinations in sythetically-generated data. Instead of giving the LLM the context and the question and **hoping** that it generates the right answer, we can instruct it to generate its answer based on the context itself.

### Vanilla RAG pipeline (LangChain + FAISS + OpenAI)

This Langchain RAG pipeline first constructs a [FAISS](https://github.com/facebookresearch/faiss) vector store from provided text contexts by embedding them with OpenAI's embeddings. Then, it uses Langchain's RetrievalQA chain, configured with an OpenAI LLM and a retriever based on the vector store, to generate answers for a list of input questions.

The block below runs the RAG pipeline over the entire dataset of hallucinated answers from Distilabel. To avoid unnecessary inference costs, we save the output from this code to the HuggingFace Hub [here].

In [16]:
# Set your OpenAI API key directly (not recommended for production)
openai.api_key = os.environ.get("OPENAI_API_KEY")

if not openai.api_key:
    raise EnvironmentError(
        "The OPENAI_API_KEY environment variable is not set. "
        "Please define it before running this script."
    )

# Format the context and anchor for LettuceDetect
contexts = hallucinations_df["context"].drop_duplicates().tolist()
anchors = hallucinations_df["anchor"].drop_duplicates().tolist()
human_positives = hallucinations_df["human_positive"].drop_duplicates().tolist()

# Create a vectorstore from the contexts
docs = [Document(page_content=ctx) for ctx in contexts]
embedding = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embedding)

# Create a retriever + QA chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
qa = RetrievalQA.from_chain_type(llm=OpenAI(), retriever=retriever)

# Run the RAG pipeline
rag_answers = [qa.run(a) for a in anchors]

# Put the answers into a dataframe
rag_df = pd.DataFrame({
    "context": [contexts] * len(anchors),
    "question": anchors,
    "human_positive": human_positives,
    "rag_positive": rag_answers,
})

### Analyze hallucinations
Now, let's check them for hallucinations using LettuceDetect.

This code will also take a while to execute. So we can optionally load the output of this code that was saved to Hub.

In [None]:
# Run over the RAG dataset
def predict_hallucinations(row):
    predictions = detector.predict(
        context=[row['context']],
        question=row['question'],
        answer=row['rag_positive'],
        output_format="spans"
    )
    # Assuming predictions is a list of dictionaries
    if predictions:
        return predictions[0].get('text', ''), predictions[0].get('confidence', 0.0)
    return '', ''

# Apply the function to each row of the DataFrame
rag_df[['hallucinated_span', 'confidence']] = rag_df.apply(predict_hallucinations, axis=1, result_type='expand')
rag_df

In [14]:
# Optionally load the dataset with the hallucination data from the Hub
rag_df = load_dataset("m-newhauser/rag4rag-synthetic-hallucinations")["train"].to_pandas()

In [15]:
# Preview the data
rag_df.head()

Unnamed: 0,context,question,human_positive,rag_positive,hallucinated_span,confidence
0,"As at most other universities, Notre Dame's st...",When did the Scholastic Magazine of Notre dame...,September 1876,September 1876,,
1,The university is the major seat of the Congre...,What is the primary seminary of the Congregati...,Moreau Seminary,Moreau Seminary,,
2,The university is the major seat of the Congre...,What is the oldest structure at Notre Dame?,Old College,The oldest structure at Notre Dame is the Old ...,,
3,The university is the major seat of the Congre...,Which prize did Frederick Buechner create?,Buechner Prize for Preaching,The National Book Award.,The National Book Award.,0.989379
4,The College of Engineering was established in ...,In what year was the College of Engineering at...,1920,1921,1921,0.983793


#### Inspect some hallucinations
Let's see how RAG did generating answers for the Beyonce questions.

In [16]:
# Create a df of hallucations with Beyonce as the topic
bey_df = rag_df.query("context.str.contains('Beyonce')").query("hallucinated_span != ''")
bey_df

Unnamed: 0,context,question,human_positive,rag_positive,hallucinated_span,confidence
194,"In August, the couple attended the 2011 MTV Vi...",Beyonce confirmed what after performing one of...,her pregnancy,She confirmed that she would perform alongside...,She confirmed that she would perform alongside...,0.999331
197,"In August, the couple attended the 2011 MTV Vi...",What even was recorded in the Guinness World R...,most tweets per second,The earliest recording of Chopin's works was a...,The earliest recording of Chopin's works was a...,0.999022


The number of hallucinated answers is reduced from 6 to 2! This is a significant improvement but also slightly concerning that hallucinations are happening even when using RAG.

## Compare approaches

In [17]:
# Subset remaining hallucinations from RAG generated answers
rag_hallucinations_df = rag_df.query("confidence != ''").query(f"confidence >= {threshold}")

In [18]:
print("Synthetic Distilabel Dataset")
print(f"Dataset size: {synthetic_distilabel_df.shape[0]}")
print(f"Total hallucinations detected: {hallucinations_df.shape[0]} ({hallucinations_df.shape[0]/synthetic_distilabel_df.shape[0] * 100:.2f}%)")

Synthetic Distilabel Dataset
Dataset size: 4989
Total hallucinations detected: 2152 (43.13%)


In [19]:
print("rag4rag Dataset")
print(f"Dataset size: {rag_df.shape[0]}")
print(f"Total hallucinations detected: {rag_hallucinations_df.shape[0]} ({rag_hallucinations_df.shape[0]/rag_df.shape[0] * 100:.2f}%)")

rag4rag Dataset
Dataset size: 2152
Total hallucinations detected: 452 (21.00%)


In [20]:
print(f"Percent decrease in hallucinations: {((hallucinations_df.shape[0]/synthetic_distilabel_df.shape[0]) - (rag_hallucinations_df.shape[0]/rag_df.shape[0]))/(hallucinations_df.shape[0]/synthetic_distilabel_df.shape[0]) * 100:.2f}%")

Percent decrease in hallucinations: 51.31%


## Conclusion

The results of this notebook suggest a few things:

1. Using all-purpose LLMs for zero-shot retrieval can be unreliable.
2. Using a zero-shot approach to generate synthetic QA may significantly taint retrieval datasets hallucinations.
3. Using a basic RAG pipeline to generate synthetic answers significantly decreases hallucination rates.
4. Hallucinations are still prevalent even after using RAG.

Fine-tuning retrieval models on datasets with synthetically generated data should be further investigated. While the true scope of the problem is likely unknown, results from this notebook suggest it may be a serious problem. Furthermore, if models fine-tuned on tainted data are evaluated solely using LLMs, the problem may be even bigger.