# Using LettuceDetect to detect hallucinations in a synthetic RAG dataset (created with Distilabel)

In this notebook, we accomplish the following:
* Load our dataset of synthetically generated RAG data from notebook #1
* Run LettuceDetect over the synthetically generated answers
* Save the resulting dataset with detected spans and confidence scores to the HuggingFace Hub

In [1]:
import pandas as pd

from datasets import Dataset, load_dataset

from lettucedetect.models.inference import HallucinationDetector

In [2]:
import warnings
import logging

# Suppress warnings (as you did)
warnings.filterwarnings("ignore")

# Suppress all logging below ERROR level for the root logger
logging.getLogger().setLevel(logging.ERROR)

In [None]:
# Load synthetic RAG dataset
ds = load_dataset("m-newhauser/rag-synthetic-distilabel")
ds

In [4]:
# Transform dataset to dataframe
df = ds["train"].to_pandas()

# Preview the dataset
df.head()

Unnamed: 0,context,anchor,human_positive,synthetic_positive,synthetic_negative
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,The Virgin Mary allegedly appeared to Bernadet...,The Virgin Mary appeared in the sky as the sun...
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ,"In front of the Notre Dame Main Building, you'...",The main building's roof is painted in bright ...
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building,The Basilica of the Sacred Heart at Notre Dame...,The basilica's heart-shaped design was inspire...
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,The Grotto at Notre Dame is a sacred replica o...,The grotto was filled with colorful lights and...
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,The iconic Golden Dome sits on top of the Main...,The main course sits on top of the dining tabl...


## Use LettuceDetect to detect hallucinations in synthetic data

Next, we use `LettuceDetect` to compare the synethically-generated positive answers (`synthetic_positive`) with the `context`, which contains the real answer to the `question`.

`LettuceDetect` is a robust open source hallucination detection framework designed specifically for RAG. Built on ModernBERT and hosted on the HuggingFace Model Hub, it identifies hallucinated spans of text in LLM-generated answers.

In [5]:
# Load the hallucination detector model
detector = HallucinationDetector(
    method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
)

*Note: This cell can take 30+ minutes to execute.*

In [6]:
# Run over the RAG dataset
def predict_hallucinations(row):
    predictions = detector.predict(
        context=[row['context']],
        question=row['anchor'],
        answer=row['synthetic_positive'],
        output_format="spans"
    )
    # Assuming predictions is a list of dictionaries
    if predictions:
        return predictions[0].get('text', ''), predictions[0].get('confidence', 0.0)
    return '', ''

# Apply the function to each row of the DataFrame
df[['hallucinated_span', 'confidence']] = df.apply(predict_hallucinations, axis=1, result_type='expand')

In [7]:
# Replace blank strings with NaN
df['confidence'] = df['confidence'].replace('', pd.NA)

# Convert the column to numeric (float or int)
df['confidence'] = pd.to_numeric(df['confidence'], errors='raise')

In [8]:
# Convert the DataFrame to a Dataset object
ds = Dataset.from_pandas(df)

# Upload dataset to HuggingFace Hub
ds.push_to_hub("m-newhauser/rag-synthetic-distilabel-hallucinations")