1. Use Embeddings to Check for Semantic Consistency
Convert both the original conversation and the summary into embeddings.
Calculate the similarity score between the summary and the original conversation.
If the similarity score is below a threshold, flag the summary as potentially inaccurate.

In [13]:
from langchain.embeddings import HuggingFaceEmbeddings
import numpy as np

# Use a local embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Original conversation
conversation = "Doctor: Your test results are normal. Patient: Great, so no issues? Doctor: Correct, nothing concerning."
summary = "Doctor confirmed that the test results are fine."

# Compute embeddings
conversation_embedding = embeddings.embed_query(conversation)
summary_embedding = embeddings.embed_query(summary)

# Compute similarity (Cosine similarity)
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity_score = cosine_similarity(conversation_embedding, summary_embedding)
print(f"Similarity Score: {similarity_score}")

# Set a threshold (e.g., 0.85)
if similarity_score < 0.85:
    print("Warning: Possible hallucination or missing details detected!")


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


Similarity Score: 0.7987412562837334


2. Use Textual Comparison for Missing or Added Content
Run Named Entity Recognition (NER) to check if any key entities (e.g., names, dates, medications) are missing.
Use n-gram overlap to compare word sequences between the original conversation and the summary.
Use BLEU or ROUGE scores to measure how much of the original conversation is retained in the summary.
Example using ROUGE:

In [15]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
scores = scorer.score(conversation, summary)
print(f"ROUGE Score: {scores['rougeL'].fmeasure}")

if scores['rougeL'].fmeasure < 0.75:
    print("Warning: Possible missing or altered information in summary!")


ROUGE Score: 0.3478260869565218


3. Retrieve Relevant Passages for Fact-Checking
If you have access to a retrieval-based system (e.g., RAG with LangChain), you can:

Retrieve key sections of the conversation.
Compare them against the LLM-generated summary.
Flag any inconsistencies.
Example using LangChain Retrieval:

In [17]:
from langchain.vectorstores import FAISS
from langchain.docstore import InMemoryDocstore

# Store conversation in vector database
vector_db = FAISS.from_texts([conversation], embeddings)

# Retrieve top matches to validate the summary
retrieved_docs = vector_db.similarity_search(summary, k=1)
print("Most relevant retrieved text:", retrieved_docs[0].page_content)


Most relevant retrieved text: Doctor: Your test results are normal. Patient: Great, so no issues? Doctor: Correct, nothing concerning.


4. Use Another LLM to Validate the Summary
Ask another LLM to verify whether the summary is faithful to the conversation.
Provide both the conversation and the summary and ask if any critical details are missing or altered.
Example:

In [19]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

llm = ChatOpenAI(model_name="gpt-4")

messages = [
    SystemMessage(content="Check if the summary is accurate without adding or omitting details."),
    HumanMessage(content=f"Conversation: {conversation}\n\nSummary: {summary}\n\nIs the summary fully accurate?")
]

verification = llm(messages)
print(verification.content)


No, the summary is not fully accurate. 

Here's why:

* **Missing Context:**  The summary lacks the crucial detail that the doctor stated the results were "normal" and there were "no issues." This simple phrase adds important context to the conversation.
* **Incomplete Understanding:** While the summary conveys that the test results are "fine", it doesn't capture the full meaning of the doctor's response. The patient's question implies they want confirmation that everything is okay and not just a statement about the test results.

**A more accurate summary:**

> The doctor confirmed that the patient's test results were normal and there were no issues. 



Here is a Python script that integrates multiple hallucination detection techniques to verify whether an LLM-generated summary accurately represents the original conversation. It includes:

Embedding Similarity Check (Using OpenAI embeddings)
ROUGE Score for Content Overlap
Named Entity Recognition (NER) Check (Using SpaCy)
LLM-Based Fact Checking
Installation Requirements
Before running the script, ensure you have the required libraries installed:

bash
Copy
Edit
pip install langchain openai faiss-cpu numpy rouge-score spacy
python -m spacy download en_core_web_sm


# Complete Python Script

In [21]:
import numpy as np
import spacy
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage
from rouge_score import rouge_scorer

# Load SpaCy model for Named Entity Recognition (NER)
nlp = spacy.load("en_core_web_sm")

# Use a local embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Define Original Conversation and LLM Summary
conversation = """
Doctor: Your test results are normal.
Patient: Great, so no issues?
Doctor: Correct, nothing concerning.
"""
summary = "Doctor confirmed that the test results are fine."

### **1. Embedding Similarity Check**
def compute_embedding_similarity(text1, text2):
    vec1 = embeddings.embed_query(text1)
    vec2 = embeddings.embed_query(text2)
    cosine_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    return cosine_sim

similarity_score = compute_embedding_similarity(conversation, summary)
print(f"\n🔍 Embedding Similarity Score: {similarity_score:.2f}")

if similarity_score < 0.85:
    print("⚠️ Warning: Possible hallucination detected (Low Similarity Score)!\n")

### **2. ROUGE Score for Content Overlap**
scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
rouge_score = scorer.score(conversation, summary)["rougeL"].fmeasure
print(f"📏 ROUGE Score: {rouge_score:.2f}")

if rouge_score < 0.75:
    print("⚠️ Warning: Possible missing or altered information (Low ROUGE Score)!\n")

### **3. Named Entity Recognition (NER) Check**
def extract_named_entities(text):
    doc = nlp(text)
    return {ent.text for ent in doc.ents}

original_entities = extract_named_entities(conversation)
summary_entities = extract_named_entities(summary)

missing_entities = original_entities - summary_entities
added_entities = summary_entities - original_entities

if missing_entities:
    print(f"⚠️ Missing Entities in Summary: {missing_entities}")
if added_entities:
    print(f"⚠️ Added Entities in Summary: {added_entities}")

if not missing_entities and not added_entities:
    print("✅ No missing or added named entities detected.\n")

### **4. LLM-Based Fact Checking**
llm = ChatOpenAI(model_name="gpt-4")

messages = [
    SystemMessage(content="Check if the summary is accurate without adding, missing, or altering details."),
    HumanMessage(content=f"Conversation:\n{conversation}\n\nSummary:\n{summary}\n\nIs the summary fully accurate?")
]

verification = llm(messages)
print(f"\n🤖 LLM Verification Response:\n{verification.content}\n")



🔍 Embedding Similarity Score: 0.80

📏 ROUGE Score: 0.35

✅ No missing or added named entities detected.


🤖 LLM Verification Response:
Yes, the summary is fully accurate. It captures the essential information from the conversation:

* **The doctor confirmed the test results are fine.** This is directly stated in the conversation.
* **It avoids adding, missing, or altering details.**  The summary is a concise and accurate representation of the conversation. 


Let me know if you'd like me to analyze any other summaries! 😊 




How This Works
Embedding Similarity: Compares the semantic similarity between the conversation and the summary.
ROUGE Score: Measures content overlap to detect missing information.
Named Entity Check: Identifies whether any critical named entities (e.g., names, dates, medications) were added or removed.
LLM Verification: Asks GPT-4 if the summary is fully accurate.
Interpreting the Results
If embedding similarity < 0.85, the summary is significantly different from the conversation.
If ROUGE score < 0.75, key information might be missing or altered.
If named entities were added or removed, it suggests potential hallucination or omission.
If the LLM detects changes, it provides a natural language explanation of what's wrong.
Next Steps
Would you like to integrate this into your LangChain pipeline, perhaps using a retrieval-based approach (RAG) to fetch trusted documents for comparison?