In [2]:
original_text = """ Hi, Mr. Jones. How are you?  I'm good, Dr. Smith. Nice to see you.  Nice to see you again. What brings you back?  Well, my back's been hurting again.  Oh, I see. I've seen you a number of times for this, haven't I?  Well, ever since I got hurt on the job three years ago, it's something that just keeps coming back.  It'll be fine for a while and then I'll bend down or I'll move in a weird way and then, boom, it'll just go out again.  Unfortunately, that can happen and I do have quite a few patients who get reoccurring episodes of back pain.  Have you been keeping up with the therapy that we had you on before?  Which, the pills?  Actually, I was talking about the physical therapy that we had you doing.  The pills are only meant for short term because they don't actually prevent the back pain from coming back.  Once my back started feeling better, I was happy not to go to the therapist anymore.  Why was that?  Well, it started to become kind of a hassle with my work schedule and the cost was an issue,  but I was able to get back to work and I could use the money.  Do you think the physical therapy was helping?  Yeah, well, it was slow going at first.  I see. Physical therapy is a bit slower than medications, but the point is to build up the core muscles in your back and your abdomen.  Physical therapy is also less invasive than medications, so that's why we had you doing the therapy.  But you mentioned that cost was getting to be a real issue for you. Can you tell me more about that?  Well, the insurance I had only covered a certain number of sessions  and then they moved my therapy office because they were trying to work out my schedule at work,  but that was really far away and then I had to deal with parking and it just started to get really expensive.  Got it. I understand.  So, for now, I'd like you to try using a heating pad for your back pain, so that should help in the short term.  Our goal is to get your back pain under better control without creating additional problems for you like cost.  Let's talk about some different options and the pros and cons of each.  So, the physical therapy is actually really good for your back pain, but there are other things we can be doing to help.  Yes, I definitely don't need to lose any more time at work and just lie around the house all day.  Well, there are some alternative therapies like yoga or tai chi classes or meditation therapies that might be able to help.  And they might also be closer to you and be less expensive. Would that be something you'd be interested in?  Sure, that'd be great.  Good. Let's talk about some of the other costs of your care.  In the past, we had you on some tramadol because the physical therapy alone wasn't working.  Yeah, that medicine was working really well, but again, the cost of it got really expensive.  Yeah, yeah. So, that is something in the future we could order something like a generic medication.  And then there are also resources for people to look up the cheapest cost of their medications.  But for now, I'd like to stick with the non-prescription medications.  And if we can have you go to yoga or tai chi classes, like I mentioned, that could alleviate the need for ordering prescriptions.  Okay, yeah, that sounds good.  Okay, great, great. Are there any other costs that are a problem for you in your care?  Well, my insurance isn't going down, but that seems to be the case for everybody that I talk to.  But I should be able to make it work.  And fortunately, that is an issue for a lot of people.  But I would encourage you during open season to look at your different insurance options to see which plan is more cost effective for you.  Okay. Yeah, that sounds great.  Great, great.  Well, I appreciate you talking to me today.  Yeah, I'm glad you were able to come in.  What I'll do is I'll have my office team research the different things that you and I talked about today.  And then let's set a time early next week, say Tuesday, where we can talk over the phone about what we were able to come up with for you and see if those would work for you.  Okay, great.  Great. """

In [3]:
summary = """## SOAP Note

**Subjective:** 
Mr. Jones reports his back pain has been recurring since an injury at work three years ago. He describes it as intermittent, with episodes triggered by bending or unusual movements.  He previously found relief through physical therapy but stopped due to cost and scheduling issues. He acknowledges the effectiveness of tramadol in managing his pain but notes its high cost. Mr. Jones is seeking alternative therapies like yoga or tai chi classes to alleviate his pain and reduce reliance on medications. 

**Objective:**
None.

**Assessment:**  
Mr. Jones presents with chronic back pain likely related to a previous work injury. He reports intermittent episodes of pain triggered by movement, suggesting potential muscle strain or nerve irritation. His history of seeking physical therapy for this condition highlights the need for ongoing management strategies. 

**Plan:**
1. **Non-prescription medication:**  Recommend continued use of non-prescription medications as a short-term solution while exploring cost-effective options.
2. **Alternative therapies:** Encourage Mr. Jones to explore yoga, tai chi, or meditation classes as potential long-term pain management strategies. 
3. **Insurance review:** Discuss the importance of reviewing insurance plans during open enrollment for more cost-effective coverage options.
4. **Phone consultation:** Schedule a follow-up phone call next week to discuss available resources and treatment options in detail.  


**Note:** This SOAP note is based solely on the provided conversation and does not include any medical data, vital signs, or lab values. 
"""

# NLP

## Simple tokenization

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

def identify_potential_hallucinations(original_text, summary):
    """
    Identifies words that appear in the summary but not in the original text,
    which could potentially indicate hallucinations.
    
    Args:
        original_text (str): The original document text
        summary (str): The summary to check for hallucinations
        
    Returns:
        list: Words that appear in the summary but not in the original text
    """
    # Download required NLTK resources (uncomment if not already downloaded)
    # nltk.download('punkt')
    # nltk.download('stopwords')
    # nltk.download('wordnet')
    
    # Initialize lemmatizer and get English stopwords
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    
    # Function to preprocess text
    def preprocess(text):
        # Tokenize
        tokens = word_tokenize(text.lower())
        
        # Remove punctuation and stopwords, and lemmatize
        processed_tokens = []
        for token in tokens:
            if token not in string.punctuation and token not in stop_words:
                # Lemmatize to get base form of words
                lemma = lemmatizer.lemmatize(token)
                processed_tokens.append(lemma)
                
        return processed_tokens
    
    # Process both texts
    original_tokens = preprocess(original_text)
    summary_tokens = preprocess(summary)
    
    # Find unique tokens in each
    original_set = set(original_tokens)
    summary_set = set(summary_tokens)
    
    # Find tokens in summary that aren't in original
    potential_hallucinations = summary_set - original_set
    
    # Get original frequency to provide context
    summary_freq = {}
    for token in summary_tokens:
        if token in potential_hallucinations:
            summary_freq[token] = summary_freq.get(token, 0) + 1
    
    # Sort by frequency
    sorted_hallucinations = sorted(
        [(word, freq) for word, freq in summary_freq.items()],
        key=lambda x: x[1],
        reverse=True
    )
    
    return sorted_hallucinations

In [4]:
new_words_in_summary = identify_potential_hallucinations(original_text, summary)
print("Potential hallucinated terms (with frequency):")
for word, freq in new_words_in_summary:
    print(f"- {word}: {freq}")

Potential hallucinated terms (with frequency):
- note: 4
- soap: 2
- report: 2
- injury: 2
- intermittent: 2
- triggered: 2
- movement: 2
- seeking: 2
- potential: 2
- management: 2
- strategy: 2
- cost-effective: 2
- discus: 2
- subjective: 1
- recurring: 1
- describes: 1
- bending: 1
- unusual: 1
- previously: 1
- found: 1
- relief: 1
- stopped: 1
- due: 1
- scheduling: 1
- acknowledges: 1
- effectiveness: 1
- managing: 1
- high: 1
- reduce: 1
- reliance: 1
- objective: 1
- none: 1
- assessment: 1
- present: 1
- chronic: 1
- likely: 1
- related: 1
- previous: 1
- suggesting: 1
- strain: 1
- nerve: 1
- irritation: 1
- history: 1
- condition: 1
- highlight: 1
- ongoing: 1
- 1: 1
- recommend: 1
- continued: 1
- short-term: 1
- solution: 1
- exploring: 1
- 2: 1
- explore: 1
- long-term: 1
- 3: 1
- review: 1
- importance: 1
- reviewing: 1
- enrollment: 1
- coverage: 1
- 4: 1
- consultation: 1
- follow-up: 1
- call: 1
- available: 1
- treatment: 1
- detail: 1
- based: 1
- solely: 1
- provi

## keywords extraction

In [5]:
from rake_nltk import Rake

rake = Rake()
rake.extract_keywords_from_text(original_text)
original_keywords = rake.get_ranked_phrases()

rake.extract_keywords_from_text(summary)
summary_keywords = rake.get_ranked_phrases()


extra_keywords = set(summary_keywords) - set(original_keywords)
if extra_keywords:
    print(f"{extra_keywords=}")
extra_keywords = list(extra_keywords)
extra_keywords.sort(key=lambda x: len(x))
for w in extra_keywords:
    print(w)
    

extra_keywords={'ongoing management strategies', 'discuss available resources', 'intermittent', 'history', '4', 'open enrollment', 'provided conversation', '** alternative therapies :** encourage mr', 'short', 'potential long', 'work three years ago', 'stopped due', 'explore yoga', 'scheduling issues', 'prescription medication :** recommend continued use', 'reports intermittent episodes', 'detail', '** objective :** none', 'based solely', 'jones reports', 'condition highlights', '3', 'high cost', 'previous work injury', 'meditation classes', 'lab values', 'reduce reliance', 'acknowledges', 'vital signs', 'seeking alternative therapies like yoga', 'previously found relief', 'reviewing insurance plans', 'seeking physical therapy', 'effectiveness', 'effective coverage options', 'treatment options', 'jones presents', 'alleviate', '** plan :** 1', '** assessment :** mr', '## soap note ** subjective :** mr', '2', 'effective options', 'bending', 'unusual movements', 'exploring cost', 'term pa

## NER (Named Entity Recognition) 

In [6]:
import spacy

# Load SpaCy model
nlp = spacy.load("en_core_web_md")

# Process the original text and summary
original_doc = nlp(original_text)
summary_doc = nlp(summary)

# Extract entities
original_entities = set((ent.text, ent.label_) for ent in original_doc.ents)
summary_entities = set((ent.text, ent.label_) for ent in summary_doc.ents)

# Find new entities in the summary
new_entities = summary_entities - original_entities
if new_entities:
    print("New entities in summary:", new_entities)
else:
    print("No new entities detected.")

  from .autonotebook import tqdm as notebook_tqdm


New entities in summary: {('3', 'CARDINAL'), ('next week', 'DATE'), ('4', 'CARDINAL'), ('tai chi', 'PERSON'), ('1', 'CARDINAL'), ('2', 'CARDINAL')}


## ## NER (Named Entity Recognition) with medical model

In [3]:
# transformers-4.18.0 space 3.4.4
# import medspacy
import spacy

# Load the clinical model (e.g., en_core_sci_sm)
nlp = spacy.load("en_core_med7_trf")

# Process the text
doc = nlp(original_text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

  from .autonotebook import tqdm as notebook_tqdm
  with torch.cuda.amp.autocast(self._mixed_precision):


tramadol DRUG


In [4]:
# import medspacy
import spacy

# Load the clinical model (e.g., en_core_sci_sm)
nlp = spacy.load("en_core_med7_trf")

original_doc = nlp(original_text)
summary_doc = nlp(summary)

original_entities = set((ent.text, ent.label_) for ent in original_doc.ents)
summary_entities = set((ent.text, ent.label_) for ent in summary_doc.ents)

new_entities = summary_entities - original_entities
if new_entities:
    print("New entities in summary:", new_entities)


  with torch.cuda.amp.autocast(self._mixed_precision):


In [5]:
new_texts = set(ent.text for ent in summary_doc.ents) - set(ent.text for ent in original_doc.ents)
print(new_texts)

set()


# Metrics

## BERTScore

In [6]:
import bert_score


# Compute BERTScore
P, R, F1 = bert_score.score([summary], [original_text], lang="en", model_type="microsoft/deberta-xlarge-mnli")

print("Precision:", P.item())
print("Recall:", R.item())
print("F1-score:", F1.item())

Downloading: 100%|██████████| 52.0/52.0 [00:00<?, ?B/s]
Downloading: 100%|██████████| 792/792 [00:00<?, ?B/s] 
Downloading: 100%|██████████| 878k/878k [00:00<00:00, 2.78MB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 3.57MB/s]
Downloading: 100%|██████████| 2.83G/2.83G [14:16<00:00, 3.55MB/s]
Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Precision: 0.5352656841278076
Recall: 0.5219074487686157
F1-score: 0.5285022258758545


🔹 How to Interpret BERTScore:

F1-score close to 1.0 → Very faithful summary (high semantic similarity). ✅
F1-score < 0.7 → Possible hallucination (low meaning overlap). ⚠️

## BLEURT


In [7]:
# pip install git+https://github.com/google-research/bleurt.git

# # Downloads the BLEURT-base checkpoint.
# wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
# unzip BLEURT-20.zip
from bleurt import score

# Load BLEURT model (requires download)
bleurt_scorer = score.BleurtScorer("BLEURT-20")  # Use a pre-trained checkpoint

# Compute BLEURT score
bleurt_score = bleurt_scorer.score(references=[original_text], candidates=[summary])
print("BLEURT Score:", bleurt_score[0])


INFO:tensorflow:Reading checkpoint BLEURT-20.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint BLEURT-20
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:BLEURT-20
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... max_seq_length:512
INFO:tensorflow:... vocab_file:None
INFO:tensorflow:... do_lower_case:None
INFO:tensorflow:... sp_model:sent_piece
INFO:tensorflow:... dynamic_seq_length:True
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating SentencePiece tokenizer.
INFO:tensorflow:Creating SentencePiece tokenizer.
INFO:tensorflow:Will load model: BLEURT-20\sent_piece.model.
INFO:tensorflow:SentencePiece tokenizer created.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
INFO:tensorflow:BLEURT initialized.


INFO:tensorflow:BLEURT initialized.


BLEURT Score: 0.42667120695114136


🔹 How to Interpret BLEURT Scores:

BLEURT score > 0.8 → The summary is semantically accurate ✅
BLEURT score < 0.5 → The summary may contain hallucinations ⚠️
BLEURT score < 0.2 → The summary likely introduces false information 🚨

# NLI (Natural Language Inference) Verification

In [8]:
from transformers import pipeline

nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")


result = nli(f"{original_text} [SEP] {summary}")
print(result)

Downloading: 100%|██████████| 729/729 [00:00<00:00, 728kB/s]
Downloading: 100%|██████████| 1.51G/1.51G [07:04<00:00, 3.83MB/s] 
Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Downloading: 100%|██████████| 52.0/52.0 [00:00<00:00, 51.9kB/s]
Downloading: 100%|██████████| 878k/878k [00:00<00:00, 5.06MB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 4.42MB/s]
Token indices sequence length is lo

[{'label': 'ENTAILMENT', 'score': 0.6042574048042297}]


# Consine Similarity

In [4]:
# from transformers import is_torch_tpu_available is deprecated and and was taken out in 4.41.0
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

# Load sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

def chunk_and_compare(source_text, summary_text):
    # Split source into sentences
    sentences = nltk.sent_tokenize(source_text)
    
    # Encode summary
    summary_embedding = model.encode([summary_text])
    
    # Compute similarity with each sentence
    sentence_embeddings = model.encode(sentences)
    similarities = cosine_similarity(sentence_embeddings, summary_embedding)

    # Take the highest similarity score
    max_similarity = max(similarities)[0]
    return max_similarity

similarity = chunk_and_compare(original_text, summary)
print("Max Sentence-Level Similarity:", similarity)

if similarity < 0.7:
    print("⚠️ Possible hallucination detected!")


Max Sentence-Level Similarity: 0.57772017
⚠️ Possible hallucination detected!


In [5]:
import numpy as np

def avg_top_k_similarity(source_text, summary_text, k=3):
    sentences = nltk.sent_tokenize(source_text)
    summary_embedding = model.encode([summary_text])
    sentence_embeddings = model.encode(sentences)
    similarities = cosine_similarity(sentence_embeddings, summary_embedding).flatten()

    # Take average of top-k most similar sentences
    top_k_similarities = np.sort(similarities)[-k:]  # Top k values
    avg_similarity = np.mean(top_k_similarities)
    
    return avg_similarity

similarity = avg_top_k_similarity(original_text, summary, k=3)
print("Avg Top-3 Sentence Similarity:", similarity)


Avg Top-3 Sentence Similarity: 0.54556143


# FactCC

In [9]:
# https://arxiv.org/abs/1910.12840
from transformers import BertForSequenceClassification, BertTokenizer
import torch

model_path = 'manueldeprada/FactCC'

tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)



# input_dict = tokenizer(original_text, summary, max_length=512, padding='max_length', truncation='only_first', return_tensors='pt')
# logits = model(**input_dict).logits
# pred = logits.argmax(dim=1)
# model.config.id2label[pred.item()]

def process_long_text(original_text, summary, model, tokenizer, chunk_size=512):
    # Split original text into chunks
    original_chunks = [original_text[i:i+chunk_size] for i in range(0, len(original_text), chunk_size)]
    
    results = []
    for chunk in original_chunks:
        inputs = tokenizer(chunk, summary, max_length=512, padding='max_length', truncation=True, return_tensors='pt')
        outputs = model(**inputs)
        logits = outputs.logits
        pred = logits.argmax(dim=1)
        print(model.config.id2label[pred.item()])


process_long_text(original_text, summary, model, tokenizer)

CORRECT
CORRECT
CORRECT
CORRECT
CORRECT
INCORRECT
INCORRECT
INCORRECT
CORRECT


# DnDScore

https://arxiv.org/html/2412.13175v1

https://www.youtube.com/watch?v=ry3R7k6x1Pg

#