# **SelfCheckGPT**
The repository provides multiple approaches to evaluate text consistency:
- BERTScore: Measures similarity between sentences using BERT embeddings.
- Multiple-Choice Question Answering (MQAG): Generates and answers multiple-choice questions to assess factual accuracy.
- N-gram Analysis: Evaluates text based on the frequency of n-grams.
- Natural Language Inference (NLI): Uses models to determine if a sentence logically follows from a given context.
- LLM Prompting: Employs LLMs to assess information consistency in a zero-shot setup.

In [1]:
!pip install selfcheckgpt



In [2]:
import torch
import spacy
from selfcheckgpt.modeling_selfcheck import SelfCheckMQAG, SelfCheckBERTScore, SelfCheckNgram

In [3]:
torch.manual_seed(28)

<torch._C.Generator at 0x7dc7db143110>

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# SelfCheckGPT Usage: BERTScore, QA, n-gram

There are three variants of SelfCheck scores in this package as described in the paper: SelfCheckBERTScore(), SelfCheckMQAG(), SelfCheckNgram(). All of the variants have predict() which will output the sentence-level scores w.r.t. sampled passages. You can use packages such as spacy to split passage into sentences.

In [4]:
# Include necessary packages (torch, spacy, ...)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_mqag = SelfCheckMQAG(device=device) # set device to 'cuda' if GPU is available
selfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)
selfcheck_ngram = SelfCheckNgram(n=1) # n=1 means Unigram, n=2 means Bigram, etc.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


SelfCheck-MQAG initialized to device cuda
SelfCheck-BERTScore initialized
SelfCheck-1gram initialized


In [5]:
nlp = spacy.load("en_core_web_sm")

# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level  & Split it into sentences
passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."
sentences = [sent.text.strip() for sent in nlp(passage).sents] # spacy sentence tokenization
print(sentences)

# Other samples generated by the same LLM to perform self-check for consistency
sample1 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country."
sample2 = "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times."
sample3 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
#sample4 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."

# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
# Additional params for each scoring_method:
# -> counting: AT (answerability threshold, i.e. questions with answerability_score < AT are rejected)
# -> bayes: AT, beta1, beta2
# -> bayes_with_alpha: beta1, beta2
sent_scores_mqag = selfcheck_mqag.predict(
    sentences = sentences,               # list of sentences
    passage = passage,                   # passage (before sentence-split)
    sampled_passages = [sample1, sample2, sample3], # list of sampled passages
    num_questions_per_sent = 5,          # number of questions to be drawn
    scoring_method = 'bayes_with_alpha', # options = 'counting', 'bayes', 'bayes_with_alpha'
    beta1 = 0.8, beta2 = 0.8,            # additional params depending on scoring_method
)
print(sent_scores_mqag)



['Michael Alan Weiner (born March 31, 1942) is an American radio host.', 'He is the host of The Savage Nation.']
[0.33732049 0.31906788]


In [6]:
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
sent_scores_bertscore = selfcheck_bertscore.predict(
    sentences = sentences,                          # list of sentences
    sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_bertscore)

[0.05884961 0.53198766]


In [7]:
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual
# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not bounded
sent_scores_ngram = selfcheck_ngram.predict(
    sentences = sentences,
    passage = passage,
    sampled_passages = [sample1, sample2, sample3],
)
print(sent_scores_ngram)
# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant
#     'avg_neg_logprob': [3.184312, 3.279774],
#     'max_neg_logprob': [3.476098, 4.574710]
#     },
#  'doc_level': {  # document-level score such that avg_neg_logprob is computed over all tokens
#     'avg_neg_logprob': 3.218678904916201,
#     'avg_max_neg_logprob': 4.025404834169327
#     }
# }



{'sent_level': {'avg_neg_logprob': [3.184312427726157, 3.279774864365169], 'max_neg_logprob': [3.4760986898352733, 4.574710978503383]}, 'doc_level': {'avg_neg_logprob': 3.218678904916201, 'avg_max_neg_logprob': 4.025404834169328}}


# SelfCheckGPT Usage: NLI (recommended)
Entailment (or Contradiction) score with input being the sentence and a sampled passage can be used as the selfcheck score. We use DeBERTa-v3-large fine-tuned to Multi-NLI, and we normalize the probability of "entailment" or "contradiction" classes, and take Prob(contradiction) as the score.

In [9]:
from selfcheckgpt.modeling_selfcheck import SelfCheckNLI
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_nli = SelfCheckNLI(device=device) # set device to 'cuda' if GPU is available

tokenizer_config.json:   0%|          | 0.00/400 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/883 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

SelfCheck-NLI initialized to device cuda


In [10]:
sent_scores_nli = selfcheck_nli.predict(
    sentences = sentences,                          # list of sentences
    sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_nli)
# [0.334014 0.975106 ] -- based on the example above

[0.33401403 0.9751058 ]


# SelfCheckGPT Usage: LLM Prompt
Prompting an LLM (Llama2, Mistral, OpenAI's GPT) to assess information consistency in a zero-shot setup. We query an LLM to assess whether the i-th sentence is supported by the sample (as the context). Similar to other methods, a higher score indicates higher chance of being hallucination. An example when using Mistral is below:

In [12]:
# Option1: open-source model
from transformers import pipeline
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt

# We use Phi-2 2.7B SLM for inferencing
pipe = pipeline("text-generation", model="microsoft/phi-2", device_map="auto")


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [15]:
prompt = "Give me the professional journey of Ashish Vaswani in detail.Answer:"

In [None]:
# As per the original paper the response is generated with greedy decoding
Response = pipe(prompt, do_sample=False, max_new_tokens=128, return_full_text=False)
Response
# took 4 minutes

In [None]:
# The samples are generated for the same prompt with temperature as 1.
N = 20
Samples = pipe(
    [prompt] * N,
    temperature=1.0,
    do_sample=True,
    max_new_tokens=128,
    return_full_text=False,
)
print(Samples[0])

In [None]:
Response = Response[0]["generated_text"]
Samples = [sample[0]["generated_text"] for sample in Samples]

In [None]:
# Mistral 7B became a gated repository so huggingface login is required to access it.
from huggingface_hub import login

HUGGINGFACE_TOKEN = "..."
login(token=HUGGINGFACE_TOKEN)


In [None]:
# We use Mistral 7B LLM to detect whether the response generated with Phi-2 LM is hallucinated or not using LLM Promting technique.
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"
selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)

# Option2: API access (currently only support client_type="openai")
# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt
# selfcheck_prompt = SelfCheckAPIPrompt(client_type="openai", model="gpt-3.5-turbo")

In [None]:
import numpy as np

nlp = spacy.load("en_core_web_sm")
sentences = [
    sent.text.strip() for sent in nlp(Response).sents
]  # spacy sentence tokenization
print(sentences)

sent_scores_prompt = selfcheck_prompt.predict(
    sentences=sentences,  # list of sentences
    sampled_passages=Samples,  # list of sampled passages
    verbose=True,  # whether to show a progress bar
)

print(sent_scores_prompt)
print("Hallucination Score:", np.mean(sent_scores_prompt))