# Detecting Hallucinations in Question-Answering System

In this notebook I use the 'DeepPavlov/rubert-base-cased' BERT model for sequence classification, pre-trained on a Russian language corpus, and fine-tune it on my dataset for the binary classification task of hallucination detection.

Hallucinations, in this context, refer to incorrect or misleading information produced by models when generating responses to given queries. 

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import torch
import numpy as np
import pandas as pd
from tqdm import tqdm

from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, precision_recall_fscore_support

## Load Dataset

The dataset used is a Russian-language question-answering system, structured to support the task of detecting hallucinations in text. It consists of the following columns:

- summary: This column contains a summary or context related to the question and answer. It provides background information that may help in understanding whether the answer is relevant and correct.
- question: This column contains the question that is posed based on the summary.
- answer: This column contains the answer provided for the corresponding question.
- is_hallucination: This column is a binary label indicating whether the answer is a -hallucination (1) or not (0). A hallucination in this context refers to an incorrect or misleading answer that does not correctly respond to the question based on the provided summary.

In [2]:
data = pd.read_csv('train.csv', index_col = 0 )
data.head()

Unnamed: 0_level_0,summary,question,answer,is_hallucination
line_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Херманус Питер (Дик) Логгере (нидерл. Hermanus...,В каком городе проходил чемпионат мира по хокк...,В Хилверсюме.,1
1,Ходуткинские горячие источники (Худутские горя...,Как называется район в который входят источники?,Елизовским районом,0
2,Чёрная вдова (лат. Latrodectus mactans) — вид ...,Для кого опасны пауки-бокоходы?,Для рыб.,1
3,"Рысь — река в России, протекает по территориям...",Какова длина реки Рысь?,5 км.,1
4,"И́се (яп. 伊勢市), ранее Удзиямада — город в Япон...",Что такое Исе?,"Исе — это небольшой город в Японии, который не...",1


## Data Inspection

Let's print several samples of hallucinated and non-hallucinated data to get a sense of the text content and format.

In [3]:
for i in range(5):
    line = data[data['is_hallucination']==1].iloc[i,:]
    print(f"Q {line['question']}")
    print(f"A {line['answer']}")
    print(f"S {line['summary']}")
    print()

Q В каком городе проходил чемпионат мира по хоккею с шайбой в 1936 году?
A В Хилверсюме.
S Херманус Питер (Дик) Логгере (нидерл. Hermanus Pieter (Dick) Loggere, 6 мая 1921, Амстердам, Нидерланды — 30 декабря 2014, Хилверсюм, Нидерланды) — нидерландский хоккеист (хоккей на траве), полузащитник. Бронзовый призёр летних Олимпийских игр 1948 года.

Q Для кого опасны пауки-бокоходы?
A Для рыб.
S Чёрная вдова (лат. Latrodectus mactans) — вид пауков, распространённый в Северной и Южной Америке. Опасен для человека.

Q Какова длина реки Рысь?
A 5 км.
S Рысь — река в России, протекает по территориям Муезерского городского поселения и Ледмозерского сельского поселения Муезерского района Карелии. Устье реки находится в 6,6 км по правому берегу реки Кайдодеги. Длина реки — 10 км. Рысь течёт преимущественно в северном направлении по заболоченной территории. Впадает в реку Кайдодеги возле озера Кайдодеги. Населённые пункты на реке отсутствуют.

Q Что такое Исе?
A Исе — это небольшой город в Японии, 

In [4]:
for i in range(5):
    line = data[data['is_hallucination']==0].iloc[i,:]
    print(f"Q {line['question']}")
    print(f"A {line['answer']}")
    print(f"S {line['summary']}")
    print()

Q Как называется район в который входят источники?
A Елизовским районом
S Ходуткинские горячие источники (Худутские горячие источники, Термальные источники вулкана Ходутка, Ходуткинское геотермальное месторождение) — пресные геотермальные источники на юге полуострова Камчатка в Елизовском районе Камчатского края. Относятся к Южно-Камчатской геотермальной провинции.

Q Как точно переводится название кабаре Мулен Руж?
A Красная мельница
S «Муле́н Ру́ж» (фр. Moulin Rouge, буквально «Красная мельница») — классическое кабаре в Париже, построенное в 1889 году, одна из достопримечательностей французской столицы. Расположено в 18 муниципальном округе, на бульваре Клиши, в квартале красных фонарей около площади Пигаль. Ближайшая станция метро — линия 2, станция «Бланш».

Q Как называлась кинолента, вышедшая на экраны в 1961 году?
A Дуэль
S «Дуэль» (в советском прокате шёл под названием «Комиссар полиции и Малыш», рум. Duelul) — первый по хронологии и предпоследний по времени съёмок фильм из цик

## Tokenization:

For tokenization, I use BertTokenizer, compatible with the BERT model, pre-trained on  Russian language corpus. In general, tokenization process is as follows:
- Summary, question, and answer fields are concatenated into one string so that the model can get the context of all the information. Special tokens [CLS] and [SEP] are added:
    - The [CLS] token is added at the beginning of the entire sequence.
    - The [SEP] token is used to separate different segments of text, such as summary, question, and answer.
- token_type_ids is used to differentiate between different parts of the text (summary, question, and answer). 
  
I do not lowercase text, as BERT is case-sensitive and was trained on case-sensitive text. Also, I do not remove special characters and punctuation, as they help the model to better understand context.

Maximum token length for the BERT model is 512 tokens. There are several methods to handle large texts:
- Text summarization to extract key points from a long text. 
- Splitting text into several parts within length limit, processing each part separately, and combining the results.
- Processing long text in overlapping windows to account for contextual transitions between parts of the text.
   
I chose simple cropping to the maximum token length, as most examples in the dataset are within 512 tokens. Each segment (summary, question, and answer) of the text is cropped proportionally, so that the relative proportions of each part are maintained.

In [5]:
tokenizer = BertTokenizer.from_pretrained('DeepPavlov/rubert-base-cased')
max_length = 512  

In [6]:
data['text'] = data.apply(lambda row: f"[CLS] {row['summary']} [SEP] {row['question']} [SEP] {row['answer']} [SEP]", axis=1)
token_len = [len(tokenizer.tokenize(t)) for t in data['text']]
token_len.sort(reverse=True)

In [7]:
token_len[:10]

[823, 436, 435, 425, 419, 417, 416, 412, 409, 408]

The dataset is randomly split into training and testing sets, with the test set size being 20%.

In [8]:
train_texts, test_texts, train_labels, test_labels = train_test_split(data['text'], data['is_hallucination'], test_size=0.2, random_state=42)

len(train_texts), len(test_texts), len(train_labels), len(test_labels)

(840, 210, 840, 210)

Custom function `tokenize_segments` tokenizes and encodes a single input text into the format required for BERT. It consists of the following steps:
- Input text is splitted back into: summary, question, and answer parts.
- Each part is tokenized separately using the BERT tokenizer.
- Handling length constraints: If the combined length of tokens exceeds the max_length limit (considering special tokens), the tokens for each segment (summary, question, and answer) are truncated proportionally to fit within the limit.
- Combining tokens back, adding special tokens.
- Converting tokens to IDs.
- Token type IDs are created to distinguish between different segments. Since the tokenizer from the transformers library does not support using more than two segments, summary and question are combined into one segment. Here, 0 indicates the first segment (summary and question), and 1 indicates the second segment (answer). 
- The attention mask is created to indicate which tokens should be attended to (1 for real tokens, 0 for padding).
- Sequences are padded to the maximum length if they are shorter.
  
The function returns a dictionary containing the input_ids, attention_mask, and token_type_ids as tensors.
  
Function `encode_data` processes multiple texts using `tokenize_segments` and stacks the results into batches suitable for model input. The function returns the dictionary of batched tensors, ready for input into a BERT model.

In [9]:
def tokenize_segments(text, tokenizer, max_length=512):
    
    text_parts = text.replace('[CLS]','').split('[SEP]')

    summary, question, answer = text_parts[0].strip(), text_parts[1].strip(), text_parts[2].strip()
    
    summary_tokens = tokenizer.tokenize(summary)
    question_tokens = tokenizer.tokenize(question)
    answer_tokens = tokenizer.tokenize(answer)
    
    if len(summary_tokens) + len(question_tokens) + len(answer_tokens) > max_length - 4:
        extra = len(summary_tokens) + len(question_tokens) + len(answer_tokens) - (max_length - 4)
        n = int(( max_length - 4)*len(summary_tokens)/(len(summary_tokens) + len(question_tokens) + len(answer_tokens)))
        m = int((max_length - 4)*len(question_tokens)/(len(summary_tokens) + len(question_tokens) + len(answer_tokens)))
        k =  (max_length - 4) - n - m
        summary_tokens = summary_tokens[:n]
        question_tokens = question_tokens[:m]
        answer_tokens = answer_tokens[:k]
    
    tokens = ['[CLS]'] + summary_tokens + ['[SEP]'] + question_tokens + ['[SEP]'] + answer_tokens + ['[SEP]']
    
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    token_type_ids = [0] * (len(summary_tokens) + 2) + [0] * (len(question_tokens) + 1) + [1] * (len(answer_tokens) + 1)
    
    attention_mask = [1] * len(input_ids)
    
    padding_length = max_length - len(input_ids)
    if padding_length > 0:
        input_ids = input_ids + ([0] * padding_length)
        attention_mask = attention_mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
    else:
        input_ids = input_ids[:max_length]
        attention_mask = attention_mask[:max_length]
        token_type_ids = token_type_ids[:max_length]
    
    return {
        'input_ids': torch.tensor(input_ids),
        'attention_mask': torch.tensor(attention_mask),
        'token_type_ids': torch.tensor(token_type_ids)
    }

def encode_data(texts, tokenizer):
    encodings = {
        'input_ids': [],
        'attention_mask': [],
        'token_type_ids': []
    }

    for text in texts:
        text_encoding = tokenize_segments(text, tokenizer)
        for key in encodings:
            encodings[key].append(text_encoding[key])

    for key in encodings:
        encodings[key] = torch.stack(encodings[key])
        
    return encodings

In [10]:
train_encodings = encode_data(train_texts, tokenizer)
test_encodings = encode_data(test_texts, tokenizer)

In [11]:
train_encodings['input_ids'].shape, train_encodings['attention_mask'].shape, train_encodings['token_type_ids'].shape

(torch.Size([840, 512]), torch.Size([840, 512]), torch.Size([840, 512]))

In [12]:
test_encodings['input_ids'].shape, test_encodings['attention_mask'].shape, test_encodings['token_type_ids'].shape

(torch.Size([210, 512]), torch.Size([210, 512]), torch.Size([210, 512]))

In [13]:
train_labels = train_labels.reset_index(drop=True)
test_labels = test_labels.reset_index(drop=True)

## Dataset and DataLoader

A custom Dataset class and DataLoader are defined for the training and testing data, facilitating batch processing.

In [14]:
class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item


train_dataset = TextDataset(train_encodings, train_labels)
test_dataset = TextDataset(test_encodings, test_labels)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

## Train model

I load pre-trained on Russian language corpus BERT model for sequence classification and fine-tune it on my dataset for a binary classification task of hallucination detection. 

I use AdamW optimizer with weight decay to add regularization and scheduler to adjust the learning rate during training, with a warm-up period and a linear decay. 

I train the model for 3 epochs to avoid overfitting. BERT models typically use cross-entropy loss for optimization in classification tasks. 

During each epoch, the training loss, accuracy, and F1-score are computed and printed. The model is also evaluated on the test set after each epoch.

In [15]:
def train_model(model, data_loader, optimizer, device, scheduler):
    model.train()

    total_loss = 0
    predictions = []
    true_labels = []

    for batch in tqdm(data_loader, desc='Train'):
        optimizer.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
        labels = labels.cpu().numpy()

        predictions.extend(preds)
        true_labels.extend(labels)

        loss.backward()
        optimizer.step()
        scheduler.step()
        
    total_loss /= len(data_loader)
    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='binary')
    
    return total_loss, accuracy, precision, recall, f1


@torch.inference_mode()
def eval_model(model, data_loader, device):
    model.eval()
    total_loss = 0
    predictions = []
    true_labels = []


    for batch in tqdm(data_loader, desc='Evaluation'):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        
        preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
        labels = labels.cpu().numpy()

        predictions.extend(preds)
        true_labels.extend(labels)

    total_loss /= len(data_loader)
    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='binary')
    
    return total_loss, accuracy, precision, recall, f1

In [16]:
model = BertForSequenceClassification.from_pretrained('DeepPavlov/rubert-base-cased', num_labels=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

num_epochs = 3
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.02)
total_steps = len(train_loader) * num_epochs  
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=500, num_training_steps=total_steps)

  return self.fget.__get__(instance, owner)()
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at DeepPavlov/rubert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [17]:
for epoch in range(num_epochs): 
    print(f'Epoch {epoch + 1}')
    train_loss, train_accuracy, __, __, train_f1 = train_model(model, train_loader, optimizer, device, scheduler)
    valid_loss, valid_accuracy, __, __, valid_f1 = eval_model(model, test_loader, device)
    print(f'Train loss: {train_loss:.2f} | Valid loss: {valid_loss:.2f} ')
    print(f'Train accuracy: {train_accuracy:.2f} | Valid accuracy: {valid_accuracy:.2f}')
    print(f'Train F1: {train_f1:.2f} | Valid F1: {valid_f1:.2f}')
    
model_save_path = f'./bert_crop.bin'
torch.save(model.state_dict(), model_save_path)

Epoch 1


Train: 100%|██████████| 105/105 [01:22<00:00,  1.27it/s]
Evaluation: 100%|██████████| 27/27 [00:06<00:00,  3.93it/s]


Train loss: 0.73 | Valid loss: 0.65 
Train accuracy: 0.49 | Valid accuracy: 0.61
Train F1: 0.28 | Valid F1: 0.38
Epoch 2


Train: 100%|██████████| 105/105 [01:22<00:00,  1.27it/s]
Evaluation: 100%|██████████| 27/27 [00:06<00:00,  3.92it/s]


Train loss: 0.45 | Valid loss: 0.38 
Train accuracy: 0.85 | Valid accuracy: 0.87
Train F1: 0.84 | Valid F1: 0.86
Epoch 3


Train: 100%|██████████| 105/105 [01:22<00:00,  1.27it/s]
Evaluation: 100%|██████████| 27/27 [00:06<00:00,  3.91it/s]


Train loss: 0.20 | Valid loss: 0.37 
Train accuracy: 0.94 | Valid accuracy: 0.91
Train F1: 0.94 | Valid F1: 0.91


In [18]:
from IPython.display import FileLink

FileLink('bert_crop.bin')

## Prediction and Evaluation

In the end, I make predictions on the training and test datasets and print a classification report and confusion matrix to evaluate the model's performance.

In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = BertForSequenceClassification.from_pretrained('DeepPavlov/rubert-base-cased', num_labels=2)
model.load_state_dict(torch.load('bert_crop.bin', map_location=torch.device('cpu')))
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at DeepPavlov/rubert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<All keys matched successfully>

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [20]:
@torch.inference_mode()
def pred(model, data_loader, device):
    model.eval()
    predictions = []
    true_labels = []


    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    
        preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
        labels = labels.cpu().numpy()

        predictions.extend(preds)
        true_labels.extend(labels)

    return predictions, true_labels 


In [21]:
yhat_train, y_train = pred(model, train_loader, device)
yhat_test, y_test = pred(model, test_loader, device)

In [22]:
print(classification_report(y_train, yhat_train))
print(confusion_matrix(y_train, yhat_train))

              precision    recall  f1-score   support

           0       0.95      0.98      0.96       411
           1       0.98      0.95      0.96       429

    accuracy                           0.96       840
   macro avg       0.96      0.96      0.96       840
weighted avg       0.96      0.96      0.96       840

[[402   9]
 [ 21 408]]


In [23]:
print(classification_report(y_test, yhat_test))
print(confusion_matrix(y_test, yhat_test))

              precision    recall  f1-score   support

           0       0.89      0.93      0.91       107
           1       0.93      0.88      0.91       103

    accuracy                           0.91       210
   macro avg       0.91      0.91      0.91       210
weighted avg       0.91      0.91      0.91       210

[[100   7]
 [ 12  91]]


The model demonstrates strong performance with high precision, recall, and F1-scores for both training and test datasets. 

The performance metrics on the test set are slightly lower than on the training set, which is expected for small dataset. F1 score for the class 1 is 0.91 on test and 0.96 on train. However, the differences in performance are not significant, suggesting that the model can generalize well.