Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-seeking Conversations

This repository provides resources developed within the following article [PDF]:

W. Łajewska and K. Balog. Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-seeking Conversations. In: Proceedings of the 46th European Conference on Information Retrieval (ECIR ’24). Glasgow, United Kingdom. March 2024. 10.1007/978-3-031-56063-7_25

Summary

Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems. We propose to approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response. This way we can automatically assess if the answer to the user's question is present in the corpus. Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate. For training and evaluation, we develop a dataset based on the TREC CAsT benchmark that includes answerability labels on the sentence, passage, and ranking levels. We demonstrate that our proposed method represents a strong baseline and outperforms a state-of-the-art LLM on the answerability prediction task.

Data

The data used for answer-in-the-sentence classifier training and evaluation, as well as for evaluation of passage- and ranking-level answerability scores aggregation is covered in detail here.

Answerability Detection

The challenge of answerability in conversational information seeking arises from the fact that the answer is typically not confined to a single entity or text snippet, but rather spans across multiple sentences or even multiple paragraphs.

At the core of our approach is a sentence-level classifier (more details here) that can distinguish sentences that contribute to the answer from ones that do not. These sentence-level estimates are then aggregated on the passage level and then further on the ranking level (i.e., set of top-n passages) (more details here) to determine whether the question is answerable.

ChatGPT

For reference, we compare against a state-of-the-art large language model (LLM), using the most recent snapshot of GPT-3.5 (gpt-3.5-turbo-0301) via the ChatGPT API. More details about the setup and implementation can be found here. Data generated with ChatGPT are covered in details here.

Results

Results for answerability detection on the sentence-, passage-, and ranking-level in terms of classification accuracy.

Sentence		Passage		Ranking
Classifier	Accuracy	Aggregation	Accuracy	Aggregation	Accuracy
CAsT-answerability	0.752	Max	0.634	Max	0.790
		Max	0.634	Mean	0.891
		Mean	0.589	Max	0.332
		Mean	0.589	Mean	0.829
CAsT-answerability + SQuAD 2.0	0.779	Max	0.676	Max	0.810
		Max	0.676	Mean	0.848
		Mean	0.639	Max	0.468
		Mean	0.639	Mean	0.672
ChatGPT (zero-shot)			0.787	T=0.33	0.839
ChatGPT (zero-shot)			0.787	T=0.66	0.623
ChatGPT (zero-shot)					0.669
ChatGPT (two-shot)					0.601

Citation

If you use the resources presented in this repository, please cite:

@inproceedings{Lajewska:2024:ECIR,
	author = {{\L}ajewska, Weronika and Balog, Krisztian},
	title = {Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-Seeking Conversations},
	year = {2024},
	doi = {10.1007/978-3-031-56063-7_25},
	url = {https://doi.org/10.1007/978-3-031-56063-7_25},
	booktitle = {Proceedings of the 46th European Conference on Information Retrieval},
	pages = {336--344},
	series = {ECIR '24}
}

Contact

Should you have any questions, please contact Weronika Łajewska at weronika.lajewska[AT]uis.no (with [AT] replaced by @).

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
answerability_prediction		answerability_prediction
data		data
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
cast-answerability.png		cast-answerability.png
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
system_architecture.png		system_architecture.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

answerability_prediction

answerability_prediction

data

data

.flake8

.flake8

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

README.md

README.md

cast-answerability.png

cast-answerability.png

conftest.py

conftest.py

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

system_architecture.png

system_architecture.png

Repository files navigation

Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-seeking Conversations

Summary

Data

Answerability Detection

ChatGPT

Results

Citation

Contact

About

Releases

Packages

Languages

iai-group/answerability-prediction

Folders and files

Latest commit

History

Repository files navigation

Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-seeking Conversations

Summary

Data

Answerability Detection

ChatGPT

Results

Citation

Contact

About

Resources

Stars

Watchers

Forks

Languages