Qiming Xie, Zengzhi Wang, Yi Feng, Rui Xia
Nanjing University of Science and Technology, China
📄 [Paper] 🖥️ [Homepage on PaperWithCode]
- Overview
- FOLLOW-UP QUESTIONING MECHANISM
- Evaluation
- Further Studies
- Mitigation Method Exploration
- Examples 🌰
- Any Question?
- Citation
❗️ With the emergence of generative conversational large language models (LLMs) like ChatGPT, serving as virtual assistants in various fields, the stability and reliability of their responses have become crucial. However, during usage, it has been observed that these models tend to waver in their judgements when confronted with follow-up questions from users expressing skepticism or disagreement. 🌰 Like these examples 🌰
🪛 In this work, we draw inspiration from questioning strategies in education and propose a FOLLOW-UP QUESTIONING MECHANISM along with two evaluation metrics to assess the judgement consistency of LLMs before and after exposure to disturbances. We evaluate the judgement consistency of ChatGPT, PaLM2-Bison, and Vicuna-13B under this mechanism across eight reasoning benchmarks. Empirical results show that even when the initial answers are correct, judgement consistency sharply decreases when LLMs face disturbances such as questioning, negation, or misleading.
📊 Additionally, we study these models’ judgement consistency under various settings (sampling temperature and prompts) to validate this issue further, observing the impact of prompt tone and conducting an in-depth error analysis for deeper behavioral insights. Furthermore, we also explore several prompting methods to mitigate this issue and demonstrate their effectiveness.
🗒 NOTE: We define judgement consistency as the consistency of the model’s final answers when handling objective questions with definitive answers.
To evaluate this consistency of large language models, we design a FOLLOW-UP QUESTIONING MECHANISM. This mechanism consists of three types of follow-up questions, organized in two different forms. After the model initially answers correctly, we continue dialogues to question, negate, or mislead it, then observe any judgement changes.
The prompts we used in the experiment. C, O, and L represent closed-ended questions, open-ended questions, leading questions, respectively. {M_A} denotes the misleading answers.
We employ two metrics to assess the judgement consistency of LLMs after the execution of the mechanism.
- Modification (M.) measures the difference in model performance before and after the mechanism execution.
- Modification Rate (M. Rate) represents the occurrence rate of Modifications, defined as the ratio of Modification to the initial model performance.
- Models
- ChatGPT (gpt-3.5-turbo-0301) with temperature at 0.5.
- PaLM2-Bison (chat-bison-001) with temperature at 0.4.
- Vicuna-13b (Vicuna-13B-v1.3) with temperature at 0.7.
- Benchmarks
- Arithmetic Reasoning
- GSM8K
- SVAMP
- MultiArith
- Commonsense Reasoning
- CSQA
- StrategyQA
- Symbolic Reasoning
- Last Letter Concatenation
- Coin Flip
- Knowledge Reasoning
- MMLU
- Arithmetic Reasoning
The results of ChatGPT in Direct Form.
The results of ChatGPT in Progressive Form.
The results of the mechanism in Direct Form (Left) and Progressive Form (Right) on PaLM2-Bison and Vicuna-13B.
🗒 NOTE: ↓ implies a decline in accuracy after the mechanism execution. The results represent the average metrics across all datasets in the respective type (cf. Benchmarks). Bold denotes the poorest judgement consistency.
Intuitively, the lower the sampling temperature, the more deterministic the generated outputs, whereas higher temperature lead to more diverse outputs. Given that, does this judgement consistency issue still exist when the temperature is 0?
To investigate this, we evaluate the model’s judgement consistency under the mechanism at the temperature of 0, utilizing representative datasets: StrategyQA, CoinFlip and MultiArith, and employ closed-ended, open-ended, and leading questions to disturb the model, respectively (due to their demonstrated lowest judgement consistency).
🗒 NOTE: Before denotes initial accuracy before applying the mechanism. Bold denotes the poorest judgement consistency.
Do the models waver in their judgements under other prompts as well? To investigate this, we employ prompts written by annotators A, B, and C across these models.
The impact of different prompts on Modification (Direct Form).
Considering the practical educational scenario, when students face questioning, denial, or misinformation, their judgements often experience a significant impact from the teacher’s tone intensity of speech. Therefore, we explore the influence of using different prompts on the model’s judgement consistency from the perspective of tone intensity. Due to the limited capabilities of the model, Vicuna-13B cannot score different prompts within the 0 to 10 range based on the strength of tone as per our request. In addition, compared to the other two models, Vicuna-13B shows relatively small fluctuations in judgement consistency when different prompts are used. Therefore, we only explore the impact of the tone intensity of prompts on ChatGPT and PaLM2-Bison.
Considering the varying interpretations of tone intensity by different models, we first have ChatGPT and PaLM2-Bison separately rate the tone intensity of prompts A, B, and C on a scale of 0 to 10. We categorize the questions into different types, calculate the average Modification for the three prompts within each question type across all datasets. The models’ tone intensity scores for the three prompts (cf. The Impact of Different Prompts) were taken as reference points.
Using ChatGPT’s judgement consistency as the reference, we analyze error examples in StrategyQA, CoinFlip, and MultiArith, employing closed-ended, open-ended and leading questions to mislead the model. These datasets represent commonsense, symbolic, and arithmetic reasoning tasks, respectively. Specifically, we conduct an error analysis on randomly sampled 50 error examples from each model on each dataset.
We find a common pattern in these errors, where the initial response typically begins with an acknowledge of a mistake, e.g., “I apologize for my mistake.”. Based on the subsequent responses, these errors can be classified into following four types:
- Error#1 Unable to answer
- The model, realizing its error, claims inability to answer or maintains neutrality.
- Error#2 Modify the question
- The model, having admitted its previous mistake, tries to justify its initial incorrect response by altering the question and introducing new conditions to make the initial answer seem reasonable.
- Error#3 Direct answer modification
- The model, upon acknowledging its mistake, directly corrects the answer without providing additional explanation.
- Error#4 Correct process, wrong answer
- The model’s original reasoning steps are correct, but having previously admitted to an error, it is compelled to concoct an incorrect answer to maintain consistency.
Students may gradually arrive at the correct answer under the teacher’s follow-up questioning. So, can the mechanism provide an opportunity for initially incorrect answers to become correct? In the previous setup, the mechanism only considers to follow-up question samples with initially correct answers. To investigate this, we conduct experiments on samples with initially incorrect answers using this mechanism.
Essentially, we believe that this issue originates from the misalignment between the model’s response generation process when facing disturbances and the thinking process of humans under similar disturbances. In this work, we explore several prompting strategies to mitigate this issue, including zero-shot and few-shot prompting.
- Zero-shot prompting
- Zero-shot-CoT: Let’s think step by step.
- EmotionPrompt: This is very important to my career.
- Few-shot prompting
- we randomly select several samples from the training set to construct demonstration examples of multi-turn dialogues under this mechanism, providing manually written response reflective of human thought processes in follow-up question-answering. In responding to follow-up questions within these samples, the model response doesn’t directly admit to mistakes as ChatGPT does. Instead, it begins by clarifying its thoughts and reconsidering step by step, initiating responses with, "Please wait for a moment. In order to answer your question, I need to take a moment to reconsider. I will now clear my mind of distractions and approach this step by step."
Here are examples of ChatGPT, Bard, Vicuna-13b, and some other Chinese large language models.
If you find this work helpful, please cite our paper as follows:
@article{xie2023ask,
title={Ask Again, Then Fail: Large Language Models' Vacillations in Judgement},
author={Xie, Qiming and Wang, Zengzhi and Feng, Yi and Xia, Rui},
eprint={2310.02174},
year={2023}
}
If you have any questions related to this work, you can open an issue with details or feel free to email Qiming(qmxie@njust.edu.cn
), Zengzhi(zzwang@njust.edu.cn
).