## Inference to the Best Explanation (IBE) in Large Language Models (LLMs)

IBE-Eval estimates the plausibility of natural language explanations through a combination of explicit logical and linguistic features. It operates on top of natural language explanations generated by Large Language Models using a combination of hard and soft critique models as a proxy to assess consistency, parsimony, coherence, and uncertainty.

<img src="figures/ibe.png" height="400" class="center">

## IBE Evaluation Criteria

- *Consistency (Hard Critique).* Verify whether the explanation is logically valid. Given a hypothesis, composed of a premise pi, a conclusion ci, and an explanation consisting of a set of statements E =s1,...,si, we define E to be logically consistent if pi ∪ E ⊨ ci. Specifically, an explanation is logically consistent if it is possible to build a deductive proof linking premise and conclusion.

- *Parsimony (Soft Critique).* The parsimony principle, also known as Ockham’s razor, favors the selection of the simplest explanation consisting of the fewest elements and assumptions. Adopt two metrics as a proxy of parsimony, namely proof depth, and concept drift.  Concept drift, denoted as Drift, is defined as the
number of additional concepts and entities, outside the ones appearing in the hypothesis (i.e., premise and conclusion), that are introduced by the LLM to support the entailment. 

\begin{equation}
Drift(h) = |Noun_{E} - (Noun_{p} \cup Noun_{c})
\end{equation}

- *Coherence (Soft Critique).* Attempts to measure the logical relations within individual explanatory statements and implications. An explanation can be formally consistent on the surface while still including implausible or ungrounded intermediate assumptions. Coherence evaluate the quality of each intermediate If-Then implication by measuring the entailment strength between the If and Then clauses. To this end, we employ a fine-tuned natural language inference (NLI) model. Let S
be a set of explanation steps, where each step s consists of an If-Then statement, s = (Ifs,Thens). For a given step si, let ES(si) denote the entailment score obtained via the NLI model between Ifs and Thens clauses. The step-wise entailment score SWE(S) is then calculated as the averaged sum of the entailment scores across all explanation steps |S|.

\begin{equation}
\text{SWE}(S) = \frac{1}{|S|}\sum_{i=1}^{|S|} \text{ES}(s_i)
\end{equation}

- *Uncertainty (Soft Critique).* Finally, we consider the linguistic certainty expressed in the generated explanation as a proxy for plausibility. Hedging words such as probably, might be, could be, etc typically signal ambiguity and are often used when the truth condition of a statement is unknown or improbable. Pei and Jurgens (2021) found that the strength of scientific claims in research papers is strongly correlated with the use of direct language. In contrast, they found that the use of hedging language suggested that the veracity of the claim was weaker or highly contextualized. To measure the linguistic certainty we use a fine-tuned sentence-level RoBERTa model.

## Results

<div>
<img src="figures/ibe_results.png" height="265">
<img src="figures/ibe_results_1.png" height="265">
</div>

## Let's try with GPT-4o

Generate Explanations for the hypotheses

In [1]:
# Import the  critique models
from critique import CoherenceCritique
from critique import ParsimonyCritique
from critique import UncertaintyCritique

from transformers.utils import logging
logging.set_verbosity(logging.CRITICAL)

# Import the generative GPT model
from generation.gpt import GPT
import yaml


# Initialise the generative model (i.e. GPT-4o-mini)
with open('config.yaml', 'r') as file:
     config = yaml.safe_load(file)
     api_key = config.get('gpt-4o', {}).get('api_key')

llm = GPT('gpt-4o', api_key)


# First hypothesis (change to premise)
hypothesis_1 = "I blew into the baloon."
conclusion_1 =  "The balloon expanded."

# Second hypothesis (change to premise)
hypothesis_2 = "I pricked the baloon."
conclusion_2 =  "The balloon expanded."

# Prompt the model to generate the explanation for the first hypothesis
explanation_1 = llm.generate(
             model_prompt_dir = 'ibe',
             prompt_name = "generate_explanation_prompt.txt",
             hypothesis = hypothesis_1,
             conclusion = conclusion_1
         )
print(f"\nExplanation 1:\n\nHypothesis {hypothesis_1}\nConlusion {conclusion_1}\n\n{explanation_1}")


# Prompt the model to generate the explanation for the first hypothesis
explanation_2 = llm.generate(
             model_prompt_dir = 'ibe',
             prompt_name = "generate_explanation_prompt.txt",
             hypothesis = hypothesis_2,
             conclusion = conclusion_2
         )
print(f"\nExplanation 2:\n\nHypothesis {hypothesis_2}\nConclusion {conclusion_2}\n\n{explanation_2}")


Explanation 1:

Hypothesis I blew into the baloon.
Conlusion The balloon expanded.

Step 1: IF someone blows air into a balloon, THEN the balloon will fill with air.
Assumption: Blowing air into a balloon introduces air into the balloon's interior.

Step 2: IF a balloon fills with air, THEN the pressure inside the balloon increases.
Assumption: Adding air to a confined space like a balloon increases the internal pressure.

Step 3: IF the pressure inside the balloon increases, THEN the balloon will expand.
Assumption: Balloons are made of elastic material that stretches when internal pressure increases.

Step 4: Therefore, since you blew into the balloon, air was introduced, increasing the internal pressure and causing the balloon to expand.

Explanation 2:

Hypothesis I pricked the baloon.
Conclusion The balloon expanded.

Step 1: IF a balloon is pricked, THEN it will not expand; instead, it will deflate.
Assumption: Pricking a balloon creates a hole, causing the air inside to escape,

Evaluate explanations via soft critique models

In [2]:

# Initialise the soft critique models
coherence = CoherenceCritique()
parsimony = ParsimonyCritique()
uncertainty = UncertaintyCritique()

print("Soft Critique Evaluation")
# Calculate and display soft critique scores

# Coherence Metrics
exp1_coherence = coherence.critique(explanation=explanation_1)
exp2_coherence = coherence.critique(explanation=explanation_2)

print("\n ================ Coherence ================\n")

print("Explanation 1: ", exp1_coherence)
print("Explanation 2: ", exp2_coherence)

print(f"Coherence comparision: Explanation 1: {exp1_coherence['coherence']} vs. Explanation 2: {exp2_coherence['coherence']}")

if exp1_coherence['coherence'] > exp2_coherence['coherence']:
    print("Explanation 1 is therefore more coherent than Explanation 2.")
else:
    print("Explanation 2 is the most coherente than Explanation 1.")

# Parsimony Metrics
exp1_parsimony = parsimony.critique(hypothesis_1, conclusion_1, explanation_1)
exp2_parsimony = parsimony.critique(hypothesis_2, conclusion_2, explanation_2)

print("\n================ Parsimony ================\n")

print("Explanation 1: ", exp1_parsimony)
print("Explanation 2: ", exp2_parsimony)

print(f"\nParsimony comparision: Explanation 1: {exp1_parsimony['parsimony']} vs. Explanation 2: {exp2_parsimony['parsimony']}")

if exp1_parsimony['parsimony'] < exp2_parsimony['parsimony']:
    print("Explanation 1 is therefore more parsimonious than Explanation 2.")
else:
    print("Explanation 2 is therefore more parsimonious than Explanation 1.")

# Uncertainty Metrics
exp1_uncertainty = uncertainty.critique(explanation=explanation_1)
exp2_uncertainty = uncertainty.critique(explanation=explanation_2)

print("\n================ Uncertainty ================\n")

print("Explanation 1: ", exp1_uncertainty)
print("Explanation 2: ", exp2_uncertainty)

print(f"\nUncertainty comparision: Explanation 1: {exp1_uncertainty['uncertainty']} vs. Explanation 2: {exp2_uncertainty['uncertainty']}")

if exp1_uncertainty['uncertainty'] > exp2_uncertainty['uncertainty']:
    print("Explanation 1 is therefore more uncertain than Explanation 2.")
else:
    print("Explanation 2 is therefore more uncertain than Explanation 1.")

  return self.fget.__get__(instance, owner)()


Soft Critique Evaluation


Explanation 1:  {'coherence': 0.7369064018130302}
Explanation 2:  {'coherence': 0.46557168662548065}
Coherence comparision: Explanation 1: 0.7369064018130302 vs. Explanation 2: 0.46557168662548065
Explanation 1 is therefore more coherent than Explanation 2.


Explanation 1:  {'parsimony': 3}
Explanation 2:  {'parsimony': 3}

Parsimony comparision: Explanation 1: 3 vs. Explanation 2: 3
Explanation 2 is therefore more parsimonious than Explanation 1.


Explanation 1:  {'uncertainty': 0.9982295831044515}
Explanation 2:  {'uncertainty': 1.0043059190114338}

Uncertainty comparision: Explanation 1: 0.9982295831044515 vs. Explanation 2: 1.0043059190114338
Explanation 2 is therefore more uncertain than Explanation 1.


## Let's try with GPT-3.5-Turbo

In [3]:
# Import the  critique models
from critique import CoherenceCritique
from critique import ParsimonyCritique
from critique import UncertaintyCritique

from transformers.utils import logging
logging.set_verbosity(logging.CRITICAL)

# Import the generative GPT model
from generation.gpt import GPT
import yaml


# Initialise the generative model (i.e. GPT-4o)
with open('config.yaml', 'r') as file:
     config = yaml.safe_load(file)
     api_key = config.get('gpt-3.5-turbo', {}).get('api_key')

llm = GPT('gpt-3.5-turbo', api_key)


# First hypothesis (change to premise)
hypothesis_1 = "I blew into the baloon."
conclusion_1 =  "The balloon expanded."

# Second hypothesis (change to premise)
hypothesis_2 = "I pricked the baloon."
conclusion_2 =  "The balloon expanded."

# Prompt the model to generate the explanation for the first hypothesis
explanation_1 = llm.generate(
             model_prompt_dir = 'ibe',
             prompt_name = "generate_explanation_prompt.txt",
             hypothesis = hypothesis_1,
             conclusion = conclusion_1
         )
print(f"\nExplanation 1:\n\nHypothesis {hypothesis_1}\nConlusion {conclusion_1}\n\n{explanation_1}")


# Prompt the model to generate the explanation for the first hypothesis
explanation_2 = llm.generate(
             model_prompt_dir = 'ibe',
             prompt_name = "generate_explanation_prompt.txt",
             hypothesis = hypothesis_2,
             conclusion = conclusion_2
         )
print(f"\nExplanation 2:\n\nHypothesis {hypothesis_2}\nConclusion {conclusion_2}\n\n{explanation_2}")


Explanation 1:

Hypothesis I blew into the baloon.
Conlusion The balloon expanded.

Step 1: IF air is blown into a balloon, THEN the balloon can expand.
Assumption: Blowing air into a balloon increases the air pressure inside, causing it to expand.

Step 2: Therefore, since you blew into the balloon, the balloon expanded as a result of the increased air pressure inside.

Explanation 2:

Hypothesis I pricked the baloon.
Conclusion The balloon expanded.

Step 1: IF a balloon is pricked, THEN the air inside the balloon can escape.
Assumption: When a balloon is pricked, it creates a hole through which the air inside can exit.

Step 2: IF the air inside a balloon escapes, THEN the pressure inside the balloon decreases.
Assumption: As the air escapes from the balloon, the pressure inside the balloon decreases.

Step 3: IF the pressure inside a balloon decreases, THEN the balloon can expand.
Assumption: A decrease in pressure inside a balloon can cause the balloon to expand as the external p

Evaluate explanations via soft critique models

In [4]:

# Initialise the soft critique models
coherence = CoherenceCritique()
parsimony = ParsimonyCritique()
uncertainty = UncertaintyCritique()

print("Soft Critique Evaluation")
# Calculate and display soft critique scores

# Coherence Metrics
exp1_coherence = coherence.critique(explanation=explanation_1)
exp2_coherence = coherence.critique(explanation=explanation_2)

print("\n ================ Coherence ================\n")

print("Explanation 1: ", exp1_coherence)
print("Explanation 2: ", exp2_coherence)

print(f"Coherence comparision: Explanation 1: {exp1_coherence['coherence']} vs. Explanation 2: {exp2_coherence['coherence']}")

if exp1_coherence['coherence'] > exp2_coherence['coherence']:
    print("Explanation 1 is therefore more coherent than Explanation 2.")
else:
    print("Explanation 2 is the most coherente than Explanation 1.")

# Parsimony Metrics
exp1_parsimony = parsimony.critique(hypothesis_1, conclusion_1, explanation_1)
exp2_parsimony = parsimony.critique(hypothesis_2, conclusion_2, explanation_2)

print("\n================ Parsimony ================\n")

print("Explanation 1: ", exp1_parsimony)
print("Explanation 2: ", exp2_parsimony)

print(f"\nParsimony comparision: Explanation 1: {exp1_parsimony['parsimony']} vs. Explanation 2: {exp2_parsimony['parsimony']}")

if exp1_parsimony['parsimony'] < exp2_parsimony['parsimony']:
    print("Explanation 1 is therefore more parsimonious than Explanation 2.")
else:
    print("Explanation 2 is therefore more parsimonious than Explanation 1.")

# Uncertainty Metrics
exp1_uncertainty = uncertainty.critique(explanation=explanation_1)
exp2_uncertainty = uncertainty.critique(explanation=explanation_2)

print("\n================ Uncertainty ================\n")

print("Explanation 1: ", exp1_uncertainty)
print("Explanation 2: ", exp2_uncertainty)

print(f"\nUncertainty comparision: Explanation 1: {exp1_uncertainty['uncertainty']} vs. Explanation 2: {exp2_uncertainty['uncertainty']}")

if exp1_uncertainty['uncertainty'] > exp2_uncertainty['uncertainty']:
    print("Explanation 1 is therefore more uncertain than Explanation 2.")
else:
    print("Explanation 2 is therefore more uncertain than Explanation 1.")

Soft Critique Evaluation


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)




Explanation 1:  {'coherence': nan}
Explanation 2:  {'coherence': 0.39371824637055397}
Coherence comparision: Explanation 1: nan vs. Explanation 2: 0.39371824637055397
Explanation 2 is the most coherente than Explanation 1.


Explanation 1:  {'parsimony': 1}
Explanation 2:  {'parsimony': 3}

Parsimony comparision: Explanation 1: 1 vs. Explanation 2: 3
Explanation 1 is therefore more parsimonious than Explanation 2.


Explanation 1:  {'uncertainty': 1.010850429534912}
Explanation 2:  {'uncertainty': 1.0850576559702554}

Uncertainty comparision: Explanation 1: 1.010850429534912 vs. Explanation 2: 1.0850576559702554
Explanation 2 is therefore more uncertain than Explanation 1.


## Comparison 

The explanations generated by GPT-4-o for this example have a better "separation" than the ones generated by GPT-3.5-turbo.

GPT-4o:

- Coherence comparision:  Explanation 1: 0.7090832414105535 vs. Explanation 2: 0.29190194606781006
- Parsimony comparision:  Explanation 1: 1 vs. Explanation 2: 12
- Uncertainty comparision: Explanation 1: 0.999363899230957 vs. Explanation 2: 1.5216406186421714


GPT-3.4-Turbo:

- Coherence comparision: Explanation 1: 0.871249190531671 vs. Explanation 2: 0.06694453046657145
- Parsimony comparision: Explanation 1: 3 vs. Explanation 2: 6
- Uncertainty comparision: Explanation 1: 1.0230472882588706 vs. Explanation 2: 2.125597596168518