# Self-consistency and multiple paths of reasoning
In this notebook, we explore a strategy for improving the reliability and quality of responses from LLMs: generating multiple reasoning paths and using self-consistency as a way to evaluate and refine the final answer.

While large language models like GPT are powerful, they may not always give the correct or most consistent answer on the first try—especially for tasks that involve complex reasoning, multiple steps, or some ambiguity. One way to make their responses more dependable is to ask them to solve the same problem in multiple ways. By generating several independent reasoning paths and comparing them, we can better identify patterns, avoid mistakes, and surface the most consistent and trustworthy answer.

This technique has several advantages. It helps reduce variability and randomness in outputs, exposes possible reasoning flaws, and makes the model’s thought process more transparent. When combined with a method for aggregating results and evaluating their internal consistency, it becomes a practical way to boost both the accuracy and interpretability of model responses.

This technique is especially helpful for quantitative problem solving, conceptual explanations, open-ended reasoning tasks.

In [1]:
import os
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from dotenv import load_dotenv
import random
from collections import Counter

# Load environment variables
load_dotenv()

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Initialize the language model
We instantiate a lightweight GPT model from OpenAI using LangChain.

In [2]:
# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18")

## Generating multiple reasoning paths
LLMs are inherently probabilistic — each response is sampled from a distribution of possible continuations, meaning that even for the same prompt, the model may produce different outputs depending on the temperature, randomness, and hidden internal state.

When solving a problem, there is often more than one way to approach it — just like humans might use different mental models or strategies depending on their background and perspective- this variability can be turned into a strength. We can encourage LLMs to do the same by prompting them to solve the same problem multiple times, each time nudging them toward a distinct "path" of reasoning. This helps us:
- Surface diverse reasoning strategies (e.g., symbolic vs. conceptual thinking)
- Catch inconsistencies or incorrect logic in individual answers
- Aggregate multiple paths into a more reliable solution
- Apply ensemble-like thinking to model reasoning (akin to boosting in ML)

We will now define a function that generates several different reasoning paths for the same problem, encouraging the model to "think differently" each time.

In [3]:
def generate_multiple_paths(problem, num_paths=3):
    """
    Generate multiple reasoning paths for a given problem.

    Args:
    problem (str): The problem statement.
    num_paths (int): Number of reasoning paths to generate.

    Returns:
    list: A list of generated reasoning paths.
    """
    # We define a prompt template that instructs the model to use a unique approach.
    # The `path_number` is used to encourage diversity in reasoning.
    prompt_template = PromptTemplate(
        input_variables=["problem", "path_number"],
        template="""Solve the following problem using a unique approach. This is reasoning path {path_number}.
        Problem: {problem}
        Reasoning path {path_number}:"""
    )

    paths = []

    # Loop through and generate multiple reasoning paths using the template
    for i in range(num_paths):
        chain = prompt_template | llm # Create a new prompt-chain for each path
        # Invoke the model to get a reasoning output for this path
        response = chain.invoke({"problem": problem, "path_number": i+1}).content
        # Collect the model's response
        paths.append(response)

    return paths

# Define the physics problem
problem = "A ball is thrown upwards with an initial velocity of 20 m/s. How high will it go?"
# Generate three reasoning paths
paths = generate_multiple_paths(problem, num_paths=3)

# Display each generated reasoning path
for i, path in enumerate(paths, 1):
    print(f"Path {i}:\n{path}\n")

Path 1:
To solve the problem of how high a ball thrown upwards with an initial velocity of 20 m/s will go, we can use the principles of physics, specifically the equations of motion under uniform acceleration due to gravity.

1. **Identify the Variables**: 
   - Initial velocity (u) = 20 m/s (upwards)
   - Final velocity (v) at the highest point = 0 m/s (the ball stops rising momentarily at its peak)
   - Acceleration (a) = -9.81 m/s² (the negative sign indicates that gravity is acting downwards).

2. **Use the Kinematic Equation**: 
   We can use the kinematic equation that relates initial velocity, final velocity, acceleration, and displacement (height in this case):
   \[
   v^2 = u^2 + 2a s
   \]
   where \(s\) is the displacement (maximum height).

3. **Plug in the Known Values**: 
   Substituting the known values into the equation:
   \[
   0^2 = (20)^2 + 2(-9.81)s
   \]
   This simplifies to:
   \[
   0 = 400 - 19.62s
   \]

4. **Solve for s**: 
   Rearranging the equation to so

1. Prompt variation with structure: By using a structured prompt and varying only the "reasoning path number" (`path_number`), we subtly nudge the model to approach the problem differently without breaking prompt consistency. This creates a custom prompt for each reasoning path that includes the problem and a unique path label (`Reasoning path 1`, `Reasoning path 2`, etc.).
2. Chain creation per path: Each reasoning path is invoked independently, allowing the model to sample different solutions from its internal distribution. This mimics ensemble reasoning. We use the model via LangChain to generate each reasoning response independently.
4. Randomness without chaos: Since we are keeping the problem fixed but varying the framing, we encourage diversity in response without losing coherence.

By generating several different solutions, we open up the possibility of better judgment and cross-checking — especially in situations where we want the most robust, logically sound answer possible. Next, we will look at how to make sense of these responses collectively.


## Aggregating results from multiple reasoning paths
Once we have collected multiple reasoning paths for a given problem, the next step is to synthesize them into a single, consistent final answer. This aggregation process helps mitigate errors or hallucinations that might appear in any single reasoning attempt. Hallucinations refer to outputs generated by a language model that sound plausible but are factually incorrect, logically invalid, or entirely made up.

We aggregate because:
- Some reasoning paths may be flawed, inconsistent, or make incorrect assumptions.
- Others may converge on a similar conclusion but phrase it differently.
- Aggregating them allows the model to "look at its own work" and resolve ambiguity.
- It is similar to ensemble methods in ML, where multiple weak learners combine into a stronger predictor.

This technique essentially turns the LLM into its own evaluator, allowing it to review its earlier thoughts and pick the most coherent answer.

In [4]:
def aggregate_results(paths):
    """
    Aggregate results from multiple reasoning paths.

    Args:
    paths (list): List of reasoning paths.

    Returns:
    str: The most consistent answer.
    """
    # Define a prompt template that presents the list of reasoning paths to the model and asks it to analyze them collectively.
    prompt_template = PromptTemplate(
        input_variables=["paths"],
        template="""Analyze the following reasoning paths and determine the most consistent answer. If there are discrepancies, explain why and provide the most likely correct answer.
        Reasoning paths:
        {paths}

        Most consistent answer:"""
    )

    # Compose the aggregation prompt by joining all reasoning paths into a single string
    joined_paths = "\n".join(paths)
    # Compose the prompt with the language model to a chain
    chain = prompt_template | llm
    # Invoke the model using the constructed prompt
    response = chain.invoke({"paths": joined_paths}).content
    return response

# Aggregate the reasoning paths for the earlier physics problem
aggregated_result = aggregate_results(paths)
# Display the model's selected answer after synthesizing all paths
print("Aggregated Result:\n", aggregated_result)

Aggregated Result:
 After analyzing the reasoning paths provided, the most consistent answer regarding the maximum height a ball thrown upwards with an initial velocity of 20 m/s will reach is approximately **20.39 meters**. 

### Explanation of Consistency:
1. **Kinematic Approach**: The first reasoning path uses the kinematic equation to derive the maximum height. It correctly identifies the variables and applies the equation \(v^2 = u^2 + 2as\) to find the height. The calculations are accurate, leading to the conclusion of 20.39 meters.

2. **Energy Conservation Approach**: The second reasoning path leverages the principle of conservation of energy, equating the initial kinetic energy to the potential energy at the maximum height. The calculations again arrive at the same result of approximately 20.39 meters. This method is valid and provides an alternative perspective on the problem.

3. **Reiterated in Reasoning Path 3**: The final reasoning path reiterates the use of energy conse

- Prompt composition: All reasoning paths are concatenated and passed into a single structured prompt that tells the model to evaluate them.
- Role shift: This shift in prompt context nudges the model to switch from open-ended response generation to meta-reasoning, comparing different chains of thought it previously created.
- Implicit voting: Although we don’t explicitly ask the model to "vote", the aggregation prompt encourages a form of rational comparison — choosing the most consistent and well-supported logic.
- Error correction opportunity: If a reasoning path contains an error (e.g., math mistake, incorrect unit conversion), the model has a chance to detect and correct it during aggregation — especially if other paths got it right.

Aggregation is a crucial part of making multi-path prompting practical and scalable:
- It gives us a *single interpretable answer* after sampling multiple possibilities.
- It balances between creative exploration (via diverse paths) and disciplined conclusion (via review).
- It mimics how humans often double-check work: think a few ways, then evaluate which is most reliable.

Next, we will go one level deeper: evaluating the consistency of the aggregated result itself, to ensure it is not only plausible but trustworthy.


## Self-consistency check
Even after aggregating multiple reasoning paths, it's possible for the final answer to contain subtle inconsistencies, flawed assumptions, or just sound plausible without being fully sound.

To strengthen the reliability of our result, we will add a final verification step: a self-consistency check. This is where we ask the model to critically evaluate its own (aggregated) answer in the context of the original problem — not to solve the problem again, but to assess the correctness and internal coherence of the solution it has proposed.

This helps catch logical contradictions, overgeneralizations, incorrect assumptions and miscalculations that slipped through aggregation.

This step acts like a QA filter before deploying or trusting a model’s response. It simulates a second opinion or internal peer review, which is crucial in high-stakes or technical applications like scientific problem solving, financial modeling, legal reasoning or complex diagnostics. It also gives human users more transparency, allowing them to see whether the model itself "believes" in its final answer — and why.

In [5]:
def self_consistency_check(problem, aggregated_result):
    """
    Perform a self-consistency check on the aggregated result.

    Args:
    problem (str): The original problem statement.
    aggregated_result (str): The aggregated result to check.

    Returns:
    str: An evaluation of the result's consistency and reliability.
    """
    # Define a structured evaluation prompt
    prompt_template = PromptTemplate(
        input_variables=["problem", "result"],
        template="""Evaluate the consistency and reliability of the following result for the given problem.
        Problem: {problem}
        Result: {result}

        Evaluation (consider factors like logical consistency, adherence to known facts, and potential biases):"""
    )

    # Combine the prompt and the language model
    chain = prompt_template | llm
    # Pass both the original problem and the final answer into the chain
    response = chain.invoke({"problem": problem, "result": aggregated_result}).content
    return response

# Perform consistency check on the aggregated answer
consistency_evaluation = self_consistency_check(problem, aggregated_result)
# Display the evaluation result
print("Self-Consistency Evaluation:\n", consistency_evaluation)

Self-Consistency Evaluation:
 ### Evaluation of Consistency and Reliability

1. **Logical Consistency**:
   - The result of approximately **20.39 meters** is logically consistent across different reasoning paths. Both the kinematic approach and the energy conservation approach yield the same outcome, which reinforces the reliability of the result. The use of established kinematic equations and principles of energy conservation indicates a solid understanding of the physics involved.

2. **Adherence to Known Facts**:
   - The kinematic equation used, \(v^2 = u^2 + 2as\), is a well-known relation in physics for uniformly accelerated motion. In this case, the final velocity \(v\) at the maximum height is 0 m/s, the initial velocity \(u\) is 20 m/s, and the acceleration \(a\) is -9.81 m/s² (due to gravity). The calculations based on these values should yield a height of approximately 20.39 meters. 
   - Similarly, the energy conservation approach correctly equates kinetic energy (\(\frac{1

Technically speaking, this prompt reframes the model’s role again — now acting not as a generator or aggregator, but as a verifier. Here's a breakdown of how it works:
- Inputs: The model is given the original problem + the final (aggregated) result.
- Task framing: Instead of solving, the model is asked to critically examine whether the answer holds up. The prompt encourages consideration of:
  - Logical steps in the reasoning
  - Mathematical or factual accuracy
  - General coherence with the problem setup
- Behavioral signal: By explicitly asking for an "evaluation," we prime the model to analyze, not just regenerate, which leads to more trustworthy output.

Next, we will generalize this into a reusable solver that combines all three steps: multiple path generation, result aggregation, and self-consistency evaluation.

## Applying to different problem types
Now that we have developed a method for generating multiple reasoning paths, aggregating the results, and evaluating their consistency, it is time to put it all together and test it on various kinds of problems.

This step demonstrates the flexibility of the **multi-path + aggregation + self-consistency** approach across different domains — from factual questions and conceptual explanations to basic quantitative reasoning.

This kind of structured multi-pass reasoning is particularly valuable when dealing with ambiguous, high-stakes, or error-prone queries.

In [6]:
def solve_problem(problem):
    """
    Solve a problem using multiple reasoning paths, aggregation, and self-consistency check.

    Args:
    problem (str): The problem statement.

    Returns:
    tuple: (aggregated_result, consistency_evaluation)
    """
    # Step 1: Generate multiple distinct reasoning paths
    paths = generate_multiple_paths(problem)

    # Step 2: Aggregate these reasoning paths into a single answer
    aggregated_result = aggregate_results(paths)

    # Step 3: Evaluate the aggregated result for logical consistency and reliability
    consistency_evaluation = self_consistency_check(problem, aggregated_result)
    return aggregated_result, consistency_evaluation

# Example problems spanning factual, conceptual, and numerical reasoning
problems = [
    "What is the capital of France?",
    "Explain the concept of supply and demand in economics.",
    "If a train travels at 60 km/h, how long will it take to cover 180 km?"
]

for problem in problems:
    print(f"Problem: {problem}")
    # Apply the reasoning pipeline
    result, evaluation = solve_problem(problem)
    # Output the aggregated answer and the model’s self-evaluation
    print("Aggregated Result:\n", result)
    print("\nConsistency Evaluation:\n", evaluation)
    print("\n" + "-"*50 + "\n")


Problem: What is the capital of France?
Aggregated Result:
 The most consistent answer across all three reasoning paths is that the capital of France is **Paris**. Each reasoning path effectively highlights different aspects that lead to the same conclusion, emphasizing the city's historical, cultural, and political significance.

1. **Reasoning Path 1** focuses on the recognition of Paris through its iconic landmarks and its role as the political and administrative center, mentioning specific institutions like the presidential palace and the National Assembly.

2. **Reasoning Path 2** takes a broader approach, discussing geographical and cultural contexts, and highlights historical events, further supporting the idea of Paris as the capital through its identity as the "City of Light" and its central role in French history.

3. **Reasoning Path 3** dives into historical context and linguistic connections, reinforcing Paris's importance not only as a capital but also as a symbol of Fren

- We define a list of diverse problems.
- For each problem:
  - We call the `solve_problem()` function, which internally runs:
    - Multiple reasoning paths generation
    - Aggregation of those paths
    - Self-consistency check
  - Then we print both the final answer and a diagnostic of how consistent or reliable that answer seems.