#### Self-Consistency Improves Chain of Thought Reasoning in Language Models
- Self-Consistency is a method designed to enhance the reasoning abilities of language models (LLMs) by addressing variability in their outputs.
- It builds on the concept of Chain of Thought (CoT) reasoning, which involves decomposing a problem into logical, step-by-step reasoning paths. 

#### Example: Math Problem Using Self-Consistency

**Query**: "What is 27 times 36?"

- **Reasoning Paths** (Generated Independently):
  - **Path 1**: "27 × 36 = (27 × 30) + (27 × 6) = 810 + 162 = 972."
  - **Path 2**: "27 × 36 = (30 × 36) - (3 × 36) = 1080 - 108 = 972."
  - **Path 3**: "27 × 36 = (20 × 36) + (7 × 36) = 720 + 252 = 972."

**Final Answer**: The majority answer is `972`, which is consistent across all paths.

---

#### Example: Medical Diagnosis Using Self-Consistency

**Query**: "What is the likely diagnosis for a patient experiencing fatigue, weight loss, and frequent urination?"

- **Reasoning Paths** (Generated Independently):
  - **Path 1**: "Fatigue and weight loss are common symptoms of diabetes. Frequent urination further supports this diagnosis."
  - **Path 2**: "Fatigue and weight loss might indicate thyroid issues, but frequent urination points more strongly to diabetes."
  - **Path 3**: "While other conditions like kidney problems could cause fatigue, frequent urination and weight loss strongly suggest diabetes."

---

#### Example: Weather Prediction Using Self-Consistency

**Query**: "What is the expected weather tomorrow based on the following context?"

**Context**:
- The temperature today is 28°C.
- Forecast predicts 30°C tomorrow.
- It has been raining for two days.
- Humidity is high.

- **Reasoning Paths** (Generated Independently):
  - **Path 1**: "The temperature increase and high humidity suggest warmer and muggy conditions, possibly with scattered rain."
  - **Path 2**: "Continued rain might moderate the temperature rise, but humidity will remain high."
  - **Path 3**: "The forecasted temperature rise indicates warmer weather, but residual rain might keep conditions unstable."

**Final Answer**: The consistent prediction is "Warmer and humid conditions, potentially with scattered rain."


#### Medical Diagnosis Using Self-Consistency: How Voting Works

**Scenario**: 
In a medical diagnosis example, self-consistency uses multiple reasoning paths to derive the most likely outcome. The "voting" process involves determining which reasoning path is most consistent with the provided data.

---

#### How the Voting Happens

1. **Multiple Reasoning Paths**:  
   The LLM generates several independent reasoning paths or chains of thought based on the query and the patient's medical context.

2. **Aggregation of Answers**:  
   - The outputs from all reasoning paths are collected.  
   - If most paths converge on a specific diagnosis (e.g., "diabetes"), that diagnosis is selected as the most consistent and likely correct answer.

3. **Consistency as a Metric**:  
   - The "winning" diagnosis is chosen based on the highest frequency of agreement among all reasoning paths.

---

#### Who Votes

- The "voters" are the individual reasoning paths generated by the LLM.  
- These paths are produced by prompting the LLM multiple times using slight variations (e.g., different random seeds or temperature settings).

---

#### Evaluation Steps

1. **Conflict Resolution**:  
   - If reasoning paths produce conflicting diagnoses, the consistency of their steps is analyzed.  
   - The final output reflects the majority's reasoning.

2. **Human Validation** (if needed):  
   - Medical experts can review the LLM’s reasoning for accuracy and plausibility.

3. **External Validation Layers**:  
   - Systems may implement structured rules or knowledge bases to cross-check the LLM's output.

---

#### Benefits of Self-Consistency for Medical Diagnosis

- **Robustness**: Helps ensure the diagnosis is not a result of a single flawed reasoning path.  
- **Transparency**: Provides multiple perspectives on the problem.  
- **Scalability**: Automates consistency checks across large datasets.  

This approach relies on generating diverse reasoning pathways and filtering for the majority consensus to ensure the LLM considers all relevant factors.


In [1]:
import openai
import random

In [2]:
client = openai.OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    # api_key = api_key
)

In [3]:
def generate_reasoning_paths(query, context, num_paths=5, temperature=0.7):
    """
    Generate multiple reasoning paths for a medical diagnosis.
    """
    responses = []
    for _ in range(num_paths):
        # Generate reasoning path
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a highly knowledgeable medical assistant."},
                {"role": "user", "content": f"""
                Question: {query}

                Relevant context:
                {context}

                Please reason step by step and provide the most likely diagnosis based on the symptoms and history.
                """}
            ],
            temperature=temperature,
        )
        responses.append(response.choices[0].message.content.strip())
    return responses

In [4]:
def analyze_consistency(responses):
    """
    Analyze consistency across reasoning paths and find the most frequent diagnosis.
    """
    diagnosis_counts = {}
    for response in responses:
        # Extract the diagnosis (simple assumption: last line is the conclusion)
        diagnosis = response.splitlines()[-1].strip()
        diagnosis_counts[diagnosis] = diagnosis_counts.get(diagnosis, 0) + 1
    
    # Find the most common diagnosis
    most_common_diagnosis = max(diagnosis_counts, key=diagnosis_counts.get)
    return most_common_diagnosis, diagnosis_counts

In [5]:
# Example context
context = """
Patient reports fatigue, weight loss, and frequent urination over the last month.
The patient has a history of high blood sugar and a family history of diabetes.
Recent lab results indicate elevated blood glucose levels and the presence of ketones in the urine.
"""

query = "What is the likely diagnosis for this patient?"

In [6]:
# Generate reasoning paths
reasoning_paths = generate_reasoning_paths(query, context, num_paths=5)

In [10]:
# Print all reasoning paths
print("Reasoning Paths:")
for i, path in enumerate(reasoning_paths, 1):
    print(f"\nPath {i}:\n{path}")

Reasoning Paths:

Path 1:
Based on the provided context, let's analyze the symptoms and history step by step:

1. **Symptoms**:
   - **Fatigue**: Common in many conditions, but particularly in diabetes due to poor glucose utilization.
   - **Weight Loss**: Unintentional weight loss can occur in uncontrolled diabetes as the body starts breaking down fat and muscle for energy when it cannot utilize glucose properly.
   - **Frequent Urination**: Known as polyuria, it occurs when excess glucose in the blood spills over into the urine, pulling water with it and leading to increased urination.

2. **Medical History**:
   - **History of High Blood Sugar**: This suggests that the patient may have had undiagnosed or poorly managed diabetes in the past.
   - **Family History of Diabetes**: A strong risk factor for developing diabetes, especially type 2 diabetes or even type 1 diabetes in some cases.

3. **Recent Lab Results**:
   - **Elevated Blood Glucose Levels**: Indicates hyperglycemia, whic

In [11]:
# Analyze consistency
most_common_diagnosis, diagnosis_counts = analyze_consistency(reasoning_paths)

In [13]:
diagnosis_counts

{'**Conclusion**: Given the symptoms, history, and lab findings, the most likely diagnosis for this patient is **Diabetes Mellitus, likely Type 1 Diabetes with Diabetic Ketoacidosis (DKA)**.': 1,
 "Given the combination of symptoms (fatigue, weight loss, frequent urination), the patient's history (high blood sugar, family history of diabetes), and the lab results (elevated blood glucose and ketones in urine), the most likely diagnosis is **Type 1 Diabetes Mellitus**, potentially presenting with **Diabetic Ketoacidosis (DKA)**. This diagnosis is supported by the acute presentation of symptoms and the presence of ketones, which indicates significant metabolic derangement often associated with Type 1 diabetes.": 1,
 "Immediate medical evaluation and intervention would be warranted to manage the patient's condition effectively.": 1,
 'Thus, the most likely diagnosis for this patient is **Diabetes Mellitus with Diabetic Ketoacidosis**.': 1,
 'In summary, the most likely diagnosis for this p

In [12]:
# Output the results
print("\nDiagnosis Frequency:")
for diagnosis, count in diagnosis_counts.items():
    print(f"{diagnosis}: {count} votes")

print(f"\nMost Likely Diagnosis: {most_common_diagnosis}")


Diagnosis Frequency:
**Conclusion**: Given the symptoms, history, and lab findings, the most likely diagnosis for this patient is **Diabetes Mellitus, likely Type 1 Diabetes with Diabetic Ketoacidosis (DKA)**.: 1 votes
Given the combination of symptoms (fatigue, weight loss, frequent urination), the patient's history (high blood sugar, family history of diabetes), and the lab results (elevated blood glucose and ketones in urine), the most likely diagnosis is **Type 1 Diabetes Mellitus**, potentially presenting with **Diabetic Ketoacidosis (DKA)**. This diagnosis is supported by the acute presentation of symptoms and the presence of ketones, which indicates significant metabolic derangement often associated with Type 1 diabetes.: 1 votes
Immediate medical evaluation and intervention would be warranted to manage the patient's condition effectively.: 1 votes
Thus, the most likely diagnosis for this patient is **Diabetes Mellitus with Diabetic Ketoacidosis**.: 1 votes
In summary, the most

In [14]:
diagnosis_counts = {}

for response in reasoning_paths:
    # Extract the diagnosis (simple assumption: last line is the conclusion)
    diagnosis = response.splitlines()[-1].strip()
    diagnosis_counts[diagnosis] = diagnosis_counts.get(diagnosis, 0) + 1

In [15]:
diagnosis_counts

{'**Conclusion**: Given the symptoms, history, and lab findings, the most likely diagnosis for this patient is **Diabetes Mellitus, likely Type 1 Diabetes with Diabetic Ketoacidosis (DKA)**.': 1,
 "Given the combination of symptoms (fatigue, weight loss, frequent urination), the patient's history (high blood sugar, family history of diabetes), and the lab results (elevated blood glucose and ketones in urine), the most likely diagnosis is **Type 1 Diabetes Mellitus**, potentially presenting with **Diabetic Ketoacidosis (DKA)**. This diagnosis is supported by the acute presentation of symptoms and the presence of ketones, which indicates significant metabolic derangement often associated with Type 1 diabetes.": 1,
 "Immediate medical evaluation and intervention would be warranted to manage the patient's condition effectively.": 1,
 'Thus, the most likely diagnosis for this patient is **Diabetes Mellitus with Diabetic Ketoacidosis**.': 1,
 'In summary, the most likely diagnosis for this p

#### what about different T settings

In [16]:
def generate_responses(prompt, model="gpt-4", n=5, temperatures=[0.1, 0.5, 0.7, 1.0]):
    """
    Generate candidate responses using different temperature settings.

    Parameters:
    - prompt (str): The question or context to feed into the LLM.
    - model (str): The OpenAI model to use (default is "gpt-4").
    - n (int): Number of responses to generate per temperature.
    - temperatures (list of float): List of temperature values to explore.

    Returns:
    - responses (list of str): A list of all generated responses.
    """
    responses = []

    for temp in temperatures:
        for _ in range(n):
            response = openai.ChatCompletion.create(
                model=model,
                messages=[{"role": "system", "content": "You are a medical assistant helping with diagnoses."},
                          {"role": "user", "content": prompt}],
                temperature=temp,
                max_tokens=300,
            )
            responses.append(response.choices[0].message.content.strip())

    return responses

In [17]:
def analyze_consistency(responses):
    """
    Analyze consistency across reasoning paths and find the most frequent diagnosis.

    Parameters:
    - responses (list of str): List of reasoning outputs from the LLM.

    Returns:
    - most_common_diagnosis (str): The diagnosis that appears most frequently.
    - diagnosis_counts (dict): A dictionary with diagnoses as keys and their counts as values.
    """
    diagnosis_counts = {}
    for response in responses:
        # Extract diagnosis (simple assumption: last line is the conclusion)
        diagnosis = response.splitlines()[-1].strip()
        diagnosis_counts[diagnosis] = diagnosis_counts.get(diagnosis, 0) + 1

    # Find the most common diagnosis
    most_common_diagnosis = max(diagnosis_counts, key=diagnosis_counts.get)
    return most_common_diagnosis, diagnosis_counts

#### Key Design Considerations for Building Self-Consistency (SC) Prompt-Based Applications with LLMs

#### 1. Purpose and Scope of the Application
- Clearly define the goals of the application (e.g., medical diagnosis, legal analysis, customer support).
- Ensure the problem domain is well-suited for reasoning and deliberation, benefiting from SC methods.

---

#### 2. Prompt Engineering
- **Design a clear and structured prompt**: Include step-by-step reasoning instructions to guide the model effectively.
  - Example: `"Summarize the symptoms, correlate them with possible causes, and provide the most likely diagnosis."`
- Use **domain-specific language** to increase precision.
  - Example: Include terms like "ketones," "elevated glucose levels," or "Type 1 diabetes" for medical applications.
- Encourage **exploration of alternatives** in the prompt.
  - Example: `"Consider all possible conditions that fit these symptoms and evaluate them before concluding."`

---

#### 3. Diversity of Reasoning Paths
- Generate diverse outputs by experimenting with:
  - **Temperature Settings**: Use high `T` (e.g., 0.7–1.0) for creativity and low `T` (e.g., 0.1–0.3) for determinism.
  - **Top-p Sampling**: Adjust `p` (e.g., 0.9) to control the diversity of generated tokens.
- Create multiple prompts or prompt variations for the same query to elicit different reasoning paths.

---

#### 4. Ensuring Adequate Context
- **Chunk large datasets**: For tasks like medical diagnosis with an EMR/EHR system, divide patient records into smaller, relevant chunks (e.g., medications, lab results, imaging reports).
- Use **retrieval-augmented generation (RAG)** to fetch and pass only the relevant context for the query.
  - Example: Combine FAISS/Chroma for indexing and retrieval.
- Ensure proper **contextual grounding** by explicitly citing sources or references in prompts.

---

#### 5. Generation of Responses
- **Number of Candidates**:
  - Experiment with generating 5–10 candidate responses for robust SC analysis.
- **Balance Output Length**:
  - Avoid overly verbose reasoning that might dilute key points.
- Consider **multi-turn dialogues**:
  - Example: Allow iterative clarifications to refine responses.

---

#### 6. Evaluation Metrics for Self-Consistency
- **Frequency Voting**:
  - Identify the most frequent outcome across all reasoning paths.
- **Semantic Similarity**:
  - Use tools like `spaCy` or `SentenceTransformers` to measure the similarity of conclusions, identifying consistent outputs with semantic overlap.
- **Domain-Specific Metrics**:
  - Evaluate based on specific KPIs, such as diagnostic accuracy for medical use or legal correctness in legal applications.

---

#### 7. Handling Ambiguities
- **Encourage uncertainty quantification**:
  - Example: Include prompts asking for confidence levels or listing assumptions.
- Create fallbacks for **low-consensus cases**:
  - Notify users or require manual review when consistency is below a threshold.

---

#### 8. Integration with External Systems
- Use APIs to **fetch data dynamically** from databases like EMR/EHR systems or other knowledge bases.
- Implement **validation layers** for domain-specific checks (e.g., cross-referencing diagnoses with lab findings).

---

#### 9. Optimizing for Cost and Latency
- Optimize the number of SC iterations and candidate responses to balance quality with compute resources.
- Use **cheaper models for initial reasoning** and premium models (e.g., GPT-4) for critical paths.

---

#### 10. Iterative Testing and Refinement
- Test prompts rigorously with domain experts to fine-tune reasoning steps and accuracy.
- Conduct regular **failure case analyses**:
  - Investigate why SC outputs fail or provide inconsistent answers and refine accordingly.

---

#### 11. Explainability and Transparency
- Provide clear explanations for final outputs:
  - Example: `"The diagnosis is Type 1 Diabetes Mellitus because it aligns with the symptoms and lab results, and it appeared most frequently in generated responses."`
- Use **traceable SC logs** to audit reasoning paths for critical domains like healthcare and law.

---

#### 12. Ethical Considerations
- Ensure compliance with domain-specific regulations (e.g., HIPAA for healthcare).
- Clearly disclose the limitations of SC-based reasoning to users:
  - Example: `"This diagnosis is based on probabilistic reasoning and should be validated by a healthcare professional."`

---

By addressing these considerations, practitioners can effectively design robust and practical SC-based applications for diverse use cases while maintaining reliability, accuracy, and user trust.


- Chain of Thought (CoT) Prompting: Involves asking an LLM to generate a series of steps to solve a problem, improving performance on reasoning and arithmetic tasks. Typically uses the phrase "let's think step by step."
- Self-Consistency Improvement: Authors propose generating multiple chains of thought and selecting the most consistent answer, similar to how humans approach problems.
Self-Consistency Advantage: Simple and unsupervised technique, no fine-tuning or additional training required.
- Diversity of Answers: Achieved by adjusting the temperature parameter, which controls the randomness of answers generated by the LLM. Higher temperature = more diverse responses.
- Benchmarking Results: Self-consistency method significantly improves accuracy across various reasoning benchmarks, especially on arithmetic tasks.
- Accuracy Improvement: Accuracy increased from about 95% to over 99.3%, with improvements of up to 18% for different LLMs.
- Sample Requirements: Accuracy improvements plateau after about 10 samples.
- Temperature Sensitivity: Results are robust across different temperature settings, with increasing samples leading to higher accuracy, but diminishing returns after a certain point.
- Computational Cost: Generating more samples increases computational cost, but the improved accuracy may justify it.
- Simple Technique: Self-consistency is an easy-to-understand method for boosting the effectiveness of Chain of Thought prompting.

#### Self-Consistency in LLMs

**Self-Consistency** is a technique where multiple responses (or chains of thought) are generated by the same LLM for a given query. These responses are then compared to find the most consistent answer. 

While we use a **separate evaluator** to compare the outputs, the term **"self-consistency"** still holds for the following reasons:

#### 1. Self-Assessment
The LLM generates multiple possible solutions independently, without human intervention. It performs a form of self-assessment by producing different reasoning paths and checking if they lead to the same or similar answers. This is akin to the model evaluating its own reasoning.

#### 2. Model-Driven Comparison
Although an external evaluator is used to compare the outputs, the core idea of self-consistency is that the comparison is based solely on the **model's own outputs**. The model's reasoning is still central to the process, and the evaluator only helps identify the most consistent result.

#### 3. Algorithmic Consistency
The evaluator algorithm simply compares the different chains of thought generated by the model. It helps to highlight which responses are consistent, but the key point is that the model generates multiple answers based on its own reasoning. The external evaluator does not influence the reasoning process; it merely aids in finding the consistent answer.

#### Conclusion
So, while the evaluation of consistency is external, the **self-consistency** concept emphasizes that the reasoning is generated internally by the model, and the evaluator’s role is to identify the most consistent output from those generated by the model itself.
