# CS25: Common Sense Reasoning - Comprehensive Notes

These notes provide a detailed overview of the Stanford CS25 lecture on Common Sense Reasoning. The goal is to provide a self-sufficient resource.  We'll cover the key topics, examples, and research directions discussed, enhanced with visualizations and Python code to illustrate concepts interactively. This notebook aims to provide an in-depth understanding of the concepts presented.

**Table of Contents:**
1.  Introduction: The Illusion of Solved Common Sense
2.  Maieutic Prompting: Socratic Reasoning for LMs
    *   Maieutic Inference Tree
    *   Logical Consistency
    *   Belief and Pairwise Consistency
    *   Constraint Optimization
    *   Empirical Results
3.  Symbolic Knowledge Distillation: From Language Models to Causal Commonsense Models
    *   Systematic Generalization Problem
    *   Commonsense Definition
    *   ATOMIC Knowledge Graph
    *   Symbolic Knowledge Distillation Process
    *   Critique Model
    *   Results: Machine-Authored vs. Human-Authored Knowledge
4.  Commonsense Morality: Aligning AI with Human Values
    *   Moral Implications of Language Models
    *   Delphi: A Commonsense Moral Model
    *   Commonsense Norm Bank
    *   Social Chemistry and Bias Reframing
    *   Delphi Hybrid: Neural-Symbolic Reasoning
    *   AI Safety, Equity, and Morality
5.  Open Questions and Future Directions


## 1. Introduction: The Illusion of Solved Common Sense

The lecture begins by addressing the common question of whether large language models (LLMs) like ChatGPT have solved common sense reasoning.  While LLMs can perform impressively on certain tasks, they often exhibit unreliability and inconsistencies when challenged with slightly modified or adversarial examples. The speaker uses the Winograd Schema Challenge as an example, showing that ChatGPT can answer a question correctly but fail when the wording is subtly altered.

**Example:**
*   Original: "The trophy doesn't fit in the brown suitcase because it's too big. What's too big?"
*   ChatGPT (Correct): "The trophy"
*   Modified: "The trophy doesn't fit in the brown suitcase because it's too small. What's too small?"
*   ChatGPT (Incorrect): "The trophy"

This highlights that while LLMs may appear to possess common sense, their reasoning is often superficial and easily disrupted.

The lecture emphasizes the importance of rigorous evaluation, focusing on new tasks and leaderboards, and innovating algorithms and data to improve the true common sense capabilities of AI models. Key themes are that smaller models can be better with better data and algorithms, and that *knowledge is power*.

## 2. Maieutic Prompting: Socratic Reasoning for LMs

Maieutic prompting is a technique inspired by Socrates' method of questioning to improve the logical consistency of LLMs. The central idea is to build a *Maieutic inference tree* by recursively prompting the LLM to explain its reasoning and then evaluating the consistency of those explanations.

### Maieutic Inference Tree

The tree is constructed as follows:

1.  **Start with a question and a proposed answer (True or False).**
2.  **Prompt the LLM to explain *why* the answer is True (Explanation of True, E(T)).**
3.  **Prompt the LLM to explain *why* the answer is False (Explanation of False, E(F)).**
4.  **Check the logical consistency of the explanations.** For E(T), ask the LLM if it agrees with the explanation.  For E(F), ask if it disagrees (negates) with the explanation.
5.  **Recursively repeat steps 2-4 on the explanations themselves, building a tree of reasoning.**

**Example:**

*   Question: "If you travel West far enough from the West Coast, you will reach the East Coast or not?"
*   Proposed Answer: True
*   E(T): "The world is round, so eventually you will reach the East Coast."
*   E(F): "You cannot reach the East Coast because the world is flat."

### Logical Consistency

The goal is to identify and prune branches of the tree where the LLM exhibits logical inconsistencies.  If the LLM contradicts its own explanations, that branch is considered unreliable and discarded. The speaker emphasizes how language models can be inconsistent with their own statements.

### Belief and Pairwise Consistency

Even after pruning inconsistent branches, the remaining tree may still contain inconsistencies. To address this, the following steps are taken:

1.  **Compute Node-wise Confidence (Belief):** Calculate a confidence score for each node based on conditional probabilities. The speaker describes the equation as looking at different conditional probabilities and then computes its ratio to see how confident it is for any particular node.
2.  **Compute Edge-wise Consistency:** Use a Natural Language Inference (NLI) model to determine whether pairs of nodes are contradictory. This generates pairwise weights.

### Constraint Optimization

Finally, a constraint optimization problem is formulated to assign labels (True or False) to each node in the tree such that the overall consistency and confidence are maximized. This is solved using a Max-SAT solver. This process might flip the original label if it enhances graph-level consistency.

### Empirical Results

The speaker mentions improved performance on CommonsenseQA 2.0, CREAK, and Come2Sense benchmarks when using Maieutic prompting compared to other prompting methods like chain-of-thought or self-consistency. Maieutic prompting even outperforms supervised models trained on T5, suggesting the power of this inference-time algorithm.

In [1]:
# Example: Implementing a simplified Maieutic Prompting (Conceptual)

def get_llm_response(question, model="GPT-3"):  # Placeholder for LLM interaction
    # Simulate getting response from an LLM
    if question.startswith("Why is it True that"):  # Basic example
        if "travel West" in question:
            return "Because the Earth is round."
        elif "butterflies fly" in question:
            return "Because they have four wings."
    elif question.startswith("Why is it False that"): # Example reasoning
        if "travel West" in question:
            return "Because you'll fall off the edge."
        elif "butterflies fly" in question:
            return "Because butterflies can't fly."
    elif question.startswith("Does the following support True:"):
        if "Earth is round" in question:
            return "Yes"
        else:
            return "No"
    elif question.startswith("Does the following support False:"):
            return "No"
    return ""

def maieutic_prompting(question, answer):
    # Step 1 & 2: E(T)
    explanation_true = get_llm_response(f"Why is it True that {question}?", model="GPT-3")
    print(f"E(T): {explanation_true}")
    # Step 3: E(F)
    explanation_false = get_llm_response(f"Why is it False that {question}?", model="GPT-3")
    print(f"E(F): {explanation_false}")
    # Step 4: Check Logical Consistency (simplified)
    consistency_true = get_llm_response(f"Does the following support True: {explanation_true}?", model="GPT-3")
    consistency_false = get_llm_response(f"Does the following support False: {explanation_false}?", model="GPT-3")
    print(f"Consistency True: {consistency_true}")
    print(f"Consistency False: {consistency_false}")
    # Further steps (recursion, confidence, etc.) would be implemented here

# Example Usage
question = "If you travel West far enough from the West Coast, you will reach the East Coast or not?"
maieutic_prompting(question, "True")

E(T): Because the Earth is round.
E(F): Because you'll fall off the edge.
Consistency True: Yes
Consistency False: No


This simplified Python code provides a conceptual outline of Maieutic Prompting. It showcases how the initial explanations are generated and how logical consistency can be verified. Further development can extend this framework to realize a complete Maieutic Inference Tree with sophisticated confidence and consistency checks.

## 3. Symbolic Knowledge Distillation: From Language Models to Causal Commonsense Models

The lecture then transitions to the topic of symbolic knowledge distillation, a technique to convert general language models into causal commonsense models. The motivation behind this is that despite impressive performance on leaderboards, state-of-the-art models are often brittle and make strange mistakes when faced with adversarial or out-of-domain examples.  They tend to learn surface patterns rather than true understanding.

### Systematic Generalization Problem

This brittleness is referred to as the *systematic generalization problem*. Unlike humans who conceptually understand how the world works, transformers learn surface patterns that are powerful but not robust.

### Commonsense Definition

The lecture defines commonsense as the basic level of practical knowledge and reasoning concerning everyday situations and events that are *commonly shared* among most people.  It emphasizes that commonsense is not universal knowledge but is shared across a large population, acknowledging that additional context can change what is considered commonsensical.

**Example:**
*   "It's OK to keep the closet door open."
*   "It's NOT OK to keep the fridge door open (because the food might go bad)."

However, the speaker notes exceptions to the rules (e.g., fridge in a store, friend's house rules).

### ATOMIC Knowledge Graph

The speaker introduces ATOMIC, a symbolic commonsense knowledge graph, which was initially fully crowdsourced.  ATOMIC contains *if-then* knowledge about social interactions and physical entities.  For example, "If X gets their car repaired," then:

*   Effect: "X might call Uber/Lyft."
*   Need: "X needs a mechanic and money."

ATOMIC also captures physical entity-centric knowledge and counterfactual conditions.

### Symbolic Knowledge Distillation Process

The core idea of symbolic knowledge distillation is to distill knowledge from a large language model (teacher) into a smaller, more focused commonsense model (student). The goal is to create a student model that is *better* than the teacher, even though standard knowledge distillation often results in a worse model.

The speaker introduces a *funnel* concept in the process. 

### Critique Model

The *critic* plays a crucial role in symbolic knowledge distillation. A RoBERTa model is used as a critic to evaluate the quality of the knowledge generated by the teacher (GPT-3). The critic is trained on a labeled dataset to identify whether machine-generated knowledge is correct or not. The critic aggressively filters out bad knowledge, even if it means discarding some good knowledge. By filtering aggressively, the overall accuracy and diversity of the distilled knowledge is improved.

### Results: Machine-Authored vs. Human-Authored Knowledge

The surprising finding is that machine-authored knowledge, when filtered by the critic, can be *better* than human-authored knowledge in terms of scale, accuracy, and diversity. This is attributed to the combination of GPT-3's broad knowledge and RoBERTa's ability to filter out inconsistencies. The distilled knowledge is then used to train a downstream neural commonsense model, achieving performance close to 90% on causal reasoning tasks.



In [2]:
# Simplified example of Knowledge Distillation with a Critic (Conceptual)

import random

def teacher_generate(prompt):
    """Simulates a Teacher Language Model generating knowledge."""
    if "car repaired" in prompt:
        options = [
            "Person calls a taxi.",
            "Person pays the bill.",
            "Person folds the car into origami.", #Incorrect
            "Person eats a sandwich."
        ]
        return random.choice(options)
    return ""

def critic_evaluate(statement):
    """Simulates a Critic model evaluating the generated knowledge."""
    if "origami" in statement:
        return False  # Flag as incorrect
    if "sandwich" in statement:
        return False #Flag as irrelevant 
    return True

def knowledge_distillation(prompt, num_samples=10):
    """Simulates Knowledge Distillation with a Teacher and Critic."""
    knowledge = []
    for _ in range(num_samples):
        generated = teacher_generate(prompt)
        if generated and critic_evaluate(generated):
            knowledge.append(generated)
    return knowledge

# Example usage
prompt = "If a person gets their car repaired..."
distilled_knowledge = knowledge_distillation(prompt, num_samples=10)
print("Distilled Knowledge:")
for k in distilled_knowledge:
    print(f"- {k}")

Distilled Knowledge:
- Person calls a taxi.
- Person calls a taxi.
- Person pays the bill.
- Person pays the bill.
- Person calls a taxi.
- Person pays the bill.


This code offers a simplified illustration of knowledge distillation enhanced with a critic. This mechanism makes the distillation more robust by removing irrelevant/incorrect statements. The key insight is that a discerning filter (the critic) results in a higher-quality distilled knowledge base.

## 4. Commonsense Morality: Aligning AI with Human Values

The final section of the lecture addresses the critical topic of commonsense morality, focusing on how to align AI systems with human values. The speaker emphasizes that language models are already making judgments with moral implications, even if they are not explicitly designed to do so. The widespread deployment of LLMs necessitates careful consideration of these moral implications.

### Moral Implications of Language Models

The lecture uses the example of a system making moral decisions about whether an action is "OK" or "wrong." Moral decision-making requires weighing different values that are often at odds.  The system needs to understand context and nuances to make appropriate judgments.

**Example:**
*   "Killing a bear is wrong."
*   "Killing a bear to save your child is OK."
*   "Killing a bear to please your child is wrong."
*   "Exploding a nuclear bomb to save your child?"  Delphi incorrectly answers "OK" (demonstrates potential flaws).

A knife and cheeseburger example is also used to highlight the importance of physical reasoning. "Stabbing someone *with* a cheeseburger" is less harmful than "Stabbing someone *over* a cheeseburger", because the former implies the cheeseburger is the weapon, which is less dangerous.

### Delphi: A Commonsense Moral Model

The speaker introduces Delphi, a commonsense moral model built to make ethical judgments. Delphi is trained on a dataset of ethical judgments on everyday situations. While performance is strong on the training data, the speaker emphasizes that this is just a small step toward aligning AI with human values.

### Commonsense Norm Bank

Delphi is trained on the *Commonsense Norm Bank*, which includes 1.7 million ethical judgments compiled from five existing datasets.  The key datasets are Social Chemistry and Social Bias Reframing.

### Social Chemistry and Bias Reframing

*Social Chemistry* addresses the challenge of ensuring that AI models adhere to social norms and expectations.  The goal is to prevent models from generating text that is rude, offensive, or socially inappropriate.

*Social Bias Reframing* teaches the model to avoid racism and sexism. This is essential to prevent AI systems from perpetuating harmful stereotypes and biases.

**Example of GPT-3's dubious morality (before training):**
*   Running a blender at 5:00 AM is rude... unless you're making a smoothie."
*   It's OK to post fake news if it's in the interest of the people." (Demonstrates picking up questionable moral stances).

### Delphi Hybrid: Neural-Symbolic Reasoning

The speaker introduces a new version of Delphi that incorporates neural-symbolic reasoning to address major mistakes. This involves parsing queries into smaller events, checking commonsense inferences, and using a graph of reasoning to make more informed decisions. This is reminiscent of the Maieutic graph.

**Example:**
*   The original system incorrectly answered, "Genocide is OK if you're creating jobs." The hybrid system uses commonsense reasoning to realize that genocide is inherently wrong, regardless of the potential benefits.

### AI Safety, Equity, and Morality

The speaker emphasizes that AI safety, equity, and morality are all interconnected challenges.  The speaker advocates for a value of pluralism, endorsing different cultures and individual preferences rather than imposing a single moral framework. Collaboration across AI, humanities, philosophy, psychology, and policymaking is critical. The system should learn the *distribution* of opinions, rather than attempting to impose a single 'correct' answer. It should understand that differences in certain questions are ok.

In [3]:
# A simplified model of ethical reasoning

def is_generally_ok(action):
  """Placeholder for a commonsense check on actions."""
        #Example: certain actions are almost always wrong
  if "genocide" in action.lower():
    return False
  if "torture" in action.lower():
      return False
  return True #Defaults to 'OK'


def assess_situation(action, context):
  """Simulates ethical assessment considering context."""
  if not is_generally_ok(action):
    return "Wrong (Generally Unacceptable)"

  if "save your child" in context.lower() and "killing a bear" in action.lower(): #prioritize saving your child
        return "OK (Justified to protect life)"
  if "please your child" in context.lower() and "killing a bear" in action.lower():
        return "Wrong (Not Justified)"
  return ""
        

def delphi_lite(action, context=""):
    """Simplified ethical advisor model."""
    result = assess_situation(action,context)
    if result: return result
        #Defaults:
    if "killing" in action.lower():
        return "Wrong"
    return "OK"

# Example usage
action1 = "Killing a bear"
context1 = ""
print(f"{action1}: {delphi_lite(action1, context1)}")

action2 = "Killing a bear"
context2 = "to save your child"
print(f"{action2} {context2}: {delphi_lite(action2, context2)}")

action3 = "Genocide"
context3 = "to create jobs"
print(f"{action3} {context3}: {delphi_lite(action3, context3)}")

action4 = "Posting fake news"
context4 = "to help my friend win election"
print(f"{action4} {context4}: {delphi_lite(action4, context4)}")


Killing a bear: Wrong
Killing a bear to save your child: OK (Justified to protect life)
Genocide to create jobs: Wrong (Generally Unacceptable)
Posting fake news to help my friend win election: OK


This Python code exemplifies a highly simplified model for ethical reasoning. It checks to make sure the action isn't universally immoral and then assesses against provided situations to make its final determination. By testing scenarios that have competing values, we can see how those conflicts affect outcomes and understand the model's decision-making ability.

## 5. Open Questions and Future Directions

The lecture concludes with a discussion of open questions and future directions. Key topics include:

*   **The Role of Legal Recourse:** The potential for using legal case law as training data for moral reasoning.
*   **Data Quality vs. Model Size:** The importance of data quality and algorithms in achieving better performance, particularly for smaller models.
*   **The Value of Critique Models:** The need for more investment in critique models to improve the quality and safety of generated content.
*   **Balancing Pluralism with Ethical Boundaries:** How to balance the value of pluralism with the need to establish ethical boundaries and prevent the endorsement of harmful viewpoints.
*   **The Importance of Interdisciplinary Collaboration:** The need for collaboration across AI, humanities, philosophy, psychology, and policymaking to address the complex challenges of AI safety, equity, and morality.

The speaker also mentions the need for data sharing, especially regarding human feedback on toxicity and morality concerns, to enable meaningful progress as a community.

This comprehensive notebook summarizes the key concepts and examples from the Stanford CS25 lecture on Common Sense Reasoning, providing an in-depth understanding of the challenges and opportunities in this critical field.