# Conditional Acceptability in Large Language Models and Humans - LLM Prompting
This notebook tests how Large Language Models (LLMs) handle **conditional statements**. We test performance across two tasks:

## Part 1: Conditional Probability — “Suppose A. How probable is B?”
Conditionals from Skovgaard-Olsen et al. (2016) are split into:

* **A**: supposition

* **B**: outcome

Prompt format:
**“Suppose A. How probable is B?”**

This assesses how LLMs estimate raw conditional probability. We'll compare these responses to those of Part 2 to explore whether probability alone drives the model's reasoning about conditionals.

## Part 2: Conditional Statements — “If A, then B”
Here, the model is prompted to evaluate full conditionals:

* **“How probable is 'If A, then B'?”**

* **“How acceptable is 'If A, then B'?”**

By comparing these outputs to those from Part 1, we assess whether the model treats full conditionals differently or similarly, helping us infer whether LLMs rely on conditional probability, evidentiality, or a blend of both.

## Preparation

### Setup
We begin by installing the necessary packages and logging into the HuggingFace Hub.

In [None]:
# Install necessary packages
!pip install transformers torch huggingface_hub

# Import libraries
from google.colab import files
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import notebook_login
import torch
import json

# Log in to HuggingFace to access models
notebook_login()

### Loading a Model
Now we load the specific language model to be tested. In this case, it's the **LLaMA 3.1-8B Instruct** model from Meta.

In [None]:
# Empty the GPU cache
torch.cuda.empty_cache()

# Define the model and load it
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Load the tokenizer
model_tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

## Experiment Setup

We define key parameters to configure the experiment:

* `part`: ***task type***
  * `1`: *“Suppose A. How probable is B?”*
  * `2`: *“If A, then B”*

* `metric`: ***evaluation target***
  * Automatically set to `"cprob"` for Part 1
  * For Part 2, manually select:
    * `"ifprob"`: probability
    * `"ifacc"`: acceptability

* `context`: ***context inclusion***
  * `True`: include background context
  * `False`: omit context

* `style`: ***prompt style***
  * `"vanilla"`: basic
  * `"fewshot"`: with examples
  * `"cot"`: chain-of-thought

* `n_samples`: ***number of generations per prompt***

---

> **Filenames are automatically generated** based on the selected parameters: one for fetching the prompt template, one for saving model outputs.




In [None]:
# Define experimental settings
part = 1                     # Options: 1 = "Suppose A... B", 2 = "If A then B"
metric = "cprob" if part == 1 else user_choice
user_choice = "ifacc"          # Options: "ifacc", "ifprob"
context = True              # Options: True = include context, False = exclude context
style = "vanilla"           # Options: "vanilla", "fewshot", "cot"
n_samples = 5               # Number of model samples per prompt

# Necessary filenames
prompt_file = f"{metric}_{style}_prompt.txt"
results_file = f"results_{metric}_context_{style}.jsonl" if context else f"results_{metric}_no_context_{style}.jsonl"

## Loading the Prompts

This function constructs the full set of prompts by combining:

* A **base prompt template** (from a `.txt` file)

* Task-specific elements from `statements.jsonl`:
  * **Part 1**: separate fields for `supposition` (A) and `outcome` (B)
  * **Part 2**: full conditional statements

* Optional **context fields**, if `context=True`
* An **example context**, used only if `style` is `"fewshot"` or `"cot"`

The function returns:
* `prompts`: a list of formatted prompts ready for model input
* `sample_numbers`: statement indices

In [None]:
def load_prompts():
  # Load the base prompt structure from textfile
  with open(prompt_file, "r", encoding = "utf8") as file:
    prompt_template = file.read()

  # List to store sample indices
  sample_numbers = []

  # Initialize input fields depending on the selected part
  if part == 1:
    suppositions = []   # "A" in "Suppose A. How probable is B?"
    outcomes = []       # "B" in the same structure
  elif part == 2:
    statements = []     # Full conditional statements: "If A, then B"
  else:
    raise ValueError("Variable 'part' must be either 1 or 2.")

  # Initialize context list only if context is used
  if context:
    contexts = []

  # Load the data entries (sentences and optional context) from file
  with open("statements.jsonl", "r", encoding = "utf8") as file:
    for line in file:
      obj = json.loads(line)

      # Collect sample index
      sample_numbers.append(obj["sample_number"]) # index

      # Add relevant data based on the selected task part
      if part == 1:
        suppositions.append(obj["supposition"])
        outcomes.append(obj["outcome"])
      else:
        statements.append(obj["statement"])

      # Optionally load contextual information
      if context:
        contexts.append(obj["context"])


  # Load the example context if the prompt style requires it (and if it doesn't, set the variable to an empty string)
  if style in ("fewshot", "cot") and context:
    with open("example_context.txt", "r", encoding = "utf8") as file:
      example_context = file.read()
  else:
    example_context = ""


  # Generate full prompts using the loaded data
  prompts = []

  for i in range(len(sample_numbers)):
    full_prompt = prompt_template
    full_prompt = full_prompt.replace("{example_context}", example_context)       # Replace example context placeholder

    if context:
      # Add data fields according to selected part
      if part == 1:
        full_prompt += f"\n\nContext: {contexts[i]}\nAssumption: {suppositions[i]}\nSentence: {outcomes[i]}"
      else:
        full_prompt += f"\n\nContext: {contexts[i]}\nStatement: {statements[i]}"

    else:
      # Add data fields according to selected part
      if part == 1:
        full_prompt += f"\n\nAssumption: {suppositions[i]}\nSentence: {outcomes[i]}"
      else:
        full_prompt += f"\n\nStatement: {statements[i]}"

    # Store the finalized prompt
    prompts.append(full_prompt)

  # Return both the prompts and their sample indices
  return prompts, sample_numbers

## Safety Checks (Optional)
Before running the model on the entire dataset, we can manually inspect a few instances to ensure everything is functioning as expected.

This helps verify:
1. **Prompt Formatting:** Are the prompts structured correctly?
2. **Model Behaviour:** Does the model respond appropriately to the prompts?


In [None]:
def safety_check(level:str):      # On which "level" the safety check should be done, i.e. prompt formatting or model behaviour

    prompts, _ = load_prompts()

    if level == "format":     # Print the first few prompts in prompt_list to confirm they are assembled correctly
        for i, prompt in enumerate(prompts[:3], 1):
            print(f"[{i}] {prompt}\n")

    elif level == "behaviour":      # Test the model on the first few prompts to ensure it interprets them correctly
        for i, prompt in enumerate(prompts[:3], 1):
            messages = [
                {"role": "user", "content": prompt
                },
                ]
            input_ids = model_tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True, add_generation_prompt=True).to("cuda")
            outputs = model.generate(**input_ids, max_new_tokens=50)
            print(f"[{i}] {model_tokenizer.decode(outputs[0], skip_special_tokens=True)}\n")
    else:
        raise ValueError("Choose 'format' (prompt format) or 'behaviour' (model behaviour) as safety check level.")

In [None]:
# Use: "format" for prompt formatting or "behaviour" for model behaviour

safety_check("format")
safety_check("behaviour")

## Running the Model

This function runs the model on the complete list of prompts. This step will:

* generate a response for each prompt in `prompts`
  * run the model five times per prompt to minimize the impact of outliers
* save the prompt and its outputs to a `.jsonl` file for later evaluation

> *A safeguard is included to ensure the prompt format (task type: part 1 or 2) is correct before starting the process.*

In [None]:
# Safeguard to make sure we are working with the correct set of prompts
def correct_prompts_safeguard(prompts):
  if part == 1:                                 # if we want to prompt the model on raw conditional probability:
    if "Suppose" not in prompts[0]:         # make sure that first prompt has the format "Suppose..." somewhere in it
      raise ValueError("These are not the right prompts for part 1 (expected 'Suppose').")

  elif part == 2:                                 # if we want to prompt the model on If A, then B statements:
    if "If" not in prompts[0]:                # make sure that first prompt has the format "If..." somewhere in it
      raise ValueError("These are not the right prompts for part 2 (expected 'If').")

  else:
    raise ValueError("Invalid part selection. Part variable must be either 1 or 2.")


# Run the model on each prompt and save the results
def run_model(prompts, sample_numbers):

  with open(results_file, "w", encoding = "utf8") as file:
    for i, prompt in enumerate (prompts):
      messages = [
          {"role": "user", "content": prompt
          },
          ]
      responses = []

      for _ in range(n_samples):

        input_ids = model_tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True, add_generation_prompt=True).to("cuda")
        outputs = model.generate(**input_ids, max_new_tokens=50)
        output_text = model_tokenizer.decode(outputs[0], skip_special_tokens=True)

        responses.append(output_text)

      json.dump ({"sample_number": sample_numbers[i], "prompt": prompt, "output": responses}, file)
      file.write("\n")

  files.download(results_file) # automatically download the output file after completion

# Execution Block

1. Load the prompts and sample numbers
2. Validate the prompts using the safeguard
3. If the prompts pass the validation, run the model on them

In [None]:
prompts, sample_numbers = load_prompts()  # Load prompts and sample numbers

correct_prompts_safeguard(prompts)  # Check if prompts are valid
run_model(prompts, sample_numbers)  # If so, run the model