
# CoT Factuality Debug Notebook

This notebook helps you:
1. **Debug** your extraction pipeline (`01_extract_vectors.py`) in a cell-by-cell manner.
2. **Run** optional evaluation scripts (`02_steer_eval.py`, `03_patch_eval.py`) if present.
3. **Compute a factual consistency score** on **fact-oriented benchmarks** (FEVER or TruthfulQA).

> Notes
> - The notebook is designed to run both with and without internet access. If you're offline, place datasets locally; otherwise Hugging Face `datasets` will fetch them.
> - If your tokenizer has no PAD token, we set `pad_token = eos_token` to suppress warnings.
> - You can **limit to a single sample** for debugging with `DEBUG_LIMIT = 1`.


In [None]:
import torch

path = "/common/users/sl2148/Public/yang_ouyang/projects/fact-enhancement/vectors/Qwen_Qwen2.5-7B-Instruct/layer_28.pt"

obj = torch.load(path, map_location="cpu")
print(type(obj))
if isinstance(obj, torch.Tensor):
    t = obj
elif isinstance(obj, dict):
    # 取第一个 tensor
    t = next(v for v in obj.values() if isinstance(v, torch.Tensor))
else:
    raise ValueError("Unsupported pt format")

print("shape:", t.shape)
print("dtype:", t.dtype)


<class 'torch.Tensor'>


AttributeError: 'Tensor' object has no attribute 'keys'

In [1]:

# --- 0) Environment & versions ---
import sys, os, json, math, random, time
from pathlib import Path

import torch
print("PyTorch:", torch.__version__)

try:
    import transformers
    print("Transformers:", transformers.__version__)
except Exception as e:
    print("Transformers not available:", e)

try:
    import datasets
    print("Datasets:", datasets.__version__)
except Exception as e:
    print("Datasets not available:", e)

try:
    import evaluate
    print("Evaluate:", evaluate.__version__)
except Exception as e:
    print("Evaluate not available:", e)

try:
    from tqdm.auto import tqdm
    TQDM_OK = True
except Exception as e:
    print("tqdm not available:", e)
    TQDM_OK = False


PyTorch: 2.9.0+cu128


  from .autonotebook import tqdm as notebook_tqdm


Transformers: 4.55.2
Datasets: 4.2.0
Evaluate not available: No module named 'evaluate'


In [2]:

# --- 1) Config ---
import yaml

CONFIG_PATH = "config.yaml"  # change if needed
DATA_PAIRS_PATH = "data/pairs.jsonl"  # change if needed
OUT_DIR = Path("out/debug_vectors")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Debug controls
DEBUG_LIMIT = None  # set to an int like 1 to only run one sample

# Load config with sensible fallbacks
if Path(CONFIG_PATH).exists():
    cfg = yaml.safe_load(open(CONFIG_PATH, "r"))
else:
    # Minimal fallback config
    cfg = {
        "model_name": "meta-llama/Llama-2-7b-chat-hf",  # change to your local model
        "layers_to_probe": [0, 1, 2, 3],
        "token_span_first_step": 10,
        "prompt_method": "cot0shot"
    }

MODEL_NAME = cfg.get("model_name")
LAYERS_TO_PROBE = list(cfg.get("layers_to_probe", []))
TOKEN_SPAN_FIRST_STEP = int(cfg.get("token_span_first_step", 10))
PROMPT_METHOD = cfg.get("prompt_method", "cot0shot")

print("MODEL_NAME:", MODEL_NAME)
print("LAYERS_TO_PROBE:", LAYERS_TO_PROBE)
print("TOKEN_SPAN_FIRST_STEP:", TOKEN_SPAN_FIRST_STEP)
print("PROMPT_METHOD:", PROMPT_METHOD)


MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
LAYERS_TO_PROBE: [8, 12, 16, 20, 24]
TOKEN_SPAN_FIRST_STEP: 48
PROMPT_METHOD: cot0shot


In [3]:

# --- 2) Load model & tokenizer (with PAD fix) ---
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
if tok.pad_token_id is None and tok.eos_token_id is not None:
    tok.pad_token = tok.eos_token  # avoid padding warnings

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if torch.cuda.is_available() else None,
    device_map="auto",
)
model.eval()

print("Pad token id:", tok.pad_token_id, "| EOS token id:", tok.eos_token_id)
print("Model device:", model.device if hasattr(model, "device") else "sharded/accelerated")


Fetching 4 files: 100%|██████████| 4/4 [00:55<00:00, 13.84s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00,  1.26s/it]

Pad token id: 128009 | EOS token id: 128009
Model device: cuda:0





In [4]:

# --- 3) Hidden-state extractor ---
import torch

@torch.inference_mode()
def get_hidden(prompt: str):
    """Return hidden states for each transformer layer (excluding embedding layer).
    Uses standard model **kwargs call; no manual inputs_embeds."
    """
    inputs = tok(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
    outputs = model(**inputs, output_hidden_states=True)
    # outputs.hidden_states: tuple of length n_layers+1 [embeddings + each layer]
    hs = torch.stack(outputs.hidden_states)  # [n_layers+1, B, T, d]
    return hs[1:]  # drop embedding layer -> [n_layers, B, T, d]


In [5]:

# --- 4) Load pairs & preview ---
pairs = []
if Path(DATA_PAIRS_PATH).exists():
    with open(DATA_PAIRS_PATH, "r") as f:
        for i, line in enumerate(f):
            try:
                obj = json.loads(line)
                pairs.append(obj)
            except Exception as e:
                print(f"Bad JSON on line {i}: {e}")

if DEBUG_LIMIT is not None:
    pairs = pairs[:DEBUG_LIMIT]

print(f"Loaded {len(pairs)} pair(s). Example:")
if pairs:
    print(json.dumps(pairs[0], ensure_ascii=False, indent=2)[:800])
else:
    print("No pairs found. Put your data at", DATA_PAIRS_PATH)


Loaded 1319 pair(s). Example:
{
  "question": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
  "cot_step1_true": "Janet sells 16 - 3 - 4 = 9 duck eggs a day.",
  "cot_step1_false": "Janet sells 16 - 3 = 13duck eggs a day. The",
  "error_type": "overlook"
}


In [7]:

# --- 5) Extract vectors with progress bar & per-layer slicing ---
from collections import defaultdict

def select_token_span(tensor_3d, span_len):
    """tensor_3d: [B, T, d], return the last `span_len` tokens for B=1.

    Adjust as needed for 'first-step' token span definition.

    """
    B, T, d = tensor_3d.shape
    if T == 0:
        return tensor_3d
    if span_len is None or span_len >= T:
        return tensor_3d[:, :T, :]
    return tensor_3d[:, :span_len, :]  # take first span_len tokens

layer_vectors = []  # collect dicts per sample
pairs = pairs[:10]
iterator = tqdm(pairs, desc="Extracting", total=len(pairs)) if 'tqdm' in globals() and TQDM_OK else pairs
for item in iterator:
    # Expect item like: {"question": "...", "answer": "..."}; adapt if your schema differs.
    prompt = item.get("question") or item.get("prompt") or json.dumps(item, ensure_ascii=False)
    hs = get_hidden(prompt)  # [n_layers, B, T, d]
    n_layers, B, T, d = hs.shape

    # slice per requested layers
    entry = {
        "meta": {"question": item.get("question"), "answer": item.get("answer")},
        "layers": {},
    }
    for li in LAYERS_TO_PROBE:
        if li < 0 or li >= n_layers:
            continue
        # [B, T, d] -> select token span for "first step"
        span_h = select_token_span(hs[li], TOKEN_SPAN_FIRST_STEP)  # [1, span, d]
        entry["layers"][li] = {
            "shape": list(span_h.shape),
            "mean": span_h.mean(dim=(0,1)).float().cpu().tolist(),  # [d]
            # Save small summary to keep JSON lean; store .pt separately if needed
        }
    layer_vectors.append(entry)

# Write JSONL summary
out_jsonl = OUT_DIR / "vectors_summary.jsonl"
with open(out_jsonl, "w") as f:
    for e in layer_vectors:
        f.write(json.dumps(e) + "\n")
print("Saved:", out_jsonl)


Extracting: 100%|██████████| 10/10 [00:00<00:00, 10.06it/s]


Saved: out/debug_vectors/vectors_summary.jsonl


In [None]:

# --- 6) (Optional) Save raw span tensors for deeper analysis ---
SAVE_RAW_TENSORS = True
if SAVE_RAW_TENSORS and len(pairs) > 0:
    RAW_DIR = OUT_DIR / "raw_tensors"
    RAW_DIR.mkdir(parents=True, exist_ok=True)
    # Example: save only for the first item to keep size reasonable
    prompt = pairs[0].get("question") or pairs[0].get("prompt") or json.dumps(pairs[0], ensure_ascii=False)
    hs = get_hidden(prompt)
    torch.save(hs.cpu(), RAW_DIR / "hidden_states_first_sample.pt")
    print("Raw hidden states saved to", RAW_DIR / "hidden_states_first_sample.pt")
else:
    print("Skipping raw tensor save. Toggle SAVE_RAW_TENSORS=True to enable.")


Skipping raw tensor save. Toggle SAVE_RAW_TENSORS=True to enable.



## Factual Consistency Score (FEVER / TruthfulQA)

We provide two options:

- **FEVER**: Given `(claim, evidence)` pairs, we score *entailment* using an NLI model (e.g., `roberta-large-mnli`).  
  The factual consistency score is the mean entailment probability.

- **TruthfulQA**: Compare the model's answer to the ground-truth via **exact match** and an **NLI-based** check.


In [10]:

# --- 8) NLI-based factual consistency (entailment prob) ---
from typing import List, Tuple, Dict

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def load_nli(model_name: str = "roberta-large-mnli"):
    nli_tok = AutoTokenizer.from_pretrained(model_name)
    nli_model = AutoModelForSequenceClassification.from_pretrained(model_name, device_map="auto")
    nli_model.eval()
    return nli_tok, nli_model

@torch.inference_mode()
def entailment_probability(nli_tok, nli_model, premise: str, hypothesis: str) -> float:
    enc = nli_tok(premise, hypothesis, return_tensors="pt", truncation=True, padding=True).to(nli_model.device)
    out = nli_model(**enc)
    # MNLI label order: [contradiction, neutral, entailment]
    probs = out.logits.softmax(dim=-1).squeeze(0).tolist()
    return float(probs[-1])  # entailment prob


In [None]:

# --- 10) TruthfulQA loader + simple scoring ---
def load_truthfulqa(split="validation"):
    import datasets as ds
    # truthful_qa has multiple configs; we'll use 'generation' version
    dset = ds.load_dataset("truthful_qa", "generation", split=split)
    # fields include: 'question', 'best_answer', 'correct_answers', 'incorrect_answers'
    return dset

def truthfulqa_score(nli_tok, nli_model, model_answers: List[str], gt_answers: List[str]):
    """Compute a mixed score:
      - Exact match rate (case-insensitive) against any correct answer
      - NLI entailment prob of model answer w.r.t. the best_answer (as hypothesis true given premise best_answer)
    Returns dict with 'exact_match' and 'entailment_mean'.
    """
    exact = 0
    entail_scores = []
    for pred, gt in zip(model_answers, gt_answers):
        # exact match vs a single ground truth string (could extend to set membership)
        if pred.strip().lower() == gt.strip().lower():
            exact += 1
        # NLI: premise=gt, hypothesis=pred
        ent = entailment_probability(nli_tok, nli_model, premise=gt, hypothesis=pred)
        entail_scores.append(ent)
    n = max(1, len(model_answers))
    return {
        "exact_match": exact / n,
        "entailment_mean": (sum(entail_scores) / len(entail_scores)) if entail_scores else float("nan")
    }

print("TruthfulQA helpers ready.")


TruthfulQA helpers ready.


In [18]:

# --- 12) Run TruthfulQA scoring (small demo with your model) ---
RUN_TRUTHFULQA = True  # set True to run
MAX_TQA = 100

def simple_generate(model, tok, question: str, max_new_tokens=64):
    inp = tok(question, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        out_ids = model.generate(**inp, max_new_tokens=max_new_tokens, do_sample=False)
    return tok.decode(out_ids[0], skip_special_tokens=True)

if RUN_TRUTHFULQA:
    nli_tok, nli_model = load_nli("roberta-large-mnli")
    dset = load_truthfulqa("validation")
    qs = [dset[i]["question"] for i in range(min(MAX_TQA, len(dset)))]
    gts = [dset[i]["best_answer"] for i in range(min(MAX_TQA, len(dset)))]
    preds = [simple_generate(model, tok, q) for q in (tqdm(qs) if TQDM_OK else qs)]
    tqa_scores = truthfulqa_score(nli_tok, nli_model, preds, gts)
    print("[TruthfulQA] exact_match:", tqa_scores["exact_match"])
    print("[TruthfulQA] entailment_mean:", tqa_scores["entailment_mean"])
else:
    print("Skip TruthfulQA demo. Set RUN_TRUTHFULQA=True to execute.")


Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Generating validation split: 100%|██████████| 817/817 [00:00<00:00, 2127.18 examples/s]
  0%|          | 0/100 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:1280

[TruthfulQA] exact_match: 0.0
[TruthfulQA] entailment_mean: 0.1334468884533271



## CoT First-Step Extraction & Corruption (GSM8K-style)

Given GSM8K-like entries with an `answer` containing CoT, we extract the **first step** and **create a corrupted version** to probe factual consistency and accuracy.


In [19]:
pairs

[{'question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
  'cot_step1_true': 'Janet sells 16 - 3 - 4 = 9 duck eggs a day.',
  'cot_step1_false': 'Janet sells 16 - 3 = 13duck eggs a day. The',
  'error_type': 'overlook'},
 {'question': 'A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?',
  'cot_step1_true': 'It takes 2/2=1 bolt of white fiber. Ther',
  'cot_step1_false': 'It takes 1/2bolt of white fiber. Therefo',
  'error_type': 'hallucinate'},
 {'question': 'Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?',
  'cot_step1_true': 'The cost of the house and repairs 