# Epistemic Probing: Cross-Model Analysis

This notebook analyzes epistemic transparency across 8 models (4 families Ã— base/instruct).

**Key Question:** Language models know what they don't know. Does that knowledge leak through entropy, and how does it change after fine-tuning and across architectures?

In [14]:
import sys
import io
from contextlib import redirect_stdout
import pandas as pd
import numpy as np

sys.path.insert(0, '.')
from analysis.loader import load_model_data
from analysis.effects import compute_roc_auc
from analysis.core import failure_mode_analysis

In [15]:
# Model metadata
MODELS = [
    'qwen_base', 'qwen_instruct',
    'mistral_base', 'mistral_instruct',
    'yi_base', 'yi_instruct',
    'llama_base', 'llama_instruct'
]

META = {
    'qwen': ('Custom', 'Chinese'),
    'mistral': ('Custom', 'English'),
    'yi': ('LLaMA-derived', 'Chinese'),
    'llama': ('LLaMA', 'English'),
}

## 1. Load All Models

In [16]:
# Load all model data
models_data = {}
for model in MODELS:
    f = io.StringIO()
    with redirect_stdout(f):
        models_data[model] = load_model_data(model, re_evaluate=True)
    print(f"Loaded {model}: {len(models_data[model].df)} samples")

Loaded qwen_base: 589 samples
Loaded qwen_instruct: 589 samples
Loaded mistral_base: 589 samples
Loaded mistral_instruct: 589 samples
Loaded yi_base: 589 samples
Loaded yi_instruct: 589 samples
Loaded llama_base: 589 samples
Loaded llama_instruct: 589 samples


## 2. Core Metrics Table

In [17]:
# Compute core metrics for all models
results = []
for model in MODELS:
    data = models_data[model]
    family = model.split('_')[0]
    variant = model.split('_')[1]
    
    # ROC/AUC
    f = io.StringIO()
    with redirect_stdout(f):
        roc = compute_roc_auc(data, print_output=False)
    
    # Hallucination detection
    ci = data.df[data.df['category'] == 'confident_incorrect']
    
    results.append({
        'model': model,
        'family': family.capitalize(),
        'variant': variant,
        'arch': META[family][0],
        'training': META[family][1],
        'entropy_auc': roc['entropy']['auc'],
        'probe_auc': roc['best_layer']['auc'],
        'hidden_info': roc['best_layer']['auc'] - roc['entropy']['auc'],
        'hall_det': ci['correct'].mean(),
        'mean_entropy': data.df['entropy'].mean(),
        'std_entropy': data.df['entropy'].std(),
        'overall_acc': data.df['correct'].mean(),
    })

core_df = pd.DataFrame(results)
core_df

Unnamed: 0,model,family,variant,arch,training,entropy_auc,probe_auc,hidden_info,hall_det,mean_entropy,std_entropy,overall_acc
0,qwen_base,Qwen,base,Custom,Chinese,0.763777,0.946349,0.182572,0.010101,4.056761,1.119518,0.344652
1,qwen_instruct,Qwen,instruct,Custom,Chinese,0.641253,0.945682,0.304428,0.585859,0.646161,0.458352,0.526316
2,mistral_base,Mistral,base,Custom,English,0.923253,0.970427,0.047174,0.060606,2.803168,1.794821,0.395586
3,mistral_instruct,Mistral,instruct,Custom,English,0.788595,0.9449,0.156305,0.282828,1.890673,0.846999,0.443124
4,yi_base,Yi,base,LLaMA-derived,Chinese,0.84522,0.94285,0.09763,0.010101,4.03922,1.385021,0.353141
5,yi_instruct,Yi,instruct,LLaMA-derived,Chinese,0.695343,0.929878,0.234535,0.191919,1.153487,0.307822,0.409168
6,llama_base,Llama,base,LLaMA,English,0.9352,0.958191,0.02299,0.070707,2.917209,1.850657,0.395586
7,llama_instruct,Llama,instruct,LLaMA,English,0.738617,0.943124,0.204507,0.686869,2.143775,0.772969,0.534805


## 3. Key Comparison: Training Data vs Architecture

In [18]:
# Base models only - the clean comparison
base_df = core_df[core_df['variant'] == 'base'][['family', 'arch', 'training', 'entropy_auc', 'probe_auc', 'hidden_info']]
base_df = base_df.sort_values('hidden_info')
base_df

Unnamed: 0,family,arch,training,entropy_auc,probe_auc,hidden_info
6,Llama,LLaMA,English,0.9352,0.958191,0.02299
2,Mistral,Custom,English,0.923253,0.970427,0.047174
4,Yi,LLaMA-derived,Chinese,0.84522,0.94285,0.09763
0,Qwen,Custom,Chinese,0.763777,0.946349,0.182572


In [19]:
# The critical test: Yi vs Llama (same architecture, different training)
yi_llama = base_df[base_df['family'].isin(['Yi', 'Llama'])]
print("Same LLaMA architecture, different training:")
print(yi_llama.to_string(index=False))
print(f"\nHidden info ratio: {yi_llama[yi_llama['family']=='Yi']['hidden_info'].values[0] / yi_llama[yi_llama['family']=='Llama']['hidden_info'].values[0]:.1f}x")

Same LLaMA architecture, different training:
family          arch training  entropy_auc  probe_auc  hidden_info
 Llama         LLaMA  English      0.93520   0.958191      0.02299
    Yi LLaMA-derived  Chinese      0.84522   0.942850      0.09763

Hidden info ratio: 4.2x


In [20]:
# Group by training origin
print("Mean hidden info by training origin (base models):")
print(base_df.groupby('training')['hidden_info'].mean())

Mean hidden info by training origin (base models):
training
Chinese    0.140101
English    0.035082
Name: hidden_info, dtype: float64


## 4. Instruct Tuning Effects

In [21]:
# Compute deltas for each family
deltas = []
for family in ['Qwen', 'Mistral', 'Yi', 'Llama']:
    base = core_df[(core_df['family'] == family) & (core_df['variant'] == 'base')].iloc[0]
    inst = core_df[(core_df['family'] == family) & (core_df['variant'] == 'instruct')].iloc[0]
    
    deltas.append({
        'family': family,
        'training': base['training'],
        'entropy_auc_delta': inst['entropy_auc'] - base['entropy_auc'],
        'probe_auc_delta': inst['probe_auc'] - base['probe_auc'],
        'hidden_info_delta': inst['hidden_info'] - base['hidden_info'],
        'hall_det_delta': inst['hall_det'] - base['hall_det'],
        'mean_entropy_delta': inst['mean_entropy'] - base['mean_entropy'],
    })

delta_df = pd.DataFrame(deltas)
delta_df

Unnamed: 0,family,training,entropy_auc_delta,probe_auc_delta,hidden_info_delta,hall_det_delta,mean_entropy_delta
0,Qwen,Chinese,-0.122523,-0.000667,0.121856,0.575758,-3.4106
1,Mistral,English,-0.134659,-0.025527,0.109131,0.222222,-0.912495
2,Yi,Chinese,-0.149877,-0.012972,0.136905,0.181818,-2.885733
3,Llama,English,-0.196584,-0.015067,0.181517,0.616162,-0.773433


In [22]:
# Summary of instruct tuning effects
print("Mean effect of instruct tuning across all models:")
print(f"  Entropy AUC:  {delta_df['entropy_auc_delta'].mean():+.3f}")
print(f"  Probe AUC:    {delta_df['probe_auc_delta'].mean():+.3f}")
print(f"  Hidden Info:  {delta_df['hidden_info_delta'].mean():+.1%}")
print(f"  Hall. Det:    {delta_df['hall_det_delta'].mean():+.1%}")
print(f"  Mean Entropy: {delta_df['mean_entropy_delta'].mean():+.3f}")

Mean effect of instruct tuning across all models:
  Entropy AUC:  -0.151
  Probe AUC:    -0.014
  Hidden Info:  +13.7%
  Hall. Det:    +39.9%
  Mean Entropy: -1.996


## 5. Entropy Distribution

In [23]:
# Entropy stats by model
entropy_df = core_df[['model', 'variant', 'training', 'mean_entropy', 'std_entropy']].copy()
entropy_df

Unnamed: 0,model,variant,training,mean_entropy,std_entropy
0,qwen_base,base,Chinese,4.056761,1.119518
1,qwen_instruct,instruct,Chinese,0.646161,0.458352
2,mistral_base,base,English,2.803168,1.794821
3,mistral_instruct,instruct,English,1.890673,0.846999
4,yi_base,base,Chinese,4.03922,1.385021
5,yi_instruct,instruct,Chinese,1.153487,0.307822
6,llama_base,base,English,2.917209,1.850657
7,llama_instruct,instruct,English,2.143775,0.772969


In [24]:
# Entropy by category for each model
cat_entropy = []
for model, data in models_data.items():
    for cat in data.df['category'].unique():
        cat_df = data.df[data.df['category'] == cat]
        cat_entropy.append({
            'model': model,
            'category': cat,
            'mean_entropy': cat_df['entropy'].mean(),
            'accuracy': cat_df['correct'].mean(),
            'n': len(cat_df)
        })

cat_entropy_df = pd.DataFrame(cat_entropy)
cat_entropy_df.pivot(index='category', columns='model', values='mean_entropy').round(2)

model,llama_base,llama_instruct,mistral_base,mistral_instruct,qwen_base,qwen_instruct,yi_base,yi_instruct
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,4.61,2.6,4.14,2.63,4.9,1.04,5.26,1.24
confident_correct,1.0,1.51,1.05,1.11,3.29,0.42,2.73,0.96
confident_incorrect,4.3,2.92,4.43,2.76,4.88,0.8,4.51,1.28
nonsensical,4.5,2.57,4.48,2.27,5.08,0.58,4.95,1.38
uncertain_correct,1.17,1.57,1.05,1.39,3.75,0.39,3.21,1.03
uncertain_incorrect,3.59,2.19,3.23,1.79,3.25,0.8,4.7,1.21


## 6. Hallucination Analysis

In [25]:
# Hallucination detection rates
hall_df = core_df[['model', 'variant', 'training', 'hall_det']].copy()
hall_df['hall_det_pct'] = (hall_df['hall_det'] * 100).round(1).astype(str) + '%'
hall_df.pivot(index='training', columns='variant', values='hall_det').round(3)

ValueError: Index contains duplicate entries, cannot reshape

In [None]:
# Sample hallucination responses from best and worst models
best_model = core_df.loc[core_df['hall_det'].idxmax(), 'model']
worst_model = core_df.loc[core_df[core_df['variant']=='instruct']['hall_det'].idxmin(), 'model']

print(f"Best hallucination detection: {best_model}")
print(f"Worst instruct hallucination detection: {worst_model}")

## 7. Summary Statistics

In [None]:
# Final summary table for paper/presentation
summary = core_df[['model', 'training', 'entropy_auc', 'probe_auc', 'hidden_info', 'hall_det', 'mean_entropy']].copy()
summary.columns = ['Model', 'Training', 'Entropy AUC', 'Probe AUC', 'Hidden Info', 'Hall. Det', 'Mean Entropy']
summary = summary.round(3)
summary

In [None]:
# Export to CSV if needed
# summary.to_csv('epistemic_summary.csv', index=False)

## Key Findings

1. **Training data drives epistemic transparency, not architecture**
   - Yi (LLaMA arch, Chinese): ~10% hidden info
   - Llama (LLaMA arch, English): ~2% hidden info
   - Same architecture, 4x difference

2. **Instruct tuning degrades entropy informativeness universally**
   - All models show +10-18% hidden info after instruct tuning
   - Entropy becomes compressed (lower mean and SD)

3. **Probe accuracy remains stable**
   - ~94-97% AUC across all models
   - Information exists internally, just hidden from entropy

4. **Hallucination detection improves with instruct tuning**
   - But varies widely by model (19-69%)