# Multi‑Lab Reasoning Analysis Notebook

This notebook analyzes the merged dataset containing:
- Panel metadata
- CDS outputs
- LLM outputs
- Scoring results

It produces:
- Summary statistics
- Score distributions
- CDS vs LLM comparisons
- Drift and correlation‑detection patterns
- Outlier detection
- Relationship heatmaps

This is the primary analysis notebook for the project.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import textwrap

pd.set_option('display.max_colwidth', None)

DATA_PATH = "../outputs/dataset/merged_dataset.csv"
df = pd.read_csv(DATA_PATH)

df.head()

## Score Summary

Basic descriptive statistics across all scoring dimensions.

In [None]:
score_cols = [
    'correctness_score', 'completeness_score', 'relationship_detection_score',
    'relationship_accuracy_score', 'narrative_drift_score', 'certainty_score',
    'mechanistic_score', 'structure_score', 'total_score'
]

df[score_cols].describe()

## Score Distributions

Visualizing how the model performs across each scoring dimension.

In [None]:
plt.figure(figsize=(14, 10))
for i, col in enumerate(score_cols):
    plt.subplot(3, 3, i+1)
    sns.histplot(df[col], kde=True, bins=10)
    plt.title(col)
plt.tight_layout()
plt.show()

## Correlation Matrix of Scores

Shows how different reasoning dimensions relate to each other.

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df[score_cols].corr(), annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Score Correlation Matrix")
plt.show()

## CDS vs LLM Output Length

A proxy for reasoning breadth and verbosity.

In [None]:
df['llm_length'] = df['llm_output'].fillna('').apply(len)
df['cds_length'] = df['cds_output'].fillna('').apply(len)

sns.scatterplot(data=df, x='cds_length', y='llm_length')
plt.xlabel("CDS Output Length")
plt.ylabel("LLM Output Length")
plt.title("CDS vs LLM Output Length")
plt.show()

## Drift vs Total Score

Shows whether narrative drift correlates with overall performance.

In [None]:
sns.scatterplot(data=df, x='narrative_drift_score', y='total_score')
plt.title("Narrative Drift vs Total Score")
plt.show()

## Relationship Detection vs Mechanistic Depth

This highlights whether the model's ability to detect correlations is tied to deeper physiologic reasoning.

In [None]:
sns.scatterplot(data=df, x='relationship_detection_score', y='mechanistic_score')
plt.title("Relationship Detection vs Mechanistic Reasoning Depth")
plt.show()

## Outlier Panels

Panels with unusually high or low total scores.

In [None]:
df.sort_values('total_score').head(5)

In [None]:
df.sort_values('total_score').tail(5)

## Panel Inspection Utility

A helper function to inspect a specific panel’s:
- metadata
- CDS output
- LLM output
- scoring breakdown

In [None]:
def inspect_panel(panel_id):
    row = df[df['panel_id'] == panel_id].iloc[0]
    print("=== PANEL METADATA ===")
    display(row.filter(regex='lab|panel'))

    print("\n=== CDS OUTPUT ===")
    print(textwrap.fill(row['cds_output'] or '', width=100))

    print("\n=== LLM OUTPUT ===")
    print(textwrap.fill(row['llm_output'] or '', width=100))

    print("\n=== SCORING ===")
    display(row[score_cols])

inspect_panel("P001")