# Statistical Testing, Clustering, and PCA

This notebook extends the descriptive analysis by applying:
- t-tests and ANOVA
- correlation and covariance analysis
- PCA (Principal Component Analysis)
- k-means clustering
- hierarchical clustering

The goal is to identify deeper structure in LLM reasoning behavior across panels.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage

pd.set_option('display.max_colwidth', None)

DATA_PATH = "../outputs/dataset/merged_dataset.csv"
df = pd.read_csv(DATA_PATH)

df.head()

## Score Columns
Subset the dataset to include only scoring dimensions.

In [None]:
score_cols = [
    'correctness_score', 'completeness_score', 'relationship_detection_score',
    'relationship_accuracy_score', 'narrative_drift_score', 'certainty_score',
    'mechanistic_score', 'structure_score', 'total_score'
]

df_scores = df[score_cols].copy()
df_scores.describe()

## t-Tests

Example: Compare high-drift vs low-drift panels on total score.
This is illustrative — with small datasets, statistical power is limited.

In [None]:
median_drift = df['narrative_drift_score'].median()

low_drift = df[df['narrative_drift_score'] <= median_drift]['total_score']
high_drift = df[df['narrative_drift_score'] > median_drift]['total_score']

t_stat, p_val = stats.ttest_ind(low_drift, high_drift, equal_var=False)

t_stat, p_val

## ANOVA

Example: Does correctness score differ across tertiles of mechanistic reasoning?

In [None]:
df['mech_group'] = pd.qcut(df['mechanistic_score'], 3, labels=['low','mid','high'])

groups = [
    df[df['mech_group']=='low']['correctness_score'],
    df[df['mech_group']=='mid']['correctness_score'],
    df[df['mech_group']=='high']['correctness_score']
]

anova_stat, anova_p = stats.f_oneway(*groups)

anova_stat, anova_p

## PCA

Dimensionality reduction to identify latent structure in scoring dimensions.

In [None]:
scaler = StandardScaler()
scores_scaled = scaler.fit_transform(df_scores)

pca = PCA(n_components=2)
pca_components = pca.fit_transform(scores_scaled)

df['PC1'] = pca_components[:,0]
df['PC2'] = pca_components[:,1]

plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='total_score', palette='viridis')
plt.title('PCA of Scoring Dimensions')
plt.show()

## PCA Loadings
Shows which scoring dimensions contribute most to each principal component.

In [None]:
loadings = pd.DataFrame(pca.components_.T, columns=['PC1','PC2'], index=score_cols)
loadings

## k-Means Clustering

Cluster panels based on scoring patterns. The elbow method helps choose k.

In [None]:
inertias = []
K = range(2, 8)

for k in K:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(scores_scaled)
    inertias.append(km.inertia_)

plt.plot(K, inertias, marker='o')
plt.title('Elbow Method for k-Means')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.show()

### Fit Final k-Means Model (k=3)
This is illustrative — with small datasets, clusters may collapse.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(scores_scaled)

sns.scatterplot(data=df, x='PC1', y='PC2', hue='cluster', palette='Set2')
plt.title('k-Means Clusters in PCA Space')
plt.show()

## Hierarchical Clustering

Dendrogram to visualize hierarchical relationships between panels.

In [None]:
linked = linkage(scores_scaled, method='ward')

plt.figure(figsize=(12, 6))
dendrogram(linked, labels=df['panel_id'].values, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()