# Model Selection for Reliability Testing

**Method**

- Deductively coded table turned into a prompt. AI will be used to code the dataset. See `INTERACTION_PROMPT_TEMPLATEv3`
- Random sample of 50 chat sessions taken from the dataset
- Human (that's me) coded the 50 chat sessions
- Each model will code the same 50 chat sessions three times to test for internal consistency

### Which model to select?

- Models compared, as rater / annotator:

    - Human
    - "x-ai/grok-4-fast"
    - "openai/gpt-5-mini"
    - "google/gemini-2.5-flash"
    - "anthropic/claude-sonnet-4"
    
 - Krippendorff's alpha 
    
    - for human / ai model agreement ... which model was highest?
    - interal consistency among multiple runs of the same LLM (3 passes on the same model) ... which was most consistent?


Selection of LLM to code the entire session dataset (N=1024), based on highest agreement with human coding and highest internal consistency.

 - Target for LLM / Human agreement: (target Krippendorff's alpha >= 0.85)
 - Target for internal consistency: (target Krippendorff's alpha = 1.0)

### Findings

Best model:

- "anthropic/claude-sonnet-4"
    - Agreement with human: 0.873131
    - Internal consistency: 1.0

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import TABLEAU_COLORS
import seaborn as sns
from scipy import stats
import simpledorff
import json
import import_ipynb
from llm_content_analysis import invoke_openrouter_ai, INTERACTION_PROMPT_TEMPLATEv3, classify_interaction
from helper_functions import independent_ttest

NUM_ITERATIONS = 3
SAMPLES = [
    {"model": "x-ai/grok-4-fast" }, 
    {"model": "openai/gpt-5-mini"},
    {"model": "google/gemini-2.5-flash" },
    {"model": "anthropic/claude-sonnet-4" }
]

model_comparison_df = pd.DataFrame( { 'model': [sample['model'] for sample in SAMPLES]})

### Dataset for Analysis:

- `50_sample_4_llm_3_iterations_labels.csv` - 50 samples coded 3 times by 4 different LLMs for a total of 600 rows
- `50_sample_human_labels.csv` - 50 samples coded by human for a total of 50 rows

In [3]:
llm_labels = pd.read_csv("../datasets/50_sample_4_llm_3_iterations_labels.csv")
human_labels = pd.read_csv("../datasets/50_sample_human_labels.csv")
llm_labels.head()


Unnamed: 0,participant_number,participant,sessionid,message_count,transcript_filename,transcript,classification,justification,model,sample,behavior_code
0,17,Participant_17,4151274b-0515-4308-984e-7af035baefa4,2,Participant_17_session_4151274b-0515-4308-984e...,PARTICIPANT_17:\nwhat files need to be in the ...,Task Completion,"{'behavior_code': 'Seeking Answers', 'rational...",x-ai/grok-4-fast,1,Seeking Answers
1,64,Participant_64,ebde0db7-fa59-4e38-9b9f-a1c15db8f474,2,Participant_64_session_ebde0db7-fa59-4e38-9b9f...,PARTICIPANT_64:\ncan you explain parse to me\n...,Learning,"{'behavior_code': 'Questioning', 'rationale': ...",x-ai/grok-4-fast,1,Questioning
2,41,Participant_41,6b8bc12d-8642-494c-b278-04b79ef1f968,9,Participant_41_session_6b8bc12d-8642-494c-b278...,PARTICIPANT_41:\nis my 1.3 code correct?\n\nAI...,Learning,"{'behavior_code': 'Scaffolding', 'rationale': ...",x-ai/grok-4-fast,1,Scaffolding
3,49,Participant_49,32ea3ad9-099a-4dac-891c-22d2db740102,2,Participant_49_session_32ea3ad9-099a-4dac-891c...,PARTICIPANT_49:\nwhat does sort() do?\n\nAI_AS...,Learning,"{'behavior_code': 'Questioning', 'rationale': ...",x-ai/grok-4-fast,1,Questioning
4,36,Participant_36,82bcaae3-22da-4174-b181-c56edffdb305,4,Participant_36_session_82bcaae3-22da-4174-b181...,PARTICIPANT_36:\nwhat does x.upper or x.lower ...,Learning,"{'behavior_code': 'Questioning', 'rationale': ...",x-ai/grok-4-fast,1,Questioning


## Inter-rater reliability of the LLM vs Human

Let's compare the human coding (my coding) with the LLM coding using Krippendorff's alpha to find the most "mike-like" LLM.



In [20]:
print("Inter-rater reliability (Krippendorff's alpha) of LLMs vs Human Threshold: 0.85")
# for each iteration
for i in range(NUM_ITERATIONS):
    # combine humna and llm labels for that iteration
    krip = llm_labels[['sessionid','model','classification']][(llm_labels['sample']==i+1)]
    krip = pd.concat([krip, human_labels], ignore_index=True)
    print(f"Iteration {i+1}")
    # for each model excluding  human and calculate alpha
    for m in krip['model'].unique():
        if m == 'human':
            continue
        comp = krip[krip['model'].isin([m, 'human'])]
        alpha = simpledorff.calculate_krippendorffs_alpha_for_df(
            comp,
            experiment_col='sessionid',
            annotator_col='model',
            class_col='classification'
        )
        TARGET_ALPHA = "✅" if alpha >= 0.85 else "❌"
        print(f"\t{TARGET_ALPHA}{m} vs human ==> {alpha}")
        model_comparison_df.loc[model_comparison_df['model']==m, 'agreement_with_human'] = alpha


Inter-rater reliability (Krippendorff's alpha) of LLMs vs Human Threshold: 0.85
Iteration 1
	✅x-ai/grok-4-fast vs human ==> 0.8731311405382315
	❌openai/gpt-5-mini vs human ==> 0.7885519008970525
	❌google/gemini-2.5-flash vs human ==> 0.8330522765598651
	✅anthropic/claude-sonnet-4 vs human ==> 0.8731311405382315
Iteration 2
	✅x-ai/grok-4-fast vs human ==> 0.8693356797184337
	❌openai/gpt-5-mini vs human ==> 0.7495784148397977
	❌google/gemini-2.5-flash vs human ==> 0.8330522765598651
	✅anthropic/claude-sonnet-4 vs human ==> 0.8731311405382315
Iteration 3
	✅x-ai/grok-4-fast vs human ==> 0.8764044943820225
	❌openai/gpt-5-mini vs human ==> 0.7885519008970525
	❌google/gemini-2.5-flash vs human ==> 0.8330522765598651
	✅anthropic/claude-sonnet-4 vs human ==> 0.8731311405382315


## Internal consistency of the LLMs

For internal consistency: (target Krippendorff's alpha = 1.0)

Against 3 iterations of the same model.

In [22]:
print("\nInternal consistency (Krippendorff's alpha) of LLMs Threshold: 1.0")
for model in llm_labels['model'].unique():
    krip = llm_labels[llm_labels['model']==model]
    alpha = simpledorff.calculate_krippendorffs_alpha_for_df(
        krip,
        experiment_col='sessionid',
        annotator_col='sample',
        class_col='classification'
    )
    TARGET_ALPHA = "✅" if alpha == 1.0 else "❌"
    print(f"\t{TARGET_ALPHA} {model} Krippendorff's Alpha (n=3) ==> {alpha}")
    model_comparison_df.loc[model_comparison_df['model']==model, 'internal_consistency'] = alpha


Internal consistency (Krippendorff's alpha) of LLMs Threshold: 1.0
	❌ x-ai/grok-4-fast Krippendorff's Alpha (n=3) ==> 0.8850308641975309
	❌ openai/gpt-5-mini Krippendorff's Alpha (n=3) ==> 0.9147434674804501
	✅ google/gemini-2.5-flash Krippendorff's Alpha (n=3) ==> 1.0
	✅ anthropic/claude-sonnet-4 Krippendorff's Alpha (n=3) ==> 1.0


## Summary

Among the models with internal consistency (1.0),  
the model with the Highest agreement with human coding was: 

"anthropic/claude-sonnet-4" (0.873131)