# Statistical Testing of System Configurations and Claim Rephrasing Techniques

**Henry Zelenak | Last updated: 05/12/2025**

This notebook contains the statistical testing of the system configurations and claim rephrasing techniques. The results are based on the evaluation of the system configurations and claim rephrasing techniques on the paper_test.jsonl dataset (Thorne et al., 2018, April). 

## Setup

In [2]:
!pip install kaleido --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
"""
Merged script for loading, processing, and analyzing FEVER test results.

Handles both single-repetition tests and multi-repetition tests (averaging results).
Performs statistical comparisons and generates visualizations.
"""

# --- Core Libraries ---
import pandas as pd
import numpy as np
import os
import json
import ast
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import warnings # To potentially suppress warnings if needed

# --- Plotting Libraries ---
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import matplotlib.pyplot as plt # statsmodels qqplot uses matplotlib
import kaleido # Required for static image export with Plotly (often implicit)

# --- Statistical Models & Plots ---
from statsmodels.graphics.gofplots import qqplot # More direct qqplot generation

# --- Google Colab Integration (Optional, keep if running in Colab) ---

from google.colab import drive, userdata
drive.mount('/content/drive')
# Define BASE_DIR using Colab path
BASE_DIR = '/content/drive/My Drive/SUNY_Poly_DSA598/'


# --- Constants: Directory Paths ---
# Ensure BASE_DIR is defined before using it
FT_BASE_DIR = os.path.join(BASE_DIR, 'datasets/FEVER/paper_test_results/fine-tuning_baseline/')
SINGLE_DIR = os.path.join(BASE_DIR, 'datasets/FEVER/paper_test_results/single_tuned/')
REPHRS_DIR = os.path.join(BASE_DIR, 'datasets/FEVER/paper_test_results/claim_rephrasing/')
OUTPUT_DIR = os.path.join(BASE_DIR, 'datasets/FEVER/paper_test_results/') # For saving combined results
COMBINED_CSV_PATH = os.path.join(OUTPUT_DIR, 'all_10_results_merged.csv') # Specific path for the combined file
# Define specific results directories
SAVE_DIR = os.path.join(BASE_DIR, 'presentation/figures/')
DATA_DIR = os.path.join(BASE_DIR, 'datasets/FEVER/paper_test_results/')
# Corrected paths: Remove leading '/' if joining with another path
FT_RESULTS_SAVE_PATH = os.path.join(SAVE_DIR, 'fine-tuning') # (Group 1 Plots)
CR_RESULTS_SAVE_PATH = os.path.join(SAVE_DIR, 'claim_rephrasing') # (Group 2 Plots)

# Create save directories if they don't exist
os.makedirs(FT_RESULTS_SAVE_PATH, exist_ok=True)
os.makedirs(CR_RESULTS_SAVE_PATH, exist_ok=True)

# --- Configuration & Paths ---
pio.templates.default = "plotly_white" # Set a clean default theme for plots
# Plot saving configuration
SAVE_SCALE = 4 # Increase scale for higher resolution PNGs

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# --- Load Data ---
all_results_df = pd.read_csv(os.path.join(DATA_DIR, 'all_10_results.csv'))

Mounted at /content/drive


In [3]:
# Ensure fever-scorer is installed correctly (assuming previous steps worked)
!git clone -b release-v2.0 https://github.com/sheffieldnlp/fever-scorer.git
%cd fever-scorer
!pip install -r requirements.txt

# Open /setup.py and add 'license="MIT"' on line 12, then overwrite the file
import os
with open('setup.py', 'r') as f:
    lines = f.readlines()
    lines[11] = 'license="MIT"\n'
with open('setup.py', 'w') as f:
    f.writelines(lines)
    f.close()
    print("setup.py updated")
!pip install .
%cd ..

Cloning into 'fever-scorer'...
remote: Enumerating objects: 224, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 224 (delta 0), reused 0 (delta 0), pack-reused 219 (from 1)[K
Receiving objects: 100% (224/224), 1.13 MiB | 5.69 MiB/s, done.
Resolving deltas: 100% (110/110), done.
/content/fever-scorer
setup.py updated
Processing /content/fever-scorer
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: fever-scorer
  Building wheel for fever-scorer (setup.py) ... [?25l[?25hdone
  Created wheel for fever-scorer: filename=fever_scorer-0.0.0-py3-none-any.whl size=8288 sha256=4bfb53b15ea6bc01065f36e6ebf8a3b07763fca240671f0eaf911cbbd132b9e6
  Stored in directory: /root/.cache/pip/wheels/f7/a5/f9/dffaef703ff054c8aa2ea4534130aae0e1ff9450753d0d7556
Successfully built fever-scorer
Installing collected packages: fever-scorer
Successfully installed fever-scorer-0.0.0
/content


In [4]:
from fever.scorer import fever_score # Import the FEVER scorer

## Results Data Preprocessing

In [None]:
# Load the gold standard data for per-claim FEVER score, label F1 and label accuracy
FEVER_GOLD_STANDARD_PATH = os.path.join(BASE_DIR, 'datasets/FEVER/paper_test.jsonl')
with open(FEVER_GOLD_STANDARD_PATH, 'r') as f:
    gold_data = [json.loads(line) for line in f]
gold_df = pd.DataFrame(gold_data)
# Take the first 30 rows
gold_df = gold_df.iloc[:30]
print(gold_df.head())

# Load the CSV files from the fine-tuning baseline directory

def load_dfs(dir):
    """
    Load all CSV files from the DATA_DIR into a list of DataFrames.
    """
    dfs = []
    for file in os.listdir(dir):
        if file.endswith('.csv'):
            df = pd.read_csv(os.path.join(dir, file))
            # Add a column for the system state when tested (the same value in all rows)
            df['system_config'] = file.split('_')[0] + '_' + file.split('_')[1]
            dfs.append(df)
    return dfs

### Single-repetition tests (e.g., all_base-Tx, tuned_GPT-sentEx, tuned_GPT-clf, tuned_GPT-query, tuned_sBERT-n1024)
def process_dfs(dfs):
    """
    Process the loaded DataFrames to extract relevant information.
    """
    print(f"Processing {len(dfs)} DataFrames for single-repetition tests")
    processed_dfs = []
    for df in dfs:
        # Unpack the module 1 report into new columns
        module_1_report = df['module1_report_details'].apply(json.loads)
        df['number_of_pages_retrieved'] = module_1_report.apply(lambda x: x['mod_1_total_documents'])
        df['total_document_tokens'] = module_1_report.apply(lambda x: x['total_document_tokens'])
        df['potential_titles'] = module_1_report.apply(lambda x: x['potential_titles'])
        df['retrieved_pages'] = module_1_report.apply(lambda x: x['retrieved_titles'])

        # Unpack the module 2 report into new columns
        module_2_report = df['module2_report_details'].apply(json.loads)
        for key in module_2_report.iloc[0]:
            df[key] = module_2_report.apply(lambda x: x[key])

        # Rename 'llm_total_tokens' as GPT_total_tokens
        df.rename(columns={'llm_total_tokens': 'gpt_total_tokens'}, inplace=True)
        # Rename 'llm_total_sentences' as GPT_total_sentences
        df.rename(columns={'llm_total_sentences': 'gpt_total_sentences'}, inplace=True)

        # Create a 'number_of_evidence_sentences' column from the length of final_evidence_ids
        df['number_of_evidence_sentences'] = df['final_evidence_ids'].apply(len)
        '''
                instances = [
          {
              "label": "REFUTES",
              "predicted_label": "REFUTES",
              "predicted_evidence": [["Page1", 1], ["Page3", 2]],
              "evidence": [
                  [
                      [None, None, "Page1", 1],
                      [None, None, "Page3", 1],
                      [None, None, "Page3", 2],
                  ],
              ],
          },
          {
              "label": "SUPPORTS",
              "predicted_label": "SUPPORTS",
              "predicted_evidence": [["Page3", 3]],
              "evidence": [
                  [
                      [None, None, "Page3", 3]
                  ]
              ],
          },
          {
              "label": "NOT ENOUGH INFO",
              "predicted_label": "NOT ENOUGH INFO",
              "predicted_evidence": [],
              "evidence": [],
          },
        ]
        '''
        # Unpack the 'predicted_evidence_ids' column into a list of dictionaries (predicted_evidence_ids is a list of lists, where each inner list is a list of [page, sentence])
        for row in df.iterrows():
            # Get the corresponding row in the gold_df using the id
            id = row[1]['id']
            gold_row = gold_df[gold_df['id'] == id]
            if not gold_row.empty:
                # rename strict_score,label_accuracy,precision,recall,f1 to have "avg_"
                for key in ['strict_score', 'label_accuracy', 'precision', 'recall', 'f1']:
                    df.rename(columns={key: 'avg_' + key}, inplace=True)

                # Get the label and predicted label
                label = gold_row['label'].values[0]
                predicted_label = row[1]['module3_result']
                # Get the predicted evidence ids
                   # Get the evidence ids
                if predicted_label == 'NOT ENOUGH INFO':
                    predicted_evidence_ids = []
                else:
                  predicted_evidence_ids = row[1]['predicted_evidence_ids']
                  # Use ast to convert the string to a list
                  predicted_evidence_ids = ast.literal_eval(predicted_evidence_ids)

                # Get the evidence
                evidence = gold_row['evidence'].values[0]

                # Create a dictionary for the instance
                instance = {
                    'label': label,
                    'predicted_label': predicted_label,
                    'predicted_evidence': predicted_evidence_ids,
                    'evidence': evidence
                }

                strict_score, label_accuracy, precision, recall, f1 = fever_score([instance])
                print(f"Strict score: {strict_score}, Label accuracy: {label_accuracy}, Precision: {precision}, Recall: {recall}, F1: {f1}")

                # Add new columns for each per-claim score
                df.loc[row[0], 'pc_strict_score'] = strict_score
                df.loc[row[0], 'pc_label_accuracy'] = label_accuracy
                df.loc[row[0], 'pc_precision'] = precision
                df.loc[row[0], 'pc_recall'] = recall
                df.loc[row[0], 'pc_f1'] = f1

        processed_dfs.append(df)
        # Concatenate the DataFrames
    df = pd.concat(processed_dfs, ignore_index=True)
    return df

single_rep_dfs = load_dfs(SINGLE_DIR)
single_rep_df = process_dfs(single_rep_dfs)

### Quadruple-repetition tests

### Use the same load_dfs function as above
def process_reps_dfs(dfs):
    """
    Process the loaded DataFrames to extract relevant information.
    """
    print(f"Processing {len(dfs)} DataFrames for quadruple-repetition tests")
    to_average_dfs = []
    averaged_df = pd.DataFrame()
    unique_configs = []

    for df in dfs:
        # Unpack the module 1 report into new columns
        module_1_report = df['module1_report_details'].apply(json.loads)
        df['number_of_pages_retrieved'] = module_1_report.apply(lambda x: x['mod_1_total_documents'])
        df['total_document_tokens'] = module_1_report.apply(lambda x: x['total_document_tokens'])
        df['potential_titles'] = module_1_report.apply(lambda x: x['potential_titles'])
        df['retrieved_pages'] = module_1_report.apply(lambda x: x['retrieved_titles'])

        # Unpack the module 2 report into new columns
        module_2_report = df['module2_report_details'].apply(json.loads)
        for key in module_2_report.iloc[0]:
            df[key] = module_2_report.apply(lambda x: x[key])

        # Rename 'llm_total_tokens' as GPT_total_tokens
        df.rename(columns={'llm_total_tokens': 'gpt_total_tokens'}, inplace=True)
        # Rename 'llm_total_sentences' as GPT_total_sentences
        df.rename(columns={'llm_total_sentences': 'gpt_total_sentences'}, inplace=True)

        # Create a 'number_of_evidence_sentences' column from the length of final_evidence_ids
        df['number_of_evidence_sentences'] = df['final_evidence_ids'].apply(len)


        # Split on '-' and take everything except the last item
        trimmed_sys_config = df['system_config'].apply(lambda x: x.split('-')[:-1]).apply('-'.join)
        df['system_config'] = trimmed_sys_config
        unique_configs.extend(trimmed_sys_config)
        unique_configs = list(set(unique_configs))

        to_average_dfs.append(df)

    for config in unique_configs:
        # Get the DataFrames for this config
        print(f"Processing config: {config}")
        config_dfs = [df for df in to_average_dfs if df['system_config'].iloc[0] == config]
        print(f"Number of DataFrames for config {config}: {len(config_dfs)}")
        # Concatenate them into a single DataFrame
        config_df = pd.concat(config_dfs)
        # Drop non-numeric columns
        config_df_num = config_df.select_dtypes(include=[np.number])

        # for each row that shares an id in config_df_num, average the other columns and create a new row for that id and add it to averaged_df
        averaged_rows = []
        for id in config_df_num['id'].unique():
            # Get the rows for this id
            id_rows = config_df_num[config_df_num['id'] == id]

            # rename strict_score,label_accuracy,precision,recall,f1 to have "avg_"
            for key in ['strict_score', 'label_accuracy', 'precision', 'recall', 'f1']:
                config_df_num.rename(columns={key: 'avg_' + key}, inplace=True)

            # Average the numeric columns
            averaged_row = id_rows.mean()

            # Add the id to the averaged row
            averaged_row['id'] = id
            # Add the system config to the averaged row
            averaged_row['system_config'] = config
            # Add the module3_result back to the averaged row
            averaged_row['module3_result'] = config_df[config_df['id'] == id]['module3_result'].iloc[0]
            # Add the "predicted_evidence_ids" back
            averaged_row['predicted_evidence_ids'] = config_df[config_df['id'] == id]['predicted_evidence_ids'].iloc[0]

            # Unpack the 'predicted_evidence_ids' column into a list of dictionaries (predicted_evidence_ids is a list of lists, where each inner list is a list of [page, sentence])
            # Get the corresponding row in the gold_df using the id
            gold_row = gold_df[gold_df['id'] == id]
            if not gold_row.empty:
                # Get the label and predicted label
                label = gold_row['label'].values[0]
                predicted_label = averaged_row['module3_result']
                # Get the predicted evidence ids
                if predicted_label == 'NOT ENOUGH INFO':
                    predicted_evidence_ids = []
                else:
                    predicted_evidence_ids = averaged_row['predicted_evidence_ids']
                    # Use ast to convert the string to a list
                    predicted_evidence_ids = ast.literal_eval(predicted_evidence_ids)

                # Get the evidence
                evidence = gold_row['evidence'].values[0]

                # Create a dictionary for the instance
                instance = {
                    'label': label,
                    'predicted_label': predicted_label,
                    'predicted_evidence': predicted_evidence_ids,
                    'evidence': evidence
                }

                strict_score, label_accuracy, precision, recall, f1 = fever_score([instance])

                # Add new columns for each per-claim score
                averaged_row['pc_strict_score'] = strict_score
                averaged_row['pc_label_accuracy'] = label_accuracy
                averaged_row['pc_precision'] = precision
                averaged_row['pc_recall'] = recall
                averaged_row['pc_f1'] = f1

            # Append the averaged row to the list
            averaged_rows.append(averaged_row)
        # Create a DataFrame from the averaged rows
        averaged_df = pd.concat([averaged_df, pd.DataFrame(averaged_rows)], ignore_index=True)

    return averaged_df


In [None]:
reps_ft_base_dfs = load_dfs(FT_BASE_DIR)
reps_ft_base_df = process_reps_dfs(reps_ft_base_dfs)
reps_dfs = load_dfs(REPHRS_DIR)
reps_df = process_reps_dfs(reps_dfs)

# Concatenate the DataFrames
reps_df = pd.concat([reps_ft_base_df, reps_df], ignore_index=True)

# Now we have two DataFrames: single_rep_df and reps_df, the first with the single-repetition tests (GPT-sentEx, tuned_GPT-clf, tuned_GPT-query, tuned_sBERT-n1024) and the second with the quadruple-repetition tests (all_base, tuned_GPT-sBERTn1024-sentEx-naiveRephrs, tuned_GPT-sBERTn1024-sentEx-rephsHist, tuned_GPT-sBERTn1024-sentEx+GPT-query, tuned_GPT-sBERTn1024-sentEx+PT).

# We concatenate them into a single DataFrame
all_results_df = pd.concat([single_rep_df, reps_df], ignore_index=True)

# We can now select the relevant columns for our analysis
         # - number_of_pages_retrieved
         # - gpt_total_tokens
         # - number_of_evidence_sentences
         # - iterations_run
         # - time_to_check
         # - label_accuracy
         # - strict_score
         # - f1

relevant_columns = [
    'system_config',
    'number_of_pages_retrieved',
    'gpt_total_tokens',
    'number_of_evidence_sentences',
    'iterations_run',
    'time_to_check',
    'avg_label_accuracy',
    'avg_strict_score',
    'avg_f1',
    'pc_strict_score',
    'pc_label_accuracy',
    'pc_precision',
    'pc_recall',
    'pc_f1'
]
"""
We end up with 10 system configurations:
- all_base (fine-tuning baseline, an average of 4 repetitions)
- tuned_GPT-sentEx
- tuned_GPT-clf
- tuned_GPT-query
- tuned_sBERT-n1024
- tuned_GPT-sBERTn1024-sentEx-noRephrs (fine-tuning baseline for claim rephrasing, an average of 4 repetitions)
- tuned_GPT-sBERTn1024-sentEx-naiveRephrs (naive rephrasing, an average of 4 repetitions)
- tuned_GPT-sBERTn1024-sentEx-rephsHist (rephrase history, an average of 4 repetitions)
- tuned_GPT-sBERTn1024-sentEx+GPT-query
- tuned_GPT-sBERTn1024-sentEx+PT
"""
# Select only the relevant columns
all_results_df = all_results_df[relevant_columns]

# We can now save the DataFrame to a CSV file
all_results_df.to_csv(os.path.join(BASE_DIR, 'datasets/FEVER/paper_test_results/all_10_results.csv'), index=False)

# we now want to test the 4 single-model configurations (tuned_GPT-sentEx, tuned_GPT-clf, tuned_GPT-query, tuned_sBERT-n1024) relative to the fine-tuning baseline (all_base) and the 2 claim rephrasing configurations (tuned_GPT-sBERTn1024-sentEx-naiveRephrs, tuned_GPT-sBERTn1024-sentEx-rephsHist) relative the claim rephrasing baseline (no rephrasing) (tuned_GPT-sBERTn1024-sentEx-noRephrs) using a t-test or ANOVA, depending on the number of groups we are comparing.

In [7]:
all_results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   system_config                 240 non-null    object 
 1   number_of_pages_retrieved     240 non-null    float64
 2   gpt_total_tokens              240 non-null    float64
 3   number_of_evidence_sentences  240 non-null    float64
 4   iterations_run                240 non-null    float64
 5   time_to_check                 240 non-null    float64
 6   avg_label_accuracy            236 non-null    float64
 7   avg_strict_score              236 non-null    float64
 8   avg_f1                        236 non-null    float64
 9   pc_strict_score               240 non-null    float64
 10  pc_label_accuracy             240 non-null    float64
 11  pc_precision                  240 non-null    float64
 12  pc_recall                     240 non-null    float64
 13  pc_f1

## Statistical Testing and Description

In [None]:
# --- Create Color Mapping ---
unique_system_configs = sorted(all_results_df['system_config'].unique())
color_mapping = {}
palette = px.colors.qualitative.Safe_r
for i, system_config in enumerate(unique_system_configs):
    color_mapping[system_config] = palette[i % len(palette)]
print("\nColor Mapping:", json.dumps(color_mapping, indent=2))

# Define metric groups
overall_perf_metrics = ['avg_label_accuracy', 'avg_strict_score', 'avg_f1'] # These are overall run averages
pc_binary_metrics = ['pc_strict_score', 'pc_label_accuracy']
pc_continuous_metrics = ['pc_precision', 'pc_recall', 'pc_f1']
other_operational_metrics = [
    'number_of_pages_retrieved',
    'gpt_total_tokens',
    'number_of_evidence_sentences',
    'iterations_run',
    'time_to_check'
]

# Metrics for Group 2 statistical tests (per-claim level)
metrics_to_test_group2_pc = pc_binary_metrics + pc_continuous_metrics + other_operational_metrics

# --- Group 1 Comparison ---
group1_configs_tuned = [
    'tuned_GPT-sentEx', 'tuned_GPT-clf', 'tuned_GPT-query', 'tuned_sBERT-n1024'
]
group1_baseline = 'all_base'
group1_all_configs = sorted(group1_configs_tuned + [group1_baseline])
group1_df = all_results_df[all_results_df['system_config'].isin(group1_all_configs)].copy()

print("\n--- Group 1 Visualizations (vs Baseline) ---")
print("NOTE: Comparing single-rep results to a quad-rep baseline's per-claim scores.")

# Plot 1.1: Grouped Bar Chart for OVERALL Performance Metrics (avg_ columns)
group1_agg_df = group1_df.groupby('system_config')[overall_perf_metrics].mean().reset_index()
group1_perf_melted = pd.melt(group1_agg_df, id_vars='system_config', value_vars=overall_perf_metrics,
                             var_name='Metric', value_name='Average Score (Overall Run)')
plot_name_perf1 = 'group1_overall_avg_performance_bar'
fig_perf1 = px.bar(group1_perf_melted, x='Metric', y='Average Score (Overall Run)', color='system_config',
                   barmode='group', title='Group 1: Overall Average Performance Scores',
                   category_orders={"system_config": group1_all_configs}, color_discrete_map=color_mapping, height=500)
fig_perf1.update_layout(xaxis_tickangle=0)
try:
    fig_perf1.show()
    fig_perf1.write_image(os.path.join(FT_RESULTS_SAVE_PATH, f'{plot_name_perf1}.png'), scale=SAVE_SCALE)
    #print(f"Saved Group 1 overall performance bar chart to: {FT_RESULTS_SAVE_PATH}/{plot_name_perf1}.png")
except Exception as e: print(f"ERROR saving/showing {plot_name_perf1}: {e}")
print(f"Group 1: Overall Average Performance Scores:")
print(group1_perf_melted)

# Plot 1.2: Bar Charts for PER-CLAIM BINARY Metrics (pc_strict_score, pc_label_accuracy)
for metric in pc_binary_metrics:
    plot_name_bar1_pc = f'group1_{metric}_proportion_bar'
    # Calculate proportion of 1s (successes)
    prop_df = group1_df.groupby('system_config')[metric].agg(ProportionSuccess='mean', Count='size').reset_index()
    fig_bar1_pc = px.bar(prop_df, x='system_config', y='ProportionSuccess', color='system_config',
                         title=f'Group 1: Proportion of Claims with {metric.replace("pc_", "").replace("_", " ").title()} = 1',
                         labels={'ProportionSuccess': f'Proportion of Claims ({metric})', 'system_config': 'System Configuration'},
                         category_orders={"system_config": group1_all_configs}, color_discrete_map=color_mapping, height=500)
    fig_bar1_pc.update_yaxes(range=[0, 1]) # Ensure y-axis is 0 to 1 for proportions
    try:
        fig_bar1_pc.show()
        fig_bar1_pc.write_image(os.path.join(FT_RESULTS_SAVE_PATH, f'{plot_name_bar1_pc}.png'), scale=SAVE_SCALE)
        #print(f"Saved {plot_name_bar1_pc} to: {FT_RESULTS_SAVE_PATH}/{plot_name_bar1_pc}.png")
    except Exception as e: print(f"ERROR saving/showing {plot_name_bar1_pc}: {e}")

# Plot 1.3: Box Plots for PER-CLAIM CONTINUOUS & OTHER Metrics
metrics_for_box_group1 = pc_continuous_metrics + other_operational_metrics
for metric in metrics_for_box_group1:
    plot_name_box1 = f'group1_{metric}_boxplot'
    try:
        fig_box1 = px.box(group1_df, x='system_config', y=metric,
                         title=f'Group 1: Distribution of {metric.replace("pc_", "").replace("_", " ").title()} Per-Claim ',
                         points="all", category_orders={"system_config": group1_all_configs},
                         color="system_config", color_discrete_map=color_mapping)
        fig_box1.update_layout(showlegend=False)
        #fig_box1.show()
        fig_box1.write_image(os.path.join(FT_RESULTS_SAVE_PATH, f'{plot_name_box1}.png'), scale=SAVE_SCALE)
        #print(f"Saved {plot_name_box1} to: {FT_RESULTS_SAVE_PATH}/{plot_name_box1}.png")
    except Exception as e: print(f"Could not generate/save box plot for {metric} (Group 1): {e}")
    # Print the dataframe description
    print(f"\nGroup 1: Description of {metric} per system configuration:")
    print(group1_df.groupby('system_config')[metric].describe())

# --- Group 2 Comparison (Claim Rephrasing - Multi-Rep) ---
group2_baseline = 'tuned_GPT-sBERTn1024-sentEx-noRephrs'
group2_configs_tuned = [
    'tuned_GPT-sBERTn1024-sentEx-naiveRephrs', 'tuned_GPT-sBERTn1024-sentEx-rephsHist'
]
group2_all_configs = sorted(group2_configs_tuned + [group2_baseline])
group2_df = all_results_df[all_results_df['system_config'].isin(group2_all_configs)].copy()

print("\n--- Group 2 Analysis: Claim Rephrasing (Quad-Rep) ---")

# Plot 2.1: Grouped Bar Chart for OVERALL Performance Metrics (avg_ columns)
group2_agg_df = group2_df.groupby('system_config')[overall_perf_metrics].mean().reset_index()
group2_perf_melted = pd.melt(group2_agg_df, id_vars='system_config', value_vars=overall_perf_metrics,
                             var_name='Metric', value_name='Average Score (Overall Run)')
plot_name_perf2 = 'group2_overall_avg_performance_bar'
fig_perf2 = px.bar(group2_perf_melted, x='Metric', y='Average Score (Overall Run)', color='system_config',
                   barmode='group', title='Group 2: Overall Avg Performance by Rephrasing',
                   category_orders={"system_config": group2_all_configs}, color_discrete_map=color_mapping, height=500)
fig_perf2.update_layout(xaxis_tickangle=0)
try:
    fig_perf2.show()
    fig_perf2.write_image(os.path.join(CR_RESULTS_SAVE_PATH, f'{plot_name_perf2}.png'), scale=SAVE_SCALE)
    #print(f"Saved Group 2 overall performance bar chart to: {CR_RESULTS_SAVE_PATH}/{plot_name_perf2}.png")
except Exception as e: print(f"ERROR saving/showing {plot_name_perf2}: {e}")
print(f"Group 2: Overall Average Performance Scores:")
print(group2_perf_melted)

# Plot 2.2: Bar Charts for PER-CLAIM BINARY Metrics
for metric in pc_binary_metrics:
    plot_name_bar2_pc = f'group2_{metric}_proportion_bar'
    prop_df = group2_df.groupby('system_config')[metric].agg(ProportionSuccess='mean', Count='size').reset_index()
    fig_bar2_pc = px.bar(prop_df, x='system_config', y='ProportionSuccess', color='system_config',
                         title=f'Group 2: Proportion of Claims with {metric.replace("pc_", "").replace("_", " ").title()} = 1',
                         labels={'ProportionSuccess': f'Proportion of Claims ({metric})', 'system_config': 'System Configuration'},
                         category_orders={"system_config": group2_all_configs}, color_discrete_map=color_mapping, height=500)
    fig_bar2_pc.update_yaxes(range=[0, 1])
    try:
        fig_bar2_pc.show()
        fig_bar2_pc.write_image(os.path.join(CR_RESULTS_SAVE_PATH, f'{plot_name_bar2_pc}.png'), scale=SAVE_SCALE)
        #print(f"Saved {plot_name_bar2_pc} to: {CR_RESULTS_SAVE_PATH}/{plot_name_bar2_pc}.png")
    except Exception as e: print(f"ERROR saving/showing {plot_name_bar2_pc}: {e}")

alpha = 0.05

for metric in metrics_to_test_group2_pc:
    print(f"\n--- Analyzing Per-Claim Metric: {metric.replace('_', ' ').title()} (Group 2) ---")
    data_groups = [group2_df[group2_df['system_config'] == config][metric].dropna() for config in group2_all_configs]
    group_names = group2_all_configs

    if metric in pc_binary_metrics:
        print(f"Statistical Test for Binary Metric: {metric}")
        # --- Chi-Squared Test ---
        # Create a contingency table: counts of 0s and 1s for each system_config
        contingency_table_data = []
        for config_name, data_series in zip(group_names, data_groups):
            successes = data_series.sum() # Count of 1s
            failures = len(data_series) - successes # Count of 0s
            contingency_table_data.append([successes, failures])
            print(f"  {config_name}: Successes={successes}, Failures={failures} (Total={len(data_series)})")

        # Ensure all groups have data
        if all(len(data_series) > 0 for data_series in data_groups) and len(contingency_table_data) > 1:
            try:
                # Convert to numpy array for chi2_contingency
                observed = np.array(contingency_table_data)
                chi2, p_val, dof, expected = stats.chi2_contingency(observed)
                print(f"  Chi-Squared Test: chi2 = {chi2:.4f}, p = {p_val:.4f}, df = {dof}")
                if p_val < alpha:
                    print(f"  Significant difference in proportions found between groups (p < {alpha}).")
                    # Optional: Post-hoc pairwise comparisons (e.g., multiple 2x2 Chi-squared tests with Bonferroni)
                else:
                    print(f"  No significant difference in proportions found between groups (p >= {alpha}).")
            except ValueError as ve: # Catches errors like "The internally computed table of expected frequencies has a zero element at..."
                 print(f"  Chi-Squared Test could not be performed: {ve}. Likely due to low counts in some cells.")
                 print(f"  Consider Fisher's Exact Test for small N or if expected frequencies are low.")
            except Exception as e_chi2:
                print(f"  Error during Chi-Squared test for {metric}: {e_chi2}")
        else:
            print("  Skipping Chi-Squared test (insufficient data in one or more groups).")

    else: # Continuous or other operational metrics
        # --- Visualizations (Box Plot & Q-Q) ---
        plot_name_box2_pc = f'group2_{metric}_boxplot'
        try:
            fig_box2_pc = px.box(group2_df, x='system_config', y=metric, points="all",
                              title=f'Group 2: Distribution of Per-Claim {metric.replace("pc_","").replace("_"," ").title()}',
                              category_orders={"system_config": group2_all_configs},
                              color="system_config", color_discrete_map=color_mapping)
            fig_box2_pc.update_layout(showlegend=False)
            fig_box2_pc.show()
            fig_box2_pc.write_image(os.path.join(CR_RESULTS_SAVE_PATH, f'{plot_name_box2_pc}.png'), scale=SAVE_SCALE)
            #print(f"Saved {plot_name_box2_pc} to: {CR_RESULTS_SAVE_PATH}/{plot_name_box2_pc}.png")
        except Exception as e: print(f"Could not generate/save box plot for {metric} (Group 2): {e}")

        print(f"Normality Check (Q-Q Plots & Shapiro-Wilk) for {metric}:")
        normality_passed = True
        plot_name_qq = f'group2_{metric}_qqplots'
        fig_qq_plt, axes = plt.subplots(1, len(data_groups), figsize=(5 * len(data_groups), 4), squeeze=False)
        fig_qq_plt.suptitle(f'Group 2: Q-Q Plots for Per-Claim {metric.replace("pc_","").replace("_"," ").title()}', fontsize=14)
        shapiro_results = {}
        for i, (data, name) in enumerate(zip(data_groups, group_names)):
            if len(data) >= 3:
                try:
                    qqplot(data, line='s', ax=axes[0, i])
                    axes[0, i].set_title(f'{name}\n(N={len(data)})')
                    shapiro_test = stats.shapiro(data)
                    shapiro_results[name] = shapiro_test.pvalue
                    print(f"  Shapiro-Wilk p-value for {name}: {shapiro_test.pvalue:.4f}", end="")
                    if shapiro_test.pvalue < alpha:
                        print(f" -> Potential non-normality (p < {alpha})"); normality_passed = False
                    else:
                        print(f" -> Normality plausible (p >= {alpha})")
                except Exception as e_qq:
                     print(f"  Error generating Q-Q plot or Shapiro for {name}: {e_qq}"); axes[0,i].set_title(f'{name}\n(Error)'); shapiro_results[name] = np.nan
            else:
                print(f"  Skipping normality check for {name} (N={len(data)} < 3)"); shapiro_results[name] = np.nan; axes[0,i].set_title(f'{name}\n(N={len(data)} - Too Small)')
        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        try:
            plt.savefig(os.path.join(CR_RESULTS_SAVE_PATH, f'{plot_name_qq}.png'), dpi=300)
            #print(f"Saved {plot_name_qq} to: {CR_RESULTS_SAVE_PATH}/{plot_name_qq}.png")
            plt.show()
        except Exception as e_save_qq: print(f"ERROR saving Q-Q plot {plot_name_qq}: {e_save_qq}"); plt.show()
        plt.close(fig_qq_plt)

        print(f"\nVariance Check (Levene's Test) for {metric}:")
        if all(len(data) > 1 for data in data_groups):
            try:
                levene_test = stats.levene(*data_groups, center='median')
                print(f"  Levene's test p-value: {levene_test.pvalue:.4f}", end="")
                equal_variance = levene_test.pvalue >= alpha
                if equal_variance: print(f" -> Equal variances plausible (p >= {alpha})")
                else: print(f" -> Equal variances unlikely (p < {alpha})")
            except Exception as e_levene:
                print(f" Levene's test failed: {e_levene}"); equal_variance = None
        else:
            print("  Skipping Levene's test (insufficient data)"); equal_variance = None

        print(f"\nGroup Comparison for {metric}:")
        total_valid_samples = sum(len(d) for d in data_groups)
        if total_valid_samples < len(data_groups) * 2 or len(data_groups) < 2:
             print("  Skipping comparison test (insufficient data)"); continue
        try:
            if normality_passed and equal_variance:
                print("  Using ANOVA (assumptions met).")
                f_val, p_val = stats.f_oneway(*data_groups)
                print(f"  ANOVA Result: F = {f_val:.4f}, p = {p_val:.4f}")
                if p_val < alpha:
                    print(f"  Significant difference found between groups (p < {alpha}).")
                    print("  Pairwise t-tests (uncorrected):")
                    baseline_data = data_groups[group_names.index(group2_baseline)]
                    for i, config_name in enumerate(group2_configs_tuned): # Iterate over tuned configs for comparison
                        comp_data = data_groups[group_names.index(config_name)]
                        if len(baseline_data) > 1 and len(comp_data) > 1:
                             t_stat, p_t_pair = stats.ttest_ind(baseline_data, comp_data, equal_var=True)
                             print(f"    {group2_baseline} vs {config_name}: t={t_stat:.3f}, p={p_t_pair:.4f}{' *' if p_t_pair < alpha else ''}")
                else:
                    print(f"  No significant difference found between groups (p >= {alpha}).")
            else:
                print("  Using Kruskal-Wallis (non-parametric test due to potential assumption violation).")
                h_val, p_val = stats.kruskal(*data_groups)
                print(f"  Kruskal-Wallis Result: H = {h_val:.4f}, p = {p_val:.4f}")
                if p_val < alpha:
                    print(f"  Significant difference found between groups (p < {alpha}).")
                else:
                    print(f"  No significant difference found between groups (p >= {alpha}).")
        except Exception as e_test: print(f"  Statistical test failed for {metric}: {e_test}")

    print("-" * 40)

print("\n--- Analysis Complete ---")


Color Mapping: {
  "all_base": "rgb(136, 136, 136)",
  "tuned_GPT-clf": "rgb(102, 17, 0)",
  "tuned_GPT-query": "rgb(136, 34, 85)",
  "tuned_GPT-sBERTn1024-sentEx-naiveRephrs": "rgb(153, 153, 51)",
  "tuned_GPT-sBERTn1024-sentEx-noRephrs": "rgb(68, 170, 153)",
  "tuned_GPT-sBERTn1024-sentEx-rephsHist": "rgb(170, 68, 153)",
  "tuned_GPT-sentEx": "rgb(51, 34, 136)",
  "tuned_sBERT-n1024": "rgb(17, 119, 51)"
}

--- Group 1 Visualizations (vs Baseline) ---
NOTE: Comparing single-rep results to a quad-rep baseline's per-claim scores.
Group 1: Overall Average Performance Scores:
        system_config              Metric  Average Score (Overall Run)
0            all_base  avg_label_accuracy                     0.633333
1       tuned_GPT-clf  avg_label_accuracy                     0.566667
2     tuned_GPT-query  avg_label_accuracy                     0.600000
3    tuned_GPT-sentEx  avg_label_accuracy                     0.733333
4   tuned_sBERT-n1024  avg_label_accuracy                     0.

In [5]:
# List all the files in the figure save dirs

print(f"Figures available for Group 1:")
for file in os.listdir(FT_RESULTS_SAVE_PATH):
    print(file)
print('\n')
print(f"Figures available for Group 2:")
for file in os.listdir(CR_RESULTS_SAVE_PATH):
    print(file)

Figures available for Group 1:
group1_pc_strict_score_proportion_bar.png
group1_pc_precision_boxplot.png
group1_overall_avg_performance_bar.png
group1_pc_recall_boxplot.png
group1_number_of_pages_retrieved_boxplot.png
group1_pc_label_accuracy_proportion_bar.png
group1_gpt_total_tokens_boxplot.png
group1_iterations_run_boxplot.png
group1_pc_f1_boxplot.png
group1_number_of_evidence_sentences_boxplot.png
group1_time_to_check_boxplot.png


Figures available for Group 2:
group2_pc_precision_boxplot.png
group2_pc_strict_score_proportion_bar.png
group2_overall_avg_performance_bar.png
group2_pc_precision_qqplots.png
group2_pc_label_accuracy_proportion_bar.png
group2_pc_recall_boxplot.png
group2_pc_f1_boxplot.png
group2_pc_recall_qqplots.png
group2_pc_f1_qqplots.png
group2_number_of_pages_retrieved_boxplot.png
group2_number_of_pages_retrieved_qqplots.png
group2_gpt_total_tokens_boxplot.png
group2_number_of_evidence_sentences_qqplots.png
group2_number_of_evidence_sentences_boxplot.png
group2_gpt

## References


Sheffieldnlp. (2021). FEVER-scorer. SHEFFIELDNLP/Fever-scorer at release-v2.0. https://github.com/sheffieldnlp/fever-scorer/tree/release-v2.0 

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018, April). Fever dataset. Fact Extraction and VERification. https://fever.ai/dataset/fever.html 

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018, June). FEVER: A large-scale dataset for fact extraction and VERification. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 809–819). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1074