## tldr:

When comparing pairwise experiments to check for significant differences in  F1 when the covariates (eg. Prompt Type, Model) are varied...

Cannot do parametric tests eg. ANOVA/Tukey (compares mean), since assumption of normality for residuals failed. Gotta do non-parametric tests eg. Kruskall-Wallis, Dunn's test (compare median)

Results:
*   significant differences in median (not mean) F1, for levels within ~Prompt Type~, Model, Sample Size, Num Features, Class 1 Proportion. Great!
*   Model: significant differences between all levels, except 7B vs 7B-8bit. Great!
*   Sample Size: significant differences between ~all, except 32 vs all others. Great!
*   Num Features: significant differences between ~all, except 40 vs all others. Great!
*   Class 1 Proportion: significant differences between all. Great!
*   tradML vs best GTL: pending unaveraged tradML results



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:


import pandas as pd
import glob
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    average_precision_score, roc_auc_score
)
import ast

# Current directory
# %cd '/content/drive/MyDrive/Capstone Project'
%cd '/content/drive/MyDrive/'

/content/drive/MyDrive


# 1. Read csv files, cleaning and put together

In [3]:
# Following Varun's code
test_ohe = pd.read_csv('/content/drive/MyDrive/Indian_bank_data/test_tree.csv')
test_ohe = test_ohe[~test_ohe["Approved_Flag"].isna()]
test_ohe_100 = test_ohe.sample(n=100, random_state=42).round(4)
true_labels = test_ohe_100["Approved_Flag"].tolist()

# Functions to handle invalids - are used later during the processing of the dataframes-:
def update_invalid(lst1,lst2):
    for i in range(len(lst2)):
        if lst2[i] == 'invalid':
            lst2[i] = float(~int(lst1[i]) & 1)
    return lst2

def calculate_new_auc(x):
    y_pred = ast.literal_eval(x)
    cnt_preds = len(y_pred)
    true_labels_new = true_labels[:cnt_preds]
    y_pred_new = update_invalid(true_labels_new,y_pred)
    roc_auc = roc_auc_score(true_labels_new, y_pred_new)
    return roc_auc

def calculate_new_prauc(x):
    y_pred = ast.literal_eval(x)
    cnt_preds = len(y_pred)
    true_labels_new = true_labels[:cnt_preds]
    y_pred_new = update_invalid(true_labels_new,y_pred)
    pr_auc = average_precision_score(true_labels_new, y_pred_new)
    return pr_auc

def calculate_new_f1(x):
    y_pred = ast.literal_eval(x)
    cnt_preds = len(y_pred)
    true_labels_new = true_labels[:cnt_preds]
    y_pred_new = update_invalid(true_labels_new,y_pred)
    f1 = f1_score(true_labels_new, y_pred_new, zero_division=1)
    return f1

def calculate_new_precision(x):
    y_pred = ast.literal_eval(x)
    cnt_preds = len(y_pred)
    true_labels_new = true_labels[:cnt_preds]
    y_pred_new = update_invalid(true_labels_new,y_pred)
    precision = precision_score(true_labels_new, y_pred_new, zero_division=1)
    return precision

def calculate_new_recall(x):
    y_pred = ast.literal_eval(x)
    cnt_preds = len(y_pred)
    true_labels_new = true_labels[:cnt_preds]
    y_pred_new = update_invalid(true_labels_new,y_pred)
    recall = recall_score(true_labels_new, y_pred_new, zero_division=1)
    return recall

def generate_new_metrics(df):
    df["new_AUCROC"] = df['Prediction'].apply(lambda x: calculate_new_auc(x))
    df["new_PRAUC"] = df['Prediction'].apply(lambda x: calculate_new_prauc(x))
    df["new_F1 Score"] = df['Prediction'].apply(lambda x: calculate_new_f1(x))
    df["new_Precision"] = df['Prediction'].apply(lambda x: calculate_new_precision(x))
    df["new_Recall"] = df['Prediction'].apply(lambda x: calculate_new_recall(x))

In [4]:
# List of patterns to match multiple groups of files
file_patterns = [
    './traditional_ML_res.csv',  # Traditional ML results
    './Indian_bank_data_results/experiments_result_t_table_7B-GTL-unquant_*.csv',  # Pattern for 7B unquant -t_table- files
    './Indian_bank_data_results/experiments_result_t_table_7B-GTL-8bit_*.csv',  # Pattern for 7B 8bit -t_table- files
    './Indian_bank_data_results/experiments_result_t_table_13B-GTL-8bit_*.csv',  # Pattern for 13B 8bit -t_table- files
    # './Indian_bank_data_results/experiments_result_t_table_gpt4omini_*.csv',  # Pattern for GPT4 -t_table- files
    './Indian_bank_data_results/experiments_result_t_annony_7B-GTL-unquant_*.csv',  # Pattern for 7B unquant -t_annony- files
    './Indian_bank_data_results/experiments_result_t_annony_7B-GTL-8bit_*.csv',  # Pattern for 7B 8bit -t_annony- files
    './Indian_bank_data_results/experiments_result_t_annony_13B-GTL-8bit_*.csv',  # Pattern for 13B 8bit -t_annony- files
    # './Indian_bank_data_results/experiments_result_t_annony_gpt4omini_*.csv'  # Pattern for GPT4 -t_annony- files
]

# Initialize an empty list to store DataFrames
dfs = []

# Loop over each pattern and get all matching file names
for pattern in file_patterns:
    file_names = glob.glob(pattern)  # Get files matching the pattern

    # Check the model (7B-unquant, 7B-8bit, 13B-8bit or GPT-4) of the current pattern (we'll use it to add a column with the model)
    if '7B-GTL-unquant' in pattern:
        model = '7B-GTL-unquant'
    elif '7B-GTL-8bit' in pattern:
        model = '7B-GTL-8bit'
    elif '13B-GTL-8bit' in pattern:
        model = '13B-GTL-8bit'
    elif 'gpt4omini' in pattern:
        model = 'GPT4'
    else:
        model = 'unknown'  # Default case if none of the conditions match (serves as check)

    # Check if the current pattern belongs to t_table or t_annony (we'll use it to add a column indicating if it corresponds to t_table or t_annony)
    if 't_table' in pattern:
        prompt_type = 't_table'
    elif 't_annony' in pattern:
        prompt_type = 't_anony'
    else:
        prompt_type = 'unknown' # Default case if none of the conditions match (serves as check)

    # Read each file and append the DataFrame to the list
    for file in file_names:
        df = pd.read_csv(file)

        # Add the 'model' column to the DataFrame
        df['Model'] = model

        # Add the 'prompt_type' column to the DataFrame
        df['Prompt Type'] = prompt_type

        # Handle special cases:

        # Case 1: Check if the second column (instead of the first one) is called "Num Features"
        # This happens in experiments_result_t_table_7B-GTL-8bit_48 and experiments_result_t_table_7B-GTL-8bit_64 csv files
        if df.columns[1] == "Num Features" and df.columns[0] == "Unnamed: 0":
          # Drop the first column (index 0) ... it is a column with an index
          df = df.drop(df.columns[0], axis=1)

        # Case 2: If both "F1 Score" and "F1_Score" columns are present, drop the "F1_Score" column
        # This happens in some csv files, eg the 13B ones
        if "F1 Score" in df.columns and "F1_Score" in df.columns:
            df = df.drop("F1_Score", axis=1)

        # If only "F1_Score" column is present, rename to "F1 Score"
        if "F1_Score" in df.columns:
            df = df.rename(columns={'F1_Score': 'F1 Score'})

        # Case 3: If both "Num Features" and "feature_set_size" columns are present, drop the "feature_set_size" column
        # This happens in some csv files, eg experiments_result_t_annony_7B-GTL_128_10
        if "Num Features" in df.columns and "feature_set_size" in df.columns:
            df = df.drop("feature_set_size", axis=1)

        # Case 4: Match the AUC columns names between the different csv files
        # This happens in some files, eg: experiments_result_t_annony_7B-GTL-8bit_8 or experiments_result_t_annony_7B-GTL-8bit_0
        if "PR-AUC" in df.columns:
          df = df.rename(columns={'PR-AUC': 'PR_AUC'})
        if "ROC-AUC" in df.columns:
          df = df.rename(columns={'ROC-AUC': 'ROC_AUC'})
        if "AUCROC" in df.columns:
          df = df.rename(columns={'AUCROC': 'ROC_AUC'})

        # Case 5: Drop rows where 'Num Features' column has NaN values
        # This happens in the file 'experiments_result_t_table_7B-GTL_32' that contains empty rows
        df = df.dropna(subset=['Num Features'])

        # Case 6: Filter out rows where 'Sample Size' is equal to 48
        df = df[df['Sample Size'] != 48]

        # Case 7: Filter out rows where 'Sample Size' is equal to 128 (because many invalids for some experiments)
        df = df[df['Sample Size'] != 128]


        ### Handling invalids: new columns
        df["new_AUCROC"] = df['Prediction'].apply(lambda x: calculate_new_auc(x))
        df["new_PRAUC"] = df['Prediction'].apply(lambda x: calculate_new_prauc(x))
        df["new_F1 Score"] = df['Prediction'].apply(lambda x: calculate_new_f1(x))
        df["new_Precision"] = df['Prediction'].apply(lambda x: calculate_new_precision(x))
        df["new_Recall"] = df['Prediction'].apply(lambda x: calculate_new_recall(x))

        # Reorder columns to make 'model' and 'prompt_type' the first two columns
        columns = ['Model', 'Prompt Type'] + [col for col in df.columns if col not in ['Model', 'Prompt Type']]
        df = df[columns]

        dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
results = pd.concat(dfs, ignore_index=True)

# Drop some columns that appear on just a few dataframes: Unnamed: 0 & invalid_prop
#results = results.drop("Unnamed: 0", axis=1)
# results = results.drop("invalid_prop", axis=1)
# results = results.drop(["PR_AUC", "ROC_AUC"], axis=1)

  results = pd.concat(dfs, ignore_index=True)


In [5]:
df = results
results.head()

Unnamed: 0,Model,Prompt Type,Num Features,Sample Size,Class 1 Proportion,Set ID,Run Number,Accuracy,Precision,Recall,F1 Score,Prediction,new_AUCROC,new_PRAUC,new_F1 Score,new_Precision,new_Recall,PR_AUC,ROC_AUC,invalid_prop
0,7B-GTL-unquant,t_table,5.0,32.0,0.1,Set_1_Prop_0.1,1.0,0.86,0.0,0.0,0.0,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.494253,0.13,0.0,0.0,0.0,,,
1,7B-GTL-unquant,t_table,5.0,32.0,0.1,Set_2_Prop_0.1,1.0,0.87,0.0,0.0,0.0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.5,0.13,0.0,1.0,0.0,,,
2,7B-GTL-unquant,t_table,5.0,32.0,0.1,Set_3_Prop_0.1,1.0,0.84,0.0,0.0,0.0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.482759,0.13,0.0,0.0,0.0,,,
3,7B-GTL-unquant,t_table,5.0,32.0,0.1,Set_4_Prop_0.1,1.0,0.81,0.125,0.076923,0.095238,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...",0.498232,0.129615,0.095238,0.125,0.076923,,,
4,7B-GTL-unquant,t_table,5.0,32.0,0.1,Set_5_Prop_0.1,1.0,0.83,0.0,0.0,0.0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.477011,0.13,0.0,0.0,0.0,,,


In [6]:
results = df
# Assuming the data is in a CSV file named 'new_results.csv'
new_data_file = './Indian_bank_data_results/traditional_ML_res.csv'

# Load the new data into a DataFrame
new_data_df = pd.read_csv(new_data_file)
new_data_df = new_data_df[new_data_df['Sample Size'] != 128]
new_data_df = new_data_df[new_data_df['Model'] != 'RandomPredictor']
# print(new_data_df)

# Drop the specified columns
columns_to_drop = ['Hyperparameter_Tuning', 'Used_Hyperparameters']
new_data_df = new_data_df.drop(columns=columns_to_drop, errors='ignore')
new_data_df.rename(columns={'F1_Score': 'F1 Score'}, inplace=True)

# Ensure column consistency with the existing results DataFrame
# (Add any new columns from `results` if they are not already present in `new_data_df`)
missing_columns = set(results.columns) - set(new_data_df.columns)
for col in missing_columns:
    new_data_df[col] = 'null'  # Fill missing columns with None

# Reorder columns to match the `results` DataFrame
new_data_df = new_data_df[results.columns]

### Handling invalids: new columns
new_data_df["new_AUCROC"] = new_data_df['Prediction'].apply(lambda x: calculate_new_auc(x))
new_data_df["new_PRAUC"] = new_data_df['Prediction'].apply(lambda x: calculate_new_prauc(x))
new_data_df["new_F1 Score"] = new_data_df['Prediction'].apply(lambda x: calculate_new_f1(x))
new_data_df["new_Precision"] = new_data_df['Prediction'].apply(lambda x: calculate_new_precision(x))
new_data_df["new_Recall"] = new_data_df['Prediction'].apply(lambda x: calculate_new_recall(x))

# Append the new data to the existing `results` DataFrame
results = pd.concat([results, new_data_df], ignore_index=True)

# Optional: Save the updated results DataFrame to a new file
# results.to_csv('updated_results.csv', index=False)


In [7]:
results.tail()

Unnamed: 0,Model,Prompt Type,Num Features,Sample Size,Class 1 Proportion,Set ID,Run Number,Accuracy,Precision,Recall,F1 Score,Prediction,new_AUCROC,new_PRAUC,new_F1 Score,new_Precision,new_Recall,PR_AUC,ROC_AUC,invalid_prop
4491,LogisticRegression,,,64.0,0.5,,,,,,0.194444,"[0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, ...",0.47038,0.123885,0.194444,0.118644,0.538462,0.113596,0.415561,
4492,RandomForest,,,64.0,0.5,,,,,,0.222222,"[1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, ...",0.52962,0.137542,0.222222,0.146341,0.461538,0.184821,0.579134,
4493,DecisionTree,,,64.0,0.5,,,,,,0.230088,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...",0.5,0.13,0.230088,0.13,1.0,0.13,0.5,
4494,XGBoost,,,64.0,0.5,,,,,,0.197183,"[1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, ...",0.476127,0.124987,0.197183,0.12069,0.538462,0.192822,0.56145,
4495,LogisticRegression,,,64.0,0.5,,,,,,0.194444,"[0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, ...",0.47038,0.123885,0.194444,0.118644,0.538462,0.113596,0.415561,


In [8]:
# Additional checks:
print('Additional checks:...\n')

## Model
print('Unique values results for Model')
print(results['Model'].unique())

print('')

## Prompt Type
print('Unique values results for Prompt Type')
print(results['Prompt Type'].unique())

print('')

## Num Features
print('Unique values results for Num Features')
print(results['Num Features'].unique())

print('')

## Sample Size
print('Unique values results for Sample Size')
print(results['Sample Size'].unique())

print('')

## F1 Score
# Check if all elements in 'F1 Score' are numeric
print('Are all elements of "F1 Score" numeric?')
print(pd.to_numeric(results['F1 Score'], errors='coerce').notna().all())

print('')

## PR_AUC
print('Are all elements of "PR_AUC" numeric?')
print(pd.to_numeric(results['PR_AUC'], errors='coerce').notna().all())

# Which are the non-numeric values?
# Convert column to numeric, non-numeric values become NaN
results['PR_AUC_is_numeric'] = pd.to_numeric(results['PR_AUC'], errors='coerce')
# Identify rows where the value is NaN (non-numeric)
print(f"Non_numeric_values = {results[results['PR_AUC_is_numeric'].isna()]['PR_AUC'].unique()}")
results = results.drop("PR_AUC_is_numeric", axis=1)

print('')

## ROC_AUC
print('Are all elements of "ROC_AUC" numeric?')
print(pd.to_numeric(results['ROC_AUC'], errors='coerce').notna().all())

# Which are the non-numeric values?
# Convert column to numeric, non-numeric values become NaN
results['ROC_AUC_is_numeric'] = pd.to_numeric(results['ROC_AUC'], errors='coerce')
# Identify rows where the value is NaN (non-numeric)
print(f"Non_numeric_values = {results[results['ROC_AUC_is_numeric'].isna()]['ROC_AUC'].unique()}")
results = results.drop("ROC_AUC_is_numeric", axis=1)

print('')
print('---------------')

print('The new columns do not suffer from this problem:')
print('Are all elements of "new_F1 Score" numeric?')
print(pd.to_numeric(results['new_F1 Score'], errors='coerce').notna().all())

print('')

print('Are all elements of "new_PRAUC" numeric?')
print(pd.to_numeric(results['new_PRAUC'], errors='coerce').notna().all())

print('')

print('Are all elements of "new_AUCROC" numeric?')
print(pd.to_numeric(results['new_AUCROC'], errors='coerce').notna().all())

Additional checks:...

Unique values results for Model
['7B-GTL-unquant' '7B-GTL-8bit' '13B-GTL-8bit' 'RandomForest'
 'DecisionTree' 'XGBoost' 'LogisticRegression']

Unique values results for Prompt Type
['t_table' 't_anony' 'null']

Unique values results for Num Features
[5.0 10.0 20.0 30.0 40.0 'null']

Unique values results for Sample Size
[32. 64.  8. 16.  0.]

Are all elements of "F1 Score" numeric?
True

Are all elements of "PR_AUC" numeric?
False
Non_numeric_values = [nan 'invalid']

Are all elements of "ROC_AUC" numeric?
False
Non_numeric_values = [nan 'invalid']

---------------
The new columns do not suffer from this problem:
Are all elements of "new_F1 Score" numeric?
True

Are all elements of "new_PRAUC" numeric?
True

Are all elements of "new_AUCROC" numeric?
True


In [9]:
# Once showed the invalid problem is solved, we rename the columns as initially
results = results.drop("F1 Score", axis=1)
results = results.drop("Recall", axis=1)
results = results.drop("Precision", axis=1)
results = results.drop("PR_AUC", axis=1)
results = results.drop("ROC_AUC", axis=1)
results = results.rename(columns={'new_Precision': 'Precision', 'new_Recall': 'Recall','new_F1 Score': 'F1 Score', 'new_PRAUC':'PR_AUC', 'new_AUCROC':'ROC_AUC'})

# Define the new order of columns
new_column_order = ['Model', 'Prompt Type', 'Num Features', 'Sample Size', 'Class 1 Proportion', 'Set ID', 'Run Number', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'Prediction', 'PR_AUC', 'ROC_AUC']

# Reorder columns
results = results[new_column_order]

In [10]:
results.to_csv('./Indian_bank_data_results/aggregated_table_2024_12_01_6pm.csv', index=False)

## 1.1 Summary table

In [11]:
# # results = results[results['Class 1 Proportion'] == 0.5]
# results = df
# results.head()

In [12]:
pd.set_option('display.width', 180)

# Grouping by Model, Prompt Type, and Sample Size, then calculating the mean for F1 Score and ROC_AUC
# Replace 'Invalid' values with NaN for ROC_AUC
results['ROC_AUC'] = pd.to_numeric(results['ROC_AUC'], errors='coerce')

results_summary = results.groupby(['Model', 'Prompt Type', 'Sample Size'], dropna=False).agg(
    F1_Score_mean=('F1 Score', 'mean'),
    ROC_AUC_mean=('ROC_AUC', 'mean')
).reset_index()

# Pivoting the results to create the desired table format
f1_table = results_summary.pivot_table(index=['Prompt Type', 'Model'],
                                       columns='Sample Size',
                                       values='F1_Score_mean',
                                       aggfunc='mean')

roc_auc_table = results_summary.pivot_table(index=['Prompt Type', 'Model'],
                                            columns='Sample Size',
                                            values='ROC_AUC_mean',
                                            aggfunc='mean')

# Displaying the tables
print("F1 Score means:")
print(f1_table)
print("\nROC AUC means:")
print(roc_auc_table)

F1 Score means:
Sample Size                         0.0       8.0       16.0      32.0      64.0
Prompt Type Model                                                               
null        DecisionTree             NaN  0.043491  0.079389  0.118867  0.133856
            LogisticRegression       NaN  0.347454  0.215288  0.155047  0.113595
            RandomForest             NaN  0.036094  0.053404  0.075824  0.092732
            XGBoost                  NaN  0.065304  0.100108  0.150293  0.162467
t_anony     13B-GTL-8bit        0.119103  0.092935  0.110993  0.097328  0.112854
            7B-GTL-8bit         0.130229  0.072619  0.083966  0.097300  0.100534
            7B-GTL-unquant      0.111618  0.068523  0.101922  0.078351  0.110749
t_table     13B-GTL-8bit        0.173315  0.130500  0.124377  0.130099  0.131732
            7B-GTL-8bit         0.145590  0.084342  0.098623  0.114485  0.142783
            7B-GTL-unquant      0.201892  0.095752  0.094629  0.108355  0.123731

ROC AUC mea

In [13]:
print("F1 Score means:")
f1_table

F1 Score means:


Unnamed: 0_level_0,Sample Size,0.0,8.0,16.0,32.0,64.0
Prompt Type,Model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,DecisionTree,,0.043491,0.079389,0.118867,0.133856
,LogisticRegression,,0.347454,0.215288,0.155047,0.113595
,RandomForest,,0.036094,0.053404,0.075824,0.092732
,XGBoost,,0.065304,0.100108,0.150293,0.162467
t_anony,13B-GTL-8bit,0.119103,0.092935,0.110993,0.097328,0.112854
t_anony,7B-GTL-8bit,0.130229,0.072619,0.083966,0.0973,0.100534
t_anony,7B-GTL-unquant,0.111618,0.068523,0.101922,0.078351,0.110749
t_table,13B-GTL-8bit,0.173315,0.1305,0.124377,0.130099,0.131732
t_table,7B-GTL-8bit,0.14559,0.084342,0.098623,0.114485,0.142783
t_table,7B-GTL-unquant,0.201892,0.095752,0.094629,0.108355,0.123731


In [14]:
print("ROC AUC means:")
roc_auc_table

ROC AUC means:


Unnamed: 0_level_0,Sample Size,0.0,8.0,16.0,32.0,64.0
Prompt Type,Model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,DecisionTree,,0.490824,0.494469,0.519285,0.503551
,LogisticRegression,,0.649131,0.5,0.475538,0.46537
,RandomForest,,0.495334,0.503664,0.511357,0.519403
,XGBoost,,0.494164,0.479757,0.513886,0.515542
t_anony,13B-GTL-8bit,0.347541,0.503478,0.515444,0.49999,0.494193
t_anony,7B-GTL-8bit,0.426628,0.490528,0.499251,0.494,0.485382
t_anony,7B-GTL-unquant,0.297643,0.486083,0.502352,0.448773,0.453404
t_table,13B-GTL-8bit,0.500175,0.505091,0.501926,0.502705,0.50449
t_table,7B-GTL-8bit,0.500295,0.502265,0.49913,0.502184,0.515646
t_table,7B-GTL-unquant,0.532311,0.503869,0.441153,0.447265,0.498497


In [15]:
# prompt: get the row-wise mean for each Model in the f1_table dataframe

# Calculate the row-wise mean for each Model in the f1_table DataFrame
f1_table['Row Mean'] = f1_table.mean(axis=1)
print("F1 Score Means:")
f1_table.round(3)

F1 Score Means:


Unnamed: 0_level_0,Sample Size,0.0,8.0,16.0,32.0,64.0,Row Mean
Prompt Type,Model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,DecisionTree,,0.043,0.079,0.119,0.134,0.094
,LogisticRegression,,0.347,0.215,0.155,0.114,0.208
,RandomForest,,0.036,0.053,0.076,0.093,0.065
,XGBoost,,0.065,0.1,0.15,0.162,0.12
t_anony,13B-GTL-8bit,0.119,0.093,0.111,0.097,0.113,0.107
t_anony,7B-GTL-8bit,0.13,0.073,0.084,0.097,0.101,0.097
t_anony,7B-GTL-unquant,0.112,0.069,0.102,0.078,0.111,0.094
t_table,13B-GTL-8bit,0.173,0.131,0.124,0.13,0.132,0.138
t_table,7B-GTL-8bit,0.146,0.084,0.099,0.114,0.143,0.117
t_table,7B-GTL-unquant,0.202,0.096,0.095,0.108,0.124,0.125


# 2. Statistical Analysis

Is there a statistically significant difference between the mean F1 scores across different covariates (eg. Prompt Type, Model Type etc)?



## 2.1 multi way ANOVA

First, try ANOVA. Works only if residuals are normally distributed, so test this assumption first (using Shapiro Wilks test)

https://business-science.github.io/modeltime/reference/modeltime_residuals_test.html#:~:text=The%20Shapiro%2DWilk%20tests%20the,we%20can%20assume%20the%20normality.

In [16]:
import numpy as np
from scipy.stats import boxcox
results['Log F1 Score'] = np.log(results['F1 Score'] + 1)
results['Sqrt F1 Score'] = np.sqrt(results['F1 Score'])
results['Boxcox F1 Score'], _ = boxcox(results['F1 Score'] + 1)


In [17]:
import pandas as pd
import numpy as np
from scipy.stats import shapiro
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt


In [18]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Perform ANOVA
formula = 'Q("F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Log F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Sqrt F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Boxcox F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
model = ols(formula, data=results).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

residuals = model.resid

from scipy.stats import shapiro

stat, p_value = shapiro(residuals)
print(f"Shapiro-Wilk Test Statistic: {stat}, p-value: {p_value}")

if p_value > 0.05:
    print("Residuals appear to be normally distributed (fail to reject H0).")
else:
    print("With F1 Score: Residuals do not appear to be normally distributed (reject H0).")

                               sum_sq      df           F         PR(>F)
C(Q("Prompt Type"))          0.001569     2.0    0.110165   7.399722e-01
C(Q("Model"))                0.004708     6.0    0.110165   7.399722e-01
C(Q("Sample Size"))          0.939711     4.0   32.983575   3.786209e-27
C(Q("Num Features"))         0.003923     5.0    0.110165   7.399722e-01
C(Q("Class 1 Proportion"))  11.736818     2.0  823.917527  2.061447e-305
Residual                    31.894861  4478.0         NaN            NaN
Shapiro-Wilk Test Statistic: 0.9859987098901093, p-value: 1.0172104671505768e-20
With F1 Score: Residuals do not appear to be normally distributed (reject H0).




Since using untransformed F1 scores yields residuals that are not normally distributed, we try to transform the F1 scores first (with log, Sqrt, boxcox)

https://www.researchgate.net/post/Shapiro-Wilk-normality-test-failed-What-should-I-do

In [19]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Perform ANOVA
# formula = 'Q("F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
formula = 'Q("Log F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Sqrt F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Boxcox F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
model = ols(formula, data=results).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

residuals = model.resid

from scipy.stats import shapiro

stat, p_value = shapiro(residuals)
print(f"Shapiro-Wilk Test Statistic: {stat}, p-value: {p_value}")

if p_value > 0.05:
    print("Residuals appear to be normally distributed (fail to reject H0).")
else:
    print("With Log F1 Score: Residuals do not appear to be normally distributed (reject H0).")

                               sum_sq      df           F         PR(>F)
C(Q("Prompt Type"))          0.001321     2.0    0.119854   7.292096e-01
C(Q("Model"))                0.003964     6.0    0.119854   7.292096e-01
C(Q("Sample Size"))          0.819921     4.0   37.187767   1.222481e-30
C(Q("Num Features"))         0.003303     5.0    0.119854   7.292096e-01
C(Q("Class 1 Proportion"))   9.595032     2.0  870.370976  4.756370e-320
Residual                    24.682897  4478.0         NaN            NaN
Shapiro-Wilk Test Statistic: 0.9888962295231244, p-value: 2.482746315325354e-18
With Log F1 Score: Residuals do not appear to be normally distributed (reject H0).




In [20]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Perform ANOVA
# formula = 'Q("F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Log F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
formula = 'Q("Sqrt F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Boxcox F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
model = ols(formula, data=results).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

residuals = model.resid

from scipy.stats import shapiro

stat, p_value = shapiro(residuals)
print(f"Shapiro-Wilk Test Statistic: {stat}, p-value: {p_value}")

if p_value > 0.05:
    print("Residuals appear to be normally distributed (fail to reject H0).")
else:
    print("With Sqrt F1 Score: Residuals do not appear to be normally distributed (reject H0).")

                                sum_sq      df            F        PR(>F)
C(Q("Prompt Type"))           0.008820     2.0     0.167224  6.826106e-01
C(Q("Model"))                 0.026460     6.0     0.167224  6.826106e-01
C(Q("Sample Size"))           6.820267     4.0    64.654701  3.120572e-53
C(Q("Num Features"))          0.022050     5.0     0.167224  6.826106e-01
C(Q("Class 1 Proportion"))   57.533686     2.0  1090.814561  0.000000e+00
Residual                    118.093330  4478.0          NaN           NaN
Shapiro-Wilk Test Statistic: 0.994301305290452, p-value: 2.427653997090939e-12
With Sqrt F1 Score: Residuals do not appear to be normally distributed (reject H0).




In [21]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Perform ANOVA
# formula = 'Q("F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Log F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
# formula = 'Q("Sqrt F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
formula = 'Q("Boxcox F1 Score") ~ C(Q("Prompt Type")) + C(Q("Model")) + C(Q("Sample Size")) + C(Q("Num Features")) + C(Q("Class 1 Proportion"))'
model = ols(formula, data=results).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

residuals = model.resid

from scipy.stats import shapiro

stat, p_value = shapiro(residuals)
print(f"Shapiro-Wilk Test Statistic: {stat}, p-value: {p_value}")

if p_value > 0.05:
    print("Residuals appear to be normally distributed (fail to reject H0).")
else:
    print("With Boxcox F1 Score: Residuals do not appear to be normally distributed (reject H0).")

                               sum_sq      df           F        PR(>F)
C(Q("Prompt Type"))          0.001001     2.0    0.134977  7.133443e-01
C(Q("Model"))                0.003004     6.0    0.134977  7.133443e-01
C(Q("Sample Size"))          0.659094     4.0   44.416769  1.266185e-36
C(Q("Num Features"))         0.002504     5.0    0.134977  7.133443e-01
C(Q("Class 1 Proportion"))   6.970722     2.0  939.523441  0.000000e+00
Residual                    16.612088  4478.0         NaN           NaN
Shapiro-Wilk Test Statistic: 0.9911519734566653, p-value: 3.8056789082053296e-16
With Boxcox F1 Score: Residuals do not appear to be normally distributed (reject H0).




## 2.2 Failed Normality assumption. Do Non-Parametric Tests (which checks for significant differences in median not mean F1, between levels in a covariate), not parametric tests like ANOVA/Tukey

## 2.3 Multi-way Kruskal-Wallis Test

In [22]:
from scipy.stats import kruskal

# List of models to keep, but exclude 13B's Prompt Type = t_anony cos inferior
models_to_keep = ['13B-GTL-8bit', 'LogisticRegression', 'DecisionTree', 'RandomForest', 'XGBoost']
# Filter the DataFrame
results_kwt = results[(results['Model'].isin(models_to_keep)) & (results['Prompt Type'] != 't_anony')]

# # List of models to exclude
# models_to_keep = ['LogisticRegression', 'DecisionTree', 'RandomForest', 'XGBoost', 'RandomPredictor']
# # Filter the DataFrame
# results_kwt = results[~results['Model'].isin(models_to_keep)]

# Example: Iterate through covariates
covariates = ['Prompt Type', 'Model', 'Sample Size', 'Num Features', 'Class 1 Proportion']

for covariate in covariates:
    groups = [group['F1 Score'].values for _, group in results_kwt.groupby(covariate)]
    stat, p_value = kruskal(*groups)
    print(f"Kruskal-Wallis Test for {covariate}:")
    print(f"  Statistic = {stat}, p-value = {p_value}")
    if p_value < 0.05:
        print(f"  Significant differences exist in {covariate}\n")
    else:
        print(f"  No significant differences in {covariate}\n")


Kruskal-Wallis Test for Prompt Type:
  Statistic = 19.882868684511795, p-value = 8.233483741333632e-06
  Significant differences exist in Prompt Type

Kruskal-Wallis Test for Model:
  Statistic = 285.2688735331242, p-value = 1.628957518965856e-60
  Significant differences exist in Model

Kruskal-Wallis Test for Sample Size:
  Statistic = 44.03083093938333, p-value = 6.3218671299883786e-09
  Significant differences exist in Sample Size

Kruskal-Wallis Test for Num Features:
  Statistic = 34.27986845720759, p-value = 2.0942321379412995e-06
  Significant differences exist in Num Features

Kruskal-Wallis Test for Class 1 Proportion:
  Statistic = 466.9274366520697, p-value = 4.055042463332642e-102
  Significant differences exist in Class 1 Proportion



## 2.4 Dunn's test for pairwise comparison of differences in F1 median values, between levels in a covariate

In [23]:
pd.set_option('display.width', 180)
from scipy.stats import kruskal
!pip install scikit-posthocs
import scikit_posthocs as sp

# Example: Iterate through covariates
covariates = ['Prompt Type', 'Model', 'Sample Size', 'Num Features', 'Class 1 Proportion'] #

for covariate in covariates:
    # Group F1 Scores by the current covariate
    groups = [group['F1 Score'].values for _, group in results_kwt.groupby(covariate)]
    stat, p_value = kruskal(*groups)

    print(f"Kruskal-Wallis Test for {covariate}:")
    print(f"  Statistic = {stat}, p-value = {p_value}")

    if p_value < 0.05:
        print(f"  Significant differences exist in {covariate}. Performing Dunn's Test for post-hoc analysis...\n")

        # Perform Dunn's Test
        dunn_results = sp.posthoc_dunn(
            results_kwt,
            val_col='F1 Score',
            group_col=covariate,
            p_adjust='bonferroni'  # Adjust for multiple comparisons
        )
        print(f"Dunn's Test Results for {covariate}:\n")
        print(dunn_results)
    else:
        print(f"  No significant differences in {covariate}\n")

Collecting scikit-posthocs
  Downloading scikit_posthocs-0.11.2-py3-none-any.whl.metadata (5.8 kB)
Downloading scikit_posthocs-0.11.2-py3-none-any.whl (33 kB)
Installing collected packages: scikit-posthocs
Successfully installed scikit-posthocs-0.11.2
Kruskal-Wallis Test for Prompt Type:
  Statistic = 19.882868684511795, p-value = 8.233483741333632e-06
  Significant differences exist in Prompt Type. Performing Dunn's Test for post-hoc analysis...

Dunn's Test Results for Prompt Type:

             null   t_table
null     1.000000  0.000008
t_table  0.000008  1.000000
Kruskal-Wallis Test for Model:
  Statistic = 285.2688735331242, p-value = 1.628957518965856e-60
  Significant differences exist in Model. Performing Dunn's Test for post-hoc analysis...

Dunn's Test Results for Model:

                    13B-GTL-8bit  DecisionTree  LogisticRegression  RandomForest       XGBoost
13B-GTL-8bit        1.000000e+00  2.522103e-10        3.331282e-13  2.308539e-25  7.909017e-02
DecisionTree     

## 2.5 Permutation Tests for multiple covariates (More thorough/complex test to check for interactions between covariates)

Can be future work

In [24]:
from sklearn.model_selection import permutation_test_score
from sklearn.linear_model import LinearRegression
import time

# Prepare data for permutation test
X = pd.get_dummies(results[['Prompt Type', 'Model', 'Sample Size', 'Num Features', 'Class 1 Proportion']], drop_first=True)
y = results['F1 Score']

# Fit a regression model for comparison
model = LinearRegression()

t0 = time.time()

# Perform permutation test
score, permutation_scores, p_value = permutation_test_score(
    model, X, y, scoring='r2', n_permutations=1000, random_state=42
)

print(f"Permutation Test:")
print(f"  Observed R^2 = {score}")
print(f"  p-value = {p_value}")
if p_value < 0.05:
    print("  Significant combined effects of covariates.")
else:
    print("  No significant combined effects of covariates.")

print('time taken (s):', time.time()-t0)

Permutation Test:
  Observed R^2 = 0.006160379410725514
  p-value = 0.000999000999000999
  Significant combined effects of covariates.
time taken (s): 176.6058692932129
