This notebook evaluates the results form the user study. Set `FILL_DATABASE` to True to create a mock database. Set it to False to use the results from an existing database.

If mocking data, this notebook expects the server to run on `server_url` with a database with no entries, as set up by running `node setup.mjs`. 

In [1]:
USER_STUDY_CSV = "./results/user_study/selection.csv"
SQLITE_DB = "../survey/db.db"
FILL_DATABASE = True # if True, data is mocked, THIS CALLS THE APIs

server_url = "http://localhost:3002" # server to mock data on


In [2]:
100000 == 1000*100

True

In [3]:

import krippendorff
from scipy.stats import ttest_rel
from scipy.stats import ttest_1samp
import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [4]:
explainers = ["LIME", "SHAP", "Anchor"]

Adapted from Hase et al. 2020. Main points:

This is a within-subject* design with 4 phases: (1) Predictions only, (2) Pre-learn test, (3) Teaching: Predictions + Explanations, (4) Eval.

Phase 1 and 3 share a set of documents as do 2 and 4.

Result Hase et al.: Report **average change** in user accuracy per explanation method (phase 2 vs. 4), CI and p values of mean

Additional details by Hase et al.:
- Balance data "by model correctness" so random guessing can't succeed: *"we ensure that true positives, false positives,
true negatives, and false negatives are equally represented in the inputs. [...] We confirm user understanding of the data
balancing in our screening test"*
- Forced choice, to not "favor overly niche explanations" (like in Ribeiro et al.)
- Separate teach and test phases
- Pre prediction phase to obtain a baseline



In [5]:
import sqlite3
import pandas as pd
connection = sqlite3.connect(SQLITE_DB)

user_df = pd.read_sql_query("SELECT * FROM users", connection)

In [6]:
df_user_study = pd.read_csv(USER_STUDY_CSV)

## Fill Database
If `FILL_DATABASE` is True

In [7]:
%%writefile run_user.py

def run_user(idx, user, url, df_user_study):
    n_learn = 16
    n_eval = 16
    n_users = 10

    mu_got_it_right_pre=0.5
    sigma_got_it_right_pre=0.05
    mu_gain = 0.1
    sigma_gain = 0.1

    def guess(detector_label,p):
        return detector_label if bool(np.random.choice([0,1],p=[1-p, p])) else not detector_label

    import requests
    import json
    import numpy as np
    import pandas as pd
    user_dist_without = lambda : np.clip(np.random.normal(mu_got_it_right_pre, sigma_got_it_right_pre, 1)[0], 0,1)
    user_dist_gain = lambda : np.clip(np.random.normal(mu_gain, sigma_gain, 1)[0], -1,1)
  
    res = requests.get(url+"/auth/"+ user["access_token"])

    print(res.text)
    auth_token = json.loads(res.text)
    headers = {'Content-Type': 'application/json','Authorization': "Bearer "+auth_token, "Content-Type": "application/json",}

    requests.post(url+"/api/submitParticipantInfo", json={
    "has_seen_explanation_methods_before": "yes",
    "has_seen_OTHERS_before": "yes",
    "level_of_expertise": "is-researcher-explainability",
    "familiarity_with_chatgpt": "occasional-use",
    "prefers_monochromatic_methods": "yes" if idx % 20 == 0 else "no"
    }, headers=headers)
    # go to phase 2
    requests.post(url+"/api/completeCurrentPhase", json={"expected": 0}, headers=headers)
    # if idx % 8 == 0:
    #     return
    requests.post(url+"/api/completeCurrentPhase", json={"expected": 1}, headers=headers)
    requests.post(url+"/api/completeCurrentPhase", json={"expected": 2}, headers=headers)

    res = requests.get(url+"/api/state", headers=headers)
    state = json.loads(res.text)

    # user_df = pd.read_sql_query("SELECT * FROM users", connection) # update as group is assigned now
    # user = user_df.iloc[idx]
   # print(user[["detector", "explainer"]])
    # return state
    df_user_documents = df_user_study.loc[df_user_study.groupby("Detector").groups[state["detector"]],:].reset_index(drop=True)
    for doc_nr, row in df_user_documents.iterrows():
        p_without = user_dist_without()
        requests.post(url+"/api/submitPhase2", json={"ID": doc_nr, "label": guess(row["f(b)"], p_without)}, headers=headers)
    requests.post(url+"/api/completeCurrentPhase", json={"expected": 3}, headers=headers)

    for doc_nr, row in df_user_documents.iterrows():
        json_ = {"lickert-q{}-{}".format(question_nr, doc_nr): str(np.random.choice([1,2,3,4,5], p=[0.1,0.2,0.1,0.4,0.2])) for question_nr in range(1,4)}
        json_["document_nr"] = doc_nr
        requests.post(url+"/api/submitPhase3", json=json_, headers=headers)

    requests.post(url+"/api/completeCurrentPhase", json={"expected": 4}, headers=headers)
    for doc_nr, row in df_user_documents.iterrows():
        p_with = np.clip(p_without + user_dist_gain(), 0,1)
        requests.post(url+"/api/submitPhase4", json={"ID": doc_nr, "label": guess(row["f(b)"], p_with)}, headers=headers)
    requests.post(url+"/api/completeCurrentPhase", json={"expected": 5}, headers=headers)


Overwriting run_user.py


In [8]:
from tqdm import tqdm
from multiprocess import Pool
from run_user import run_user


if FILL_DATABASE:
    max_pool = 10
    mock_user_data = [(idx, user, server_url, df_user_study) for idx, user in user_df.iterrows() if idx < 27]

    with Pool(max_pool) as p:
        pool_outputs = list(tqdm(p.starmap(run_user,mock_user_data),total=len(mock_user_data)))    
    print(pool_outputs)


100%|██████████| 27/27 [00:00<?, ?it/s]

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]





In [9]:
user_df = pd.read_sql_query("SELECT * FROM users where current_phase = 5", connection) # update df from database

In [10]:
user_df.groupby(["detector", "explainer"])["ID"].count()

detector           explainer       
DetectorDetectGPT  Anchor_Explainer    3
                   LIME_Explainer      3
                   SHAP_Explainer      3
DetectorGuo        Anchor_Explainer    3
                   LIME_Explainer      3
                   SHAP_Explainer      3
DetectorRadford    Anchor_Explainer    3
                   LIME_Explainer      3
                   SHAP_Explainer      3
Name: ID, dtype: int64

## Evaluation

In [11]:
def get_most_recent_responses(df_user_responses, df_user_study):
    detector = df_user_responses.iloc[0]["detector"]

    df_user_documents = df_user_study.loc[df_user_study.groupby("Detector").groups[detector],:].reset_index(drop=True)
    detector_predictions = df_user_documents["f(b)"].astype(bool)

    user_responses = df_user_responses.loc[df_user_responses.groupby("document_nr")["timestamp"].idxmax()].set_index("document_nr")["label"].astype(bool) # only keep most recent response
    #display(user_responses.join(detector_predictions))
    TP = ((detector_predictions) & (user_responses)).sum()
    FP = ((~detector_predictions) & (user_responses)).sum()

    TN = ((~detector_predictions) & (~user_responses)).sum()
    FN = ((detector_predictions) & (~user_responses)).sum()

    acc = (TP+TN) / (TP+FP+TN+FN)
    print("acc", acc)
    print("TP", TP)
    print("FP", FP)
    print("TN", TN)
    print("FN", FN)


    assert sum([TP,FP,TN,FN]) == len(detector_predictions), "Check that input is bool"
    assert (acc ==(user_responses == detector_predictions).sum() / len(detector_predictions)), "Check that input is bool: acc"

    return pd.DataFrame([(TP,TN,FP,FN, acc)], columns=["TP","TN","FP","FN", "User Accuracy"])

In [12]:
from sklearn.metrics import confusion_matrix

In [13]:
def get_is_correct_phase(df_user_responses, df_user_study):
    
    detector = df_user_responses.iloc[0]["detector"]
    df_user_documents = df_user_study.loc[df_user_study.groupby("Detector").groups[detector],:].reset_index(drop=True)
    detector_predictions = df_user_documents["f(b)"].astype(bool)
    user_responses = df_user_responses.loc[df_user_responses.groupby("document_nr")["timestamp"].idxmax()].set_index("document_nr")["label"].astype(bool) # only keep most recent response

    return user_responses & detector_predictions
   

In [14]:
u = user_df.set_index("ID").rename_axis("user_id")[["explainer", "detector"]]


In [15]:
df_phase_2 = pd.read_sql_query("SELECT responses_phase_2.*, users.detector, users.explainer FROM responses_phase_2 INNER JOIN users ON responses_phase_2.user_id = users.ID", connection)
df_phase_4 = pd.read_sql_query("SELECT responses_phase_4.*, users.detector, users.explainer FROM responses_phase_4 INNER JOIN users ON responses_phase_4.user_id = users.ID", connection)

is_correct_phase_4 = df_phase_4.groupby(["user_id"]).apply(lambda df_user_responses : get_is_correct_phase(df_user_responses,df_user_study))
is_correct_phase_2 = df_phase_2.groupby(["user_id"]).apply(lambda df_user_responses : get_is_correct_phase(df_user_responses,df_user_study))

# TN, FP
# FN, TP



In [16]:
is_correct_phase_2

document_nr,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,False,True,True,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False
2,True,True,False,False,False,False,False,True,False,False,False,False,True,False,True,False,False,False
3,True,False,True,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5,True,True,True,False,False,False,True,False,True,False,False,False,True,False,True,False,False,False
6,True,False,False,False,False,False,True,False,True,False,False,False,True,False,False,False,False,False
7,False,False,False,False,False,False,True,True,True,False,False,False,True,False,True,False,False,False
8,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False
10,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False


In [17]:
a = ((is_correct_phase_2 == 0) & (is_correct_phase_4 == 0)).sum(axis=1).rename("a")
b = ((is_correct_phase_2 == 0) & (is_correct_phase_4 == 1)).sum(axis=1).rename("b")
c = ((is_correct_phase_2 == 1) & (is_correct_phase_4 == 0)).sum(axis=1).rename("c")
d = ((is_correct_phase_2 == 1) & (is_correct_phase_4 == 1)).sum(axis=1).rename("d")


In [18]:
from statsmodels.stats.contingency_tables import mcnemar 


In [40]:
results = []
for explainer, _ in u.groupby("explainer"):
    matrix = np.array([u.join(x).loc[(u.join(x)["explainer"] == explainer), x.name].sum() for x in [a,b,c,d]]).reshape(2,2)
    #display(u.join(a).loc[(u.join(a)["explainer"] == explainer)])
 #   Y_0_2 = u.join(~is_correct_phase_2).loc[(u["explainer"] == explainer) ,0:].sum(axis=1).sum()
    Y_1_2 = u.join(is_correct_phase_2).loc[(u["explainer"] == explainer), 0:].sum(axis=1).sum()
 #   Y_0_4 = u.join(~is_correct_phase_4).loc[(u["explainer"] == explainer) , 0:].sum(axis=1).sum()
    Y_1_4 = u.join(is_correct_phase_4).loc[(u["explainer"] == explainer) , 0:].sum(axis=1).sum()
 #   marginal_frequencies = np.array([[Y_0_2, Y_1_2],[Y_0_4, Y_1_4]])
    relative_increase = Y_1_4 / Y_1_2
    (a_,b_),(c_,d_) = matrix
    variance_log_increase = np.sqrt((b_ + c_) / ((b_ + d_)*(c_ + d_)))
    ci_lower_rTPF = relative_increase * np.exp(-1.96 * variance_log_increase)
    ci_upper_rTPF = relative_increase * np.exp(+1.96 * variance_log_increase)

    agreement = (d_*matrix.sum()) / ((b_+d_)*(c_+d_))

    m = mcnemar(matrix)

    results.append((explainer,  #matrix, marginal_frequencies, 
                    (relative_increase-1)*100, variance_log_increase, ci_lower_rTPF,ci_upper_rTPF, (relative_increase > ci_lower_rTPF) and (relative_increase < ci_upper_rTPF),
                    agreement, m.pvalue, m.statistic
                    ))
df = pd.DataFrame(results, columns=["Explainer", "increase [%]", "variance_log_increase", "CI[","]", "within", "agreement", "pvalue", "mcnemar_statistic"]) 
df

Unnamed: 0,Explainer,increase [%],variance_log_increase,CI[,],within,agreement,pvalue,mcnemar_statistic
0,Anchor_Explainer,66.666667,0.148454,1.245897,2.22954,True,2.142149,0.00068,9.0
1,LIME_Explainer,2.272727,0.157313,0.751367,1.392091,True,1.636364,1.0,24.0
2,SHAP_Explainer,57.575758,0.154573,1.163896,2.133362,True,2.076923,0.004324,11.0


In [41]:
latex_output = []

In [42]:
df["Explainer"] = df["Explainer"].str.replace("_Explainer","")
result = df.set_index("Explainer").style\
    .map_index(lambda v: "rotatebox:{45}--rwrap;", level=0, axis=1).format_index(escape="latex", axis=1).format(precision=2).format_index(escape="latex", axis=0)
latex_output.append(result.to_latex(environment="longtable", 
                                    convert_css=True, 
                                    clines="all;data", 
                                    hrules=True, 
                                    caption="Results per group. New evaluation method", 
                                    label="user-study-per-group-leonardo"))
result

Unnamed: 0_level_0,increase [\%],variance\_log\_increase,CI[,],within,agreement,pvalue,mcnemar\_statistic
Explainer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Anchor,66.67,0.15,1.25,2.23,True,2.14,0.0,9.0
LIME,2.27,0.16,0.75,1.39,True,1.64,1.0,24.0
SHAP,57.58,0.15,1.16,2.13,True,2.08,0.0,11.0


In [43]:
with open("figures/tables_user_study_leonardo.tex", "w", encoding="UTF-8") as text_file:
    text_file.write("\n".join(latex_output))