In [1]:
import pandas as pd
import scipy.stats as stats
import os

In [2]:
from constants import meta_cols, question_dict, type_dict, pu_cols, peou_cols, se_cols, load_cols

Define a few helper functions. These are to easily extract construct specific columns from a data frame passed as argument.

In [3]:
def get_pu_df(df: pd.DataFrame) -> pd.DataFrame:
    return df[pu_cols + ["group"]]

def get_peou_df(df: pd.DataFrame) -> pd.DataFrame:
    return df[peou_cols + ["group"]]

def get_self_efficacy_df(df: pd.DataFrame) -> pd.DataFrame:
    return df[se_cols + ["group"]]

def get_load_df(df: pd.DataFrame) -> pd.DataFrame:
    return df[load_cols + ["group"]]

Read the initial raw data and fix scale for the single question that was asked with an inverted scale due to limitations with LimeSurvey ("Performance" of the NASA TLX questionnaire).

In [4]:
raw_res = pd.read_csv("./data/survey_results.csv", index_col="id", dtype=type_dict, parse_dates=["submitdate", "startdate", "datestamp"], date_format="%Y-%m-%dT%H:%M:%S%z")

# fix scale for load question 4 (needs to be reversed)
raw_res["load[SQ004]"] = 20 - raw_res["load[SQ004]"]

Remove some metadata columns that LimeSurvey adds automatically, but are not needed for the evaluation. Additionally the metadata (see `meta_cols` for more information) will be separated into its own data frame for easier data handling.

In [5]:
working_df = raw_res.drop(columns=["lastpage", "startlanguage", "gender_other", "seed", "token", "refurl"])

meta_df = working_df[meta_cols]
data_df = working_df.drop(columns=meta_cols)

# sanity check df format
# data_df

**Start cleanup based on the final cell from here. This will require some reorganization of the cells and analysis steps.**

### Data Preparation

Analysis is separated into different sections for each questionnaires used in the study (TAM, Self-Efficacy, NASA TLX). To prepare the data, separate data frames are created for each of the constructs measured by the questionnaires. Below the following steps will be performed:

1. Slice the relevant columns from the main data frame for each construct
2. Perform any necessary combination of columns to create overall scores for the constructs for each participant (row)
3. Store a main result CSV for all participants and constructs (`./out/study_results.csv`)

After the results have been prepared, statistical tests will be performed to compare the different groups (baseline vs. explanation) for each construct to determine if there are significant differences between the groups. The results of these tests will be store in a csv file (`./out/statistical_evaluation.csv`)

In [6]:
# produce separate output csv
expl_df = data_df[raw_res["group"] == "E"]
base_df = data_df[raw_res["group"] == "B"]

os.makedirs("./out", exist_ok=True)

expl_df.to_csv("./out/survey_results_expl.csv")
base_df.to_csv("./out/survey_results_base.csv")

### Statistical Evaluation

The evaluation is performed using simple statistical tests to compare the different groups. The evaluation will be split between the different questionnaires used in the study (TAM, Self-Efficacy, NASA TLX), where TAM is split into its two subscales (Perceived Usefulness and Perceived Ease of Use), CSE is evaluated as a whole, and NASA TLX is evaluated for four of its six subscales (Mental Demand, Performance, Effort, Frustration).

The validity of the statistical tests is validated using Mann-Whitney U tests and (Welch's) t-tests for all comparisons. since the small sample size does not allow for a reliable assessment (or assumption) of normality, the non-parametric Mann-Whitney U test is preferred. However, for completeness, t-tests are also reported.

Statistical tests are performed using the `scipy.stats` library, and visualizations are created using `seaborn` and `matplotlib`.

In [7]:
# split data into constructs
pu_df = get_pu_df(data_df)
peou_df = get_peou_df(data_df)
se_df = get_self_efficacy_df(data_df)
load_df = get_load_df(data_df)

stat_results = []

TAM is evaluated using the Perceived Usefulness (PU) and Perceived Ease of Use (PEOU) subscales. Each subscale consists of multiple items, which are averaged to obtain a single score for each participant.

In [8]:
pu_base_df = pu_df[pu_df["group"] == "B"].drop(columns=["group"])
pu_expl_df = pu_df[pu_df["group"] == "E"].drop(columns=["group"])

pu_b = pu_base_df.mean(axis=1)
pu_e = pu_expl_df.mean(axis=1)

u_res = stats.mannwhitneyu(pu_b, pu_e, alternative="less")
u_pval = u_res.pvalue
u_stat = u_res.statistic

t_res = stats.ttest_ind(pu_b, pu_e, alternative="less", equal_var=False)
t_pval = t_res.pvalue
t_stat = t_res.statistic

stat_results.append(
    {
        "const": "Perceived Usefulness",
        "b_mean": pu_b.mean(),
        "e_mean": pu_e.mean(),
        "mannwhitneyu_stat": u_stat,
        "mannwhitneyu_p": u_pval,
        "ttest_stat": t_stat,
        "ttest_p": t_pval,
    }
)

In [9]:
peou_base_df = peou_df[peou_df["group"] == "B"].drop(columns=["group"])
peou_expl_df = peou_df[peou_df["group"] == "E"].drop(columns=["group"])

peou_b = peou_base_df.mean(axis=1)
peou_e = peou_expl_df.mean(axis=1)

u_res = stats.mannwhitneyu(peou_b, peou_e, alternative='less')
u_pval = u_res.pvalue
u_stat = u_res.statistic

t_res = stats.ttest_ind(peou_b, peou_e, alternative='less', equal_var=False)
t_pval = t_res.pvalue
t_stat = t_res.statistic

stat_results.append(
    {
        "const": "Perceived Ease of Use",
        "b_mean": peou_b.mean(),
        "e_mean": peou_e.mean(),
        "mannwhitneyu_stat": u_stat,
        "mannwhitneyu_p": u_pval,
        "ttest_stat": t_stat,
        "ttest_p": t_pval,
    }
)

Computer Self-Efficacy (CSE) is evaluated as a whole, using all items from the CSE questionnaire. Similar to TAM, the items are averaged to obtain a single score for each participant.

In [10]:
se_base_df = se_df[se_df["group"] == "B"].drop(columns=["group"])
se_expl_df = se_df[se_df["group"] == "E"].drop(columns=["group"])

se_b = se_base_df.mean(axis=1)
se_e = se_expl_df.mean(axis=1)

u_res = stats.mannwhitneyu(se_b, se_e, alternative='less')
u_pval = u_res.pvalue
u_stat = u_res.statistic

t_res = stats.ttest_ind(se_b, se_e, alternative='less', equal_var=False)
t_pval = t_res.pvalue
t_stat = t_res.statistic

stat_results.append(
    {
        "const": "Self-Efficacy",
        "b_mean": se_b.mean(),
        "e_mean": se_e.mean(),
        "mannwhitneyu_stat": u_stat,
        "mannwhitneyu_p": u_pval,
        "ttest_stat": t_stat,
        "ttest_p": t_pval,
    }
)

Evaluate NASA TLX results as “raw TLX” as per https://doi.org/10.1177/154193120605000909. Since the given task was not time-constrained and did not involve physical effort, the corresponding subscales are ignored in the evaluation.

In [11]:
# columns for mental demand, performance, effort, frustration
cols = [question_dict["load[SQ001]"], question_dict["load[SQ004]"], question_dict["load[SQ005]"], question_dict["load[SQ006]"]]

# rename columns according to question_dict for easier interpretation
load_df = load_df.rename(columns=question_dict)

# split groups
load_base_group = load_df[load_df["group"] == "B"].drop(columns=["group"])
load_expl_group = load_df[load_df["group"] == "E"].drop(columns=["group"])

for col in cols:
    b = load_base_group[col]
    e = load_expl_group[col]

    u_res = stats.mannwhitneyu(b, e, alternative="greater")
    u_pval = u_res.pvalue
    u_stat = u_res.statistic

    t_res = stats.ttest_ind(b, e, alternative="greater", equal_var=False)
    t_pval = t_res.pvalue
    t_stat = t_res.statistic

    stat_results.append(
        {
            "const": col,
            "b_mean": b.mean(),
            "e_mean": e.mean(),
            "mannwhitneyu_stat": u_stat,
            "mannwhitneyu_p": u_pval,
            "ttest_stat": t_stat,
            "ttest_p": t_pval,
        }
    )

In [12]:
stat_results_df = pd.DataFrame(stat_results)
stat_results_df = stat_results_df.rename(columns={
    "const": "Construct",
    "b_mean": "Base Mean",
    "e_mean": "Explanation Mean",
    "mannwhitneyu_stat": "Mann-Whitney U Stat",
    "mannwhitneyu_p": "Mann-Whitney U p-value",
    "ttest_stat": "t-test Stat",
    "ttest_p": "t-test p-value"
})

stat_results_df.to_csv("./out/statistical_test_results.csv", index=False)

stat_results_df

Unnamed: 0,Construct,Base Mean,Explanation Mean,Mann-Whitney U Stat,Mann-Whitney U p-value,t-test Stat,t-test p-value
0,Perceived Usefulness,5.25,5.904762,11.0,0.084082,-1.726694,0.05614
1,Perceived Ease of Use,6.166667,5.928571,24.0,0.693676,0.566684,0.707922
2,Self-Efficacy,9.35,8.214286,26.0,0.785899,1.542875,0.91548
3,Mental Demand,12.333333,10.714286,24.5,0.331933,0.506887,0.31112
4,Performance,7.0,3.857143,27.0,0.213129,1.137086,0.147984
5,Effort,10.666667,9.428571,21.5,0.5,0.381143,0.35567
6,Frustration,10.5,2.571429,37.5,0.010499,2.917351,0.010525


In [13]:
# Build participant-level constructs DataFrame
# Compute construct scores for all participants (preserves original index = participant id)
pu_all = pu_df.drop(columns=['group']).mean(axis=1)
peou_all = peou_df.drop(columns=['group']).mean(axis=1)
se_all = se_df.drop(columns=['group']).mean(axis=1)
# Ensure load columns are named for readability
load_named = load_df.rename(columns=question_dict)
md_col = question_dict['load[SQ001]']
perf_col = question_dict['load[SQ004]']
eff_col = question_dict['load[SQ005]']
fr_col = question_dict['load[SQ006]']
mental = load_named[md_col]
performance = load_named[perf_col]
effort = load_named[eff_col]
frustration = load_named[fr_col]
# Assemble into a single DataFrame indexed by participant id
participant_df = pd.DataFrame({
    'Perceived Usefulness': pu_all,
    'Perceived Ease of Use': peou_all,
    'Self-Efficacy': se_all,
    'Mental Demand': mental,
    'Performance': performance,
    'Effort': effort,
    'Frustration': frustration,
})
# Add group and metadata columns (if present)
participant_df['group'] = data_df['group']
for c in meta_cols:
    if c in meta_df.columns:
        participant_df[c] = meta_df[c]
# Save to CSV and show a quick preview
os.makedirs('./out', exist_ok=True)
participant_df.to_csv('./out/participant_constructs.csv')
participant_df.head()

Unnamed: 0_level_0,Perceived Usefulness,Perceived Ease of Use,Self-Efficacy,Mental Demand,Performance,Effort,Frustration,group,submitdate,startdate,datestamp,tasks,age,gender_mf,education,UUID
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,6.166667,6.5,5.8,10,5,10,4,E,2025-07-25 18:50:55+02:00,2025-07-25 18:26:03+02:00,2025-07-25 18:50:55+02:00,1,30 - 39,Männlich,Master,f971d5ae-a9ff-4a23-a92b-7650ad697dfc
2,5.333333,5.666667,9.1,3,14,12,12,B,2025-07-29 18:26:34+02:00,2025-07-29 18:18:52+02:00,2025-07-29 18:26:34+02:00,1,20 - 29,Männlich,Allgemeine Hochschulreife (Abitur),7d8c9331-dd8c-4df6-8f6c-9e7a16d008d9
3,4.166667,6.5,9.1,18,1,15,15,B,2025-07-31 12:27:55+02:00,2025-07-31 12:21:54+02:00,2025-07-31 12:27:55+02:00,2,20 - 29,Männlich,Bachelor,fa69270e-20e0-4849-bae1-473de60495a4
4,7.0,6.666667,9.9,2,2,0,0,E,2025-08-01 13:26:23+02:00,2025-08-01 13:18:45+02:00,2025-08-01 13:26:23+02:00,1,20 - 29,Männlich,Allgemeine Hochschulreife (Abitur),8d7e51b5-77bd-4474-8bcb-127fa69cd98c
5,5.5,6.666667,9.8,13,2,15,15,B,2025-08-05 17:20:45+02:00,2025-08-05 17:09:23+02:00,2025-08-05 17:20:45+02:00,1,20 - 29,Männlich,Bachelor,f41a9ecb-7000-4ae1-8b8e-1b39ea598476
