## Introduction

**Author: Izzat Arroyyan**

This usability test was conducted at the SGLC FT UGM as part of a UX course, focusing on evaluating the user experience of a food ordering and chair reservation system in a campus canteen setting. The aim was to gather data on how users interact with two UI design variations, assessing their preferences, task completion time, and error rates in real-world usage.

The project revolves around creating a prototype that simplifies the process of ordering food and reserving seats. By analyzing task performance and user satisfaction, the goal is to identify the more effective design and refine it for better usability. This testing will guide improvements to the prototype, ensuring a smoother and more intuitive experience for users.

In [1]:
# Import necessary libraries
import pandas as pd
from scipy import stats
from statsmodels.stats.contingency_tables import mcnemar
import numpy as np

## Dataset Description
The dataset includes two UI design scenarios (A and B) with multiple metrics related to user experience, including task success, time on task, errors, and user satisfaction (SEQa and SEQb). We will perform statistical tests to compare the effectiveness and user preference for these two designs.

In [2]:
# Creating the DataFrame with data from two UI designs (A and B)
data = {
    'Task Success A': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    'Time on Task A': [18.17, 56.11, 21.66, 19.5, 28.36, 61.36, 31.46, 36.03, 19.69, 18.36, 15.47, 30.31],
    'Errors A': [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0],
    'SEQa': [7, 6, 7, 7, 7, 4, 7, 6, 6, 7, 7, 7],

    'Task Success B': [0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1],
    'Time on Task B': [16.16, 53.75, 36.78, 37.69, 34.91, 36.03, 38.84, 95.61, 39.23, 24.04, 29.68, 56.82],
    'Errors B': [2,0,3,1,1,1,1,1,0,0,6,1],
    'SEQb': [7,6,5,3,5,6,7,7,5,7,7,6]
}


df = pd.DataFrame(data)

In [3]:
print("Data")
df

Data


Unnamed: 0,Task Success A,Time on Task A,Errors A,SEQa,Task Success B,Time on Task B,Errors B,SEQb
0,1,18.17,1,7,0,16.16,2,7
1,1,56.11,1,6,0,53.75,0,6
2,1,21.66,0,7,0,36.78,3,5
3,1,19.5,0,7,0,37.69,1,3
4,1,28.36,0,7,1,34.91,1,5
5,1,61.36,0,4,1,36.03,1,6
6,1,31.46,0,7,0,38.84,1,7
7,1,36.03,0,6,0,95.61,1,7
8,1,19.69,0,6,1,39.23,0,5
9,1,18.36,0,7,1,24.04,0,7


## Descriptive Statistics
Before performing statistical tests, we calculate basic descriptive statistics for each metric. This includes the mean and standard deviation for task success, time on task, errors, and user satisfaction (SEQ).

In [4]:
# Calculating summary statistics
descriptive_stats = {
    'Metric': ['Task Success A', 'Time on Task A', 'Errors A', 'SEQa',
               'Task Success B', 'Time on Task B', 'Errors B',
               'SEQb'],
    'Mean': [
        df['Task Success A'].mean(),
        df['Time on Task A'].mean(),
        df['Errors A'].mean(),
        df['SEQa'].mean(),
        df['Task Success B'].mean(),
        df['Time on Task B'].mean(),
        df['Errors B'].mean(),
        df['SEQb'].mean(),
    ],
    'Standard Deviation': [
        df['Task Success A'].std(),
        df['Time on Task A'].std(),
        df['Errors A'].std(),
        df['SEQa'].std(),
        df['Task Success B'].std(),
        df['Time on Task B'].std(),
        df['Errors B'].std(),
        df['SEQb'].std(),
    ]
}

descriptive_df = pd.DataFrame(descriptive_stats)

# Displaying the descriptive statistics
print("Descriptive Statistics:")
print(descriptive_df)

Descriptive Statistics:
           Metric       Mean  Standard Deviation
0  Task Success A   1.000000            0.000000
1  Time on Task A  29.706667           15.009356
2        Errors A   0.416667            0.900337
3            SEQa   6.500000            0.904534
4  Task Success B   0.416667            0.514929
5  Time on Task B  41.628333           20.274281
6        Errors B   1.416667            1.676486
7            SEQb   5.916667            1.240112


## Hypothesis Testing
We'll conduct several statistical tests to evaluate the differences between the two UI designs based on the metrics:

-- **McNemar's Test**: This test compares the task success rates of designs A and B.

-- **Paired T-tests**: These tests compare the time on task, errors, and user satisfaction for the two designs.

In [5]:
# 1. McNemar Test for Task Success Comparison
b = len(df[(df['Task Success A'] == 1) & (df['Task Success B'] == 0)])  # A Success, B Failure
c = len(df[(df['Task Success A'] == 0) & (df['Task Success B'] == 1)])  # A Failure, B Success
a = len(df[(df['Task Success A'] == 1) & (df['Task Success B'] == 1)])  # A Success, B Success
d = len(df[(df['Task Success A'] == 0) & (df['Task Success B'] == 0)])  # A Failure, B Failure

# McNemar Contingency Table
contingency_table = [[a, b], [c, d]]
print(f"McNemar Contingency Table:\n{contingency_table}\n")

# Perform McNemar's Test
mcnemar_test = mcnemar(contingency_table, exact=True)  # 'exact=True' for more accurate test
mcnemar_statistic = mcnemar_test.statistic
mcnemar_pvalue = mcnemar_test.pvalue

# Perform Paired T-tests for each metric
results = {}
results['Task Success'] = (mcnemar_statistic, mcnemar_pvalue)  # Adding McNemar result to the summary
results['Time on Task'] = stats.ttest_rel(df['Time on Task A'], df['Time on Task B'])
results['Errors'] = stats.ttest_rel(df['Errors A'], df['Errors B'])
results['SEQa and SEQb'] = stats.ttest_rel(df['SEQa'], df['SEQb'])

McNemar Contingency Table:
[[5, 7], [0, 0]]



## Statistical Results
Now, we present the results of the hypothesis tests for each metric. We check if the p-values are below the significance threshold (0.05) to determine if there are significant differences between the two designs.

In [6]:
# Displaying the results of statistical tests
print("\nHypothesis Test Results (Paired T-test and McNemar):")
for metric, result in results.items():
    if metric == 'Task Success':
        t_statistic = result[0]
        p_value = result[1]
        if p_value < 0.05:
            significance = "significant"
            if t_statistic > 0:
                comparison = "Design B has a higher task success rate compared to Design A."
            else:
                comparison = "Design A has a higher task success rate compared to Design B."
        else:
            significance = "not significant"
            comparison = "No significant difference in task success rates between Design A and Design B."
        print(f"{metric} - McNemar test - p-value: {p_value:.4f} (This result is {significance}. {comparison})")
    else:
        t_statistic = result.statistic
        p_value = result.pvalue
        if p_value < 0.05:
            significance = "significant"
            if metric == 'Time on Task':
                comparison = "Design B is faster than Design A." if t_statistic > 0 else "Design A is faster."
            elif metric == 'Errors':
                comparison = "Design B has fewer errors than Design A." if t_statistic > 0 else "Design A has fewer errors."
            elif metric == 'SEQa and SEQb':
                comparison = "Design A is preferred over Design B." if t_statistic > 0 else "Design B is preferred."
        else:
            significance = "not significant"
            comparison = "No significant difference."

        print(f"{metric} - t-statistic: {t_statistic:.4f}, p-value: {p_value:.4f} (This result is {significance}. {comparison})")


Hypothesis Test Results (Paired T-test and McNemar):
Task Success - McNemar test - p-value: 0.0156 (This result is significant. Design A has a higher task success rate compared to Design B.)
Time on Task - t-statistic: -2.0484, p-value: 0.0652 (This result is not significant. No significant difference.)
Errors - t-statistic: -3.0706, p-value: 0.0106 (This result is significant. Design A has fewer errors.)
SEQa and SEQb - t-statistic: 1.2918, p-value: 0.2229 (This result is not significant. No significant difference.)


## Conclusion
This analysis compares two UI designs (A and B) based on several key metrics. The McNemar test was used for task success, while paired t-tests were conducted for task time, error rate, and user satisfaction. Based on the results, we can conclude whether the differences between the designs are statistically significant and interpret which design performs better in various aspects.

