<a href="https://colab.research.google.com/github/mille-s/ReproHum_072904_DCU25/blob/main/QRA%2B_SM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TODO - Introduction

In [1]:
#@title Setup repo
import os
import sys
import pandas as pd
from google.colab import drive

# Mount Google Drive.
drive.mount('/content/drive', force_remount=True)

# Make sure the files folder exists.
ROOT_DIR = "/content/drive/Shareddrives/human_evaluation_tutorial/code/qra"
!mkdir -p $ROOT_DIR

DATA_DIR = f"{ROOT_DIR}/data"
!mkdir -p $DATA_DIR

RESULTS_DIR = f"{DATA_DIR}/reproduction_study_results"
!mkdir -p $RESULTS_DIR

# To load the external python files
SRC_DIR = f"{ROOT_DIR}/src"
!mkdir -p $SRC_DIR
sys.path.append(SRC_DIR)

# Load the QRA functions
# TODO - format QRAUtils to have Google style and comments.
from qra_utils import QRAUtils

Mounted at /content/drive


In [11]:
#@title Create a dataframe for each type of QRA+ results.
# This function takes a glob path as input.  It then loads each of the file.
# The expectation is that the files are all identical.
# This is so that you can manually enter data into multiple spreadsheets, then
# if you made no mistakes, the files are likely to be identical and this function
# will then load a single copy of it.
input_df = QRAUtils.load_dataframe_from_parallel_sources(
    # f"{RESULTS_DIR}/p_and_l_2021*.csv"
    '/content/gu-etal-2022_*.csv'
)

# print(f'DF: {input_df}')

# We need to specify these in order, as the first study is always considered
# to be the original, regardless of what we name it.
STUDIES = [
    'Original',       # https://aclanthology.org/2021.tacl-1.31
    'Reproduction 1'  # https://aclanthology.org/2023.humeval-1.7
]

# This will affect the order that criteria are shown in our dataframes
# - regardless of which is found first in the input data.
QUALITY_CRITERIA = [
    'Overall', 'Overall_agg'
]

# This will affect the order that systems are shown in our dataframes
# - regardless of which is found first in the input data.
SYSTEMS = [
    'MemSum',
    'NeuSum'
]

# CV* Requires that scales start at zero, therefore this adjustment needs to be set.
# - best-worst scaling goes from -100 to 100.
INSTRUMENT_SCALE_STARTS_AT  = 1

# This loads the data, with the results from each study in their own columns.
base_df = QRAUtils.get_base_df(
    input_df,
    STUDIES
)

# Type I
type_i_df = QRAUtils.get_type_i_df(
    base_df,
    STUDIES,
    INSTRUMENT_SCALE_STARTS_AT
)

# Type II
type_ii_df = QRAUtils.get_type_ii_df(
    base_df,
    STUDIES,
    QUALITY_CRITERIA
)

# Type III - We do not have responses from the original study with which to calculate this
# - It is calculated using Krippendorff's Alpha, an inter-study agreement.
# - Note that some python implementations for Krippendorff's alpha are wrong.

# TODO - implement this anyway with dummy responses.

# Type IV
type_iv_df = QRAUtils.get_type_iv_df(
    input_df,
    STUDIES,
    QUALITY_CRITERIA
)

print(f'Selected start of instrument scale: {INSTRUMENT_SCALE_STARTS_AT}.')
print("Type I Results:")
print(type_i_df)

print("\nType II Results:")
print(type_ii_df)

print("\nType IV Results:")
print(type_iv_df)

Selected start of instrument scale: 1.
Type I Results:
       Key  System    Criterion  Original  Reproduction 1  CV_STAR_2_O_A
0  0729-04  MemSum  Overall_agg      1.38            1.27      33.744794
1  0729-04  NeuSum  Overall_agg      1.57            1.33      53.173616
2  0729-04  MemSum      Overall      1.38            1.47      21.113053
3  0729-04  NeuSum      Overall      1.57            1.53       7.250948

Type II Results:
     Criterion  PEARSON_O_A  SPEARMAN_O_A
0      Overall          1.0           1.0
1  Overall_agg          1.0           1.0

Type IV Results:
     Criterion   Study A         Study B  Numerator  Denominator
0      Overall  Original  Reproduction 1          1            1
1  Overall_agg  Original  Reproduction 1          1            1
