# Hypothesis A
There is no increase in the level of reproducibility of AGILE papers between pre- (2017, 2018, 2019) and post-intervention (2021, 2022, 2023) papers. 2020 is not considered because it was a transition year. 

Alternative hypothesis: The level of reproducibility has increased after the intervention

Note: What is actually assessed is the level of potential reproducibility along the criteria mentioned in the assessment protocol. 

In [40]:
# all imports
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu

## Data preparation
The available data is a CSV file. The relevant columns for this hypothesis are:

- conf: "agile" or "giscience"
- year: year of the conference
- consolidated_cp: conceptual paper true/false
- consolidated_data: data dimension in UDAO (undocumented, documented, available, open) scheme
- consolidated_methods: methods dimension in UDAO scheme
- consolidated_results: results dimension in UDAO scheme
- consolidated_ce: computational environment true/false

The necessary steps are:

1. Filter conf for "agile"
2. convert UDAO scheme into ranks 1 to 4 and drop conceptual papers
3. Create two groups: first group contains years 2017, 2018, and 2019, second group contains years 2021, 2022, 2023

The other variables are needed later in the analysis.

In [67]:
# change variables as needed
INPUT_CSV = '../data-clean/all-data.csv'
CONF_COL = 'conf'
CONF_FILTER = 'agile'
GROUP_COL = 'year'
GROUP_A = [2017, 2018, 2019]
GROUP_B = [2021, 2022, 2023]
TEST_COL = ['consolidated_data', 'consolidated_methods', 'consolidated_results']
REPLACE = {
    'Not applicable': np.nan,
    'U': 1,
    'D': 2,
    'A': 3,
    'O': 4
}
P_VALUE=0.01

# Load the CSV file
df = pd.read_csv(INPUT_CSV)

# Filter the DataFrame on conference
filtered_df = df[df[CONF_COL] == CONF_FILTER]

filtered_df.head()

Unnamed: 0,conf,paper,year,title,link,oa,dasa,rev1,rev2,rev1_cp,...,consolidated_cp,consolidated_data,consolidated_methods,consolidated_results,consolidated_ce,consolidated_notes,disagr_type,agile_badge,agile_reproreport,disagr_id
0,agile,agile_2017_006,2017,Follow the Signs—Countering Disengagement from...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,D,D,D,False,no disagreement,no disagreement,,,0
1,agile,agile_2017_014,2017,The Effect of Regional Variation and Resolutio...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,D,D,D,False,no disagreement,no disagreement,,,0
2,agile,agile_2019_003,2019,Evaluating the Effectiveness of Embeddings in ...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,A,D,D,False,rev1 correct; broken link to Twitter dataset; ...,uncertain assessment,,,3
3,agile,agile_2019_011,2019,Enhancing the Use of Population Statistics Der...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,U,D,D,False,Rev2 correct,uncertain assessment,,,3
4,agile,agile_2021_007,2021,A Comparative Study of Typing and Speech For M...,https://agile-giss.copernicus.org/articles/2/7...,yes,yes,FO,AG,False,...,False,O,O,O,False,"Rev1 correct, everything available in on Figsh...",uncertain assessment,True,https://osf.io/p473e,3


In [68]:
# convert values
for column in TEST_COL:
    filtered_df[column] = filtered_df[column].replace(REPLACE)

filtered_df.head()

  filtered_df[column] = filtered_df[column].replace(REPLACE)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df[column] = filtered_df[column].replace(REPLACE)


Unnamed: 0,conf,paper,year,title,link,oa,dasa,rev1,rev2,rev1_cp,...,consolidated_cp,consolidated_data,consolidated_methods,consolidated_results,consolidated_ce,consolidated_notes,disagr_type,agile_badge,agile_reproreport,disagr_id
0,agile,agile_2017_006,2017,Follow the Signs—Countering Disengagement from...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,2.0,2.0,2.0,False,no disagreement,no disagreement,,,0
1,agile,agile_2017_014,2017,The Effect of Regional Variation and Resolutio...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,2.0,2.0,2.0,False,no disagreement,no disagreement,,,0
2,agile,agile_2019_003,2019,Evaluating the Effectiveness of Embeddings in ...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,3.0,2.0,2.0,False,rev1 correct; broken link to Twitter dataset; ...,uncertain assessment,,,3
3,agile,agile_2019_011,2019,Enhancing the Use of Population Statistics Der...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,1.0,2.0,2.0,False,Rev2 correct,uncertain assessment,,,3
4,agile,agile_2021_007,2021,A Comparative Study of Typing and Speech For M...,https://agile-giss.copernicus.org/articles/2/7...,yes,yes,FO,AG,False,...,False,4.0,4.0,4.0,False,"Rev1 correct, everything available in on Figsh...",uncertain assessment,True,https://osf.io/p473e,3


In [69]:
# Create the two groups
group_a = filtered_df[filtered_df[GROUP_COL].isin(GROUP_A)][TEST_COL].dropna()
group_b = filtered_df[filtered_df[GROUP_COL].isin(GROUP_B)][TEST_COL].dropna()

group_b.head()

Unnamed: 0,consolidated_data,consolidated_methods,consolidated_results
4,4.0,4.0,4.0
12,3.0,3.0,3.0
13,4.0,3.0,4.0
14,3.0,3.0,3.0
15,3.0,3.0,3.0


## Data analysis
We have

- two independent groups (pre- and post-intervention) with unequal sample sizes
- the data is ranked (i.e., a value of 4 is better than a value of 2, but not double)
- the question is whether any one group has significantly higher or lower ranks than the other

The suitable test is Mann-Whitney U (Wilcoxon rank-sum test) which

- is non-parametric
- compares sum of ranks between groups
- suited for ordinal data
- robust to different sample sizes

In [73]:
for test in TEST_COL:
    # Perform the Mann–Whitney U test
    stat, p_value = mannwhitneyu(group_a[test], group_b[test], alternative='two-sided')

    # Print the result
    print(f"Testing: {test}")
    print(f"Mann–Whitney U statistic: {stat}")
    print(f"P-value: {p_value}")
    print(f"Mean of ranks for groups: {group_a[test].mean(), group_b[test].mean()}")

    # Interpretation
    if p_value < P_VALUE:
        print(f"Significant difference between groups (p < {P_VALUE})")
    else:
        print(f"No significant difference between groups (p ≥ {P_VALUE})")

Testing: consolidated_data
Mann–Whitney U statistic: 605.5
P-value: 1.8153522788765465e-05
Mean of ranks for groups: (np.float64(1.7450980392156863), np.float64(2.608695652173913))
Significant difference between groups (p < 0.01)
Testing: consolidated_methods
Mann–Whitney U statistic: 452.5
P-value: 7.738848344167941e-09
Mean of ranks for groups: (np.float64(2.019607843137255), np.float64(2.891304347826087))
Significant difference between groups (p < 0.01)
Testing: consolidated_results
Mann–Whitney U statistic: 578.0
P-value: 2.3137519910790535e-07
Mean of ranks for groups: (np.float64(2.019607843137255), np.float64(2.739130434782609))
Significant difference between groups (p < 0.01)
