# Hypothesis C
There is no difference in the development of reproducibility over time between the AGILE and the GIScience conference.  

Alternative hypothesis: The development is different between the two. 

Note: What is actually assessed is the level of potential reproducibility along the criteria mentioned in the assessment protocol. 

In [22]:
# all imports
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu

## Data preparation
The available data is a CSV file. The relevant columns for this hypothesis are:

- conf: "agile" or "giscience"
- year: year of the conference
- consolidated_cp: conceptual paper true/false
- consolidated_data: data dimension in UDAO (undocumented, documented, available, open) scheme
- consolidated_methods: methods dimension in UDAO scheme
- consolidated_results: results dimension in UDAO scheme
- consolidated_ce: computational environment true/false

The necessary steps are:

1. Convert UDAO scheme into ranks 1 to 4 and drop conceptual papers
2. Create two groups: AGILE and GIScience papers
3. Split the groups into before/after 

The other variables are needed later in the analysis.

In [23]:
# change variables as needed
INPUT_CSV = '../data-clean/all-data.csv'
CONF_COL = 'conf'
CONF_A = 'agile'
CONF_B = 'giscience'
TIME_COL = 'year'
CONF_A_BEFORE = [2017, 2018, 2019]
CONF_A_AFTER = [2021, 2022, 2023]
CONF_B_BEFORE = [2016, 2018]
CONF_B_AFTER = [2021, 2023]
TEST_COL = ['consolidated_data', 'consolidated_methods', 'consolidated_results']
REPLACE = {
    'Not applicable': np.nan,
    'U': 0,
    'D': 1,
    'A': 2,
    'O': 3
}
P_VALUE=0.01

# Load the CSV file
df = pd.read_csv(INPUT_CSV)

In [24]:
# convert values
# pd.set_option('future.no_silent_downcasting', True)
for column in TEST_COL:
    df[column] = df[column].replace(REPLACE).astype('Int64')

df.head()

  df[column] = df[column].replace(REPLACE).astype('Int64')


Unnamed: 0,conf,paper,year,title,link,oa,dasa,rev1,rev2,rev1_cp,...,consolidated_cp,consolidated_data,consolidated_methods,consolidated_results,consolidated_ce,consolidated_notes,disagr_type,agile_badge,agile_reproreport,disagr_id
0,agile,agile_2017_006,2017,Follow the Signs—Countering Disengagement from...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,1,1,1,False,no disagreement,no disagreement,,,0
1,agile,agile_2017_014,2017,The Effect of Regional Variation and Resolutio...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,1,1,1,False,no disagreement,no disagreement,,,0
2,agile,agile_2019_003,2019,Evaluating the Effectiveness of Embeddings in ...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,2,1,1,False,rev1 correct; broken link to Twitter dataset; ...,uncertain assessment,,,3
3,agile,agile_2019_011,2019,Enhancing the Use of Population Statistics Der...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,0,1,1,False,Rev2 correct,uncertain assessment,,,3
4,agile,agile_2021_007,2021,A Comparative Study of Typing and Speech For M...,https://agile-giss.copernicus.org/articles/2/7...,yes,yes,FO,AG,False,...,False,3,3,3,False,"Rev1 correct, everything available in on Figsh...",uncertain assessment,True,https://osf.io/p473e,3


## Data analysis
We have

- two independent groups (conferences)
- for each group, we have two independent groups (pre- and post-intervention) with unequal sample sizes
- the data is ranked (i.e., a value of 4 is better than a value of 2, but not double)
- the question is whether there is a difference in ranking before and after, and whether this is different between the two conferences.

An often suggested approach to compare the change over time between two groups with ordinal data is Aligned Rank Transform Analysis of Variance (e.g., ARTool package in R). However, there have been raised serious concerns over its reliability (https://statransform.github.io/jovi/). 

An alternative would be to run another MannWhitneyU on the deltas between the two conference groups, but that would require quantitatively meaningful differences between ranks (e.g., a rank of 4 is twice that of 2), which we don't have (see above). 

That leaves a descriptive comparison between effect sizes, e.g., by computing the rank-biserial for both conference groups. 

In [25]:
def rank_biserial(group1, group2):
    # Perform the Mann–Whitney U test
    U, p = mannwhitneyu(group1, group2, alternative='two-sided')

    # Sample sizes
    n1 = len(group1)
    n2 = len(group2)

    # Compute rank-biserial correlation
    r = 1 - (2 * U) / (n1 * n2)
    
    return U, p, r

In [26]:
def analyze_conferences(conf_a_before, conf_a_after, conf_b_before, conf_b_after):
    for test in TEST_COL:
        # Perform the Mann–Whitney U test and rank_biserial for conf A
        stat, p_value, rA = rank_biserial(conf_a_before[test], conf_a_after[test])
    
        # Print the result
        print(f"Testing: {test} for {CONF_A}")
        print(f"Mann–Whitney U statistic: {stat}")
        print(f"P-value: {p_value}")
      
        # Interpretation
        if p_value < P_VALUE:
            print(f"Significant difference between groups (p < {P_VALUE})")
        else:
            print(f"No significant difference between groups (p ≥ {P_VALUE})")
    
        print(f"Effect size (rank-biserial r): {rA:.3f}")
    
        print("---")
        
        # Perform the Mann–Whitney U test and rank_biserial for conf B
        stat, p_value, rB = rank_biserial(conf_b_before[test], conf_b_after[test])
    
        # Print the result
        print(f"Testing: {test} for {CONF_B}")
        print(f"Mann–Whitney U statistic: {stat}")
        print(f"P-value: {p_value}")
      
        # Interpretation
        if p_value < P_VALUE:
            print(f"Significant difference between groups (p < {P_VALUE})")
        else:
            print(f"No significant difference between groups (p ≥ {P_VALUE})")
    
        print(f"Effect size (rank-biserial r): {rB:.3f}")
    
        print("------")

In [27]:
# Create the groups
conf_a = df[df[CONF_COL] == CONF_A]
conf_a_before = conf_a[conf_a[TIME_COL].isin(CONF_A_BEFORE)][TEST_COL].dropna()
conf_a_after = conf_a[conf_a[TIME_COL].isin(CONF_A_AFTER)][TEST_COL].dropna()

conf_b = df[df[CONF_COL] == CONF_B]
conf_b_before = conf_b[conf_b[TIME_COL].isin(CONF_B_BEFORE)][TEST_COL].dropna()
conf_b_after = conf_b[conf_b[TIME_COL].isin(CONF_B_AFTER)][TEST_COL].dropna()

conf_b_after.head()

Unnamed: 0,consolidated_data,consolidated_methods,consolidated_results
26,0,2,1
27,1,2,2
52,1,2,1
53,0,1,1
54,0,1,1


In [28]:
analyze_conferences(conf_a_before, conf_a_after, conf_b_before, conf_b_after)

Testing: consolidated_data for agile
Mann–Whitney U statistic: 605.5
P-value: 1.8153522788765465e-05
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.484
---
Testing: consolidated_data for giscience
Mann–Whitney U statistic: 389.0
P-value: 0.747457444959889
No significant difference between groups (p ≥ 0.01)
Effect size (rank-biserial r): 0.047
------
Testing: consolidated_methods for agile
Mann–Whitney U statistic: 452.5
P-value: 7.738848344167941e-09
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.614
---
Testing: consolidated_methods for giscience
Mann–Whitney U statistic: 242.0
P-value: 0.00047106935239097617
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.407
------
Testing: consolidated_results for agile
Mann–Whitney U statistic: 578.0
P-value: 2.3137519910790535e-07
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.507
---
Testing: consolidated_res

For every tested dimension, the effect size for conference A (AGILE) is larger than for conference B (GIScience). This supports the argument that

- there has been a general increase in potential reproducibility in the research domain over the study period
- the increase is larger in relative terms and absolute ranks for AGILE

Whether this is due to the introduction of the guidelines and the reproducibility reviews is open for interpretation. It should also be noted that there is an overlap in the community (in terms of authors, reviewers, and scientific program committee), so one could expect a certain type of informal spill-over effect. 

We also checked that the filtering and grouping worked as intended by comparing the MannWhitneyU test scores against those in Hypothesis A/B notebooks. They are identical, as expected. 

### Added 2024 for AGILE and 2020 for GIScience

In [29]:
CONF_A_BEFORE = [2017, 2018, 2019]
CONF_A_AFTER = [2021, 2022, 2023, 2024]
CONF_B_BEFORE = [2016, 2018]
CONF_B_AFTER = [2020, 2021, 2023]

# Create the groups
conf_a = df[df[CONF_COL] == CONF_A]
conf_a_before = conf_a[conf_a[TIME_COL].isin(CONF_A_BEFORE)][TEST_COL].dropna()
conf_a_after = conf_a[conf_a[TIME_COL].isin(CONF_A_AFTER)][TEST_COL].dropna()

conf_b = df[df[CONF_COL] == CONF_B]
conf_b_before = conf_b[conf_b[TIME_COL].isin(CONF_B_BEFORE)][TEST_COL].dropna()
conf_b_after = conf_b[conf_b[TIME_COL].isin(CONF_B_AFTER)][TEST_COL].dropna()

analyze_conferences(conf_a_before, conf_a_after, conf_b_before, conf_b_after)

Testing: consolidated_data for agile
Mann–Whitney U statistic: 804.0
P-value: 1.9517140587696757e-05
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.456
---
Testing: consolidated_data for giscience
Mann–Whitney U statistic: 594.0
P-value: 0.40952949890888923
No significant difference between groups (p ≥ 0.01)
Effect size (rank-biserial r): 0.104
------
Testing: consolidated_methods for agile
Mann–Whitney U statistic: 565.0
P-value: 1.5353670324339366e-09
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.618
---
Testing: consolidated_methods for giscience
Mann–Whitney U statistic: 447.0
P-value: 0.0019107013695415555
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.326
------
Testing: consolidated_results for agile
Mann–Whitney U statistic: 740.0
P-value: 1.0870452046378186e-07
Significant difference between groups (p < 0.01)
Effect size (rank-biserial r): 0.500
---
Testing: consolidated_r