# Hypothesis B
There is an increase in the level of reproducibility of GIScience papers between AGILE pre- (2016 and 2018) and post-intervention (2021 and 2023) papers.  

Alternative hypothesis: The level of reproducibility has not increased after the intervention

Note: What is actually assessed is the level of potential reproducibility along the criteria mentioned in the assessment protocol. 

In [1]:
# all imports
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu

## Data preparation
The available data is a CSV file. The relevant columns for this hypothesis are:

- conf: "agile" or "giscience"
- year: year of the conference
- consolidated_cp: conceptual paper true/false
- consolidated_data: data dimension in UDAO (undocumented, documented, available, open) scheme
- consolidated_methods: methods dimension in UDAO scheme
- consolidated_results: results dimension in UDAO scheme
- consolidated_ce: computational environment true/false

The necessary steps are:

1. Filter conf for "giscience"
2. convert UDAO scheme into ranks 1 to 4 and drop conceptual papers
3. Create two groups: first group contains years 2016 and 2018, second group contains years 2021 and 2023

The other variables are needed later in the analysis.

In [2]:
# change variables as needed
INPUT_CSV = '../data-clean/all-data.csv'
CONF_COL = 'conf'
CONF_FILTER = 'giscience'
GROUP_COL = 'year'
GROUP_A = [2016, 2018]
GROUP_B = [2021, 2023]
TEST_COL = ['consolidated_data', 'consolidated_methods', 'consolidated_results']
REPLACE = {
    'Not applicable': np.nan,
    'U': 0,
    'D': 1,
    'A': 2,
    'O': 3
}
P_VALUE=0.01

# Load the CSV file
df = pd.read_csv(INPUT_CSV)

In [3]:

# Filter the DataFrame on conference
filtered_df = df[df[CONF_COL] == CONF_FILTER]

filtered_df.head()

Unnamed: 0,conf,paper,year,title,link,oa,dasa,rev1,rev2,rev1_cp,...,consolidated_cp,consolidated_data,consolidated_methods,consolidated_results,consolidated_ce,consolidated_notes,disagr_type,agile_badge,agile_reproreport,disagr_id
19,giscience,giscience_2016_001,2016,Computing River Floods Using Massive Terrain Data,https://link.springer.com/chapter/10.1007/978-...,no,no,CG,AG,False,...,False,A,D,D,False,Rev1 correct,uncertain assessment,,,3
20,giscience,giscience_2016_009,2016,Representing the Spatial Extent of Places Base...,https://link.springer.com/chapter/10.1007/978-...,no,no,CG,AG,False,...,False,D,D,D,False,no disagreement,no disagreement,,,0
21,giscience,giscience_2016_017,2016,Exploring the Notion of Spatial Lenses,https://link.springer.com/chapter/10.1007/978-...,no,no,CG,AG,False,...,False,U,D,D,False,Needs a REV3 --> Consolidated score based on R...,borderline conceptual paper,,,1
22,giscience,giscience_2018_008,2018,Labeling Points of Interest in Dynamic Maps us...,https://drops.dagstuhl.de/storage/00lipics/lip...,yes,no,CG,AG,False,...,False,U,A,D,False,Rev1 correct,uncertain assessment,,,3
23,giscience,giscience_2018_016,2018,FUTURES-AMR: Towards an Adaptive Mesh Refineme...,https://drops.dagstuhl.de/storage/00lipics/lip...,yes,no,CG,AG,False,...,False,U,D,D,False,Rev2 correct,uncertain assessment,,,3


In [4]:
# convert values
# pd.set_option('future.no_silent_downcasting', True)
for column in TEST_COL:
    filtered_df[column] = filtered_df[column].replace(REPLACE).astype('Int64')

filtered_df.head()

  filtered_df[column] = filtered_df[column].replace(REPLACE).astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df[column] = filtered_df[column].replace(REPLACE).astype('Int64')


Unnamed: 0,conf,paper,year,title,link,oa,dasa,rev1,rev2,rev1_cp,...,consolidated_cp,consolidated_data,consolidated_methods,consolidated_results,consolidated_ce,consolidated_notes,disagr_type,agile_badge,agile_reproreport,disagr_id
19,giscience,giscience_2016_001,2016,Computing River Floods Using Massive Terrain Data,https://link.springer.com/chapter/10.1007/978-...,no,no,CG,AG,False,...,False,2,1,1,False,Rev1 correct,uncertain assessment,,,3
20,giscience,giscience_2016_009,2016,Representing the Spatial Extent of Places Base...,https://link.springer.com/chapter/10.1007/978-...,no,no,CG,AG,False,...,False,1,1,1,False,no disagreement,no disagreement,,,0
21,giscience,giscience_2016_017,2016,Exploring the Notion of Spatial Lenses,https://link.springer.com/chapter/10.1007/978-...,no,no,CG,AG,False,...,False,0,1,1,False,Needs a REV3 --> Consolidated score based on R...,borderline conceptual paper,,,1
22,giscience,giscience_2018_008,2018,Labeling Points of Interest in Dynamic Maps us...,https://drops.dagstuhl.de/storage/00lipics/lip...,yes,no,CG,AG,False,...,False,0,2,1,False,Rev1 correct,uncertain assessment,,,3
23,giscience,giscience_2018_016,2018,FUTURES-AMR: Towards an Adaptive Mesh Refineme...,https://drops.dagstuhl.de/storage/00lipics/lip...,yes,no,CG,AG,False,...,False,0,1,1,False,Rev2 correct,uncertain assessment,,,3


In [5]:
# Create the two groups
group_a = filtered_df[filtered_df[GROUP_COL].isin(GROUP_A)][TEST_COL].dropna()
group_b = filtered_df[filtered_df[GROUP_COL].isin(GROUP_B)][TEST_COL].dropna()

group_b.head()

Unnamed: 0,consolidated_data,consolidated_methods,consolidated_results
26,0,2,1
27,1,2,2
52,1,2,1
53,0,1,1
54,0,1,1


## Data analysis
We have

- two independent groups (pre- and post-intervention) with unequal sample sizes
- the data is ranked (i.e., a value of 4 is better than a value of 2, but not double)
- the question is whether any one group has significantly higher or lower ranks than the other

The suitable test is Mann-Whitney U (Wilcoxon rank-sum test) which

- is non-parametric
- compares sum of ranks between groups
- suited for ordinal data
- robust to different sample sizes

In [6]:
def analyze_groups(group_a, group_b, column):
    # Perform the Mann–Whitney U test
    stat, p_value = mannwhitneyu(group_a[column], group_b[column], alternative='two-sided')

    # Print the results
    print(f"Testing (before/after): {column}")
    print(f"Mann–Whitney U statistic: {stat}")
    print(f"P-value: {p_value}")

    # Interpretation
    if p_value < P_VALUE:
        print(f"Significant difference between groups (p < {P_VALUE})")
    else:
        print(f"No significant difference between groups (p ≥ {P_VALUE})")

    # Descriptive stats
    grp_a_mean, grp_b_mean = group_a[column].mean(), group_b[column].mean()
    grp_a_mode, grp_b_mode = group_a[column].mode(), group_b[column].mode()
    grp_a_median, grp_b_median = group_a[column].median(), group_b[column].median()

    # Print the results
    print(f"Mean of ranks for groups: {float(grp_a_mean), float(grp_b_mean)}")
    print(f"Mode of ranks for groups: {int(grp_a_mode.iloc[0]), int(grp_b_mode.iloc[0])}")
    print(f"Median of ranks for groups: {int(grp_a_median), int(grp_b_median)}")

    print("---")

In [7]:
for test in TEST_COL:
    analyze_groups(group_a, group_b, test)

Testing (before/after): consolidated_data
Mann–Whitney U statistic: 389.0
P-value: 0.747457444959889
No significant difference between groups (p ≥ 0.01)
Mean of ranks for groups: (0.6470588235294118, 0.75)
Mode of ranks for groups: (0, 0)
Median of ranks for groups: (0, 0)
---
Testing (before/after): consolidated_methods
Mann–Whitney U statistic: 242.0
P-value: 0.00047106935239097617
Significant difference between groups (p < 0.01)
Mean of ranks for groups: (1.0, 1.5)
Mode of ranks for groups: (1, 1)
Median of ranks for groups: (1, 1)
---
Testing (before/after): consolidated_results
Mann–Whitney U statistic: 288.0
P-value: 0.0016405817255228663
Significant difference between groups (p < 0.01)
Mean of ranks for groups: (0.9411764705882353, 1.3333333333333333)
Mode of ranks for groups: (1, 1)
Median of ranks for groups: (1, 1)
---


For data, there was no statistically significant increase in potential reproducibility at the chosen significance level of 0.01.

For methods and results, there were statistically significant increases in potential reproducibilty at the chosen significance level of 0.01.

We therefore reject the original hypothesis only for the data dimension. 

While the mean ranks have increased for all dimensions, the mode and media stayed the same. 