# Combined hypothesis testing

This notebook combines and integrates the entire analysis workflow. It tests whether
- there was an increase in potential reproducibility in Agile papers after the introduction of the guidelines in 2020 (A)
- there was an increase in potential reproducibilit in GIScience papers after that introduction (B)
- there was a difference in increase between the two conference series (C)

## Preliminaries: preparing data and analysis functions

This sections loads and prepares the input data, and provides the functions necessary for the hypothesis testing. It only needs to be run once. All later sections depend on this one, but are independent from one another. That means you can explore different variable settings for the hypotheses without interfering with each other. 

### Setup and prepare data
You can set global variables after the necessary libraries are loaded, then the data is loaded and preprocessed. 

The script assumes that the available data is a CSV file with the following relevant columns and variable values (for detailed explanations of the UDAO scheme, see the corresponding paper). 

- conf: "agile" or "giscience"
- year: year of the conference proceedings
- consolidated_cp: conceptual paper true/false
- consolidated_data: data dimension in UDAO (undocumented, documented, available, open) scheme
- consolidated_methods: methods dimension in UDAO scheme
- consolidated_results: results dimension in UDAO scheme
- consolidated_ce: computational environment true/false

The preprocessing requires the same two steps for all hypotheses:

1. We need to remove all conceptual papers.
2. For basic calculations on ranks (see discussion on that in a later section), we need to convert the UDAO scheme into ranks 1 to 4.

In [36]:
# all imports
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu
from statsmodels.stats.contingency_tables import Table2x2
from IPython.display import display

pd.set_option('future.no_silent_downcasting', True)

# change variables as needed
INPUT_CSV = '../data-clean/all-data.csv'
CONF_COL = 'conf'
YEAR_COL = 'year'
TEST_COL = ['consolidated_data', 'consolidated_methods', 'consolidated_results']
REPLACE = {
    'Not applicable': np.nan,
    'U': 0,
    'D': 1,
    'A': 2,
    'O': 3
}
P_VALUE=0.01

In [4]:
# Load the CSV file
df = pd.read_csv(INPUT_CSV)

# remove conceptual papers
mask = df['consolidated_cp'] == True
df_nocp = df[~mask]

print(len(df), len(df_nocp))

224 203


In [5]:
# convert values
for column in TEST_COL:
    df_nocp[column] = df_nocp[column].replace(REPLACE).astype('Int64')

df_nocp.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nocp[column] = df_nocp[column].replace(REPLACE).astype('Int64')


Unnamed: 0,conf,paper,year,title,link,oa,dasa,rev1,rev2,rev1_cp,...,consolidated_cp,consolidated_data,consolidated_methods,consolidated_results,consolidated_ce,consolidated_notes,disagr_type,agile_badge,agile_reproreport,disagr_id
0,agile,agile_2017_006,2017,Follow the Signs—Countering Disengagement from...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,1,1,1,False,no disagreement,no disagreement,,,0
1,agile,agile_2017_014,2017,The Effect of Regional Variation and Resolutio...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,1,1,1,False,no disagreement,no disagreement,,,0
2,agile,agile_2019_003,2019,Evaluating the Effectiveness of Embeddings in ...,https://link.springer.com/chapter/10.1007/978-...,no,no,FO,AG,False,...,False,2,1,1,False,rev1 correct; broken link to Twitter dataset; ...,uncertain assessment,,,3


### Setup analysis functions

For testing hypotheses A and B, we have

- two independent groups (pre- and post-intervention) with unequal sample sizes
- the data is ranked (i.e., a value of 4 is better than a value of 2, but not double)

The question is whether any one group has significantly higher or lower ranks than the other.

The suitable test is Mann-Whitney U (Wilcoxon rank-sum test) which

- is non-parametric
- compares sum of ranks between groups
- suited for ordinal data
- robust to different sample sizes

The analysis function takes two groups and a column to test on and returns a dataframe. A secondary analysis function for descriptive statistics takes the same input and returns also a dataframe. 

In [66]:
def mann_whitney_u(group_1, group_2, column):
    # Perform the Mann–Whitney U test
    stat, p_value = mannwhitneyu(group_1[column], group_2[column], alternative='two-sided')

    # Add the results to data frame
    u_stats= pd.DataFrame([
        {
            "Criterion": column, 
            "U-Statistic": stat, 
            "P-value": p_value
        }
    ])

    return u_stats


def descr_stats(group_1, group_2, column):
    # Collect results in a data frame
    d_stats = pd.DataFrame([
        {
            "Group": "Pre-Intervention",
            "criterion": column,
            "mean": group_1[column].mean(),
            "mode": group_1[column].mode().iloc[0],
            "median": int(group_1[column].median())
        },
        {
            "Group": "Post-Intervention",
            "criterion": column,
            "mean": group_2[column].mean(),
            "mode": group_2[column].mode().iloc[0],
            "median": int(group_2[column].median())
        }
    ])

    return d_stats

For testing Hypothesis C, we have

- two independent groups (conferences)
- for each group, we have two independent sub-groups (pre- and post-intervention) with unequal sample sizes
- the data is ranked (i.e., a value of 4 is better than a value of 2, but not double)

The question is whether any difference in ranking before and after is different between the two conferences.

An often suggested approach to compare the change over time between two groups with ordinal data is Aligned Rank Transform Analysis of Variance (e.g., ARTool package in R). However, there have been raised serious concerns over its reliability (https://statransform.github.io/jovi/). 

An alternative would be to run another MannWhitneyU on the deltas between the two conference groups, but that would require quantitatively meaningful differences between ranks (e.g., a rank of 4 is twice that of 2), which we don't have (see above). 

That leaves a descriptive comparison between effect sizes, e.g., by computing the rank-biserial for both conference groups. 

In [73]:
def rank_biserial(group_1, group_2, column):
    # Perform the Mann–Whitney U test
    stat, p_value = mannwhitneyu(group_1[column], group_2[column], alternative='two-sided')

    # Sample sizes
    n1 = len(group_1)
    n2 = len(group_2)

    # Compute rank-biserial correlation
    r = 1 - (2 * stat) / (n1 * n2)

    # Add the results to data frame
    effect_stats= pd.DataFrame([
        {
            "Criterion": column, 
            "U-Statistic": stat, 
            "P-value": p_value,
            "Rank-Biserial": r
        }
    ])

    return effect_stats

Another option is to compare odds ratios. For this, we need to collapse the ranks into binary "low" (1, 2) and "high" (3, 4).

In [8]:
def odds_ratios(group_1, group_2, column):
    # Collapse to binary outcome: high = 3 or 4
    group_1["High"] = (group_1[column] >= 3).astype(int)
    group_2["High"] = (group_2[column] >= 3).astype(int)

    # create 2x2 contingency table
    table = pd.DataFrame(
        [[len(group_1[group_1['High']==0]), len(group_1[group_1['High']==1])],   
         [len(group_2[group_2['High']==0]), len(group_2[group_2['High']==1])]],  
        index=["before", "after"],
        columns=["low", "high"]
    )
    table2x2 = Table2x2(table.values)
    
    # calculate and print
    print("\n", table)
    print("Odds ratio:", table2x2.oddsratio)
    print("95% CI:", table2x2.oddsratio_confint())
    print("P-value (Fisher exact):", table2x2.test_nominal_association().pvalue)

## Hypothesis A

First, let's select the conference and then check the distribution over the years. Then you can specify the years to create the groups, before the dataframe is filtered and the test carried out.

In [9]:
df_A = df_nocp[df_nocp[CONF_COL] == 'agile']

print(df_A['year'].value_counts().sort_index(), len(df_A))

year
2017    16
2018    17
2019    18
2020    21
2021    13
2022    20
2023    13
2024    12
Name: count, dtype: int64 130


In [10]:
# specify the years for each group
GROUP_A1 = [2017, 2018, 2019]
GROUP_A2 = [2021, 2022, 2023,2024]

# Create the two groups
group_a1 = df_A[df_A[YEAR_COL].isin(GROUP_A1)][TEST_COL].dropna()
group_a2 = df_A[df_A[YEAR_COL].isin(GROUP_A2)][TEST_COL].dropna()

print(len(group_a1), len(group_a2))

51 58


In [72]:
d_stats = pd.DataFrame()
u_stats = pd.DataFrame()

for column in TEST_COL:
    d_stats = pd.concat([d_stats, descr_stats(group_a1, group_a2, column)], ignore_index=True)
    u_stats = pd.concat([u_stats, mann_whitney_u(group_a1, group_a2, column)], ignore_index=True)

display(d_stats.style.hide(axis="index"))
display(u_stats.style.hide(axis="index"))

Group,criterion,mean,mode,median
Pre-Intervention,consolidated_data,0.745098,0,1
Post-Intervention,consolidated_data,1.586207,2,2
Pre-Intervention,consolidated_methods,1.019608,1,1
Post-Intervention,consolidated_methods,1.931034,2,2
Pre-Intervention,consolidated_results,1.019608,1,1
Post-Intervention,consolidated_results,1.741379,1,2


Criterion,U-Statistic,P-value
consolidated_data,804.0,2e-05
consolidated_methods,565.0,0.0
consolidated_results,740.0,0.0


For all three dimensions, we observe a statistically significant increase in potential reproducibility at the chosen significance level of 0.01.

We therefore reject the original hypothesis for data, methods, and results dimensions.

Further, the descriptive statistics show that for all dimension/statistic combinations (except the mode of results) the ranks have improved. Especially the increase of mode and medians for data and methods is meaningful, because improving to rank 2 ("Available") has the largest practical impact. 

## Hypothesis B

Follows the pattern of Hypothesis a: First, let's select the conference and then check the distribution over the years. Then you can specify the years to create the groups, before the dataframe is filtered and the test carried out.

In [69]:
df_B = df_nocp[df_nocp[CONF_COL] == 'giscience']

print(df_B['year'].value_counts().sort_index(), len(df_B))

year
2016    17
2018    17
2020    15
2021    13
2023    11
Name: count, dtype: int64 73


In [70]:
# specify the years for each group
GROUP_B1 = [2016, 2018]
GROUP_B2 = [2021, 2023]

# Create the two groups
group_b1 = df_B[df_B[YEAR_COL].isin(GROUP_B1)][TEST_COL].dropna()
group_b2 = df_B[df_B[YEAR_COL].isin(GROUP_B2)][TEST_COL].dropna()

print(len(group_b1), len(group_b2))

34 24


In [71]:
d_stats = pd.DataFrame()
u_stats = pd.DataFrame()

for column in TEST_COL:
    d_stats = pd.concat([d_stats, descr_stats(group_b1, group_b2, column)], ignore_index=True)
    u_stats = pd.concat([u_stats, mann_whitney_u(group_b1, group_b2, column)], ignore_index=True)

display(d_stats.style.hide(axis="index"))
display(u_stats.style.hide(axis="index"))

Group,criterion,mean,mode,median
Pre-Intervention,consolidated_data,0.647059,0,0
Post-Intervention,consolidated_data,0.75,0,0
Pre-Intervention,consolidated_methods,1.0,1,1
Post-Intervention,consolidated_methods,1.5,1,1
Pre-Intervention,consolidated_results,0.941176,1,1
Post-Intervention,consolidated_results,1.333333,1,1


Criterion,U-Statistic,P-value
consolidated_data,389.0,0.747457
consolidated_methods,242.0,0.000471
consolidated_results,288.0,0.001641


For data, there was no statistically significant increase in potential reproducibility at the chosen significance level of 0.01.

For methods and results, there were statistically significant increases in potential reproducibilty at the chosen significance level of 0.01.

We therefore reject the original hypothesis only for the data dimension. 

While the mean ranks have increased for all dimensions, the mode and media stayed the same. 

## Hypothesis C

This step relies on the previous steps to build the groups for the testing. 

In [76]:
effect_stats_a = pd.DataFrame()

print("\nFirst the effect sizes for the pre-/post-intervention groups from Hypothesis A:")

for column in TEST_COL:
    effect_stats_a = pd.concat([effect_stats_a, rank_biserial(group_a1, group_a2, column)], ignore_index=True)

display(effect_stats_a.style.hide(axis="index"))

print("\nThen the effect sizes for the pre-/post-intervention groups from Hypothesis B:")

effect_stats_b = pd.DataFrame()

for column in TEST_COL:
    effect_stats_b = pd.concat([effect_stats_b, rank_biserial(group_b1, group_b2, column)], ignore_index=True)

display(effect_stats_b.style.hide(axis="index"))


First the effect sizes for the pre-/post-intervention groups from Hypothesis A:


Criterion,U-Statistic,P-value,Rank-Biserial
consolidated_data,804.0,2e-05,0.456389
consolidated_methods,565.0,0.0,0.617985
consolidated_results,740.0,0.0,0.499662



Then the effect sizes for the pre-/post-intervention groups from Hypothesis B:


Criterion,U-Statistic,P-value,Rank-Biserial
consolidated_data,389.0,0.747457,0.046569
consolidated_methods,242.0,0.000471,0.406863
consolidated_results,288.0,0.001641,0.294118


For every tested dimension, the effect size for conference A (AGILE) is larger than for conference B (GIScience). This supports the argument that

- there has been a general increase in potential reproducibility in the research domain over the study period
- the increase is larger in relative terms and absolute ranks for AGILE

Whether this is due to the introduction of the guidelines and the reproducibility reviews is open for interpretation. It should also be noted that there is an overlap in the community (in terms of authors, reviewers, and scientific program committee), so one could expect a certain type of informal spill-over effect. 

Finally, we can also compare the odds ratios:

In [18]:
for column in TEST_COL:
    print("\nColumn: " + column)
    print("\nGroup A:")
    odds_ratios(group_a1, group_a2, column)
    print("\nGroup B:")
    odds_ratios(group_b1, group_b2, column)


Column: consolidated_data

Group A:

         low  high
before   51     0
after    46    12
Odds ratio: 26.608695652173914
95% CI: (np.float64(1.5285797936024483), np.float64(463.1898755127469))
P-value (Fisher exact): 0.0012009731256340528

Group B:

         low  high
before   34     0
after    22     2
Odds ratio: 6.181818181818182
95% CI: (np.float64(0.2662226714666457), np.float64(143.54478460654204))
P-value (Fisher exact): 0.2003755421490082

Column: consolidated_methods

Group A:

         low  high
before   51     0
after    43    15
Odds ratio: 35.58139534883721
95% CI: (np.float64(2.0649946685469067), np.float64(613.0939291291904))
P-value (Fisher exact): 0.0001919252787526693

Group B:

         low  high
before   34     0
after    22     2
Odds ratio: 6.181818181818182
95% CI: (np.float64(0.2662226714666457), np.float64(143.54478460654204))
P-value (Fisher exact): 0.2003755421490082

Column: consolidated_results

Group A:

         low  high
before   51     0
after    46 

For all reproducibility dimensions, the odds rarios for an improvement are higher for AGILE than for GScience, and all have a p-Value below our chosen significance level of 0.01 for AGILE, while none have for GIScience. This is another indication that the improvement for AGILE has been stronger than for GIScience. 