# Creating the output from `cwl-eval`

In `compatiability/evals/*`, we have the output from `cwl-eval` for several retrieval models: BM25, a Language Model (LM) and Divergence from Randomness Model (PL2) on the TREC AP collection.

- ap_bm25.eval	
- ap_lmd.eval	
- ap_pl2.eval

The result files for each retrieval model and the qrels file used are also in the `compatiability` folder.

    cwl-eval qrels/trec_ap_51-200.qrels results/ap_bm25.res -n > evals/ap_bm25.eval
    cwl-eval qrels/trec_ap_51-200.qrels results/ap_lmd.res -n > evals/ap_lmd.eval 
    cwl-eval qrels/trec_ap_51-200.qrels results/ap_pl2.res -n > evals/ap_pl2.eval 
    
Note that have included the `-n` argument - which includes the column headings. 

Now lets load these eval files into dataframes.

# Importing the `cwl-eval` output into a Dataframe

In [39]:
import pandas as pd

dfbm25 = pd.read_csv('../compatibility/evals/ap_bm25.eval',sep='\t')
dflmd = pd.read_csv('../compatibility/evals/ap_lmd.eval',sep='\t')
dfpl2 = pd.read_csv('../compatibility/evals/ap_pl2.eval',sep='\t')


## Listing the Columns in `cwl-eval` output

By default `cwl-eval` will output six columns (see below), where EU = Expected Utility per Item, ETU = Expected Total Utility, EC = Expected Cost per Item, ETC = Expected Total Cost and ED is Expected Depth (or Expected Number of Items Examined.

In [36]:
fields = dfbm25.columns
for f in fields:
    print(f)

Topic
Metric
EU
ETU
EC
ETC
ED


## Listing the Metrics in `cwl-eval` output

By default `cwl-eval` will output a subset of metrics (but you can specific the metrics you want with the `-m` argument).

In [17]:
metrics = dfbm25['Metric'].unique()
for m in metrics:
    print(m)

P@1
P@2
P@3
P@4
P@5
P@10
RBP@0.2
RBP@0.4
RBP@0.8
NDCG-k@5
NDCG-k@10
RR
AP
INST-T=1.0
INST-T=2.0
INST-T=3.0


## Reporting the Mean and Standard Error

Below we have an example where we group by the metric, and report the mean and standard error of: EU, ETU and ED for BM25.

In [31]:
dfbm25.groupby('Metric')['EU','ETU','ED'].agg(['mean','sem']).round(decimals=3)

Unnamed: 0_level_0,EU,EU,ETU,ETU,ED,ED
Unnamed: 0_level_1,mean,sem,mean,sem,mean,sem
Metric,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
AP,0.315,0.019,10.182,0.889,61.243,7.925
INST-T=1.0,0.498,0.033,0.741,0.043,1.878,0.04
INST-T=2.0,0.45,0.029,1.225,0.066,3.326,0.065
INST-T=3.0,0.426,0.027,1.701,0.09,4.824,0.088
NDCG-k@10,0.419,0.027,1.903,0.122,4.544,0.0
NDCG-k@5,0.439,0.029,1.294,0.086,2.949,0.0
P@1,0.487,0.041,0.487,0.041,1.0,0.0
P@10,0.403,0.026,4.027,0.264,10.0,0.0
P@2,0.477,0.035,0.953,0.069,2.0,0.0
P@3,0.438,0.031,1.313,0.093,3.0,0.0


# Comparing Runs


First we need to give each result list a name - this is done by using the `insert` command on the dataframe.

Then, we need to concatenate all the results together.

To perform the statistica testing we will be using the Pingouin Python Package.

We shall focus our attention on testing whether the Expected Utility (EU) is similar for Precision at 10 (P@10).



In [91]:
if 'Name' not in dfbm25.columns:
    dfbm25.insert(0,'Name','bm25')
    dflmd.insert(0,'Name','lmd')
    dfpl2.insert(0,'Name','pl2')

dfall = pd.concat([dfbm25, dflmd, dfpl2])

import pingouin as pg

metric = 'P@10'
measurement = 'EU'

# select the metric we are interested in doing the comparison over
dftest = dfall.loc[dfall['Metric'] == metric]

# show a table of the different runs for the expected utility
dftest.groupby(['Name','Metric'])['EU'].agg(['mean','sem']).round(decimals=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,sem
Name,Metric,Unnamed: 2_level_1,Unnamed: 3_level_1
bm25,P@10,0.403,0.026
lmd,P@10,0.423,0.028
pl2,P@10,0.432,0.027


# Is this Significantly Different?

Which one is better? Our table shows that PL2 has the highest P@10 = 0.432. But is this really better than BM25 and LMD?



### Perform a Repeated Measures ANOVA

To find out if there is potentially a significant difference between the runs, we first perform a Repeated Measures ANOVA - this is because each topic provides a different measurement, and each run provides a measurement for each topic. So our within variable is the `Name` of our run, and the the `Topic` variable each subject.



In [92]:
aov = pg.rm_anova(data=dftest, dv=measurement , within=['Name'], subject='Topic', detailed=True)
print(aov)

  Source     SS   DF     MS     F       p-unc  p-GG-corr    np2    eps  \
0   Name  0.068    2  0.034  4.98  0.00745722  0.0122786  0.032  0.807   
1  Error  2.039  298  0.007     -           -          -      -      -   

  sphericity W-spher      p-spher  
0      False   0.761  1.62009e-09  
1          -       -            -  


### Perform follow up significance testing using Pairwise T-Tests with Bonferroni Correction
Since the ANOVA came back with showing that the corrected p-value `p-GG-corr` is less that 0.05, it motivates performing a follow up test to find out which run pairs are different. It is tempting to use `p-unc`, but this is on valid if you have two runs that are being compared. Note that `np2` is the partial effect size, where <0.06 is a small effect size, 0.006-0.14 is a medium effect size, and >0.14 is a large effect size. (see J. Cohen. 1973.   Eta-squared and partial eta-squared in fixed factor ANOVAdesigns.Educational and psychological measurement33, 1 (1973), 107–112.)


So to find out which pairs are different, we need to perform a pairwise T-Test with Bonfferroni Correction.
Note: when interpreting the follow up T-Tests we need to use the corrected p-values, i.e. `p-corr` not the uncorrected p-values i.e. `p-unc`

If you are only comparing two systems/runs, then no correction is needed, and so only `p-unc` is reported.


In [89]:
pt = pg.pairwise_ttests(dv=measurement, within=['Name'], subject='Topic', data=dftest, padjust='bonf')
print(pt)

  Contrast     A    B  Paired  Parametric      T    dof       Tail     p-unc  \
0     Name  bm25  lmd    True        True -1.812  149.0  two-sided  0.071930   
1     Name  bm25  pl2    True        True -3.077  149.0  two-sided  0.002488   
2     Name   lmd  pl2    True        True -1.193  149.0  two-sided  0.234613   

     p-corr p-adjust   BF10  hedges  
0  0.215789     bonf  0.448  -0.062  
1  0.007463     bonf  8.252  -0.089  
2  0.703838     bonf  0.182  -0.026  


## Who wins?

From the pairwise comparison, we can see that only BM25 vs PL2 shows a significant difference with corrected p-value of p=0.007463.
