# Statistical evaluation experiments

In this notebook we follow the methodolody presented in section 4 of [1] to evaluate the performance of the Anguilla implementation of the (100+1)-MO-CMA-ES-I, (100+1)-MO-CMA-ES-P, (100+100)-MO-CMA-ES-I, (100+100)-MO-CMA-ES-P optimizers. It consists in:

- Comparing the four aforementioned optimizers plus a fifth one (Shark's NSGA-II + HV indicator) on a specific problem **f** (benchmark objective function) after **g** function evaluations.
  - The 2-D are: ZDT{1-4, 6}, IHR{1-4, 6}, ELLI{1,2}, CIGTAB{1,2}.
  - The 3-D are: DTLZ{1-7}. Note: the GELLI function from [1] is not included in this evaluation, but it is considered in the reference paper.
- Using the HV indicator as the performance measure, taking a common reference point across algorithms for each **f**. In this case, the union of the 5 * t populations after **g** function evaluations (25K and 50K), where **t** denotes the number of trials (25 in this case). We compute the mean value for the **t** trials.
- We use the Friedman Aligned Ranks test and afterwards, if the test rejects the null hypothesis, one of the available posthoc tests provided by STAC [2]. The reference paper [1] uses Bergmann-Homel's posthoc method instead, but a Python implementation is not available. The significance level is fixed (p = 0.001).

The experimental data for the Anguilla and Shark implementations was gathered in a separate GitHub repository (https://github.com/pocs-anguilla/evaluation-data) as a collection of CSV files.

For conducting the statistical tests we use the software by [2] in Python.
## References

- [1] T. Voß, N. Hansen, and C. Igel. Improved Step Size Adaptation for the MO-CMA-ES. In Genetic And Evolutionary Computation Conference, 487–494. Portland, United States, July 2010. ACM. URL: https://hal.archives-ouvertes.fr/hal-00503251, doi:10.1145/1830483.1830573.

- [2] I. Rodriguez-Fdez, A. Canosa, M. Mucientes, & A. Bugarin (2015). STAC: a web platform for the comparison of algorithms using statistical tests. In Proceedings of the 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). URL: https://git.io/Jtapw

- [3] Terpilowski, M. (2019). scikit-posthocs: Pairwise multiple comparison tests in Python. The Journal of Open Source Software, 4(36), 1169.

- [4] J. Derrac, S. García, D. Molina, & F. Herrera (2011). A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithmsSwarm and Evolutionary Computation, 1(1), 3-18.

In [1]:
#!pip install -i https://test.pypi.org/simple/ anguilla

In [None]:
#!pip install scikit-posthocs

In [None]:
#!pip install tabulate

In [6]:
import pathlib
import dataclasses
import tabulate

from itertools import product
from typing import Optional, List

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import display, clear_output

import stac  # don't install from Pip

import scipy.stats as ss
import statsmodels.api as sa
import scikit_posthocs as sp

import anguilla
import anguilla.hypervolume as hv

from anguilla.dominance import NonDominatedSet2D, NonDominatedSetKD
from anguilla.fitness import benchmark
from anguilla.evaluation import load_logs

pd.set_option('display.float_format', '{:.5E}'.format)
print(anguilla.__version__)

0.0.15


In [7]:
FNS_2D = ['ZDT1', 'ZDT2', 'ZDT3', 'ZDT4', 'ZDT6', 'IHR1', 'IHR2', 'IHR3', 'IHR4', 'IHR6', 'ELLI1', 'ELLI2', 'CIGTAB1', 'CIGTAB2']
FNS_3D = ['DTLZ1', 'DTLZ2', 'DTLZ3', 'DTLZ4', 'DTLZ5', 'DTLZ6', 'DTLZ7']
OPTS = ['(100+1)-MO-CMA-ES-I', '(100+1)-MO-CMA-ES-P', '(100+100)-MO-CMA-ES-I', '(100+100)-MO-CMA-ES-P']
OPTS_EXT = OPTS + ['NSGAII']

In [8]:
class SummaryGenerator:
    def __init__(self, paths: List[str], optsl: List[List[str]], n_objectives: int, fns: List[str], 
                 n_evaluations: int, control: Optional[str] = None, search_subdirs: bool = True,
                 use_median: bool = True):
        self.n_objectives = n_objectives
        self.fns = fns
        self.n_evaluations = n_evaluations
        self.control = control
        self.use_median = use_median
        self.opts = []
        for l in optsl:
            for opt in l:
                if opt not in self.opts:
                    self.opts.append(opt)

        self._logs = {}
        for path, opts in zip(paths, optsl):
            for log in load_logs(path, fns=fns, opts=opts, n_evaluations=[n_evaluations], observations=["fitness"], search_subdirs=search_subdirs):
                if log.fn not in self._logs:
                    self._logs[log.fn] = {}
                if log.optimizer not in self._logs[log.fn]:
                    self._logs[log.fn][log.optimizer] = []
                self._logs[log.fn][log.optimizer].append(log)

        self._mean_summary = None
        self._median_summary = None
        self._db = None

    def _compute_reference_point(self, fn: str):
        point_set = NonDominatedSet2D() if self.n_objectives == 2 else NonDominatedSetKD()
        for logs in self._logs[fn].values():
            for log in logs:
                point_set.insert(log.data)
        reference = point_set.upper_bound + 1.0
        return reference

    def _compute_indicators(self, fn:str , opt: str, reference: np.ndarray):
        indicators = []
        for log in self._logs[fn][opt]:
            indicator = hv.calculate(log.data, reference, ignore_dominated=True)
            indicators.append(indicator)
        return np.array(indicators)

    def _compute_summary(self):
        db = {}
        mean_rows = []
        median_rows = []
        for fn in self.fns:
            display(f'Computing reference point for: {fn}')
            reference = self._compute_reference_point(fn)
            display(f'Result: {reference}')
            db[fn] = []
            mean_row = []
            median_row = []
            for opt in self.opts:
                display(f'Computing HV indicators for: {opt}')
                indicators = self._compute_indicators(fn, opt, reference)  
                assert(len(indicators) == 25)
                median_ind = np.median(indicators)
                mean_ind = np.mean(indicators)
                db[fn].append({'opt': opt,
                               'mean_indicator' : mean_ind,
                               'median_indicator': median_ind,
                               'indicators': indicators,
                               'reference': reference})
                display(f'Result: mean {mean_ind}, median {median_ind}')
                mean_row.append(mean_ind)
                median_row.append(median_ind)
            mean_rows.append(mean_row)
            median_rows.append(median_row)
            clear_output()
        mean_df = pd.DataFrame(mean_rows, columns=self.opts, index=self.fns)
        median_df = pd.DataFrame(median_rows, columns=self.opts, index=self.fns)
        return mean_df, median_df, db
    
    def summary(self, useMedian=True):
        if self._mean_summary is None:
            self._mean_summary, self._median_summary, self._db = self._compute_summary()

        if useMedian:
            return self._median_summary
        return self._mean_summary
        
    def db(self):
        if self._db is None:
            self._mean_summary, self._median_summary, self._db = self._compute_summary()
        return self._db
    
    def to_csv(self, useMedian=True, name=None):
        if name is None:
            name = f'results_{self.n_evaluations}.csv'
        summary = self.summary(useMedian=useMedian)
        summary.to_csv(name)

In [9]:
s_shark_2d = SummaryGenerator(['data/shark'], [OPTS_EXT], 2, FNS_2D, 50000, search_subdirs=False)
df_shark_2d = s_shark_2d.summary()

# Augment Anguilla dataset with Shark's NSGA-II.
s_anguilla_2d = SummaryGenerator(['data/anguilla','data/shark'], [OPTS, ['NSGAII']], 2, FNS_2D, 50000, search_subdirs=True)
df_anguilla_2d = s_anguilla_2d.summary()

In [10]:
s_shark_3d = SummaryGenerator(['data/shark'], [OPTS_EXT], 3, FNS_3D, 50000, search_subdirs=False)
df_shark_3d = s_shark_3d.summary()

s_anguilla_3d = SummaryGenerator(['data/anguilla','data/shark'], [OPTS, ['NSGAII']], 2, FNS_3D, 50000, search_subdirs=True)
df_anguilla_3d = s_anguilla_3d.summary()

## Friedman aligned ranks test

In [11]:
def example_4_testcase():
    """Taken from page 9 of [4]."""

    data = np.array([
        [2.711, 3.147, 2.515, 2.612],
        [7.832, 9.828, 7.832, 7.921],
        [0.012, 0.532, 0.122, 0.005],
        [3.431, 4.111, 3.401, 3.401]
    ])

    # The third one in the paper should be 1.75 instead of 1.250.
    _, _, rankings_avg, _ = stac.friedman_test(*data.T)
    assert np.allclose(rankings_avg, [2.375, 4., 1.75, 1.875])

    _, _, rankings_avg, _ = stac.friedman_aligned_ranks_test(*data.T)
    assert np.allclose(rankings_avg, [7.625, 14.5, 5.5, 6.375])

    _, _, rankings_avg, _ = stac.quade_test(*data.T)
    assert np.allclose(rankings_avg, [2.3, 4.0, 1.55, 2.15])

example_4_testcase()

In [59]:
def compute_statistical_tests(sg1, sg2):
    df = sg1.summary(useMedian=False)
    tmp = sg2.summary(useMedian=False)
    df.append(tmp)
    
    row_labels = sg1.opts + ['Statistic', 'p-value']
    
    # Friedman's
    S, p, rankings_avg, rankings_cmp = stac.friedman_test(*df.T.to_numpy())
    out_df = pd.DataFrame(rankings_cmp + [S, p], index=row_labels, columns=['Friedman'])
    
    # Aligned Friedman's
    S, p, rankings_avg, rankings_cmp = stac.friedman_aligned_ranks_test(*df.T.to_numpy())
    out_df['Friedman Aligned'] = rankings_cmp + [S, p]
    
    # Quand's
    S, p, rankings_avg, rankings_cmp = stac.quade_test(*df.T.to_numpy())
    out_df['Quand'] = rankings_cmp + [S, p]

    return out_df

In [62]:
table0 = compute_statistical_tests(s_shark_2d, s_shark_3d)
with open('table0.tex', 'w') as f:
    latex_code = tabulate.tabulate(table0, tablefmt="latex_booktabs")
    f.write(latex_code)
table0

Unnamed: 0,Friedman,Friedman Aligned,Quand
(100+1)-MO-CMA-ES-I,3.16736,2.04759,1.83132
(100+1)-MO-CMA-ES-P,6.63352,5.37201,3.96537
(100+100)-MO-CMA-ES-I,6.0359,6.19847,4.02493
(100+100)-MO-CMA-ES-P,6.0359,5.72488,3.64774
NSGAII,3.22712,3.73301,2.16383
Statistic,8.81507,20.995,6.49621
p-value,1.64356e-05,0.000317398,0.000256418


In [63]:
table1 = compute_statistical_tests(s_anguilla_2d, s_anguilla_3d)
with open('table1.tex', 'w') as f:
    latex_code = tabulate.tabulate(table1, tablefmt="latex_booktabs")
    f.write(latex_code)
table1

Unnamed: 0,Friedman,Friedman Aligned,Quand
(100+1)-MO-CMA-ES-I,6.27495,5.06557,3.80656
(100+1)-MO-CMA-ES-P,6.63352,5.65059,3.86611
(100+100)-MO-CMA-ES-I,4.48211,4.74055,2.87353
(100+100)-MO-CMA-ES-P,4.48211,4.75913,2.89338
NSGAII,3.22712,2.86012,2.19361
Statistic,5.17404,8.97805,2.32603
p-value,0.00138412,0.0616504,0.0684877


In [55]:
def compute_posthoc_tests(sg1, sg2):
    df = sg1.summary(useMedian=False)
    tmp = sg2.summary(useMedian=False)
    df.append(tmp)

    # Aligned Friedman's
    _, _, _, rankings_cmp = stac.friedman_aligned_ranks_test(*df.T.to_numpy())
    tmp = { opt:val for opt,val in zip(sg1.opts, rankings_cmp) }

    cols = ['p-value', 'adjusted p-value', 'z-value']
    
    comparisons, z_values, p_values, adj_p_values = stac.holm_multitest(tmp)
    out_df = pd.DataFrame(p_values, index=comparisons, columns=['unadjusted p-value'])
    out_df['Holm'] = adj_p_values

    _, _, _, adj_p_values = stac.hochberg_multitest(tmp)
    out_df['Hochberg'] = adj_p_values
    
    _, _, _, adj_p_values = stac.shaffer_multitest(tmp)
    out_df['Shaffer'] = adj_p_values
    
    return out_df

In [53]:
table2 = compute_posthoc_tests(s_shark_2d, s_shark_3d)
with open('table2.tex', 'w') as f:
    # caption: * - p < 0.05, ** - p < 0.01, *** - p < 0.001
    tmp = sp.sign_table(table2)
    latex_code = tabulate.tabulate(tmp, tablefmt="latex_booktabs")
    f.write(latex_code)
table2

5


Unnamed: 0,unadjusted p-value,Holm,Hochberg,Shaffer
(100+1)-MO-CMA-ES-I vs (100+100)-MO-CMA-ES-I,3.31189e-05,0.000331189,1.27158,0.000331189
(100+1)-MO-CMA-ES-I vs (100+100)-MO-CMA-ES-P,0.000235719,0.00212147,1.27158,0.00141432
(100+1)-MO-CMA-ES-I vs (100+1)-MO-CMA-ES-P,0.000886013,0.0070881,1.27158,0.00531608
(100+100)-MO-CMA-ES-I vs NSGAII,0.0136838,0.0957863,1.27158,0.0821025
(100+100)-MO-CMA-ES-P vs NSGAII,0.0463855,0.278313,1.27158,0.278313
(100+1)-MO-CMA-ES-I vs NSGAII,0.0919061,0.459531,1.27158,0.367625
(100+1)-MO-CMA-ES-P vs NSGAII,0.101214,0.459531,1.27158,0.404856
(100+1)-MO-CMA-ES-P vs (100+100)-MO-CMA-ES-I,0.408541,1.0,1.27158,1.0
(100+100)-MO-CMA-ES-I vs (100+100)-MO-CMA-ES-P,0.635791,1.0,1.27158,1.0
(100+1)-MO-CMA-ES-P vs (100+100)-MO-CMA-ES-P,0.724185,1.0,0.724185,1.0


In [54]:
# Null hypothesis was not rejected.
table3 = compute_posthoc_tests(s_anguilla_2d, s_anguilla_3d)
with open('table3.tex', 'w') as f:
    tmp = sp.sign_table(table3)
    latex_code = tabulate.tabulate(tmp, tablefmt="latex_booktabs")
    f.write(latex_code)
table3

5


Unnamed: 0,unadjusted p-value,Holm,Hochberg,Shaffer
(100+1)-MO-CMA-ES-P vs NSGAII,0.0052631,0.052631,2.23551,0.052631
(100+1)-MO-CMA-ES-I vs NSGAII,0.0274226,0.246803,2.23551,0.164536
(100+100)-MO-CMA-ES-P vs NSGAII,0.0575635,0.460508,2.23551,0.345381
(100+100)-MO-CMA-ES-I vs NSGAII,0.0600487,0.460508,2.23551,0.360292
(100+1)-MO-CMA-ES-P vs (100+100)-MO-CMA-ES-I,0.362803,1.0,2.23551,1.0
(100+1)-MO-CMA-ES-P vs (100+100)-MO-CMA-ES-P,0.372679,1.0,2.23551,1.0
(100+1)-MO-CMA-ES-I vs (100+1)-MO-CMA-ES-P,0.558531,1.0,2.23551,1.0
(100+1)-MO-CMA-ES-I vs (100+100)-MO-CMA-ES-I,0.745171,1.0,2.23551,1.0
(100+1)-MO-CMA-ES-I vs (100+100)-MO-CMA-ES-P,0.759269,1.0,1.51854,1.0
(100+100)-MO-CMA-ES-I vs (100+100)-MO-CMA-ES-P,0.985182,1.0,0.985182,1.0
