# Error Inspection
In this notebook we look at the errors encountered while executing the benchmark.
We do this for each task (regression, classification) and each budget (1h, 4h (when its run)).
In particular, we look at which evaluations have no result recorded (and why) and which evaluations have an error recorded (and why) to determine whether the issue is caused by us (and thus warrants a redo) or by the AutoML framework  (in which case no further action is required).

### Setup

In [9]:
import itertools
import re

import pandas as pd
#import seaborn as sns

In [10]:
n_regression_jobs = 330
n_classification_jobs = 710

def missing_jobs_for(framework, results, all_jobs):
    fw = results[results["framework"] == framework]
    completed = set(fw[["task", "fold"]].itertuples(index=False, name=None))
    return all_jobs - completed

In [11]:
def is_timeout_error(info: str) -> bool:
    """ Test if the INFO message indicates it is a framework timeout. """
    return re.match(r"TimeoutError: Interrupting thread MainThread \[ident=\d+\] after \d+s timeout.", info) is not None

def is_memory_error(info: str) -> bool:
    """ Test for (one of the) error message that points to a memory issue. """
    return re.match(r"NoResultError: could not allocate \d+ bytes", info) is not None

def errors_for_framework(framework, results):
    error_dataframe = results[~results["info"].isna()][["framework", "task", "fold", "info"]]
    fw = error_dataframe[error_dataframe["framework"] == framework]
    timeout_errors = len(fw[fw["info"].apply(is_timeout_error)])
    if timeout_errors > 0:
        print("Of the errors below,", timeout_errors, "are timeout errors.")
    memory_errors = len(fw[fw["info"].apply(is_memory_error)])
    if memory_errors > 0:
        print("Of the errors below,", memory_errors, "are memory errors.")
    return error_dataframe[error_dataframe["framework"] == framework].groupby(["framework", "task"]).count()

The raw documents we load contain *all* experiments run for the task type and budget.
Some of those experiments failed to produce a result, e.g. due to CPU inactivity or due a bug in the benchmarking framework.
When we find that we failed to produce a result because of a non-framework error, we run it again.
So, in the raw file, you can find two (or more) entries for a `(framework, task, fold)`-tuple, but only one should have a valid result.
Before we investigate incomplete results, we must first make sure we don't investigate results which have already been superseded by a valid result:

In [12]:
def filter_for_latest_results(results):
    results = results.sort_values(by="utc", na_position="first")
    # There was a mistake in the old KDDCup09-Upselling task, so it was replaced with a new task.
    results = results[results["id"] != "openml.org/t/360947"]
    # Use only the latest results (earlier failures don't count, only justified reruns are done)
    results = results.drop_duplicates(["framework", "task", "fold"], keep="last")
    return results

**Table of Content**:

*Subsections marked with an asterisk (*\**) are included for completeness, but the frameworks reported no failure for the given benchmark.*

[0. Setup](#Setup)

[1. Regression 1H8C](#1.-Regression-1h8c)

- [1.1 Missing Results](#1.1-Missing-Results)

- [1.2 Failed Results](#1.2-Failed-Results)

  - [1.2.1 AutoGluon\*](#1.2.1-AutoGluon)

  - [1.2.2 auto-sklearn](#1.2.2-Autosklearn)

  - [1.2.3 autoxgboost](#1.2.3-Autoxgboost)

  - [1.2.4 FLAML\*](#1.2.4-FLAML)
    
  - [1.2.5 GAMA\*](#1.2.5-GAMA)

  - [1.2.6 H2O\*](#1.2.6-H2O)

  - [1.2.7 LightAutoML](#1.2.7-LightAutoML)

  - [1.2.8 mljar-supervised](#1.2.8-MLJar)

  - [1.2.9 ML-Plan](#1.2.9-MLPlan)

  - [1.2.10 mlr3automl](#1.2.10-MLR3AutoML)

  - [1.2.11 TPOT](#1.2.11-TPOT)

  - [1.2.12 RandomForest\*](#1.2.12-RandomForest)

  - [1.2.13 TunedRandomForest\*](#1.2.13-TunedRandomForest)

[2. Classification 1H8C](#2.-Classification-1h8c)

- [2.1 Missing Results](#2.1-Missing-Results)

- [2.2 Failed Results](#2.2-Failed-Results)

  - [2.2.1 AutoGluon\*](#2.2.1-AutoGluon)

  - [2.2.2 auto-sklearn](#2.2.2-Autosklearn)

  - [2.2.3 autoxgboost](#2.2.3-Autoxgboost)

  - [2.2.4 FLAML](#2.2.4-FLAML)
    
  - [2.2.5 GAMA](#2.2.5-GAMA)

  - [2.2.6 H2O](#2.2.6-H2O)

  - [2.2.7 LightAutoML](#2.2.7-LightAutoML)

  - [2.2.8 mljar-supervised](#2.2.8-MLJar)

  - [2.2.9 ML-Plan](#2.2.9-MLPlan)

  - [2.2.10 mlr3automl](#2.2.10-MLR3AutoML)

  - [2.2.11 TPOT](#2.2.11-TPOT)

  - [2.2.12 RandomForest\*](#2.2.12-RandomForest)

  - [2.2.13 TunedRandomForest](#2.2.13-TunedRandomForest)

[3. Regression 4H8C](#3.-Regression-4h8c)

[4. Classification 4H8C](#4.-Classification-4h8c)

[5. Remarks](#5.-Remarks)


# 1. Regression 1h8c

In [8]:
regression = pd.read_csv(r"http://openml-test.win.tue.nl/amlb/latest/regression_1h8c.csv")
regression_jobs = set(itertools.product(regression["task"].unique(), regression["fold"].unique()))
regression.sample(3)

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,duration,training_duration,predict_duration,models_count,seed,info,mae,r2,rmse,models_ensemble_count
2606,openml.org/t/359946,pol,H2OAutoML,1h8c_gp3,9,regression,-3.32883,neg_rmse,aws.docker,3.34.0.1,...,3610.6,3599.6,0.3,394.0,1831556068,,1.46114,0.9938,3.32883,
3771,openml.org/t/359948,SAT11-HAND-runtime-regression,AutoGluon_benchmark,1h8c_gp3,9,regression,-813.376,neg_rmse,aws.docker,0.3.1,...,3199.2,3184.7,11.7,22.0,1680684272,,375.428,0.865693,813.376,14.0
421,openml.org/t/359942,colleges,MLPlanWEKA,1h8c_gp3,2,regression,,neg_rmse,aws.docker,0.2.4,...,111.9,,,,1144218747,NoResultError: Command 'java -jar -Xmx29790M /...,,,,


In [9]:
regression = filter_for_latest_results(regression)

## 1.1 Missing Results
Missing results are those experiments which have no entry in the file at all.

In [10]:
missing_by_framework = (n_regression_jobs - regression.groupby("framework").count()["fold"])
missing_by_framework

framework
AutoGluon_benchmark          0
GAMA                         0
GAMA_benchmark               0
H2OAutoML                    0
MLPlanWEKA                   2
RandomForest                 0
TPOT                         0
TunedRandomForest            0
autosklearn                  0
autoxgboost                  0
flaml                        0
lightautoml                  0
mljarsupervised_benchmark    0
mlr3automl                   0
Name: fold, dtype: int64

AutoGluon_benchmark            1
GAMA_benchmark                 0
H2OAutoML                      0
TPOT                           2
TunedRandomForest              0
autosklearn                    0
flaml                          0
lightautoml                    0
mljarsupervised_benchmark    264
mlr3automl                     0
Name: fold, dtype: int6

In [15]:
missing_jobs_for("MLPlanWEKA")

TypeError: missing_jobs_for() missing 2 required positional arguments: 'results' and 'all_jobs'

Both are missing (from `mlplanweka.openml_s_269.1h8c_gp3.aws.20211203T113617`), the log says:
```
[WARNING] [amlb.runners.aws:16:38:54.716] WARN: Instance i-0325b7e530bcae84f (aws.openml_s_269.1h8c_gp3.airlines_depdelay_10m.8.mlplanweka) has no CPU activity in the last 30 minutes.
[WARNING] [amlb.runners.aws:19:02:54.769] WARN: Instance i-073e67b9ffe73d515 (aws.openml_s_269.1h8c_gp3.nyc-taxi-green-dec-2016.7.mlplanweka) has no CPU activity in the last 30 minutes.
```

## 1.2 Failed Results
The `info` field is only populated for errors jobs that failed.

In [11]:
regression_errors = regression[~regression["info"].isna()][["framework", "task", "fold", "info"]]

In [12]:
regression_errors.groupby(["framework"]).nunique()

Unnamed: 0_level_0,task,fold,info
framework,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MLPlanWEKA,12,10,61
TPOT,3,5,2
TunedRandomForest,2,10,1
autosklearn,1,1,1
autoxgboost,5,10,23
lightautoml,1,1,1
mljarsupervised_benchmark,1,8,1


### 1.2.1 AutoGluon
No failures.

### 1.2.2 Autosklearn
**Failures**: 1

**Reruns required**: 0


In [18]:
regression_errors[(regression_errors["framework"] == "autosklearn") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
1342,autosklearn,OnlineNewsPopularity,7,NoResultError: Input contains infinity or a va...


In [19]:
regression_errors[(regression_errors["framework"] == "autosklearn") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]["info"].unique()

array(["NoResultError: Input contains infinity or a value too large for dtype('float32')."],
      dtype=object)

We find it's part of some validation during `predict` :
``` 
Traceback (most recent call last):
  File "/bench/frameworks/shared/callee.py", line 70, in call_run
    result = run_fn(ds, config)
  File "/bench/frameworks/autosklearn/exec.py", line 138, in run
    predictions = auto_sklearn.predict(X_test)
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/autosklearn/estimators.py", line 1108, in predict
    return super().predict(X, batch_size=batch_size, n_jobs=n_jobs)
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/autosklearn/estimators.py", line 485, in predict
    return self.automl_.predict(X, batch_size=batch_size, n_jobs=n_jobs)
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/autosklearn/automl.py", line 1430, in predict
    for identifier in self.ensemble_.get_selected_model_identifiers()
``` 

### 1.2.3 Autoxgboost
**Failures**: 29

**Reruns required**: 0

A mix of timeout errors and resource errors during execution.

In [13]:
errors_for_framework("autoxgboost", regression)

Of the errors below, 18 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
autoxgboost,Airlines_DepDelay_10M,10,10
autoxgboost,Brazilian_houses,4,4
autoxgboost,Yolanda,10,10
autoxgboost,house_sales,1,1
autoxgboost,pol,4,4


The following errors were not timeout related:

In [14]:
regression_errors[(regression_errors["framework"] == "autoxgboost") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
1913,autoxgboost,Brazilian_houses,0,CalledProcessError: Command 'Rscript --vanilla...
2297,autoxgboost,Brazilian_houses,1,CalledProcessError: Command 'Rscript --vanilla...
2252,autoxgboost,Airlines_DepDelay_10M,8,CalledProcessError: Command 'Rscript --vanilla...
2326,autoxgboost,Brazilian_houses,2,CalledProcessError: Command 'Rscript --vanilla...
2266,autoxgboost,house_sales,4,CalledProcessError: Command 'Rscript --vanilla...
2339,autoxgboost,Brazilian_houses,8,CalledProcessError: Command 'Rscript --vanilla...
2274,autoxgboost,pol,6,CalledProcessError: Command 'Rscript --vanilla...
2318,autoxgboost,pol,3,CalledProcessError: Command 'Rscript --vanilla...
4113,autoxgboost,pol,9,CalledProcessError: Command 'Rscript --vanilla...
4083,autoxgboost,pol,5,CalledProcessError: Command 'Rscript --vanilla...


In [16]:
regression_errors[(regression_errors["framework"] == "autoxgboost") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]["info"].unique()

array(['CalledProcessError: Command \'Rscript --vanilla -e ".libPaths(\'/bench/frameworks/autoxgboost/lib\'); source(\'/bench/frameworks/autoxgboost/exec.R\'); run(\'/input/org/openml/www/datasets/42688/dataset_tr…',
       'CalledProcessError: Command \'Rscript --vanilla -e ".libPaths(\'/bench/frameworks/autoxgboost/lib\'); source(\'/bench/frameworks/autoxgboost/exec.R\'); run(\'/input/org/openml/www/datasets/42728/dataset_tr…',
       'CalledProcessError: Command \'Rscript --vanilla -e ".libPaths(\'/bench/frameworks/autoxgboost/lib\'); source(\'/bench/frameworks/autoxgboost/exec.R\'); run(\'/input/org/openml/www/datasets/42731/dataset_tr…',
       'CalledProcessError: Command \'Rscript --vanilla -e ".libPaths(\'/bench/frameworks/autoxgboost/lib\'); source(\'/bench/frameworks/autoxgboost/exec.R\'); run(\'/input/org/openml/www/datasets/201/dataset_trai…',
       'CalledProcessError: Command \'Rscript --vanilla -e ".libPaths(\'/bench/frameworks/autoxgboost/lib\'); source(\'/bench/framew

The truncated error messages are not very informative (there was some error while running the R script). We find:

 - `Airlines_DepDelay_10M.8` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T105955`): ran out of memory, exit status 1.
 - `brazilian_houses.0` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211116T234105`): `killed`  with exit status 137.
 - `brazilian_houses.1` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T105955`): `killed`  with exit status 137.
 - `brazilian_houses.2` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T105955`): ran out of memory, exit status 1.
 - `brazilian_houses.8` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T105955`): ran out of memory, exit status 1.
 - `house_sales.4`(`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T105955`): ran out of memory, exit status 1.
 - `pol.3` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T105955`): `killed`  with exit status 137.
 - `pol.5` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T141648`): `killed`  with exit status 137.
 - `pol.6` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T141648`): ran out of memory, exit status 1.
 - `pol.9` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T141648`): `killed`  with exit status 137.
 - `Yolanda.2` (`autoxgboost.openml_s_269.1h8c_gp3.aws.20211117T141648`): ran out of memory, exit status 1.
 
E.g. `brazilian_houses.0` was `killed`:
```
[mbo] 37: eta=0.193; gamma=9.22; max_depth=3; colsample_bytree=0.996; colsample_bylevel=0.671; lambda=0.00411; alpha=238; subsample=0.597 : y = 4.56e+05 : 0.2 secs : infill_cb
[mbo] 38: eta=0.197; gamma=0.0468; max_depth=20; colsample_bytree=0.674; colsample_bylevel=0.736; lambda=312; alpha=49.1; subsample=0.985 : y = 6.44e+05 : 58.6 secs : infill_cb
Killed

[ERROR] [amlb.benchmark:00:17:44.754] Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/org/openml/www/datasets/42688/data
set_train_0.arff', '/input/org/openml/www/datasets/42688/dataset_test_0.arff', target.index = 13, 'regression', '/output/predictions/Brazilian_houses/0/predictions.csv', 8, time.budget = 3600, meta_results_file=
'/output/meta_results.csv')"' returned non-zero exit status 137.
Traceback (most recent call last):
  File "/bench/amlb/benchmark.py", line 542, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/bench/frameworks/autoxgboost/__init__.py", line 10, in run
    return run(*args, **kwargs)
  File "/bench/frameworks/autoxgboost/exec.py", line 36, in run
    ), _live_output_=True)
  File "/bench/amlb/utils/process.py", line 245, in run_cmd
    raise e
  File "/bench/amlb/utils/process.py", line 232, in run_cmd
    preexec_fn=params.preexec_fn)
  File "/bench/amlb/utils/process.py", line 77, in run_subprocess
    raise subprocess.CalledProcessError(retcode, process.args, output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/org/openml/www/datasets/42688/dataset_train_0.arff', '/input/org/openml/www/datasets/42688/dataset_test_0.arff', target.index = 13, 'regression', '/output/predictions/Brazilian_houses/0/predictions.csv', 8, time.budget = 3600, meta_results_file='/output/meta_results.csv')"' returned non-zero exit status 137.
```

### 1.2.4 FLAML
No failures.

### 1.2.5 GAMA
No failures.

### 1.2.6 H2O
No failures.

### 1.2.7 LightAutoML

**Failures**: 1

**Reruns required**: 0

Looks like an error caused by an edge case in the AutoML system.

In [17]:
errors_for_framework("lightautoml", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
lightautoml,Santander_transaction_value,1,1


In [18]:
regression_errors[(regression_errors["framework"] == "lightautoml") & ~(regression_errors["info"]).isnull()]

Unnamed: 0,framework,task,fold,info
4247,lightautoml,Santander_transaction_value,5,NoResultError: Pipeline finished with 0 models...


Seems to fail after a model crashes with `Input contains NaN, infinity or a value too large for dtype('float32').`, well before 3600s. Looks like an error caused by an edge case in the AutoML system:
```
===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====

Linear model: C = 1e-05 score = -102768654963623.77
...
ERROR:frameworks.shared.callee:Pipeline finished with 0 models for some reason.
Probably one or more models failed
Traceback (most recent call last):
  File "/bench/frameworks/shared/callee.py", line 70, in call_run
    result = run_fn(ds, config)
  File "/bench/frameworks/lightautoml/exec.py", line 38, in run
    automl.fit_predict(train_data=df_train, roles={'target': label})
  File "/bench/frameworks/lightautoml/venv/lib/python3.7/site-packages/lightautoml/addons/utilization/utilization.py", line 275, in fit_predict
    valid_data, valid_features)
  File "/bench/frameworks/lightautoml/venv/lib/python3.7/site-packages/lightautoml/automl/presets/tabular_presets.py", line 413, in fit_predict
    oof_pred = super().fit_predict(train, roles=roles, cv_iter=cv_iter, valid_data=valid_data)
  File "/bench/frameworks/lightautoml/venv/lib/python3.7/site-packages/lightautoml/automl/presets/base.py", line 173, in fit_predict
    result = super().fit_predict(train_data, roles, train_features, cv_iter, valid_data, valid_features)
  File "/bench/frameworks/lightautoml/venv/lib/python3.7/site-packages/lightautoml/automl/base.py", line 189, in fit_predict
    pipe_pred = ml_pipe.fit_predict(train_valid)
  File "/bench/frameworks/lightautoml/venv/lib/python3.7/site-packages/lightautoml/pipelines/ml/base.py", line 129, in fit_predict
    assert len(predictions) > 0, 'Pipeline finished with 0 models for some reason.\nProbably one or more models failed'
AssertionError: Pipeline finished with 0 models for some reason.
Probably one or more models failed
WARNING:amlb.utils.process:Terminating process psutil.Process(pid=79, name='python', status='sleeping', started='20:15:09').
...
Input contains NaN, infinity or a value too large for dtype('float32').
```

## 1.2.8 MLJar
**Failures**: 8

**Reruns required**: 0

One error type, raised by the framework itself, on one task but not every fold.

In [19]:
errors_for_framework("mljarsupervised_benchmark", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
mljarsupervised_benchmark,QSAR-TID-11,8,8


In [21]:
regression_errors[regression_errors["framework"] == "mljarsupervised_benchmark"]["info"].unique()

array(['NoResultError: Object of type float32 is not JSON serializable'],
      dtype=object)

All errors for the task were identical. We do see that two folds for the same task carried out to completion:

In [22]:
regression[(regression.framework == "mljarsupervised_benchmark") & (regression["info"].isnull()) & (regression["task"] == "QSAR-TID-11")]

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,duration,training_duration,predict_duration,models_count,seed,info,mae,r2,rmse,models_ensemble_count
3461,openml.org/t/360932,QSAR-TID-11,mljarsupervised_benchmark,1h8c_gp3,3,regression,-0.750082,neg_rmse,aws.docker,0.11.0,...,3737.2,3615.5,111.1,59.0,1044029151,,0.527794,0.750313,0.750082,
3581,openml.org/t/360932,QSAR-TID-11,mljarsupervised_benchmark,1h8c_gp3,6,regression,-0.677109,neg_rmse,aws.docker,0.11.0,...,3735.3,3617.4,107.2,50.0,1044029154,,0.485104,0.808629,0.677109,


Seems like the error was indeed raised by mljar during the `fit` call:
```
13_Xgboost rmse 0.785149 trained in 56.99 seconds
There was an error during 13_Xgboost training.
Please check /output/results/QSAR-TID-11/1/errors.md for details.

[DEBUG] [amlb.utils.process:22:43:45.161] 2021-11-12 22:12:23,338 exec.py INFO
**** mljar-supervised [v0.11.0] ****

2021-11-12 22:43:44,308 frameworks.shared.callee ERROR Object of type float32 is not JSON serializable
Traceback (most recent call last):
  File "/bench/frameworks/shared/callee.py", line 70, in call_run
    result = run_fn(ds, config)
  File "/bench/frameworks/mljarsupervised/exec.py", line 57, in run
    automl.fit(X_train, y_train)
  File "/bench/frameworks/mljarsupervised/venv/lib/python3.7/site-packages/supervised/automl.py", line 337, in fit
    return self._fit(X, y, sample_weight, cv)
  File "/bench/frameworks/mljarsupervised/venv/lib/python3.7/site-packages/supervised/base_automl.py", line 1117, in _fit
    raise e
  File "/bench/frameworks/mljarsupervised/venv/lib/python3.7/site-packages/supervised/base_automl.py", line 1103, in _fit
    self.save_progress(step, generated_params)
  File "/bench/frameworks/mljarsupervised/venv/lib/python3.7/site-packages/supervised/base_automl.py", line 649, in save_progress
    fout.write(json.dumps(state, indent=4))
  File "/usr/lib/python3.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.7/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/lib/python3.7/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.7/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.7/json/encoder.py", line 438, in _iterencode
```

### 1.2.9 MLPlan
Experiments suspended.

### 1.2.10 MLR3AutoML
**Failures**: 0

**Reruns required**: 0


In [25]:
errors_for_framework("mlr3automl", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1


### 1.2.11 TPOT
**Failures**: 5

**Reruns required**: 0

Other errors are weird, but ultimately most likely framework errors.

In [23]:
errors_for_framework("TPOT", regression)

Of the errors below, 1 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
TPOT,Airlines_DepDelay_10M,3,3
TPOT,Buzzinsocialmedia_Twitter,1,1
TPOT,Yolanda,1,1


In [24]:
regression_errors[(regression_errors["framework"] == "TPOT") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
1798,TPOT,Buzzinsocialmedia_Twitter,0,NoResultError: A pipeline has not yet been opt...
2049,TPOT,Airlines_DepDelay_10M,4,NoResultError: A pipeline has not yet been opt...
2147,TPOT,Airlines_DepDelay_10M,7,NoResultError: A pipeline has not yet been opt...
2142,TPOT,Yolanda,2,NoResultError: A pipeline has not yet been opt...


`Yolanda.2` points to excessive memory usage, though seems near instant? Job exited prematurely (2200sec).
Same for `Airlines.4` (260sec):

```
[INFO] [amlb.print:20:30:21.667] INFO:__main__:
[INFO] [amlb.print:20:31:24.979] **** TPOT [v0.11.7]****
[INFO] [amlb.print:20:31:24.981]
[INFO] [amlb.print:20:31:24.981] INFO:__main__:Running TPOT with a maximum time of 3600s on 8 cores, optimizing neg_mean_squared_error.
[INFO] [amlb.print:20:31:24.981] ERROR:frameworks.shared.callee:A pipeline has not yet been optimized. Please call fit() first.
[INFO] [amlb.print:20:31:30.007] Traceback (most recent call last):
[INFO] [amlb.print:20:31:30.007]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 828, in fit
[INFO] [amlb.print:20:31:30.007]     log_file=self.log_file_,
[INFO] [amlb.print:20:31:30.007]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/gp_deap.py", line 228, in eaMuPlusLambda
[INFO] [amlb.print:20:31:30.007]     population[:] = toolbox.evaluate(population)
[INFO] [amlb.print:20:31:30.007]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 1572, in _evaluate_individuals
[INFO] [amlb.print:20:31:30.007]     chunk_idx : chunk_idx + chunk_size
[INFO] [amlb.print:20:31:30.007]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
[INFO] [amlb.print:20:31:30.007]     self.retrieve()
[INFO] [amlb.print:20:31:30.008]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
[INFO] [amlb.print:20:31:30.008]     self._output.extend(job.get(timeout=self.timeout))
[INFO] [amlb.print:20:31:30.008]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
[INFO] [amlb.print:20:31:30.008]     return future.result(timeout=timeout)
[INFO] [amlb.print:20:31:30.008]   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
[INFO] [amlb.print:20:31:30.008]     return self.__get_result()
[INFO] [amlb.print:20:31:30.008]   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
[INFO] [amlb.print:20:31:30.008]     raise self._exception
[INFO] [amlb.print:20:31:30.008] joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault w
hile calling the function or by an excessive memory usage causing the Operating System to kill the worker.
[INFO] [amlb.print:20:31:30.008]
[INFO] [amlb.print:20:31:30.009] The exit codes of the workers are {SIGKILL(-9)}
[INFO] [amlb.print:20:31:30.009]
[INFO] [amlb.print:20:31:30.009] During handling of the above exception, another exception occurred:
[INFO] [amlb.print:20:31:30.009]
[INFO] [amlb.print:20:31:30.009] Traceback (most recent call last):
[INFO] [amlb.print:20:31:30.009]   File "/bench/frameworks/shared/callee.py", line 70, in call_run
[INFO] [amlb.print:20:31:30.009]     result = run_fn(ds, config)
[INFO] [amlb.print:20:31:30.009]   File "/bench/frameworks/TPOT/exec.py", line 63, in run
[INFO] [amlb.print:20:31:30.009]     tpot.fit(X_train, y_train)
[INFO] [amlb.print:20:31:30.009]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 863, in fit
[INFO] [amlb.print:20:31:30.009]     raise e
[INFO] [amlb.print:20:31:30.009]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 854, in fit
[INFO] [amlb.print:20:31:30.009]     self._update_top_pipeline()
[INFO] [amlb.print:20:31:30.009]   File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 962, in _update_top_pipeline
[INFO] [amlb.print:20:31:30.010]     "A pipeline has not yet been optimized. Please call fit() first."
[INFO] [amlb.print:20:31:30.010] RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
[INFO] [amlb.print:20:31:30.010] WARNING:amlb.utils.process:Terminating process psutil.Process(pid=157, name='python', status='sleeping', started='20:30:29').
[INFO] [amlb.print:20:31:30.010] WARNING:amlb.utils.process:Killing process psutil.Process(pid=157, name='python', status='sleeping', started='20:30:29').
[INFO] [amlb.print:20:31:31.045]
[INFO] [amlb.print:20:31:31.045]
[INFO] [amlb.print:20:31:31.045]
[DEBUG] [amlb.utils.process:20:31:31.046] INFO:__main__:
**** TPOT [v0.11.7]****

INFO:__main__:Running TPOT with a maximum time of 3600s on 8 cores, optimizing neg_mean_squared_error.
ERROR:frameworks.shared.callee:A pipeline has not yet been optimized. Please call fit() first.
Traceback (most recent call last):
  File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 828, in fit
    log_file=self.log_file_,
  File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/gp_deap.py", line 228, in eaMuPlusLambda
    population[:] = toolbox.evaluate(population)
  File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 1572, in _evaluate_individuals
    chunk_idx : chunk_idx + chunk_size
  File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/bench/frameworks/TPOT/venv/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
```

No open issue on TPOT issue tracker ([`TerminatedWorkerError`](https://github.com/EpistasisLab/tpot/issues?q=is%3Aissue+is%3Aopen+TerminatedWorkerError+)). It looks like joblib subprocesses are terminated unexpectedly and TPOT does not account for this possibility, which may lead to it trying to simply finish search and do a best fit but it can't since nothing had finished evaluation. But that's of course just a guess based on the stack trace.

### 1.2.12 RandomForest
No failures.

### 1.2.13 TunedRandomForest

No failures.

# 2. Classification 1h8c

In [40]:
classification = pd.read_csv(r"http://openml-test.win.tue.nl/amlb/latest/classification_1h8c.csv")
classification_jobs = set(itertools.product(classification["task"].unique(), classification["fold"].unique()))
classification.sample(3)

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,training_duration,predict_duration,models_count,seed,info,acc,balacc,logloss,auc,models_ensemble_count
6696,openml.org/t/359987,shuttle,RandomForest,1h8c_gp3,3,multiclass,-0.000628,neg_logloss,aws.docker,0.24.2,...,10.0,0.7,2000.0,815597336,,0.999828,0.857143,0.000628,,
7275,openml.org/t/3945,KDDCup09_appetency,mljarsupervised_benchmark,1h8c_gp3,7,binary,0.85226,auc,aws.docker,0.11.0,...,3647.2,451.6,40.0,1835214307,,0.9814,0.527174,0.072323,0.85226,
6754,openml.org/t/168868,APSFailure,RandomForest,1h8c_gp3,9,binary,0.992451,auc,aws.docker,0.24.2,...,277.8,0.8,2000.0,815597342,,0.995132,0.894391,0.021782,0.992451,


In [41]:
classification = filter_for_latest_results(classification)
grouped_by_fw_task = classification.groupby(["framework", "task"]).count()["fold"]

## 2.1 Missing Results

In [42]:
missing_by_framework = (n_classification_jobs - classification.groupby("framework").count()["fold"])
missing_by_framework

framework
AutoGluon_benchmark           0
GAMA_benchmark                0
H2OAutoML                     0
MLPlanWEKA                   24
RandomForest                  0
TPOT                          0
TunedRandomForest             0
autosklearn                   0
autosklearn2                  0
autoxgboost                  10
flaml                         0
lightautoml                   0
mljarsupervised_benchmark     1
mlr3automl                    0
Name: fold, dtype: int64

Why are results missing?

 - **autoxgboost (10):** Forgot to run upselling. **redo**
 - **mljar (1):** Results not uploaded to bucket **?**
 


### autosklearn

In [13]:
grouped_by_fw_task["autosklearn"][grouped_by_fw_task["autosklearn"] != 10]

Series([], Name: fold, dtype: int64)

### autosklearn 2

In [17]:
grouped_by_fw_task["autosklearn2"][grouped_by_fw_task["autosklearn2"] != 10]

Series([], Name: fold, dtype: int64)

Reran some `Upselling` folds as they were terminated by our CPU monitor.
```
0   0 days 01:31:39
6   0 days 01:05:56
9   0 days 00:55:38
```

### autoxgboost

In [108]:
grouped_by_fw_task["autoxgboost"][grouped_by_fw_task["autoxgboost"] != 10]

Series([], Name: fold, dtype: int64)

Forgot to run `Upselling`.

### GAMA

In [18]:
grouped_by_fw_task["GAMA_benchmark"][grouped_by_fw_task["GAMA_benchmark"] != 10]

Series([], Name: fold, dtype: int64)

Reran some `Upselling` folds as they were terminated by our CPU monitor.
Old message: `Upselling` failed on folds `0, 1, 2, 4, 5, 8` because it was terminated with idle CPU activity for 30 minutes, less than 3600 seconds after surrendering control to GAMA. The latest termination was at ~4300 seconds, while control is surrendered to GAMA about ~1500 seconds in, which means it is well within the 3600 second budget. While the CPU inactivity is most likely caused by a bug in GAMA, it should be unrelated to its ability to terminate "successfully". I fully expect it to terminate `train` within 3600 seconds and to try to produce predictions (which will either work or fail quickly).

 

### MLJar

In [110]:
grouped_by_fw_task["mljarsupervised_benchmark"][grouped_by_fw_task["mljarsupervised_benchmark"] != 10]

task
dionis    9
Name: fold, dtype: int64

Exceeded the time limit three times (terminated after 7500sec). Memory usage 100%.

### MLPlanWEKA
To be revisited after consultation with the authors.

### TPOT
Instances aborted due to CPU inactivity *after* the time budget is expired.

In [19]:
grouped_by_fw_task["TPOT"][grouped_by_fw_task["TPOT"] != 10]

Series([], Name: fold, dtype: int64)

Reran some jobs as they were terminated by our CPU monitor.

Let's first look at the problems, `amazon-commerce-reviews.5`:
```
[INFO] [amlb.runners.aws:21:34:42.691] [2021-11-26T20:34:42] checking job aws.openml_s_271.1h8c_gp3.amazon-commerce-reviews.5.TPOT on instance i-0c81d19d8ed8dc27b: running [16].
[WARNING] [amlb.runners.aws:21:34:43.650] WARN: Instance i-0c81d19d8ed8dc27b (aws.openml_s_271.1h8c_gp3.amazon-commerce-reviews.5.tpot) has no CPU activity in the last 30 minutes.
[ERROR] [amlb.runners.aws:21:35:13.085] Job aws.openml_s_271.1h8c_gp3.amazon-commerce-reviews.5.TPOT failed with: Aborting instance i-0c81d19d8ed8dc27b for job aws.openml_s_271.1h8c_gp3.amazon-commerce-reviews.5.TPOT.
          'aws.openml_s_271.1h8c_gp3.amazon-commerce-reviews.5.TPOT.',
[INFO] [amlb.job:21:35:13.107] Job `aws.openml_s_271.1h8c_gp3.amazon-commerce-reviews.5.TPOT` executed in 6876.356 seconds.
```
Seems like it was already exceeding runtime by a large margin but then froze.

`kddcup99.7` is another case of inactivity after exceeding the time limit:
```
pieternwo@openml:/openml_db/automlbenchmark/classification_1h8c/tpot.openml_s_271.1h8c_gp3.aws.20211126T183739$ cat logs/runbenchmark.20211126T183739.log | grep kddcup99.7
[WARNING] [amlb.runners.aws:05:13:34.476] WARN: Instance i-0d8be38b159e7c5a4 (aws.openml_s_271.1h8c_gp3.kddcup99.7.tpot) has no CPU activity in the last 30 minutes.
[INFO] [amlb.runners.aws:05:13:18.737] [2021-11-27T04:13:18] checking job aws.openml_s_271.1h8c_gp3.KDDCup99.7.TPOT on instance i-0d8be38b159e7c5a4: running [16].
[ERROR] [amlb.runners.aws:05:13:48.961] Job aws.openml_s_271.1h8c_gp3.KDDCup99.7.TPOT failed with: Aborting instance i-0d8be38b159e7c5a4 for job aws.openml_s_271.1h8c_gp3.KDDCup99.7.TPOT.
          'aws.openml_s_271.1h8c_gp3.KDDCup99.7.TPOT.',
[INFO] [amlb.job:05:13:48.979] Job `aws.openml_s_271.1h8c_gp3.KDDCup99.7.TPOT` executed in 5044.440 seconds.
```

## 2.2 Failed Results

In [32]:
classification_errors = classification[~classification["info"].isna()][["framework", "task", "fold", "info"]]

### 2.2.1 AutoGluon

In [39]:
errors_for_framework("AutoGluon_benchmark", classification_errors)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1


### 2.2.2 Autosklearn
**Failures**: 1

**Reruns required**: 0

In [40]:
errors_for_framework("autosklearn", classification_errors)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
autosklearn,KDDCup09-Upselling,1,1


In [35]:
askl = classification_errors[(classification_errors["framework"] == "autosklearn") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
askl

Unnamed: 0,framework,task,fold,info
3567,autosklearn,KDDCup09-Upselling,4,NoResultError:


Memory Error

### Autosklearn 2

In [41]:
errors_for_framework("autosklearn2", classification_errors)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1


### 2.2.3 AutoXGBoost

**Failures**: 98

**Reruns required**: 20 ? 

A lot of memory errors, some errors that arise due numerical issues (?) and two datasets affected by what seems like a bug in data loading and/or processing train/test sets. Asked Janek to look into this.

In [42]:
errors_for_framework("autoxgboost", classification_errors)

Of the errors below, 67 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
autoxgboost,Click_prediction_small,1,1
autoxgboost,Fashion-MNIST,8,8
autoxgboost,KDDCup09_appetency,8,8
autoxgboost,KDDCup99,10,10
autoxgboost,albert,10,10
autoxgboost,covertype,10,10
autoxgboost,dionis,10,10
autoxgboost,helena,10,10
autoxgboost,okcupid-stem,10,10
autoxgboost,porto-seguro,1,1


In [119]:
axgb = classification_errors[(classification_errors["framework"] == "autoxgboost") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
axgb

Unnamed: 0,framework,task,fold,info
2234,autoxgboost,Click_prediction_small,8,CalledProcessError: Command 'Rscript --vanilla...
2469,autoxgboost,KDDCup09_appetency,0,CalledProcessError: Command 'Rscript --vanilla...
2235,autoxgboost,KDDCup09_appetency,1,CalledProcessError: Command 'Rscript --vanilla...
2625,autoxgboost,KDDCup09_appetency,3,CalledProcessError: Command 'Rscript --vanilla...
2461,autoxgboost,KDDCup09_appetency,4,CalledProcessError: Command 'Rscript --vanilla...
2546,autoxgboost,KDDCup09_appetency,5,CalledProcessError: Command 'Rscript --vanilla...
2591,autoxgboost,KDDCup09_appetency,6,CalledProcessError: Command 'Rscript --vanilla...
2294,autoxgboost,KDDCup09_appetency,7,CalledProcessError: Command 'Rscript --vanilla...
2454,autoxgboost,KDDCup09_appetency,9,CalledProcessError: Command 'Rscript --vanilla...
2540,autoxgboost,KDDCup99,9,CalledProcessError: Command 'Rscript --vanilla...


In the regression we saw that it encountered similar errors (exit code 1 and 137), both of which seemed to stem from memory issues. For this reason, I will only sample one result per dataset:
 - `Click-prediction.8`: Not a clear sign of memory issues:
 ```
 Error in chol.default(R) :
  the leading minor of order 1 is not positive definite
Calls: run ... km1Nugget.init -> apply -> FUN -> chol -> chol.default
In addition: Warning message:
In runif(n = ninit, min = 1/2 * angle.init, max = min(3/2 * angle.init,  :
  NAs produced
Timing stopped at: 258.6 27.09 267
Execution halted
[ERROR] [amlb.benchmark:17:19:47.879] Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/org/openml/www/datasets/42733/dataset_train_8.arff', '/input/org/openml/www/datasets/42733/dataset_test_8.arff', target.index = 1, 'classification', '/output/predictions/Click_prediction_small/8/predictions.csv', 8, time.budget = 3600, meta_results_file='/output/meta_results.csv')"' returned non-zero exit status 1.
 ```
 - `KDDCup09-Appetency.0`: Same error.
 - `KDDCup09-Appetency.1`: Another internal error.
 ```
 Error in chol.default(R) :
  the leading minor of order 1 is not positive definite
Calls: run ... km1Nugget.init -> apply -> FUN -> chol -> chol.default
In addition: Warning message:
In runif(n = ninit, min = 1/2 * angle.init, max = min(3/2 * angle.init,  :
  NAs produced
Timing stopped at: 258.6 27.09 267
Execution halted
[ERROR] [amlb.benchmark:17:19:47.879] Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/org/openml/www/datasets/42733/dataset_train_8.arff', '/input/org/openml/www/datasets/42733/dataset_test_8.arff', target.index = 1, 'classification', '/output/predictions/Click_prediction_small/8/predictions.csv', 8, time.budget = 3600, meta_results_file='/output/meta_results.csv')"' returned non-zero exit status 1.
 ```
 - `KDDCup09-Appetency.3`: positive definite.
 - Skipping remainder of `KDDCup09-Appetency`.
 - `KDDCup99.9`: Internal error.
 ```
 Parse with reader=readr : /input/org/openml/www/datasets/42746/dataset_train_9.arff
Loading required package: readr
header: 0.034000; preproc: 1.847000; data: 14.380000; postproc: 0.761000; total: 17.022000
Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking,  :
  Empty factor levels were dropped for columns: service
Error in assertPropertiesOk(present.properties, allowed.properties, whichfun,  :
  Data returned by CPO trafo has property missings that impact.encode.classif did not declare in .properties.needed.
Calls: run ... checkOutputProperties -> assertPropertiesOk -> stopf
Timing stopped at: 106.1 7.325 113.4
Execution halted
[ERROR] [amlb.benchmark:16:26:03.447] Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/org/openml/www/datasets/42746/data
set_train_9.arff', '/input/org/openml/www/datasets/42746/dataset_test_9.arff', target.index = 42, 'classification', '/output/predictions/KDDCup99/9/predictions.csv', 8, time.budget = 3600, meta_results_file='/ou
tput/meta_results.csv')"' returned non-zero exit status 1.
 ```
 - `okcupid-stem.0`: Internal error, since it seems a data issue (same as `sf-police-incidents`), skipping the remainder `okcupid`.
 - `portoseguro.5`: positive definite.
 - `sf-police-incidents.0`: Looks like the same data issue as `okcupid` and 
 ```
 Error in names(x) <- value :
  'names' attribute [9] must be the same length as the vector [1]
Calls: run -> <Anonymous> -> colnames<-
In addition: Warning message:
One or more parsing issues, see `problems()` for details
Execution halted
[ERROR] [amlb.benchmark:16:04:13.220] Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/org/openml/www/datasets/42732/data
set_train_0.arff', '/input/org/openml/www/datasets/42732/dataset_test_0.arff', target.index = 9, 'classification', '/output/predictions/sf-police-incidents/0/predictions.csv', 8, time.budget = 3600, meta_results
_file='/output/meta_results.csv')"' returned non-zero exit status 1.
 ```

### 2.2.4 FLAML

**Failures**: 18

**Reruns required**: 0

Framework errors (memory)

In [43]:
errors_for_framework("flaml", classification_errors)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
flaml,KDDCup09-Upselling,10,10
flaml,dionis,1,1
flaml,guillermo,6,6
flaml,riccardo,1,1


In [44]:
flaml = classification_errors[(classification_errors["framework"] == "flaml") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
flaml

Unnamed: 0,framework,task,fold,info
2140,flaml,KDDCup09-Upselling,0,NoResultError: Unable to allocate 4.44 GiB for...
2135,flaml,KDDCup09-Upselling,1,NoResultError: Unable to allocate 4.44 GiB for...
2136,flaml,KDDCup09-Upselling,2,NoResultError: Unable to allocate 4.44 GiB for...
2138,flaml,KDDCup09-Upselling,3,NoResultError: Unable to allocate 4.44 GiB for...
2137,flaml,KDDCup09-Upselling,4,NoResultError: Unable to allocate 4.44 GiB for...
2139,flaml,KDDCup09-Upselling,5,NoResultError: Unable to allocate 4.44 GiB for...
2142,flaml,KDDCup09-Upselling,6,NoResultError: Unable to allocate 4.44 GiB for...
2134,flaml,KDDCup09-Upselling,7,NoResultError: Unable to allocate 4.44 GiB for...
2141,flaml,KDDCup09-Upselling,8,NoResultError: Unable to allocate 4.44 GiB for...
2133,flaml,KDDCup09-Upselling,9,NoResultError: Unable to allocate 4.44 GiB for...


In [45]:
flaml["info"].unique()

array(['NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13237) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13251) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13236) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13242) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13238) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13235) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13229) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13231) and data type int64',
       'NoResultError: Unable to allocate 4.44 GiB for an array with shape (45000, 13228) and data type 

 - `dionis.5`: `killed` and exit code 137.
 - `guillermo`: `killed` (137) and `segfault` (139).

### 2.2.5 GAMA
**Failures**: 37

**Reruns required**: 0

Errors due to a bug processing target labels (helena, wine, yeast) and memory (KDDCup). 

In [46]:
errors_for_framework("GAMA_benchmark", classification_errors)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
GAMA_benchmark,KDDCup09-Upselling,10,10
GAMA_benchmark,KDDCup99,10,10
GAMA_benchmark,helena,7,7
GAMA_benchmark,wine-quality-white,5,5
GAMA_benchmark,yeast,5,5


In [116]:
gama = classification_errors[(classification_errors["framework"] == "GAMA_benchmark") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
gama

Unnamed: 0,framework,task,fold,info
4995,GAMA_benchmark,KDDCup09-Upselling,3,NoResultError: 'NoneType' object is not iterable
4994,GAMA_benchmark,KDDCup09-Upselling,6,NoResultError: 'NoneType' object is not iterable
4997,GAMA_benchmark,KDDCup09-Upselling,7,NoResultError: 'NoneType' object is not iterable
4996,GAMA_benchmark,KDDCup09-Upselling,9,NoResultError: 'NoneType' object is not iterable
3702,GAMA_benchmark,KDDCup99,0,NoResultError: 'NoneType' object is not iterable
4247,GAMA_benchmark,KDDCup99,1,NoResultError: 'NoneType' object is not iterable
3765,GAMA_benchmark,KDDCup99,2,NoResultError: 'NoneType' object is not iterable
3635,GAMA_benchmark,KDDCup99,3,NoResultError: 'NoneType' object is not iterable
3749,GAMA_benchmark,KDDCup99,4,NoResultError: The least populated class in y ...
4075,GAMA_benchmark,KDDCup99,5,NoResultError: The least populated class in y ...


Most results are `NoResultError: 'NoneType' object is not iterable.` which normally points to an error during the ensembling process (though previously it did not occur this often). Let's have a closer look to verify:
 - `KDDCup99.0`: Unable to evaluate pipelines due to timeout, memory, and a mismatch in predicted vs actual labels. 
 - `helena.1`: The cache was also saved. Was able to store some pipelines. A `MemoryError` occurred while retrieving an evaluation from a subprocess.
 - `wine-quality-white.3`: No results due to the label mismatch error.
 - `yeast`: No results due to label mismatch error.

### 2.2.6 H2O
All timeout errors

In [47]:
errors_for_framework("H2OAutoML", classification)

Of the errors below, 9 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
H2OAutoML,KDDCup99,10,10


In [130]:
h2o = classification_errors[(classification_errors["framework"] == "H2OAutoML") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
h2o

Unnamed: 0,framework,task,fold,info
9043,H2OAutoML,KDDCup99,4,NoResultError: Interrupting thread MainThread ...


In [71]:
h2o.iloc[0]["info"]

'NoResultError: Interrupting thread MainThread [ident=139723361433408] after 7200s timeout.'

### 2.2.7 LightAutoML

**Failures**: 43

**Reruns required**: 0

Seem to be framework errors, mostly memory issues, and some of the label mismatch errors. 

In [48]:
errors_for_framework("lightautoml", classification)

Of the errors below, 2 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
lightautoml,APSFailure,5,5
lightautoml,KDDCup09_appetency,6,6
lightautoml,KDDCup99,10,10
lightautoml,bank-marketing,1,1
lightautoml,dionis,10,10
lightautoml,nomao,1,1
lightautoml,wine-quality-white,5,5
lightautoml,yeast,5,5


In [121]:
lama = classification_errors[(classification_errors["framework"] == "lightautoml") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
lama

Unnamed: 0,framework,task,fold,info
543,lightautoml,APSFailure,2,CalledProcessError: Command '/bench/frameworks...
379,lightautoml,APSFailure,3,CalledProcessError: Command '/bench/frameworks...
670,lightautoml,APSFailure,7,CalledProcessError: Command '/bench/frameworks...
138,lightautoml,APSFailure,9,CalledProcessError: Command '/bench/frameworks...
13,lightautoml,KDDCup09_appetency,0,CalledProcessError: Command '/bench/frameworks...
430,lightautoml,KDDCup09_appetency,1,CalledProcessError: Command '/bench/frameworks...
577,lightautoml,KDDCup09_appetency,3,CalledProcessError: Command '/bench/frameworks...
408,lightautoml,KDDCup09_appetency,4,CalledProcessError: Command '/bench/frameworks...
91,lightautoml,KDDCup09_appetency,5,CalledProcessError: Command '/bench/frameworks...
0,lightautoml,KDDCup99,0,NoResultError: Pipeline finished with 0 models...


In [54]:
lama["info"].unique()

array(["CalledProcessError: Command '/bench/frameworks/lightautoml/venv/bin/python -W ignore /bench/frameworks/lightautoml/exec.py' returned non-zero exit status 139.",
       "CalledProcessError: Command '/bench/frameworks/lightautoml/venv/bin/python -W ignore /bench/frameworks/lightautoml/exec.py' returned non-zero exit status 134.",
       'NoResultError: Pipeline finished with 0 models for some reason.\nProbably one or more models failed',
       'NoResultError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing th…',
       'NoResultError: Unable to allocate 2.12 GiB for an array with shape (100000, 355, 8) and data type float64'],
      dtype=object)

In more detail:
 - `APSFailure`: Exit status 139 through `Segmentation fault (core dumped)`.
 - `KDDCup09_Appetency`: 0 and 5 had a `segfault`, 1, 3 and 4 had the following error:
 ```
 FAIL (2021-11-21T14:02:02.126543Z): This should be unreachable
  catboost/private/libs/algo/tensor_search_helpers.cpp:99
  GetSplit() failed
??+0 (0x7F50DA105759)
...
??+0 (0x7F517B36A6DB)
clone+63 (0x7F517A49F71F)
Aborted (core dumped)
[ERROR] [amlb.benchmark:14:02:02.324] Command '/bench/frameworks/lightautoml/venv/bin/python -W ignore /bench/frameworks/lightautoml/exec.py' returned non-zero exit status 134.
 ```
 - `KDDCup99`:
 ```
 [DEBUG] [amlb.utils.process:18:35:27.299] Model Lvl_0_Pipe_0_Mod_0_LinearL2 failed during ml_algo.fit_predict call.
y_true and y_pred contain different number of classes 22, 23. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21]
Traceback (most recent call last):
 ``` 
 - `bank-marketing.7`: segfault.
 - `dionis`: Message points to memory issues:
 ```
 joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
 ```
 - `nomao.6`: segfault.
 - `wine-quality-white`: `y_true and y_pred contain different number of classes 6, 7.`
 - `yeast`: `y_true and y_pred contain different number of classes 9, 10.`

### 2.2.8 MLJar

**Failures**: 80

**Reruns required**: 0

Framework errors, mostly `catboost` errors.

In [49]:
errors_for_framework("mljarsupervised_benchmark", classification)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
mljarsupervised_benchmark,APSFailure,1,1
mljarsupervised_benchmark,Internet-Advertisements,2,2
mljarsupervised_benchmark,KDDCup09_appetency,2,2
mljarsupervised_benchmark,PhishingWebsites,5,5
mljarsupervised_benchmark,Satellite,2,2
mljarsupervised_benchmark,adult,10,10
mljarsupervised_benchmark,bank-marketing,6,6
mljarsupervised_benchmark,blood-transfusion-service-center,10,10
mljarsupervised_benchmark,cnae-9,1,1
mljarsupervised_benchmark,credit-g,10,10


In [125]:
mljar = classification_errors[(classification_errors["framework"] == "mljarsupervised_benchmark") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
mljar

Unnamed: 0,framework,task,fold,info
6676,mljarsupervised_benchmark,APSFailure,3,"NoResultError: ""['Ensemble_prediction_0_for_ne..."
7779,mljarsupervised_benchmark,Internet-Advertisements,0,"NoResultError: ""['Ensemble_prediction_0_for_ad..."
6815,mljarsupervised_benchmark,Internet-Advertisements,8,NoResultError: catboost/libs/data/model_datase...
6814,mljarsupervised_benchmark,KDDCup09_appetency,1,NoResultError: catboost/libs/data/model_datase...
6554,mljarsupervised_benchmark,KDDCup09_appetency,5,NoResultError: catboost/libs/data/model_datase...
...,...,...,...,...
6673,mljarsupervised_benchmark,qsar-biodeg,8,NoResultError: catboost/libs/data/model_datase...
6429,mljarsupervised_benchmark,wilt,0,NoResultError: catboost/libs/data/model_datase...
6432,mljarsupervised_benchmark,wilt,5,NoResultError: catboost/libs/data/model_datase...
6818,mljarsupervised_benchmark,wilt,8,NoResultError: catboost/libs/data/model_datase...


### 2.2.9 MLPlan

To be redone

### 2.2.10 MLR3AutoML


**Failures**: 13

**Reruns required**: 0

In [50]:
errors_for_framework("mlr3automl", classification)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
mlr3automl,KDDCup09-Upselling,3,3
mlr3automl,KDDCup99,10,10


In [27]:
mlr = classification_errors[(classification_errors["framework"] == "mlr3automl") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
mlr

Unnamed: 0,framework,task,fold,info
3085,mlr3automl,KDDCup09-Upselling,0,CalledProcessError: Command 'Rscript --vanilla...
3119,mlr3automl,KDDCup09-Upselling,1,CalledProcessError: Command 'Rscript --vanilla...
2987,mlr3automl,KDDCup09-Upselling,3,CalledProcessError: Command 'Rscript --vanilla...
3211,mlr3automl,KDDCup99,0,CalledProcessError: Command 'Rscript --vanilla...
3429,mlr3automl,KDDCup99,1,CalledProcessError: Command 'Rscript --vanilla...
3394,mlr3automl,KDDCup99,2,CalledProcessError: Command 'Rscript --vanilla...
3014,mlr3automl,KDDCup99,3,CalledProcessError: Command 'Rscript --vanilla...
2974,mlr3automl,KDDCup99,4,CalledProcessError: Command 'Rscript --vanilla...
2892,mlr3automl,KDDCup99,5,CalledProcessError: Command 'Rscript --vanilla...
3128,mlr3automl,KDDCup99,6,CalledProcessError: Command 'Rscript --vanilla...


 - `KDDCup99`: 
 
 `0, 9`: `system call failed: Cannot allocate memory`
 
 `1, 2, 3, 4, 5, 6, 7, 8`: 
 ```
 Error: Failed to retrieve the result of MulticoreFuture (future_mapply-1) from the forked worker (on localhost; PID 611). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive. Detected a non-exportable reference (‘externalptr’) in one of the globals (<unknown>) used in the future expression
In addition: Warning message:
In mccollect(jobs = jobs, wait = TRUE) :
  1 parallel job did not deliver a result
Timing stopped at: 2473 188.3 2370
Execution halted
 ```
 - `KDDCup09-Upselling`:
 
   `0, 1, 3`: `system call failed: Cannot allocate memory` (when trying to fork)
   

### 2.2.11 TPOT

**Failures**: 50

**Reruns required**: 0


In [51]:
errors_for_framework("TPOT", classification)

Of the errors below, 27 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
TPOT,KDDCup09-Upselling,10,10
TPOT,KDDCup09_appetency,1,1
TPOT,KDDCup99,10,10
TPOT,amazon-commerce-reviews,3,3
TPOT,arcene,4,4
TPOT,christine,2,2
TPOT,dionis,9,9
TPOT,philippine,1,1
TPOT,sf-police-incidents,1,1
TPOT,wine-quality-white,5,5


In [52]:
tpot = classification_errors[(classification_errors["framework"] == "TPOT") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
tpot

Unnamed: 0,framework,task,fold,info
756,TPOT,KDDCup99,0,NoResultError: y_true and y_pred contain diffe...
819,TPOT,KDDCup99,1,NoResultError: y_true and y_pred contain diffe...
853,TPOT,amazon-commerce-reviews,2,NoResultError: A pipeline has not yet been opt...
1129,TPOT,amazon-commerce-reviews,8,NoResultError: A pipeline has not yet been opt...
1324,TPOT,amazon-commerce-reviews,9,NoResultError: A pipeline has not yet been opt...
8477,TPOT,arcene,1,NoResultError: A pipeline has not yet been opt...
831,TPOT,arcene,4,NoResultError: A pipeline has not yet been opt...
8476,TPOT,arcene,8,NoResultError: A pipeline has not yet been opt...
974,TPOT,arcene,9,NoResultError: A pipeline has not yet been opt...
1259,TPOT,christine,3,NoResultError: A pipeline has not yet been opt...


 - `Amazon commerce`: memory error while forking the process as part of joblib backend.
 - Other datasets have shared errors (not checked explicitly)

### 2.2.12 RandomForest

**Failures**: 0

**Reruns required**: 0

In [53]:
errors_for_framework("RandomForest", classification)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1


### 2.2.13 TunedRandomForest

**Failures**: 20

**Reruns required**: 0

Memory issues, will impute with RandomForest.

In [54]:
errors_for_framework("TunedRandomForest", classification)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
TunedRandomForest,KDDCup99,10,10
TunedRandomForest,dionis,10,10


# 3. Regression 4h8c

In [23]:
regression[(regression.framework == "flaml") & (regression.task == "Yolanda")]

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,duration,training_duration,predict_duration,models_count,seed,info,mae,r2,rmse,models_ensemble_count
387,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,4,regression,-8.51322,neg_rmse,aws.docker,0.6.2,...,16075.6,15995.1,40.4,18.0,148140843,,5.9105,0.384188,8.51322,
280,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,5,regression,-8.65369,neg_rmse,aws.docker,0.6.2,...,16739.2,16671.8,28.4,16.0,148140844,,5.99619,0.386168,8.65369,
349,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,6,regression,-8.51598,neg_rmse,aws.docker,0.6.2,...,16906.3,16804.8,63.5,21.0,148140845,,5.9249,0.395716,8.51598,
473,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,9,regression,-8.55864,neg_rmse,aws.docker,0.6.2,...,17192.8,17093.5,60.3,18.0,148140848,,5.96088,0.385477,8.55864,
302,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,1,regression,-8.5182,neg_rmse,aws.docker,0.6.2,...,17396.5,17296.5,61.7,21.0,148140840,,5.91997,0.394144,8.5182,
537,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,0,regression,,neg_rmse,aws.docker,0.6.2,...,18038.7,,,,148140839,NoResultError: Interrupting thread MainThread ...,,,,
375,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,2,regression,,neg_rmse,aws.docker,0.6.2,...,18038.7,,,,148140841,NoResultError: Interrupting thread MainThread ...,,,,
383,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,3,regression,,neg_rmse,aws.docker,0.6.2,...,18041.6,,,,148140842,NoResultError: Interrupting thread MainThread ...,,,,
585,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,8,regression,,neg_rmse,aws.docker,0.6.2,...,18048.2,,,,148140847,CalledProcessError: Command '/bench/frameworks...,,,,
493,openml.org/t/317614,Yolanda,flaml,4h8c_gp3,7,regression,,neg_rmse,aws.docker,0.6.2,...,18049.2,,,,148140846,CalledProcessError: Command '/bench/frameworks...,,,,


In [14]:
regression = pd.read_csv(r"http://openml-test.win.tue.nl/amlb/latest/regression_4h8c.csv")
regression_jobs = set(itertools.product(regression["task"].unique(), regression["fold"].unique()))
regression.sample(3)

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,duration,training_duration,predict_duration,models_count,seed,info,mae,r2,rmse,models_ensemble_count
615,openml.org/t/167210,Moneyball,autosklearn,4h8c_gp3,4,regression,-21.7082,neg_rmse,aws.docker,0.14.0,...,14443.7,14398.9,0.02,5.0,776481210,,17.3904,0.948567,21.7082,
3033,openml.org/t/360932,QSAR-TID-11,GAMA_benchmark,4h8c_gp3,8,regression,-0.700791,neg_rmse,aws.docker,21.0.1,...,10120.5,10090.6,2.5,50.0,792051154,,0.52203,0.780872,0.700791,
2556,openml.org/t/233212,Allstate_Claims_Severity,lightautoml,4h8c_gp3,4,regression,-1836.26,neg_rmse,aws.docker,0.2.16,...,13531.8,13472.1,23.4,1.0,881698434,,1164.87,0.5873,1836.26,


In [15]:
regression = filter_for_latest_results(regression)

## 3.1 Missing Results
Missing results are those experiments which have no entry in the file at all.

In [16]:
missing_by_framework = (n_regression_jobs - regression.groupby("framework").count()["fold"])
missing_by_framework

framework
AutoGluon_benchmark            0
GAMA_benchmark                 0
H2OAutoML                      0
RandomForest                 300
TPOT                           0
TunedRandomForest              0
autosklearn                    0
constantpredictor              0
flaml                          0
lightautoml                    0
mljarsupervised_benchmark      0
mlr3automl                     0
Name: fold, dtype: int64

## 3.2 Failed Results
The `info` field is only populated for errors jobs that failed.

In [17]:
regression_errors = regression[~regression["info"].isna()][["framework", "task", "fold", "info"]]

In [18]:
regression_errors.groupby(["framework"]).nunique()

Unnamed: 0_level_0,task,fold,info
framework,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GAMA_benchmark,2,6,2
TPOT,2,5,2
TunedRandomForest,2,10,1
autosklearn,1,1,1
flaml,6,5,7
lightautoml,1,5,1
mljarsupervised_benchmark,2,6,2


### 3.2.1 AutoGluon
**Failures**: 0

**Reruns required**: 0


### 3.2.2 Autosklearn
**Failures**: 1

**Reruns required**: 0

In [31]:
regression_errors[(regression_errors["framework"] == "autosklearn") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
793,autosklearn,OnlineNewsPopularity,7,NoResultError: Input contains infinity or a va...


In [32]:
regression_errors[(regression_errors["framework"] == "autosklearn") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]["info"].unique()

array(["NoResultError: Input contains infinity or a value too large for dtype('float32')."],
      dtype=object)

Curiously the same as 1H. The error is generated when checking `X` during `predict`. Manually downloading the task and checking the input does verify that the entirety of the test (and train) datasets can be coerced into `float32` so I am confident that it's an auto-sklearn issue.

### 3.2.3 Autoxgboost
**Failures**: ?

**Reruns required**: ?


### 3.2.4 FLAML
**Failures**: 10

**Reruns required**: 0

In [19]:
regression_errors[(regression_errors["framework"] == "flaml") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
371,flaml,Airlines_DepDelay_10M,0,CalledProcessError: Command '/bench/frameworks...
341,flaml,diamonds,2,CalledProcessError: Command '/bench/frameworks...
537,flaml,Yolanda,0,NoResultError: Interrupting thread MainThread ...
375,flaml,Yolanda,2,NoResultError: Interrupting thread MainThread ...
383,flaml,Yolanda,3,NoResultError: Interrupting thread MainThread ...
585,flaml,Yolanda,8,CalledProcessError: Command '/bench/frameworks...
493,flaml,Yolanda,7,CalledProcessError: Command '/bench/frameworks...
457,flaml,nyc-taxi-green-dec-2016,7,CalledProcessError: Command '/bench/frameworks...
359,flaml,OnlineNewsPopularity,3,CalledProcessError: Command '/bench/frameworks...
381,flaml,black_friday,0,NoResultError: std::bad_alloc


 - `airlines.0`: memory: `Killed` with exit `137`.
 - `diamonds.2`: memory: `Killed` with exit `137`.
 - `black friday.0`: `bad_alloc` reported by subprocess.
 - `yolanda` `0, 2, 3`: Timeout.
 - `yolanda` `7, 8`:
  ```
  [flaml.automl: 12-06 14:30:18] {1380} WARNING - Time taken to find the best model is 90% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
Terminated
  ```
 - `nyc-taxi-green-dec-2016.7`: `Segmentation fault (core dumped)` (exit 139)
 - `onlinenewspopularity.3`: memory: `Killed` with exit `137`.

### 3.2.5 GAMA
**Failures**: 7

**Reruns required**: 0

In [34]:
regression_errors[(regression_errors["framework"] == "GAMA_benchmark") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
2885,GAMA_benchmark,black_friday,3,NoResultError:
2899,GAMA_benchmark,nyc-taxi-green-dec-2016,1,NoResultError: [Errno 12] Cannot allocate memory
2965,GAMA_benchmark,nyc-taxi-green-dec-2016,6,NoResultError: [Errno 12] Cannot allocate memory
2953,GAMA_benchmark,nyc-taxi-green-dec-2016,2,NoResultError: [Errno 12] Cannot allocate memory
2824,GAMA_benchmark,nyc-taxi-green-dec-2016,3,NoResultError: [Errno 12] Cannot allocate memory
2974,GAMA_benchmark,nyc-taxi-green-dec-2016,7,NoResultError: [Errno 12] Cannot allocate memory
2853,GAMA_benchmark,nyc-taxi-green-dec-2016,4,NoResultError:


- `blackfriday.3`: `MemoryError` while communicating with evaluation process (`completed_future=self._output.get(block=False)`)
- `nyc.4`: idem.

### 3.2.6 H2O
**Failures**: 0

**Reruns required**: 0

In [35]:
regression_errors[(regression_errors["framework"] == "h2oautoml") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info


### 3.2.7 LightAutoML

**Failures**: 5

**Reruns required**: 0


In [36]:
errors_for_framework("lightautoml", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
lightautoml,Santander_transaction_value,5,5


In [37]:
regression_errors[(regression_errors["framework"] == "lightautoml") & ~(regression_errors["info"]).isnull()]

Unnamed: 0,framework,task,fold,info
1125,lightautoml,Santander_transaction_value,0,NoResultError: Pipeline finished with 0 models...
2702,lightautoml,Santander_transaction_value,8,NoResultError: Pipeline finished with 0 models...
2795,lightautoml,Santander_transaction_value,6,NoResultError: Pipeline finished with 0 models...
2704,lightautoml,Santander_transaction_value,3,NoResultError: Pipeline finished with 0 models...
2724,lightautoml,Santander_transaction_value,2,NoResultError: Pipeline finished with 0 models...


`3, 6`: No clear reason why pipeline finished with 0 models in log, stopped prematurely

## 3.2.8 MLJar
**Failures**: 6

**Reruns required**: 0


In [38]:
errors_for_framework("mljarsupervised_benchmark", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
mljarsupervised_benchmark,Airlines_DepDelay_10M,1,1
mljarsupervised_benchmark,QSAR-TID-11,5,5


In [39]:
regression_errors[regression_errors["framework"] == "mljarsupervised_benchmark"]["info"].unique()

array(['NoResultError: Object of type float32 is not JSON serializable',
       'NoResultError: Interrupting thread MainThread [ident=139655222089536] after 18000s timeout.'],
      dtype=object)

In [40]:
regression_errors[(regression_errors["framework"] == "mljarsupervised_benchmark") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
2168,mljarsupervised_benchmark,QSAR-TID-11,0,NoResultError: Object of type float32 is not J...
3311,mljarsupervised_benchmark,Airlines_DepDelay_10M,6,NoResultError: Interrupting thread MainThread ...
3201,mljarsupervised_benchmark,QSAR-TID-11,7,NoResultError: Object of type float32 is not J...
3178,mljarsupervised_benchmark,QSAR-TID-11,5,NoResultError: Object of type float32 is not J...
3366,mljarsupervised_benchmark,QSAR-TID-11,9,NoResultError: Object of type float32 is not J...
3179,mljarsupervised_benchmark,QSAR-TID-11,4,NoResultError: Object of type float32 is not J...


The `NoResultError` was a timeout interrupt during serialization of data, but *after* the `predict` call was already interrupted by a timeout, meaning no results would be had anyway (18000s).

### 3.2.9 MLPlan
Experiments suspended.

### 3.2.10 MLR3AutoML

**Failures**: 0

**Reruns required**: -


In [41]:
errors_for_framework("mlr3automl", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1


### 3.2.11 TPOT
**Failures**: 5

**Reruns required**: 0


In [42]:
errors_for_framework("TPOT", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
TPOT,Airlines_DepDelay_10M,4,4
TPOT,tecator,1,1


In [43]:
regression_errors[(regression_errors["framework"] == "TPOT") & (~regression_errors["info"].isnull()) & (~regression_errors["info"].apply(is_timeout_error))]

Unnamed: 0,framework,task,fold,info
2488,TPOT,Airlines_DepDelay_10M,0,NoResultError: A pipeline has not yet been opt...
1990,TPOT,Airlines_DepDelay_10M,4,NoResultError: A pipeline has not yet been opt...
1945,TPOT,Airlines_DepDelay_10M,6,NoResultError: A pipeline has not yet been opt...
1977,TPOT,Airlines_DepDelay_10M,5,NoResultError: A pipeline has not yet been opt...
2047,TPOT,tecator,3,CalledProcessError: Command '/bench/frameworks...


`tecator` has a ` Segmentation fault (core dumped)` after ~2.5 hours.

### 1.2.12 RandomForest
No failures.

In [44]:
errors_for_framework("RandomForest", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1


### 3.2.13 TunedRandomForest

**Failures**: 5

**Reruns required**: 0

In [39]:
errors_for_framework("TunedRandomForest", regression)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
TunedRandomForest,Airlines_DepDelay_10M,10,10
TunedRandomForest,Buzzinsocialmedia_Twitter,4,4


In [46]:
regression_errors[(regression_errors["framework"] == "TunedRandomForest") & ~(regression_errors["info"]).isnull()]

Unnamed: 0,framework,task,fold,info
1564,TunedRandomForest,Airlines_DepDelay_10M,1,CalledProcessError: Command '/bench/frameworks...
1584,TunedRandomForest,Airlines_DepDelay_10M,0,CalledProcessError: Command '/bench/frameworks...
1602,TunedRandomForest,Buzzinsocialmedia_Twitter,1,CalledProcessError: Command '/bench/frameworks...
1420,TunedRandomForest,Buzzinsocialmedia_Twitter,2,CalledProcessError: Command '/bench/frameworks...
1426,TunedRandomForest,Buzzinsocialmedia_Twitter,8,CalledProcessError: Command '/bench/frameworks...
1548,TunedRandomForest,Buzzinsocialmedia_Twitter,9,CalledProcessError: Command '/bench/frameworks...
1465,TunedRandomForest,Airlines_DepDelay_10M,9,CalledProcessError: Command '/bench/frameworks...
1435,TunedRandomForest,Airlines_DepDelay_10M,2,CalledProcessError: Command '/bench/frameworks...
1316,TunedRandomForest,Airlines_DepDelay_10M,3,CalledProcessError: Command '/bench/frameworks...
1303,TunedRandomForest,Airlines_DepDelay_10M,4,CalledProcessError: Command '/bench/frameworks...


`airlines` and `buzz.1` killed with `137`, `buzz` actually managed to stop evaluations prematurely for multiple `max_features` because of memory constraints, but ultimately failed.
Assuming other folds have the same type of errors, and should be imputed with `RandomForest` performance.


# 4. Classification 4h8c

In [29]:
classification[classification["info"].apply(lambda s: isinstance(s, str) and "UnicodeDecodeError" in s)]

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,training_duration,predict_duration,models_count,seed,info,acc,auc,balacc,logloss,models_ensemble_count
3907,openml.org/t/360114,Higgs,flaml,4h8c_gp3,5,binary,,auc,aws.docker,0.6.2,...,,,,1697533426,UnicodeDecodeError: 'utf-8' codec can't decode...,,,,,
8270,openml.org/t/360114,Higgs,flaml,4h8c_gp3,4,binary,,auc,aws.docker,0.6.2,...,,,,1632365474,UnicodeDecodeError: 'utf-8' codec can't decode...,,,,,
8315,openml.org/t/360114,Higgs,flaml,4h8c_gp3,5,binary,,auc,aws.docker,0.6.2,...,,,,1632365475,UnicodeDecodeError: 'utf-8' codec can't decode...,,,,,


In [24]:
classification = pd.read_csv(r"http://openml-test.win.tue.nl/amlb/latest/classification_4h8c.csv")
classification_jobs = set(itertools.product(classification["task"].unique(), classification["fold"].unique()))
classification.sample(3)

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,training_duration,predict_duration,models_count,seed,info,acc,auc,balacc,logloss,models_ensemble_count
2665,openml.org/t/359953,micro-mass,H2OAutoML,4h8c_gp3,4,multiclass,-0.421218,neg_logloss,aws.docker,3.34.0.1,...,14525.3,0.2,277.0,825036435,,0.912281,,0.891667,0.421218,
6416,openml.org/t/359957,cnae-9,constantpredictor,4h8c,6,multiclass,-2.19722,neg_logloss,local,0.24.2,...,0.0004,5e-05,1.0,1435978,,0.111111,,0.111111,2.19722,
3067,openml.org/t/359975,Satellite,AutoGluon_benchmark,4h8c_gp3,1,binary,0.998012,auc,aws.docker,0.3.1,...,6428.7,29.0,26.0,1129160200,,0.996078,0.998012,0.857143,0.017399,3.0


In [31]:
classification = filter_for_latest_results(classification)
grouped_by_fw_task = classification.groupby(["framework", "task"]).count()["fold"]

## 4.1 Missing Results

In [32]:
missing_by_framework = (n_classification_jobs - classification.groupby("framework").count()["fold"])
missing_by_framework

framework
AutoGluon_benchmark          0
GAMA_benchmark               0
H2OAutoML                    0
RandomForest                 0
TPOT                         0
TunedRandomForest            0
autosklearn                  0
autosklearn2                 0
flaml                        0
lightautoml                  0
mljarsupervised_benchmark    0
mlr3automl                   0
Name: fold, dtype: int64

### autosklearn

In [33]:
grouped_by_fw_task["autosklearn"][grouped_by_fw_task["autosklearn"] != 10]

Series([], Name: fold, dtype: int64)

### autosklearn2

In [14]:
grouped_by_fw_task["autosklearn2"][grouped_by_fw_task["autosklearn2"] != 10]

Series([], Name: fold, dtype: int64)

Redid a task because disk was full (best I can tell, the AWS disk, meaning it is a framework failure).

### FLAML

Redid `guillermo.4` because of a connection issue.

### H2O

In [62]:
grouped_by_fw_task["H2OAutoML"][grouped_by_fw_task["H2OAutoML"] != 10]

Series([], Name: fold, dtype: int64)

`shuttle.9` failed due to a connection error when retrieving the docker image. **redone**

`volket.1` was aborted due to CPU inactivity. **redone**

problem redoing because of log4j

### MLJar

I don't understand why `dionis.0` failed - there is some activity: `[i-0d9c2f1c57abba452]>[17390.980453] cloud-init[1754]: Job local.openml_s_271.4h8c_gp3.dionis.0.mljarsupervised_benchmark executed in 17185.877 seconds` but afterwards the instance doesn't terminate and files are not transferred.

`click_predictions_small.1` also never terminates but there seems to be a kernel panic while transfering results.

Redid both successfully.

### mlr3automl

Redid `albert.7` connection issue during setup.

## 4.2 Failed Results

In [34]:
classification_errors = classification[~classification["info"].isna()][["framework", "task", "fold", "info"]]

### 4.2.1 AutoGluon
**Failures**: 3

**Reruns required**: 0

Subprocesses killed while training lightgbm/nn (based on error code due to memory).

In [9]:
errors_for_framework("AutoGluon_benchmark", classification)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
AutoGluon_benchmark,KDDCup09-Upselling,2,2
AutoGluon_benchmark,KDDCup99,1,1


In [10]:
classification_errors[(classification_errors["framework"] == "AutoGluon_benchmark") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])

Unnamed: 0,framework,task,fold,info
1607,AutoGluon_benchmark,KDDCup09-Upselling,5,CalledProcessError: Command '/bench/frameworks...
1311,AutoGluon_benchmark,KDDCup09-Upselling,7,CalledProcessError: Command '/bench/frameworks...
1241,AutoGluon_benchmark,KDDCup99,8,CalledProcessError: Command '/bench/frameworks...


In [11]:
classification_errors[(classification_errors["framework"] == "AutoGluon_benchmark") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])["info"].unique()

array(["CalledProcessError: Command '/bench/frameworks/AutoGluon/venv/bin/python -W ignore /bench/frameworks/AutoGluon/exec.py' returned non-zero exit status 137."],
      dtype=object)

### 2.2.2 Autosklearn
**Failures**: 0

**Reruns required**: 0

In [12]:
errors_for_framework("autosklearn", classification)

Of the errors below, 5 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
autosklearn,KDDCup09-Upselling,5,5


### Autosklearn 2
**Failures**: 10

**Reruns required**: 0

In [13]:
errors_for_framework("autosklearn2", classification)

Of the errors below, 10 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
autosklearn2,KDDCup09-Upselling,10,10


### 4.2.3 AutoXGBoost

**Failures**: ?

**Reruns required**: 20 ?

In [21]:
#errors_for_framework("autoxgboost", classification_errors)

In [22]:
#axgb = classification_errors[(classification_errors["framework"] == "autoxgboost") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
#axgb

### 4.2.4 FLAML

**Failures**: 59

**Reruns required**: 1

Redo decoding errors (`Higgs.4-5`)
`NoResultError: Interrupting` means that the initial search/predict was interrupted due to time constraints, and then the next level interrupt is issued during serialization. They have both been redone once but one of them again had the same error.


In [14]:
errors_for_framework("flaml", classification)

Of the errors below, 1 are timeout errors.
Of the errors below, 1 are memory errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
flaml,Fashion-MNIST,2,2
flaml,Higgs,7,7
flaml,KDDCup09-Upselling,10,10
flaml,KDDCup99,4,4
flaml,MiniBooNE,5,5
flaml,airlines,1,1
flaml,covertype,2,2
flaml,dionis,10,10
flaml,guillermo,10,10
flaml,riccardo,5,5


In [15]:
flaml = classification_errors[(classification_errors["framework"] == "flaml") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
flaml

Unnamed: 0,framework,task,fold,info
7475,flaml,Fashion-MNIST,5,NoResultError: Interrupting thread MainThread ...
7581,flaml,Fashion-MNIST,8,NoResultError: Interrupting thread MainThread ...
6075,flaml,Higgs,0,CalledProcessError: Command '/bench/frameworks...
6064,flaml,Higgs,1,NoResultError: Interrupting thread MainThread ...
7467,flaml,Higgs,2,NoResultError: Interrupting thread MainThread ...
3906,flaml,Higgs,4,CalledProcessError: Command '/bench/frameworks...
3907,flaml,Higgs,5,UnicodeDecodeError: 'utf-8' codec can't decode...
7674,flaml,Higgs,6,CalledProcessError: Command '/bench/frameworks...
7250,flaml,Higgs,9,NoResultError: Interrupting thread MainThread ...
6033,flaml,KDDCup09-Upselling,0,NoResultError: Unable to allocate 4.44 GiB for...


`upselling.3` fails in automl.

In [19]:
flaml["info"].unique()

array(['NoResultError: Interrupting thread MainThread [ident=140291176265536] after 18000s timeout.',
       'NoResultError: Interrupting thread MainThread [ident=139641969686336] after 18000s timeout.',
       "CalledProcessError: Command '/bench/frameworks/flaml/venv/bin/python -W ignore /bench/frameworks/flaml/exec.py' returned non-zero exit status 139.",
       'NoResultError: Interrupting thread MainThread [ident=140128358463296] after 18000s timeout.',
       'NoResultError: Interrupting thread MainThread [ident=140282879719232] after 18000s timeout.',
       "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 63: invalid start byte",
       "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 64: invalid start byte",
       "CalledProcessError: Command '/bench/frameworks/flaml/venv/bin/python -W ignore /bench/frameworks/flaml/exec.py' returned non-zero exit status 137.",
       'NoResultError: Interrupting thread MainThread [ident=1398779133151

### 4.2.5 GAMA
**Failures**: 54

**Reruns required**: 0

I did not inspect the logs, based on the error messages here I am fairly confident that they are all GAMA failures.

In [16]:
errors_for_framework("GAMA_benchmark", classification_errors)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
GAMA_benchmark,Higgs,1,1
GAMA_benchmark,KDDCup09-Upselling,10,10
GAMA_benchmark,KDDCup99,10,10
GAMA_benchmark,airlines,10,10
GAMA_benchmark,covertype,4,4
GAMA_benchmark,helena,9,9
GAMA_benchmark,numerai28_6,1,1
GAMA_benchmark,wine-quality-white,5,5
GAMA_benchmark,yeast,5,5


In [13]:
gama = classification_errors[(classification_errors["framework"] == "GAMA_benchmark") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
gama

Unnamed: 0,framework,task,fold,info
3464,GAMA_benchmark,Higgs,1,NoResultError: [Errno 12] Cannot allocate memory
3445,GAMA_benchmark,KDDCup09-Upselling,0,NoResultError: 'NoneType' object is not iterable
1796,GAMA_benchmark,KDDCup09-Upselling,2,NoResultError: 'NoneType' object is not iterable
1823,GAMA_benchmark,KDDCup09-Upselling,3,NoResultError: 'NoneType' object is not iterable
1583,GAMA_benchmark,KDDCup09-Upselling,4,NoResultError: 'NoneType' object is not iterable
2043,GAMA_benchmark,KDDCup09-Upselling,5,NoResultError: 'NoneType' object is not iterable
2070,GAMA_benchmark,KDDCup09-Upselling,6,NoResultError: 'NoneType' object is not iterable
1566,GAMA_benchmark,KDDCup09-Upselling,7,NoResultError: 'NoneType' object is not iterable
2124,GAMA_benchmark,KDDCup09-Upselling,8,NoResultError: 'NoneType' object is not iterable
1724,GAMA_benchmark,KDDCup09-Upselling,9,NoResultError: 'NoneType' object is not iterable


These look like GAMA errors to me (mostly memory).

### 4.2.6 H2O
**Failures**: 10

**Reruns required**: 0

All timeout errors

In [17]:
errors_for_framework("H2OAutoML", classification)

Of the errors below, 9 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
H2OAutoML,KDDCup99,10,10


In [130]:
h2o = classification_errors[(classification_errors["framework"] == "H2OAutoML") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
h2o

Unnamed: 0,framework,task,fold,info
9043,H2OAutoML,KDDCup99,4,NoResultError: Interrupting thread MainThread ...


In [71]:
h2o.iloc[0]["info"]

'NoResultError: Interrupting thread MainThread [ident=139723361433408] after 7200s timeout.'

Timeout not captured by regex, the second interrupt happened during saving of results.

### 4.2.7 LightAutoML

**Failures**: 71

**Reruns required**: 0

Seem to be memory errors.

In [18]:
errors_for_framework("lightautoml", classification)

Of the errors below, 2 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
lightautoml,APSFailure,9,9
lightautoml,KDDCup09-Upselling,7,7
lightautoml,KDDCup09_appetency,9,9
lightautoml,KDDCup99,10,10
lightautoml,bank-marketing,9,9
lightautoml,dionis,10,10
lightautoml,nomao,4,4
lightautoml,porto-seguro,3,3
lightautoml,wine-quality-white,5,5
lightautoml,yeast,5,5


In [24]:
lama = classification_errors[(classification_errors["framework"] == "lightautoml") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
lama

Unnamed: 0,framework,task,fold,info
5270,lightautoml,APSFailure,1,CalledProcessError: Command '/bench/frameworks...
90,lightautoml,APSFailure,2,CalledProcessError: Command '/bench/frameworks...
254,lightautoml,APSFailure,3,CalledProcessError: Command '/bench/frameworks...
394,lightautoml,APSFailure,4,CalledProcessError: Command '/bench/frameworks...
147,lightautoml,APSFailure,5,CalledProcessError: Command '/bench/frameworks...
...,...,...,...,...
5370,lightautoml,yeast,0,NoResultError: Pipeline finished with 0 models...
5265,lightautoml,yeast,1,NoResultError: Pipeline finished with 0 models...
294,lightautoml,yeast,2,NoResultError: Pipeline finished with 0 models...
355,lightautoml,yeast,3,NoResultError: Pipeline finished with 0 models...


In [25]:
lama["info"].unique()

array(["CalledProcessError: Command '/bench/frameworks/lightautoml/venv/bin/python -W ignore /bench/frameworks/lightautoml/exec.py' returned non-zero exit status 139.",
       "CalledProcessError: Command '/bench/frameworks/lightautoml/venv/bin/python -W ignore /bench/frameworks/lightautoml/exec.py' returned non-zero exit status 134.",
       'NoResultError: Pipeline finished with 0 models for some reason.\nProbably one or more models failed',
       'NoResultError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing th…',
       'NoResultError: Unable to allocate 2.12 GiB for an array with shape (100000, 355, 8) and data type float64'],
      dtype=object)

In [26]:
lama[lama["info"] == 'NoResultError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing th…']

Unnamed: 0,framework,task,fold,info
5372,lightautoml,dionis,0,NoResultError: A worker process managed by the...
85,lightautoml,dionis,4,NoResultError: A worker process managed by the...
212,lightautoml,dionis,5,NoResultError: A worker process managed by the...
424,lightautoml,dionis,6,NoResultError: A worker process managed by the...
59,lightautoml,dionis,7,NoResultError: A worker process managed by the...
438,lightautoml,dionis,8,NoResultError: A worker process managed by the...
5226,lightautoml,dionis,9,NoResultError: A worker process managed by the...


Thrown from within lightautoml.

### 4.2.8 MLJar

**Failures**: 140

**Reruns required**: 0


In [19]:
errors_for_framework("mljarsupervised_benchmark", classification)

Of the errors below, 2 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
mljarsupervised_benchmark,APSFailure,6,6
mljarsupervised_benchmark,Click_prediction_small,10,10
mljarsupervised_benchmark,Internet-Advertisements,9,9
mljarsupervised_benchmark,KDDCup09_appetency,5,5
mljarsupervised_benchmark,PhishingWebsites,10,10
mljarsupervised_benchmark,Satellite,7,7
mljarsupervised_benchmark,adult,10,10
mljarsupervised_benchmark,bank-marketing,6,6
mljarsupervised_benchmark,blood-transfusion-service-center,10,10
mljarsupervised_benchmark,credit-g,10,10


In [28]:
mljar = classification_errors[(classification_errors["framework"] == "mljarsupervised_benchmark") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
mljar

Unnamed: 0,framework,task,fold,info
6747,mljarsupervised_benchmark,APSFailure,0,NoResultError: catboost/libs/data/model_datase...
6785,mljarsupervised_benchmark,APSFailure,1,NoResultError: catboost/libs/data/model_datase...
4354,mljarsupervised_benchmark,APSFailure,3,NoResultError: catboost/libs/data/model_datase...
3983,mljarsupervised_benchmark,APSFailure,5,NoResultError: catboost/libs/data/model_datase...
4003,mljarsupervised_benchmark,APSFailure,7,NoResultError: catboost/libs/data/model_datase...
...,...,...,...,...
4063,mljarsupervised_benchmark,wilt,2,NoResultError: catboost/libs/data/model_datase...
4071,mljarsupervised_benchmark,wilt,3,NoResultError: catboost/libs/data/model_datase...
4318,mljarsupervised_benchmark,wilt,7,NoResultError: catboost/libs/data/model_datase...
4179,mljarsupervised_benchmark,wilt,8,NoResultError: catboost/libs/data/model_datase...


In [29]:
mljar["info"].unique()

array(['NoResultError: catboost/libs/data/model_dataset_compatibility.cpp:81: At position 169 should be feature with name Ensemble_prediction_0_for_neg_1_for_pos (found Ensemble_prediction).',
       'NoResultError: "[\'Ensemble_prediction_0_for_neg_1_for_pos\', \'59_CatBoost_prediction_0_for_neg_1_for_pos\', \'62_CatBoost_prediction_0_for_neg_1_for_pos\', \'59_CatBoost_BoostOnErrors_prediction_0_for_neg…',
       'NoResultError: Interrupting thread MainThread [ident=139842026686272] after 18000s timeout.',
       'NoResultError: Interrupting thread MainThread [ident=140364635006784] after 18000s timeout.',
       'NoResultError: Interrupting thread MainThread [ident=139789014480704] after 18000s timeout.',
       'NoResultError: Interrupting thread MainThread [ident=140604626982720] after 18000s timeout.',
       'NoResultError: Interrupting thread MainThread [ident=139934569097024] after 18000s timeout.',
       'NoResultError: Interrupting thread MainThread [ident=139699174479680] a

### 4.2.9 MLPlan

To be redone

### 4.2.10 MLR3AutoML


**Failures**: 20

**Reruns required**: 0

Based on a sample, seem to be framework issues (children dying because of memory issues).

In [20]:
errors_for_framework("mlr3automl", classification)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
mlr3automl,KDDCup09-Upselling,10,10
mlr3automl,KDDCup99,10,10


In [32]:
mlr["info"].unique()

array(['CalledProcessError: Command \'Rscript --vanilla -e ".libPaths(\'/bench/frameworks/mlr3automl/lib\'); source(\'/bench/frameworks/mlr3automl/exec.R\'); run(\'/input/org/openml/www/datasets/43072/dataset_trai…',
       'CalledProcessError: Command \'Rscript --vanilla -e ".libPaths(\'/bench/frameworks/mlr3automl/lib\'); source(\'/bench/frameworks/mlr3automl/exec.R\'); run(\'/input/org/openml/www/datasets/42746/dataset_trai…'],
      dtype=object)

`Kddcup99.3`, `Kddcup99.6`, :
```
INFO  [01:30:11.965] [bbotk] Starting to optimize 22 parameter(s) with '<OptimizerHyperband>' and '<TerminatorCombo> [any=TRUE]'
INFO  [01:30:12.072] [bbotk] Evaluating 9 configuration(s)
INFO  [01:30:14.042] [mlr3] Running benchmark with 9 resampling iterations

[ERROR] [amlb.utils.process:01:31:25.541]
Attaching package: ‘mlr3extralearners’

The following objects are masked from ‘package:mlr3’:

    lrn, lrns

Loading required package: paradox
Error: Failed to retrieve the result of MulticoreFuture (future_mapply-2) from the forked worker (on localhost; PID 636). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive. Detected a non-exportable reference (‘externalptr’) in one of the globals (<unknown>) used in the future expression
In addition: Warning message:
In mccollect(jobs = jobs, wait = TRUE) :
  1 parallel job did not deliver a result
Timing stopped at: 3342 216.6 3200
Execution halted
```

`kddcup09-upselling`: `Error in mcfork(detached):  unable to fork, possible reason: Cannot allocate memory`

### 4.2.11 TPOT

**Failures**: 53

**Reruns required**: 0


In [21]:
errors_for_framework("TPOT", classification)

Of the errors below, 29 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
TPOT,APSFailure,1,1
TPOT,Bioresponse,1,1
TPOT,KDDCup09-Upselling,10,10
TPOT,KDDCup99,10,10
TPOT,amazon-commerce-reviews,2,2
TPOT,arcene,4,4
TPOT,christine,1,1
TPOT,dionis,10,10
TPOT,gina,1,1
TPOT,philippine,1,1


In [22]:
tpot = classification_errors[(classification_errors["framework"] == "TPOT") & (~classification_errors["info"].isnull()) & (~classification_errors["info"].apply(is_timeout_error))].sort_values(by=["task", "fold"])
tpot

Unnamed: 0,framework,task,fold,info
512,TPOT,Bioresponse,0,NoResultError: A pipeline has not yet been opt...
604,TPOT,KDDCup99,0,NoResultError: y_true and y_pred contain diffe...
5452,TPOT,KDDCup99,4,NoResultError: A pipeline has not yet been opt...
5937,TPOT,KDDCup99,5,NoResultError: A pipeline has not yet been opt...
5729,TPOT,KDDCup99,6,NoResultError: 'LinearSVC' object has no attri...
5581,TPOT,KDDCup99,7,NoResultError: A pipeline has not yet been opt...
5830,TPOT,KDDCup99,8,NoResultError: probability estimates are not a...
5777,TPOT,KDDCup99,9,NoResultError: y_true and y_pred contain diffe...
635,TPOT,amazon-commerce-reviews,0,NoResultError: A pipeline has not yet been opt...
5923,TPOT,amazon-commerce-reviews,7,NoResultError: A pipeline has not yet been opt...


 - `Amazon commerce`: memory error while forking the process as part of joblib backend.
 - `AttributeError: 'LinearSVC' object has no attribute 'predict_proba'` was internal to `fit` - it was **not** during our `predict` fallback.

### 4.2.12 RandomForest

**Failures**: 0

**Reruns required**: 0

In [35]:
errors_for_framework("RandomForest", classification)

Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1


### 4.2.13 TunedRandomForest

**Failures**: 24

**Reruns required**: 0

Memory issues, will impute with RandomForest.

In [23]:
errors_for_framework("TunedRandomForest", classification)

Of the errors below, 3 are timeout errors.


Unnamed: 0_level_0,Unnamed: 1_level_0,fold,info
framework,task,Unnamed: 2_level_1,Unnamed: 3_level_1
TunedRandomForest,KDDCup09-Upselling,4,4
TunedRandomForest,KDDCup99,10,10
TunedRandomForest,dionis,10,10


# 5. Remarks
To keep a clear overview, this notebook is formatted in a somewhat destructive manner - it relies on an unversioned results file, and the cells get updated based on the latest information. When a bug is identified which requires runs to be re-evaluated, this information is only stored as long as the new results are not in yet. This version control could of course be achieved through Github, but it would require a non-public repository or our results would be leaked prematurely. Instead simple file sharing and renaming is used to preserve the history, to some extent. It is always possible to deduce which jobs have been rerun, as *all* results are included in the final result files. We hope this is still a satisfactory middle-ground: for any jobs that have missing results, it is immediately clear why. Motivation for reran jobs can still be found (although with more effort).

For results of future work we will consider a private repository from the start which we will publicize on publication, to make the process (even) more transparent.