# AutoML Benchmarking: AutoGluon, AutoSklearn, and MLSea (AssistML)

This notebook provides a comparative evaluation of three state-of-the-art AutoML frameworks: **AutoGluon**, **AutoSklearn** and **MLSea (AssistML)**.  
The goal is to assess their performance and usability on a unified dataset, using a consistent experimental setup.  

The notebook is designed to accompany the corresponding scientific publication and includes code, preprocessing steps, and evaluation procedures that ensure full reproducibility of the experiments.


In [1]:
data_location = './tmp/data/pda_2023-04-18_10-13-22.csv'
label_location = './tmp/data/labels_030723.csv'

In [2]:
!curl --create-dirs -O --output-dir \
./tmp/data \
https://gitlab.com/mibbels/automlwrapperdata/-/raw/main/tabular-regression/labels_030723.csv 

!curl --create-dirs -O --output-dir \
./tmp/data \
https://gitlab.com/mibbels/automlwrapperdata/-/raw/main/tabular-regression/pda_2023-04-18_10-13-22.csv 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14098  100 14098    0     0  27737      0 --:--:-- --:--:-- --:--:-- 27697
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 67.7M  100 67.7M    0     0  57.3M      0  0:00:01  0:00:01 --:--:-- 57.3M


# data preparation taken without changes 
Hoseini, S., et. al.: Coatings intelligence: Data-driven automation for chemistry
4.0. In: 2024 IEEE 7th (ICPS). pp. 1–8 (2024)

This transformation involves:
1. Extracting selected features (e.g., `x_force`, `y_force`, `z_force`).
2. Padding sequences to ensure consistent dimensionality.
3. Splitting the dataset into **training**, **validation**, and **test sets**.

In [2]:
import pandas as pd
import numpy as np
csv = pd.read_csv(data_location, skiprows = 0)
csv['Zeit'] =  pd.to_datetime(csv['Zeit'])
csv.sort_values(by='Zeit', inplace = True)

TIME_LIMIT = 60 * 60

labels = pd.read_csv(label_location)
labels = labels[labels['row'] != 'None']
labels = labels[labels['row'] != 'Aussortieren']
print(len(labels))

df = csv[['Zeit','product_id', 'run_id', 'experiment_id', 'trial_id', 'set_force_begin',
       'x_position', 'y_position', 'z1_position', 'z2_position', 'x_velocity',
       'y_velocity', 'z1_velocity', 'z2_velocity', 'x_force', 'y_force',
       'z_force']]
df = df[df["product_id"] == 304]
good_experiment_ids = [{"run_id": 0, "experiment_ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
                  {"run_id": 1, "experiment_ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
                  {"run_id": 2, "experiment_ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
                  {"run_id": 3, "experiment_ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
                ]
peak_dfs = []
i = 0
for item in good_experiment_ids:
    df_temp = df[df["run_id"] == item["run_id"]]
    for item2 in item["experiment_ids"]:
        df_temp2 = df_temp[df_temp["experiment_id"] == item2]
        liste_temp = df_temp2["trial_id"].unique()
        for item3 in liste_temp:
            i += 1
            #print(item["run_id"], item2, item3 )
            if labels[(labels['trial_id'] == item3) & (labels['experiment_id'] == item2) & (labels['run_id'] == item["run_id"])].shape[0] > 0:
                peak_df = df_temp2[df_temp2["trial_id"] == item3]
                peak_dfs.append(peak_df[["run_id", "trial_id", "experiment_id", 'x_position', 'x_force', 'y_force', 'z_force']])
                
print(len(peak_dfs))

filtered_peak_dfs = []

for i, item in enumerate(peak_dfs):
    filtered_df_temp = item[item['x_position'] > 20.0001].reset_index(drop=True)
    
    peak_row_temp = filtered_df_temp['x_position'].idxmax()
    
    peak_row_data_temp = filtered_df_temp.loc[:peak_row_temp-1]
    
    filtered_df_temp2 = filtered_df_temp.loc[peak_row_temp:]
    
    condition = filtered_df_temp2['x_force'] >= 0
    
    extracted_rows = filtered_df_temp2.loc[:condition.idxmax()]
        
    if (extracted_rows['x_position'] >= 99.9).all():
        filtered_peak_dfs.append(pd.concat([peak_row_data_temp, extracted_rows]))
    else:
        filtered_peak_dfs.append(peak_row_data_temp)
print(len(filtered_peak_dfs))

max_length = max(len(df) for df in filtered_peak_dfs)

padded_dataframes = []
for df in filtered_peak_dfs:
    padding_size = 519 - len(df) #padding_size - len(df) # check classification
    padded_df = pd.DataFrame(np.pad(df.values, ((0, padding_size), (0, 0)), mode='edge'), columns=df.columns)
    padded_df['index'] = padded_df.index
    padded_dataframes.append(padded_df)
print(len(padded_dataframes))

lengths = set()
polke_padded_dataframes_with_labels = []
for item in padded_dataframes:
    lengths.add(len(item))
    
    run_id = item["run_id"].unique()[0],
    trial_id = item["trial_id"].unique()[0],
    experiment_id = item["experiment_id"].unique()[0]
    
    #print("RUN_ID:", run_id,"experiment_id:",  experiment_id,"trial_id:", trial_id)

    indidvidual = labels[labels["run_id"] == run_id]
    indidvidual = indidvidual[indidvidual["experiment_id"] == experiment_id]
    indidvidual = indidvidual[indidvidual["trial_id"] == trial_id]
    
    try:
        if indidvidual['row'].iloc[0].isnumeric():
            row_value = int(indidvidual['row'])
            polke_padded_dataframes_with_labels.append((item, row_value))
        else:
            continue                                       ### <<<----- added try block
    except AttributeError as a:
        continue
print(lengths)

398
398
398
398


  row_value = int(indidvidual['row'])


{519}


In [3]:
from sklearn.model_selection import train_test_split

padded_dataframes_with_labels_combined = polke_padded_dataframes_with_labels

  from pkg_resources import parse_version  # type: ignore


## Exporting Processed Data

For transparency and reproducibility, the labeled dataframe is exported as CSV file.  
This step ensures that all tested frameworks operate on the exact same data representation as pipelines are implemented differently across frameworks.  

This CSV file serves as the standardized input format for the subsequent LLM-based experiments.


In [4]:
import pandas as pd

rows = []
for df, label in padded_dataframes_with_labels_combined:
    temp = df.copy()
    temp['label'] = label
    rows.append(temp)

final_df = pd.concat(rows, ignore_index=True)

# Exportiere als CSV
final_df.to_csv("./tmp/transformed_data/scratchtest_transformed.csv", index=False)
print("Exported to ./tmp/transformed_data/scratchtest_transformed.csv")

Exported to ./tmp/transformed_data/scratchtest_transformed.csv


In [5]:
tensor_X = []
tensor_y = []
for item in padded_dataframes_with_labels_combined:
    #df_temp = item[0][['x_force', 'y_force', 'z_force']].copy()
    #df_temp = item[0][['x_force', 'z_force']].copy()
    df_temp = item[0][['x_force']].copy()
    a = df_temp.to_numpy().astype(np.float32)
    tensor_X.append(a)
    tensor_y.append(item[1])
print(len(tensor_X))
print(len(tensor_y))

# train test split
X_train, X_temp, y_train, y_temp = train_test_split(np.array(tensor_X), np.array(tensor_y), test_size=0.2, shuffle=True)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.8, shuffle=True)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape, X_val.shape , y_val.shape)

255
255
(204, 519, 1) (41, 519, 1) (204,) (41,) (10, 519, 1) (10,)


In [6]:
train = pd.DataFrame(X_train.reshape(X_train.shape[:2]))
train['label'] = y_train

test = pd.DataFrame(X_test.reshape(X_test.shape[:2]))
test['label'] = y_test 

val = pd.DataFrame(X_val.reshape(X_val.shape[:2]))
val['label'] = y_val

In [7]:
train.shape

(204, 520)

In [8]:
val.shape

(10, 520)

In [9]:
test.shape

(41, 520)

# Optimizing a range of different predictors using AutoGluon

In [21]:
%%time
from automlwrapper import AutoMLWrapper
import sedarapi

def wrapper_medium(train_data,val_data, eval_metric):
    
    wrapper = AutoMLWrapper('autogluon')
    wrapper.Train(
        train_data=train_data,
        validation_data=val_data,
        target_column='label',
        task_type='regression',
        data_type='tabular',
        problem_type='regression',
        hyperparameters={'time_limit': TIME_LIMIT,
                         'preset' : 'medium_quality',
                        'eval_metric':eval_metric},
    )
    
    return wrapper

CPU times: user 36 μs, sys: 0 ns, total: 36 μs
Wall time: 56.3 μs


In [22]:
%%time
w = wrapper_medium(train, val, 'mean_squared_error')

Presets specified: ['medium_quality']
Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "/home/jovyan/vhermann/AutoMLOutput/autogluon1758883405.581418"
AutoGluon Version:  1.0.0
Python Version:     3.11.7
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #236-Ubuntu SMP Fri Apr 11 19:53:21 UTC 2025
CPU Count:          104
Memory Avail:       718.20 GB / 754.52 GB (95.2%)
Disk Space Avail:   1073.42 GB / 3665.44 GB (29.3%)
Train Data Rows:    204
Train Data Columns: 519
Tuning Data Rows:    10
Tuning Data Columns: 519
Label Column:       label
Problem Type:       regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    735438.60 MB
	Train Data (Original)  Memory Usage: 0.42 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of t

CPU times: user 10min 22s, sys: 15.2 s, total: 10min 37s
Wall time: 20.8 s


In [23]:
%%time

extra_kwargs_for_auto_gluon = {'auxiliary_metrics' : True, 'detailed_report' : True}

re = w.Evaluate(test, target_column='Type', **extra_kwargs_for_auto_gluon)
print(re)

{'mean_squared_error': -154.76163652268517, 'root_mean_squared_error': -12.44032300716847, 'mean_absolute_error': -9.286921245295828, 'r2': 0.9852065189753383, 'pearsonr': 0.9929851051167379, 'median_absolute_error': -6.12017822265625}
CPU times: user 2.34 s, sys: 61.5 ms, total: 2.4 s
Wall time: 249 ms


# Optimizing a range of different predictors using AutoSklearn

### kernel needs to be changed from 'automl' to 'AutoSklearn'

In [10]:
from autosklearn.regression import AutoSklearnRegressor

In [11]:
from automlwrapper import AutoMLWrapper

 No module named 'autogluon'.
 No module named 'autokeras'.


In [12]:
# Original Code
def wrapper_sk(train_data):
    
    wrapper = AutoMLWrapper('autosklearn')
    wrapper.Train(
        train_data=train_data,
        target_column='label',
        task_type='regression',
        data_type='tabular',
        problem_type='regression',
        hyperparameters={'time_limit': 3600, 
                         'memory_limit': 102400,
            'evaluation_metric':'mean_absolute_error'
                        }
    )

    return wrapper

In [13]:
%%time
print("AutoSklearn")
w = wrapper_sk(train)

AutoSklearn
CPU times: user 6min 34s, sys: 10.2 s, total: 6min 44s
Wall time: 59min 59s


In [14]:
%%time

ev = w.Evaluate(test, target_column='label',detailed_report=True)
print(ev)

9.39143378560136
CPU times: user 119 ms, sys: 50 µs, total: 119 ms
Wall time: 116 ms


-------------
# Evaluating the capability of GPT-4o when creating regression code

To complement the direct execution of AutoML frameworks, we additionally evaluate the capability of large language models (LLMs) to translate AssistML recommendations into executable code. 

For this purpose, we selected the top-ranked configuration suggested by the AssistML dashboard for the dataset *scratchtest_transformed.csv*. 

The dashboard output contained a structured report with model families, hyperparameters, and preprocessing steps, which we used as the basis for generating natural language prompts.

__AssistML Recommendation (Top-ranked)__

![AssistML Recommendation top-ranked](./images/assistmloutput1.png)

__AssistML Recommendation (Second-ranked)__

![AssistML Recommendation second-ranked](./images/assistmloutput2.png)

In [None]:
%env OPENAI_API_KEY=sk...

In [2]:
%load_ext jupyter_ai_magics

In [19]:
%%ai chatgpt
Create a complete Python script for a regression task using scikit-learn and dabl, utilizing the following pipeline:

Pipeline:

Preprocessing: dabl.preprocessing.EasyPreprocessor

Regressor: dabl.models.SimpleRegressor

Use the default hyperparameters as shown in the following AutoML-recommended example:

sklearn.pipeline.Pipeline(
    steps=[
        ('easypreprocessor', dabl.preprocessing.EasyPreprocessor()),
        ('simpleregressor', dabl.models.SimpleRegressor())
    ],
    verbose=False
)

Make sure the following Python dependencies are installed:
scikit-learn, dabl, pandas

Use the dataset scratchtest_transformed.csv with the following structure (first rows):

run_id,trial_id,experiment_id,x_position,x_force,y_force,z_force,index,label  
0.0,0.0,0.0,20.000118762255,-0.0035452656447887,-0.029228178784251,1.6382834911346,0,344  
0.0,0.0,0.0,20.000118762255,0.0043591051362455,-0.026274267584085,2.8769311904907,1,344  
0.0,0.0,0.0,20.00012807548,0.0059745712205768,-0.046690504997969,3.2621204853058,2,344  

The target variable is the label column. Use all other numerical columns as features.

Load the data as a Pandas DataFrame, split it into training and test sets, train the model, and output the following evaluation metrics on the test set:

Mean Absolute Error (MAE)

Relative Absolute Error (RAE)

Root Mean Squared Error (RMSE)

Root Relative Squared Error (RRSE)

Training time

Comment the most important steps in the code.

```python
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from dabl import preprocessing, models
from sklearn.pipeline import Pipeline
from time import time

# Load the dataset
data = pd.read_csv('scratchtest_transformed.csv')

# Split the data into features and target variable
X = data.drop(['label'], axis=1)
y = data['label']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the pipeline consisting of EasyPreprocessor and Simple Regressor
pipeline = Pipeline([
    ('easypreprocessor', preprocessing.EasyPreprocessor()),
    ('simpleregressor', models.SimpleRegressor())
])

# Train the model
start_time = time()
pipeline.fit(X_train, y_train)
training_time = time() - start_time

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
rae = mae / y_test.mean()
rrse = rmse / y_test.var()

# Output evaluation metrics and training time
print(f"Mean Absolute Error: {mae}")
print(f"Root Mean Squared Error: {rmse}")
print(f"Relative Absolute Error: {rae}")
print(f"Root Relative Squared Error: {rrse}")
print(f"Training time: {training_time} seconds")
```

In [10]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from dabl import preprocessing, models
from sklearn.pipeline import Pipeline
from time import time

# Load the dataset
data = pd.read_csv('./tmp/transformed_data/scratchtest_transformed.csv')

# Split the data into features and target variable
X = data.drop(['label'], axis=1)
y = data['label']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the pipeline consisting of EasyPreprocessor and Simple Regressor
pipeline = Pipeline([
    ('easypreprocessor', preprocessing.EasyPreprocessor()),
    ('simpleregressor', models.SimpleRegressor())
])

# Train the model
start_time = time()
pipeline.fit(X_train, y_train)
training_time = time() - start_time

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
rae = mae / y_test.mean()
rrse = rmse / y_test.var()

# Output evaluation metrics and training time
print(f"Mean Absolute Error: {mae}")
print(f"Root Mean Squared Error: {rmse}")
print(f"Relative Absolute Error: {rae}")
print(f"Root Relative Squared Error: {rrse}")
print(f"Training time: {training_time} seconds")

Running DummyRegressor()
r2: -0.000 neg_mean_squared_error: -10603.879
=== new best DummyRegressor() (using r2):
r2: -0.000 neg_mean_squared_error: -10603.879

Running DecisionTreeRegressor(max_depth=1)
r2: 0.104 neg_mean_squared_error: -9496.160
=== new best DecisionTreeRegressor(max_depth=1) (using r2):
r2: 0.104 neg_mean_squared_error: -9496.160

Running DecisionTreeRegressor(max_leaf_nodes=8)
r2: 0.715 neg_mean_squared_error: -3019.874
=== new best DecisionTreeRegressor(max_leaf_nodes=8) (using r2):
r2: 0.715 neg_mean_squared_error: -3019.874

Running DecisionTreeRegressor(max_leaf_nodes=16)
r2: 0.890 neg_mean_squared_error: -1163.691
=== new best DecisionTreeRegressor(max_leaf_nodes=16) (using r2):
r2: 0.890 neg_mean_squared_error: -1163.691

Running DecisionTreeRegressor(max_leaf_nodes=32)
r2: 0.959 neg_mean_squared_error: -430.453
=== new best DecisionTreeRegressor(max_leaf_nodes=32) (using r2):
r2: 0.959 neg_mean_squared_error: -430.453

Running DecisionTreeRegressor(max_depth=