**What is TabBench?**
*TabBench* is a benchmark suite for tabular data focused on real-world business use cases like product categorization, deduplication, and pricing. Unlike academic benchmarks, it evaluates models on industrial datasets from sectors such as retail, banking, and insurance. Built on top of [Neuralk Foundry-CE](https://github.com/Neuralk-AI/NeuralkFoundry-CE), TabBench structures each task as a modular workflow, making it easy to test and compare different approaches. It’s designed to help identify the best models for practical, industry-driven challenges.

# Using a Custom Model

This tutorial demonstrates how to integrate a new model into the Neuralk benchmark and use it as part of a workflow.

We begin by explaining how Neuralk Foundry handles models and how to define a custom one. In this example, we wrap the `HistGradientBoostingClassifier` from scikit-learn. This allows the model to be seamlessly inserted into workflows alongside other components.

## Creating a Custom Model for Neuralk Foundry

To integrate a new model into a Neuralk workflow, you define a **model** by extending the `BaseModel` class. This component encapsulates how the model is initialized, trained, and used for inference, while ensuring compatibility with Neuralk's workflow system.

Below is a template for implementing a custom classifier. It defines the core methods required to:

* **Train** it using input features, labels, and fold definitions (`train`)
* **Perform inference** on unseen data (`forward`)
* **Initialize** the model from a config (`init_model`)
* For automated hyperparameter optimization
  * **Expose tunable hyperparameters** for automated search (`get_model_params`)

This design allows for flexible integration of any framework—whether scikit-learn, PyTorch, XGBoost, or others—within a unified workflow interface.

Note the use of the `with_masked_split` annotation. This is designed to mirror the scikit-learn interface. In contrast, our native interface avoids splitting the X DataFrame into multiple copies for memory efficiency and to prevent data leakage.

In [5]:
from neuralk_foundry_ce.models import ClassifierModel
from neuralk_foundry_ce.utils.splitting import Split, with_masked_split


class MyClassifier(ClassifierModel):
    name = "my-classifier"

    def __init__(self):
        super().__init__()

    def init_model(self, config):
        self.config = config
        self.model = ...

    @with_masked_split
    def train(self, X, y):
        ...

    @with_masked_split
    def forward(self, X):
        ...

    def get_fixed_params(self, tags):
        return { }
    
    def get_model_params(self, trial, tags):
        return { }

## Implementing the model wrapper

We begin by defining a custom model class. In this example, we use the `HistGradientBoostingClassifier` from scikit-learn, wrapped to be compatible with Neuralk Foundry’s workflow system. This model is well-suited for tabular data and supports both numerical and categorical features natively.

### Key Hyperparameters

| Hyperparameter         | Description                                                                           | Default  | Recommended Range         |
| ---------------------- | ------------------------------------------------------------------------------------- | -------- | ------------------------- |
| `learning_rate`        | Shrinks the contribution of each tree. Lower values improve generalization.           | `0.1`    | `0.01` to `0.2`           |
| `max_iter`             | Number of boosting iterations (trees).                                                | `100`    | `100` to `500`            |
| `max_leaf_nodes`       | Maximum number of leaf nodes in each tree. Controls tree complexity.                  | `31`     | `16` to `256`             |
| `max_depth`            | Maximum depth of each tree. Overrides `max_leaf_nodes` if set.                        | `None`   | `3` to `10` (if used)     |
| `min_samples_leaf`     | Minimum number of samples in each leaf. Prevents overfitting.                         | `20`     | `5` to `100`              |
| `l2_regularization`    | L2 penalty applied to leaf values to prevent overfitting.                             | `0.0`    | `0.0` to `1.0`            |
| `early_stopping`       | Whether to enable early stopping based on validation loss.                            | `True`   | `True`                    |
| `scoring`              | Metric to use for early stopping (e.g., `'loss'`, `'accuracy'`).                      | `'loss'` | Task-dependent            |
| `validation_fraction`  | Fraction of training data to set aside for validation when early stopping is enabled. | `0.1`    | `0.1` to `0.3`            |
| `n_iter_no_change`     | Number of iterations with no improvement before stopping early.                       | `10`     | `10` to `50`              |
| `random_state`         | Controls reproducibility.                                                             | `None`   | Any integer               |
| `categorical_features` | Specifies categorical features (or `"auto"` for detection).                           | `"auto"` | `"auto"` or explicit list |

As for datasets, inheriting of the `BaseTaskHead` automatically adds the class in our registry. We use this to test that the integration is successful.

In [None]:
from neuralk_foundry_ce.workflow import get_step_class
from sklearn.ensemble import HistGradientBoostingClassifier


class HistGradientClassifier(ClassifierModel):
    name = "hist-gb-classifier"

    def __init__(self):
        super().__init__()

    def init_model(self, config):
        self.config = config
        self.model = HistGradientBoostingClassifier(**config)

    @with_masked_split
    def train(self, X, y):
        self.model.fit(X, y)

    @with_masked_split
    def forward(self, X):
        self.extras['y_score'] = self.model.predict_proba(X)
        return self.model.predict(X)

    def get_fixed_params(self, inputs):
        return {
            "early_stopping": True,
            "scoring": "loss",
            "validation_fraction": 0.1,
            "n_iter_no_change": 20,
            "random_state": 42
        }
        
    def get_model_params(self, trial, inputs):
        return {
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2, log=True),
            "max_iter": trial.suggest_int("max_iter", 100, 500),
            "max_leaf_nodes": trial.suggest_int("max_leaf_nodes", 16, 256),
            "min_samples_leaf": trial.suggest_int("min_samples_leaf", 5, 100),
            "l2_regularization": trial.suggest_float("l2_regularization", 0.0, 1.0),
        }

try:
    get_step_class("hist-gb-classifier")
    print('Model registered successfully!')
except ValueError:
    print('Failed to register model.')

Model registered successfully!


Let's define our model. If you read about the structure of our benchmark in this [notebook](./1%20-%20Getting%20Started%20with%20TabBench.ipynb), you will notice that any class implemented in neuralk benchmark is automagically registered and made available for all workflows. After defining our class, we check that it is registered.

# Use it in a workflow
Your model is ready to go, you can now use it in any neuralk benchmark workflow.

In [7]:
from tabbench.workflow.use_cases import Classification


use_case = Classification('best_buy_simple_categ')
use_case.set_classifier(HistGradientClassifier())
use_case.notebook_display()

In [8]:
data, metrics = use_case.run()
print(f'Final test ROC AUC {metrics["hist-gb-classifier"]["test_roc_auc"]}')

Final test ROC AUC 0.9740347536896564


## Conclusion

Neuralk Foundry currently supports three types of dataset sources: **local files**, **downloadable resources**, and **OpenML datasets**. These options cover most common use cases in both research and industry.

If you wish to support an additional data source or contribute new tasks, contributions are welcome, feel free to open a pull request!

Check the other tutorials:

* [1 - Getting Started with TabBench.ipynb](./1%20-%20Getting%20Started%20with%20TabBench.ipynb)

* [2 - Adding a local or internet dataset.ipynb](./2%20-%20Adding%20a%20local%20or%20internet%20dataset.ipynb)