Skip to content

Conversation

@fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented May 3, 2025

In-progress V5 API rework suggestions.

  • BaseExperiment and BaseOptimizer interface
  • a jupyter notebook with explanation of usage - experiment, optimizer, and sklearn tuner
  • documented extension contracts for both, in extension_templates
  • sklearn experiment SklearnCvExperiment, inheriting from BaseExperiment class for integration
  • three common optimization test functions, also inheriting from BaseExperiment
  • optimizers inheriting from BaseOptimizer
    • example for gfo backend: HillClimbing
    • GridSearch using sklearn ParameterSet, mostly equivalent to the grid search logic used in sklearn (minus parallelization - for now)
  • the existing HyperactiveSearchCV is refactored to use the SklearnCvExperiment internally, instead of the custom adapter from the previous stable version
  • a new OptCV for sklearn which allows tuning using any optimizer in the hyperactiveAPI - e.g., random search, grid search, bayesian optimization, tree of Parzen, etc
  • test framework skeleton using scikit-base for consistent API contracts is added, for optimizers and experiments - this is extensible with more tests
  • a registry and retrieval utility is added, currently in a private module _registry for use by the test system, but this could be made public to the user.

Note: compared to #101, I have reverted most changes to optimizers, and reverted all deleted tests.
These changes could be added in a separate PR, and I would consider them mostly orthogonal.

This is to restore downwards compatibility, and allow a more gentle refactor - or, merge with #101.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 5, 2025

@SimonBlanke, I think this is complete and ready now

`MyClass(**params)` or `MyClass(**params[i])` creates a valid test instance.
`create_test_instance` uses the first (or only) dictionary in `params`
"""
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just hard-coded as an example for now, right? Will this change in future PRs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_test_params are test examples - a pure user should never get in contact with those, except if they want a quick instance for testing themselves.

The test examples typically do not change, we might add more to improve test coverage though.


def __init__(
self,
search_space=None,
Copy link
Collaborator

@SimonBlanke SimonBlanke May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the first steps to an APi that makes different optimizer packages available!
But I would find it confusing as a user to pass the search-space in here (init) or in the add_search-method. There should be one correct way to do this.
What advantages would it bring to allow this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What advantages would it bring to allow this?

I would be happy passing all in init, especially since set_params can be used to change some (and any) of the parameters later.

I have simply left add_search as an alt option since I thought you needed it in your downstream APIs? So, for downwards compatibility.

I would not mind removing it, since I do not know first-principles arguments why to have it in. A very weak argument is to separate different types of parameters - but since we also pass them to __init__, it is not a strong or compelling one.

@SimonBlanke
Copy link
Collaborator

The jupyter-notebook fails at cell 17: "OptCV tuning via GridSearch". I guess this is not implemented, yet. It is just there to show how this could work in the future?

"paramc": 42,
"experiment": AnotherExperiment("another_experiment_params"),
}
return [paramset1, paramset2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a user create a custom optimizer class, why hard codes the experiment? Maybe I do not understand the purpose of a custom optimizer. Could you provide a simple example file (*.py-file) where a custom optimizer is created by a user and then used on an experiment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a user create a custom optimizer class, why hard codes the experiment?

Sorry if there was a misunderstanding, I think there is - let me know your thoughts on where to document this better.

The experiment is hard-coded only for the test - get_test_params is for testing purposes only.

In all test cases, we hard-code the experiment so we are also able to hard-code things like the search space, which will depend on the experiment.

Copy link
Collaborator

@SimonBlanke SimonBlanke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks intriguing so far. There are some pain-points for me:

  • some hard coded stuff, that I do not understand. But maybe this is just examples for now
  • The ambiguity of passing parameter to init or to add_search.

Those points should be clarified for the next review.

Please provide runnable *.py example files (in the future). Those are much easier to review and to understand. A hill-climbing, a grid-search and a custom-optimizer example would be great. Optionally a not-runnable optuna example.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 11, 2025

The jupyter-notebook fails at cell 17: "OptCV tuning via GridSearch". I guess this is not implemented, yet. It is just there to show how this could work in the future?

Hm, no, it should work - I re-executed the notebook and it runs for me, locally.

Could you perhaps report your environment and the nature of the failure?

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 11, 2025

some hard coded stuff, that I do not understand. But maybe this is just examples for now

You mean the hard-coded example parameters in get_test_params? These parameters are used in testing only - the instances created there are the ones passed to the tests in TestAllOptimizers etc. For the tests, we need valid instances of optimizers and experiments, so we somehwere need to hardcode valid test instances.

The ambiguity of passing parameter to init or to add_search.

Yes, I am also not 100% happy with this - as said above, I included it mostly for downwards compatibility. We can remove the add_search method, though it does not cost much to maintain it either.

What is your preference?

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 11, 2025

Please provide runnable *.py example files (in the future). Those are much easier to review and to understand.

As compared to jupyter? I disagree on this statement, but perhaps it is a matter of taste. You can also convert jupyter notebooks to py via nbconvert if you prefer py.

Most users in the data science space imo strongly prefer example code in jupyter notebooks and not in py files - therefore, irrespective of personal preference, I would suggest to go with jupyter, simply since that is the strong user base preference (afaik).

A hill-climbing, a grid-search and a custom-optimizer example would be great.

The first are both available in the notebook hyperactive_intro.ipynb? Could you be more concrete if you think sth is missing, as to what kind of example you would like?

@SimonBlanke
Copy link
Collaborator

The jupyter-notebook fails at cell 17: "OptCV tuning via GridSearch". I guess this is not implemented, yet. It is just there to show how this could work in the future?

Hm, no, it should work - I re-executed the notebook and it runs for me, locally.

Could you perhaps report your environment and the nature of the failure?

I fixed it by upgrading to the newest version of sklearn

@SimonBlanke
Copy link
Collaborator

The ambiguity of passing parameter to init or to add_search.

Yes, I am also not 100% happy with this - as said above, I included it mostly for downwards compatibility. We can remove the add_search method, though it does not cost much to maintain it either.

What is your preference?

So we pass those parameter to init, because of the sklearn data-class structure, right? But if the parameters belong to the method (like add_search) "semantically" it would still be okay, right? One example we discussed was the fit-method, that accepts the training data of an estimator.

Does that mean, that you think the search-space does belong to the init, because it fits better semantically?
I think you explained this before, but I cannot remember the idea here (or I need a different explanation)

However if we are really forced to pass all those paramters to "init", then we should (maybe) remove add_search, some time in the future. This method is mainly for parallel computing. Maybe we find a better way to support this in the future.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 11, 2025

I fixed it by upgrading to the newest version of sklearn

Interesting - we should try to be compatible to a wider range of versions. What exactly was the failure, with which version?

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 11, 2025

So we pass those parameter to init, because of the sklearn data-class structure, right? But if the parameters belong to the method (like add_search) "semantically" it would still be okay, right?

I suppose it is a bit of a stretch from the API, but it would be ok in the sense that it does not introduce an incompatibility.

Does that mean, that you think the search-space does belong to the init, because it fits better semantically?

I am not sure about this - search space being one of the more fuzzy points of the design - but init as an "if in doubt" location, for now at least.

This method is mainly for parallel computing. Maybe we find a better way to support this in the future.

Ok - I did not yet understand the parallel computing case entirely. Is it simply running multiple "runs" in parallel?

@SimonBlanke
Copy link
Collaborator

I fixed it by upgrading to the newest version of sklearn

Interesting - we should try to be compatible to a wider range of versions. What exactly was the failure, with which version?

The version of sklearn was 1.5. The error looked like this:
"received ImportError: cannot import name '_deprecate_Xt_in_inverse_transform' from 'sklearn.utils.deprecation'"

Ok - I did not yet understand the parallel computing case entirely. Is it simply running multiple "runs" in parallel?

Correct, they run independend from each other, but they can share a memory-dictionary of the objective-functions are the same.

I am not sure about this - search space being one of the more fuzzy points of the design - but init as an "if in doubt" location, for now at least.

A search-space is a general parameter for optimization. I expect all optimization packages to have this parameter.

As I see it we have two alternatives to avoid the init-, add_search- parameter ambiguity:

  • remove add_search and therefore the ability to do parallel computing in the optimizer. If we want re-enable parallel computing we could maybe do it in a separate "scheduler"-class?
  • we keep add_search and pass the experiment and search_space to add_search (and n_iter?).

If we go the first way:
We should at least have a concept for parallel computing before we go forward.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 13, 2025

The version of sklearn was 1.5. The error looked like this:
"received ImportError: cannot import name '_deprecate_Xt_in_inverse_transform' from 'sklearn.utils.deprecation'"

Would be useful to have the full traceback. This looks like a big problem, something is using private methods that it should not. The obvious question is whether the problem is in one of our packages or somewhere external.

Correct, they run independend from each other, but they can share a memory-dictionary of the objective-functions are the same.

I see - how do you avoid race conditions?

I am not sure about this - search space being one of the more fuzzy points of the design - but init as an "if in doubt" location, for now at least.

A search-space is a general parameter for optimization. I expect all optimization packages to have this parameter.

Yes, but most certainly the python representations will vary widely, and often there is a mix of search space and search configuration (e.g., distributions).

As I see it we have two alternatives to avoid the init-, add_search- parameter ambiguity:

Plus the third alternative which is the current - via add_search we can support parallelism with data sharing, whereas __init__ supports the dataclass-like specification syntax.

If we go the first way:
We should at least have a concept for parallel computing before we go forward.

If I understand correctly, this is not just parallel computing but parallelism with shared search space, right?

Can you outline the simplest case in which non-trivial data sharing happens? If the search problems are completely different, it makes no sense to do that. So what is the simplest non-trivial case?

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 15, 2025

We should at least have a concept for parallel computing before we go forward.

This seems like the only open point - could we move to close it?

I do not understand the use case here, can you provide at least one non-trivial example where it is not just running two instances of run in parallel, optimally with current working code?

Copy link
Collaborator

@SimonBlanke SimonBlanke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix error(s)

Copy link
Collaborator

@SimonBlanke SimonBlanke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this! :-)

@SimonBlanke SimonBlanke merged commit 731808f into hyperactive-project:master May 18, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants