Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vectorization for categorical distance #5147

Conversation

nabenabe0928
Copy link
Collaborator

Motivation

As the current implementation for the categorical distance is not vectorized, I modified it to the vectorized version for a quicker runtime.

Furthermore, I reuse some results to reduce the computational overheads as well.

Description of the changes

The changes are to:

  1. vectorize the computation, and
  2. avoid the re-computation of some coefficients and distances.

Benchmarking Results

As the time complexity of the new algorithm is O(n_trials + n_choices * n_unique_occurrences) while that of the old one is O(n_trials * n_choices), the new algorithm is quicker in all the instances listed in the table.
Plus, all the instances passed assert np.allclose(res_old, res_new).

from __future__ import annotations

import itertools
import string
import time

import numpy as np

from optuna.distributions import CategoricalDistribution


def old_version(
    observations: np.ndarray,
    choices: list[str],
    consider_prior: bool,
    prior_weight: float,
    dist_func,
) -> np.ndarray:
    n_samples = observations.size + (consider_prior or observations.size == 0)
    n_choices = len(choices)
    weights = np.full(
        shape=(n_samples, n_choices),
        fill_value=prior_weight / n_samples,
    )

    for i, observation in enumerate(observations.astype(int)):
        dists = [
            dist_func(choices[observation], choices[j])
            for j in range(len(choices))
        ]
        exponent = -(
            (np.array(dists) / max(dists)) ** 2
            * np.log((len(observations) + consider_prior) / prior_weight)
            * (np.log(len(choices)) / np.log(6))
        )
        weights[i] = np.exp(exponent)

    return weights


def new_version(
    observations: np.ndarray,
    choices: list[str],
    consider_prior: bool,
    prior_weight: float,
    dist_func,
) -> np.ndarray:
    n_samples = observations.size + (consider_prior or observations.size == 0)
    n_choices = len(choices)
    weights = np.full(
        shape=(n_samples, n_choices),
        fill_value=prior_weight / n_samples,
    )

    observed_indices = observations.astype(int)
    used_indices, rev_indices = np.unique(observed_indices, return_inverse=True)
    dists = np.array([[dist_func(choices[i], c) for c in choices] for i in used_indices])
    max_dists = np.max(dists, axis=1)
    coef = np.log(n_samples / prior_weight) * np.log(n_choices) / np.log(6)
    categorical_weights = np.exp(-((dists / max_dists[:, np.newaxis]) ** 2) * coef)
    weights[: observed_indices.size] = categorical_weights[rev_indices]
    return weights


def dist_func(s1: str, s2: str) -> float:
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))


alphabets = list(string.ascii_lowercase)

for n_chars in range(1, 4):
    choices = ["".join(it) for it in itertools.product(*[alphabets] * n_chars)]
    for size in [100, 1000, 10000]:
        print(f"{n_chars=},{size=}")
        rng = np.random.RandomState(42)
        observations = rng.choice(len(choices), size=size)
        start = time.time()
        res_old = old_version(observations, choices, consider_prior=True, prior_weight=1.0, dist_func=dist_func)
        print(f"Old version took {(time.time() - start) * 1000:.3f}[ms]")
        start = time.time()
        res_new = new_version(observations, choices, consider_prior=True, prior_weight=1.0, dist_func=dist_func)
        print(f"New version took {(time.time() - start) * 1000:.3f}[ms]")
        assert np.allclose(res_old, res_new)
Old New
n_chars=1,size=100 2.1 0.4
n_chars=1,size=1000 12.4 0.4
n_chars=1,size=10000 128.6 1.8
n_chars=2,size=100 22.2 20.3
n_chars=2,size=1000 229.4 107.5
n_chars=2,size=10000 2320.8 162.2
n_chars=3,size=100 656.8 618.2
n_chars=3,size=1000 6699.5 6193.8
n_chars=3,size=10000 77744.1 56137.2

@github-actions github-actions bot added the optuna.samplers Related to the `optuna.samplers` submodule. This is automatically labeled by github-actions. label Dec 12, 2023
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6079b79) 89.42% compared to head (f600cd8) 89.43%.
Report is 22 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5147   +/-   ##
=======================================
  Coverage   89.42%   89.43%           
=======================================
  Files         205      205           
  Lines       15160    15170   +10     
=======================================
+ Hits        13557    13567   +10     
  Misses       1603     1603           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nabenabe0928
Copy link
Collaborator Author

Could you review this PR?
@knshnb @contramundum53

contramundum53

This comment was marked as resolved.

consider_prior = parameters.consider_prior or len(observations) == 0
choices = search_space.choices
n_choices = len(choices)
n_samples = observations.size + (parameters.consider_prior or observations.size == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the name n_samples might not be appropriate, because this includes prior. Maybe n_mixture may be better?

@contramundum53
Copy link
Member

Nit: Is there any reason you prefer observed_indices.size over len(observations)?
This is just a matter of taste, but Optuna developers seems to have previously preferred the latter.

@nabenabe0928
Copy link
Collaborator Author

Nit: Is there any reason you prefer observed_indices.size over len(observations)? This is just a matter of taste, but Optuna developers seems to have previously preferred the latter.

The reason why I take np.size rather than len is that np.size imposes a stricter constraint on the input.
However, it does not really matter in this PR, so I will revert it.

Copy link
Member

@knshnb knshnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR with benchmarking results. I added a small comment.

optuna/samplers/_tpe/parzen_estimator.py Outdated Show resolved Hide resolved
Copy link
Member

@knshnb knshnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@contramundum53
Copy link
Member

The reason why I take np.size rather than len is that np.size imposes a stricter constraint on the input. However, it does not really matter in this PR, so I will revert it.

np.size actually doesn't impose constraint on the input. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.size.html

It returns the total number of elements, regardless of the dimension.

Copy link
Member

@contramundum53 contramundum53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

github-actions bot commented Jan 3, 2024

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 3, 2024
@not522 not522 merged commit 2156f54 into optuna:master Jan 9, 2024
22 of 23 checks passed
@not522 not522 added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Jan 9, 2024
@not522 not522 added this to the v3.6.0 milestone Jan 9, 2024
@nabenabe0928 nabenabe0928 deleted the code-fix/use-vectorization-for-categorical-distance branch February 19, 2024 04:07
@nabenabe0928 nabenabe0928 removed the stale Exempt from stale bot labeling. label Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. optuna.samplers Related to the `optuna.samplers` submodule. This is automatically labeled by github-actions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants