Use vectorization for categorical distance #5147

nabenabe0928 · 2023-12-12T05:28:03Z

Motivation

As the current implementation for the categorical distance is not vectorized, I modified it to the vectorized version for a quicker runtime.

Furthermore, I reuse some results to reduce the computational overheads as well.

Description of the changes

The changes are to:

vectorize the computation, and
avoid the re-computation of some coefficients and distances.

Benchmarking Results

As the time complexity of the new algorithm is O(n_trials + n_choices * n_unique_occurrences) while that of the old one is O(n_trials * n_choices), the new algorithm is quicker in all the instances listed in the table.
Plus, all the instances passed assert np.allclose(res_old, res_new).

from __future__ import annotations

import itertools
import string
import time

import numpy as np

from optuna.distributions import CategoricalDistribution


def old_version(
    observations: np.ndarray,
    choices: list[str],
    consider_prior: bool,
    prior_weight: float,
    dist_func,
) -> np.ndarray:
    n_samples = observations.size + (consider_prior or observations.size == 0)
    n_choices = len(choices)
    weights = np.full(
        shape=(n_samples, n_choices),
        fill_value=prior_weight / n_samples,
    )

    for i, observation in enumerate(observations.astype(int)):
        dists = [
            dist_func(choices[observation], choices[j])
            for j in range(len(choices))
        ]
        exponent = -(
            (np.array(dists) / max(dists)) ** 2
            * np.log((len(observations) + consider_prior) / prior_weight)
            * (np.log(len(choices)) / np.log(6))
        )
        weights[i] = np.exp(exponent)

    return weights


def new_version(
    observations: np.ndarray,
    choices: list[str],
    consider_prior: bool,
    prior_weight: float,
    dist_func,
) -> np.ndarray:
    n_samples = observations.size + (consider_prior or observations.size == 0)
    n_choices = len(choices)
    weights = np.full(
        shape=(n_samples, n_choices),
        fill_value=prior_weight / n_samples,
    )

    observed_indices = observations.astype(int)
    used_indices, rev_indices = np.unique(observed_indices, return_inverse=True)
    dists = np.array([[dist_func(choices[i], c) for c in choices] for i in used_indices])
    max_dists = np.max(dists, axis=1)
    coef = np.log(n_samples / prior_weight) * np.log(n_choices) / np.log(6)
    categorical_weights = np.exp(-((dists / max_dists[:, np.newaxis]) ** 2) * coef)
    weights[: observed_indices.size] = categorical_weights[rev_indices]
    return weights


def dist_func(s1: str, s2: str) -> float:
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))


alphabets = list(string.ascii_lowercase)

for n_chars in range(1, 4):
    choices = ["".join(it) for it in itertools.product(*[alphabets] * n_chars)]
    for size in [100, 1000, 10000]:
        print(f"{n_chars=},{size=}")
        rng = np.random.RandomState(42)
        observations = rng.choice(len(choices), size=size)
        start = time.time()
        res_old = old_version(observations, choices, consider_prior=True, prior_weight=1.0, dist_func=dist_func)
        print(f"Old version took {(time.time() - start) * 1000:.3f}[ms]")
        start = time.time()
        res_new = new_version(observations, choices, consider_prior=True, prior_weight=1.0, dist_func=dist_func)
        print(f"New version took {(time.time() - start) * 1000:.3f}[ms]")
        assert np.allclose(res_old, res_new)

	Old	New
n_chars=1,size=100	2.1	0.4
n_chars=1,size=1000	12.4	0.4
n_chars=1,size=10000	128.6	1.8
n_chars=2,size=100	22.2	20.3
n_chars=2,size=1000	229.4	107.5
n_chars=2,size=10000	2320.8	162.2
n_chars=3,size=100	656.8	618.2
n_chars=3,size=1000	6699.5	6193.8
n_chars=3,size=10000	77744.1	56137.2

codecov · 2023-12-12T06:00:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6079b79) 89.42% compared to head (f600cd8) 89.43%.
Report is 22 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5147   +/-   ##
=======================================
  Coverage   89.42%   89.43%           
=======================================
  Files         205      205           
  Lines       15160    15170   +10     
=======================================
+ Hits        13557    13567   +10     
  Misses       1603     1603

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nabenabe0928 · 2023-12-18T06:17:18Z

Could you review this PR?
@knshnb @contramundum53

contramundum53 · 2023-12-20T08:18:56Z

optuna/samplers/_tpe/parzen_estimator.py

-        consider_prior = parameters.consider_prior or len(observations) == 0
+        choices = search_space.choices
+        n_choices = len(choices)
+        n_samples = observations.size + (parameters.consider_prior or observations.size == 0)


Nit: the name n_samples might not be appropriate, because this includes prior. Maybe n_mixture may be better?

contramundum53 · 2023-12-20T08:24:34Z

Nit: Is there any reason you prefer observed_indices.size over len(observations)?
This is just a matter of taste, but Optuna developers seems to have previously preferred the latter.

nabenabe0928 · 2023-12-22T04:25:55Z

Nit: Is there any reason you prefer observed_indices.size over len(observations)? This is just a matter of taste, but Optuna developers seems to have previously preferred the latter.

The reason why I take np.size rather than len is that np.size imposes a stricter constraint on the input.
However, it does not really matter in this PR, so I will revert it.

knshnb

Thanks for the great PR with benchmarking results. I added a small comment.

optuna/samplers/_tpe/parzen_estimator.py

knshnb

LGTM.

contramundum53 · 2023-12-27T05:32:52Z

The reason why I take np.size rather than len is that np.size imposes a stricter constraint on the input. However, it does not really matter in this PR, so I will revert it.

np.size actually doesn't impose constraint on the input. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.size.html

It returns the total number of elements, regardless of the dimension.

contramundum53

LGTM!

github-actions · 2024-01-03T23:07:43Z

This pull request has not seen any recent activity.

nabenabe0928 added 5 commits December 9, 2023 05:23

Use vectorization for categorical distance

3257fa2

Reduce time complexity from O(C**2) to O(C*U)

5d6e590

Debug by test cases

3b9db5c

Fix a bug for observations.size == 0

7112b9a

Fix an error for consider_prior=True in the cat dist calc

c1fe437

github-actions bot added the optuna.samplers Related to the `optuna.samplers` submodule. This is automatically labeled by github-actions. label Dec 12, 2023

nabenabe0928 requested review from knshnb and contramundum53 December 18, 2023 04:39

nabenabe0928 assigned knshnb and contramundum53 Dec 18, 2023

This comment was marked as resolved.

Sign in to view

contramundum53 reviewed Dec 20, 2023

View reviewed changes

Apply the suggestions by wang-san

d7e1dda

knshnb reviewed Dec 22, 2023

View reviewed changes

optuna/samplers/_tpe/parzen_estimator.py Outdated Show resolved Hide resolved

nabenabe0928 added 2 commits December 22, 2023 10:42

Make len(observations)==0 early-return in categorical dist

25ba574

Refactor slightly

f600cd8

knshnb approved these changes Dec 25, 2023

View reviewed changes

nabenabe0928 requested a review from contramundum53 December 26, 2023 07:05

contramundum53 approved these changes Dec 27, 2023

View reviewed changes

github-actions bot added the stale Exempt from stale bot labeling. label Jan 3, 2024

not522 merged commit 2156f54 into optuna:master Jan 9, 2024
22 of 23 checks passed

not522 added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Jan 9, 2024

not522 added this to the v3.6.0 milestone Jan 9, 2024

nabenabe0928 deleted the code-fix/use-vectorization-for-categorical-distance branch February 19, 2024 04:07

nabenabe0928 removed the stale Exempt from stale bot labeling. label Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vectorization for categorical distance #5147

Use vectorization for categorical distance #5147

nabenabe0928 commented Dec 12, 2023

codecov bot commented Dec 12, 2023 •

edited

nabenabe0928 commented Dec 18, 2023

This comment was marked as resolved.

contramundum53 Dec 20, 2023

contramundum53 commented Dec 20, 2023

nabenabe0928 commented Dec 22, 2023

knshnb left a comment

knshnb left a comment

contramundum53 commented Dec 27, 2023

contramundum53 left a comment

github-actions bot commented Jan 3, 2024

Use vectorization for categorical distance #5147

Use vectorization for categorical distance #5147

Conversation

nabenabe0928 commented Dec 12, 2023

Motivation

Description of the changes

Benchmarking Results

codecov bot commented Dec 12, 2023 • edited

Codecov Report

nabenabe0928 commented Dec 18, 2023

This comment was marked as resolved.

contramundum53 Dec 20, 2023

Choose a reason for hiding this comment

contramundum53 commented Dec 20, 2023

nabenabe0928 commented Dec 22, 2023

knshnb left a comment

Choose a reason for hiding this comment

knshnb left a comment

Choose a reason for hiding this comment

contramundum53 commented Dec 27, 2023

contramundum53 left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 3, 2024

codecov bot commented Dec 12, 2023 •

edited