feat - add index study_id column on trials table #4449

Ilevk · 2023-02-21T06:04:30Z

Motivation

I have experienced bottleneck in using rdbstorage with access more than 100 sessions simultaneously.
contribute #4444

Description of the changes

add index to study_id on trials table

c-bata · 2023-02-21T06:44:04Z

Thank you for your pull request!
I have some questions and suggestions.

Could you add a schema migration file using alembic? See wiki for details.
I guess a composite index of study_id and state columns might be more efficient. Is it possible to benchmark in your environment and share the result with us?

Ilevk · 2023-02-22T01:20:14Z

Okay, I'll try it and share it with you.

Ilevk · 2023-02-25T04:49:40Z

I tried composit index(study_id, state) for 3 days and didn't see any dramatic performance changes.
The previous range is a single index of study_id.

codecov-commenter · 2023-02-25T07:55:59Z

Codecov Report

Merging #4449 (352a88d) into master (ac169ea) will increase coverage by 0.65%.
The diff coverage is 90.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master    #4449      +/-   ##
==========================================
+ Coverage   89.68%   90.33%   +0.65%     
==========================================
  Files         178      184       +6     
  Lines       13974    14099     +125     
==========================================
+ Hits        12532    12736     +204     
+ Misses       1442     1363      -79

Impacted Files	Coverage Δ
optuna/storages/_rdb/alembic/versions/v3.2.0.a_.py	`88.88% <88.88%> (ø)`
optuna/storages/_rdb/models.py	`97.69% <100.00%> (+<0.01%)`	⬆️

... and 24 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Ilevk · 2023-02-28T02:36:40Z

If I were to run a large number of experiments per HPO, I would expect the study_id, state composite index to be more efficient.
In my case, I run 10 experiments per HPO, which seems to make a small difference in performance.

github-actions · 2023-03-12T23:07:06Z

This pull request has not seen any recent activity.

c-bata

Sorry for the late review. I'll review this PR today. Let me leave an early feedback for now.

optuna/storages/_rdb/alembic/versions/v3.1.0.a_.py

c-bata

Thank you for the quick update! I left a minor suggestion.

Regarding the migration script, it looks good to me. The backward migration does not work in MySQL, but is acceptable since Optuna does not actually provide an API for backward migration to users.

I am checking for performance gains with this change using the following script.
https://gist.github.com/c-bata/c08fb89a583adbcdc3eddcf8cf192c1a

In MySQL, the performance improvement could not be found because the index already exists due to foreign key constraints.
In SQLite3, as you can see the following comment, no significant difference seems to exist with 50 studies (100 trials per each).
https://gist.github.com/c-bata/c08fb89a583adbcdc3eddcf8cf192c1a?permalink_comment_id=4500763#gistcomment-4500763

I will verify the performance on PostgreSQL tomorrow with more study and trial records.

optuna/storages/_rdb/alembic/versions/v3.2.0.a_.py

c-bata · 2023-03-14T02:25:17Z

I have confirmed with PostgreSQL that this change improves performance. I will approve this PR after my suggestion is reflected 👍

Benchmarking on PostgreSQL

Here is a benchmark script and its result.

Benchmark script:

https://gist.github.com/c-bata/98532a60609a8a5f9e1e4dd162d45886

Before (master):

$ python profile_get_all_trials.py
Elapsed: 24.4298s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
0.5289	100	SELECT trial_params.trial_id AS trial_params_trial_id, trial_params.param_id AS trial_params_param_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params
WHERE trial_params.trial_id IN (%(primary_keys_1)s, %(primary_keys_2)s, %(primary_keys_3)s, %(primary_keys_4)s, %(primary_keys_5)s, ...)
0.2383	100	SELECT trials.trial_id AS trials_trial_id
FROM trials
WHERE trials.study_id = %(study_id_1)s

After (This PR):

# python profile_all_trials.py
Elapsed: 21.5016s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
0.4890	100	SELECT trial_params.trial_id AS trial_params_trial_id, trial_params.param_id AS trial_params_param_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params
WHERE trial_params.trial_id IN (%(primary_keys_1)s, %(primary_keys_2)s, %(primary_keys_3)s, %(primary_keys_4)s, ...)
0.2103	100	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id IN (%(trial_id_1_1)s, %(trial_id_1_2)s, %(trial_id_1_3)s, %(trial_id_1_4)s, ...) AND trials.study_id = %(study_id_1)s ORDER BY trials.trial_id

According to the slow queries, a composite index of study_id and state does not improves the performance since state is not included in the query.

optuna-e2e sciripts

For the second reviewer, let me share the optuna-e2e branch I used to check the migration script.
https://github.com/c-bata/optuna-e2e/tree/optuna-4449

$ docker compose up -d --build
$ docker compose run --rm optuna-300 python src/init.py
$ docker compose run --rm optuna-dev bash src/upgrade.sh
mysql
[I 2023-03-13 12:16:15,177] Upgrading the storage schema to the latest version.
[I 2023-03-13 12:16:15,417] Completed to upgrade the storage.
postgresql
[I 2023-03-13 12:16:16,268] Upgrading the storage schema to the latest version.
[I 2023-03-13 12:16:16,500] Completed to upgrade the storage.
sqlite
[I 2023-03-13 12:16:17,261] Upgrading the storage schema to the latest version.
[I 2023-03-13 12:16:17,581] Completed to upgrade the storage.

toshihikoyanase · 2023-03-14T06:32:20Z

@Alnusjaponica Could you review this PR, please?
Let me explain how to confirm the schema migration with the optuna-e2e tool together.

Co-authored-by: Masashi Shibata <c-bata@users.noreply.github.com>

Alnusjaponica · 2023-03-17T08:14:20Z

@Ilevk Thank you for your contribution.
I also confirmed that the migration code works properly and the change itself looks good to me.

As we discussed offline, @c-bata found out that this change might not have large affect on the performance (he'll share the data presently) and wonders what made this change affect the performance drastically in your environment. Could you provide us some information about what kind of job you're running or any reproducible codes?

Ilevk · 2023-03-17T08:38:07Z

@Alnusjaponica We train about 5,000 models every day, and we run 5-7 HPOs per model. We have about 100 training instances. If you have a lot of connections happening at the same time, you seem to have the same problem as me.

c-bata · 2023-03-17T08:44:25Z

@Ilevk Thank you for your swift response. Could you also share which sampler you used and the number of trials per study?

Ilevk · 2023-03-17T08:47:01Z

@c-bata we use a TPESampler & 5 ~ 7 trials each. almost 5 trials.

Ilevk · 2023-03-17T08:49:18Z

There is one more bottleneck in our environment. Currently, RDBStorage creates engines internally with create_engine, and when accessed by many instances at the same time, the previously created engines & connections are not cleaned up and delay the next experiment.

c-bata

LGTM! I could see a clear performance improvement in the following scenario.

https://gist.github.com/c-bata/98532a60609a8a5f9e1e4dd162d45886

Before

optuna=# \d trials;
                                               Table "public.trials"
      Column       |            Type             | Collation | Nullable |                 Default
-------------------+-----------------------------+-----------+----------+------------------------------------------
 trial_id          | integer                     |           | not null | nextval('trials_trial_id_seq'::regclass)
 number            | integer                     |           |          |
 study_id          | integer                     |           |          |
 state             | trialstate                  |           | not null |
 datetime_start    | timestamp without time zone |           |          |
 datetime_complete | timestamp without time zone |           |          |
Indexes:
    "trials_pkey" PRIMARY KEY, btree (trial_id)
Foreign-key constraints:
    "trials_study_id_fkey" FOREIGN KEY (study_id) REFERENCES studies(study_id)
Referenced by:
    TABLE "trial_heartbeats" CONSTRAINT "trial_heartbeats_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_intermediate_values" CONSTRAINT "trial_intermediate_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_params" CONSTRAINT "trial_params_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_system_attributes" CONSTRAINT "trial_system_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_user_attributes" CONSTRAINT "trial_user_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_values" CONSTRAINT "trial_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)

# python profiler.py
Elapsed: 157.2042s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
43.5951	11000	SELECT trials.trial_id AS trials_trial_id
FROM trials
WHERE trials.study_id = %(study_id_1)s
40.9885	10000	SELECT trial_params.param_id AS trial_params_param_id, trial_params.trial_id AS trial_params_trial_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params JOIN trials ON trials.trial_id = trial_params.trial_id
WHERE trials.study_id = %(study_id_1)s AND trial_params.param_name = %(param_name_1)s
 LIMIT %(param_1)s
4.7336	21000	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id = %(trial_id_1)s
4.5883	1000	SELECT count(trials.trial_id) AS count_1
FROM trials
WHERE trials.study_id = %(study_id_1)s AND trials.trial_id < %(trial_id_1)s
3.5560	11100	SELECT studies.study_id AS studies_study_id, studies.study_name AS studies_study_name
FROM studies
WHERE studies.study_id = %(study_id_1)s

After

optuna=# create index ix_trials_study_id on trials(study_id);
CREATE INDEX
optuna=# \d trials;
                                               Table "public.trials"
      Column       |            Type             | Collation | Nullable |                 Default
-------------------+-----------------------------+-----------+----------+------------------------------------------
 trial_id          | integer                     |           | not null | nextval('trials_trial_id_seq'::regclass)
 number            | integer                     |           |          |
 study_id          | integer                     |           |          |
 state             | trialstate                  |           | not null |
 datetime_start    | timestamp without time zone |           |          |
 datetime_complete | timestamp without time zone |           |          |
Indexes:
    "trials_pkey" PRIMARY KEY, btree (trial_id)
    "ix_trials_study_id" btree (study_id)
Foreign-key constraints:
    "trials_study_id_fkey" FOREIGN KEY (study_id) REFERENCES studies(study_id)
Referenced by:
    TABLE "trial_heartbeats" CONSTRAINT "trial_heartbeats_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_intermediate_values" CONSTRAINT "trial_intermediate_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_params" CONSTRAINT "trial_params_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_system_attributes" CONSTRAINT "trial_system_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_user_attributes" CONSTRAINT "trial_user_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_values" CONSTRAINT "trial_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)

# python profiler.py
Elapsed: 66.0051s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
4.3927	21000	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id = %(trial_id_1)s
3.0978	10000	SELECT trial_params.param_id AS trial_params_param_id, trial_params.trial_id AS trial_params_trial_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params JOIN trials ON trials.trial_id = trial_params.trial_id
WHERE trials.study_id = %(study_id_1)s AND trial_params.param_name = %(param_name_1)s
 LIMIT %(param_1)s
2.9455	11100	SELECT studies.study_id AS studies_study_id, studies.study_name AS studies_study_name
FROM studies
WHERE studies.study_id = %(study_id_1)s
2.3931	10000	INSERT INTO trial_params (trial_id, param_name, param_value, distribution_json) VALUES (%(trial_id)s, %(param_name)s, %(param_value)s, %(distribution_json)s) RETURNING trial_params.param_id
1.8892	11000	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id IN (NULL) AND (1 != 1) AND trials.study_id = %(study_id_1)s ORDER BY trials.trial_id

c-bata · 2023-03-22T02:28:36Z

There is one more bottleneck in our environment. Currently, RDBStorage creates engines internally with create_engine, and when accessed by many instances at the same time, the previously created engines & connections are not cleaned up and delay the next experiment.

@Ilevk Thank you for sharing. The connection objects are basically cleaned up when the reference count of RDBStorage reaches zero, but if there are any connections left, they can be explicitly cleaned up as follows.

storage = optuna.storages.RDBStorage(storage_url)
study = optuna.create_study(storage=storage)
study.optimize(objective, ...)

# Explicitly clean up connections
storage.engine.dispose()

If you find any problems with the handling of connection objects in Optuna, please report them to us.

Alnusjaponica

Sorry for my delayed reply. I also run optimizations in the same scenario with two different versions and confirmed that it is the newer one is about twice as fast as the older one. LGTM.

feat - add index study_id column on trials table

fa08543

github-actions bot added the optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions. label Feb 21, 2023

c-bata self-assigned this Feb 21, 2023

cross32768 assigned amylase Feb 22, 2023

Ilevk added 2 commits February 25, 2023 13:39

feat - add study_id index to migration alembic file

73e6ae9

style - apply black

b6b7c27

Ilevk added 6 commits February 25, 2023 14:04

test - add storage version

3262e7f

chore - add imports

9e0ff0e

feat - add 3.1.0.a db

8b8e37e

chore - fix typo

9c67545

chore - remove unused imports

b1c7e4b

feat - add recreate 3.1.0.a db

3660246

Ilevk force-pushed the feat/add-index-to-trials-study-id branch from 7d64fa7 to 3660246 Compare February 25, 2023 07:06

github-actions bot added the stale Exempt from stale bot labeling. label Mar 12, 2023

c-bata reviewed Mar 13, 2023

View reviewed changes

optuna/storages/_rdb/alembic/versions/v3.1.0.a_.py Outdated Show resolved Hide resolved

c-bata removed the stale Exempt from stale bot labeling. label Mar 13, 2023

toshihikoyanase unassigned amylase Mar 13, 2023

Ilevk added 3 commits March 13, 2023 13:16

refactor - change revision to 3.2.0.a

4270cbe

test - rm 3.1.0.db binary file

cd18edd

feat - add 3.2.0.a revision

d9fa004

c-bata reviewed Mar 13, 2023

View reviewed changes

optuna/storages/_rdb/alembic/versions/v3.2.0.a_.py Outdated Show resolved Hide resolved

c-bata added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Mar 14, 2023

toshihikoyanase assigned Alnusjaponica Mar 14, 2023

Update optuna/storages/_rdb/alembic/versions/v3.2.0.a_.py

041510b

Co-authored-by: Masashi Shibata <c-bata@users.noreply.github.com>

Ilevk force-pushed the feat/add-index-to-trials-study-id branch 2 times, most recently from 3385564 to f551a89 Compare March 14, 2023 14:06

style - adjust flake8 format

352a88d

Ilevk force-pushed the feat/add-index-to-trials-study-id branch from f551a89 to 352a88d Compare March 14, 2023 14:12

c-bata approved these changes Mar 22, 2023

View reviewed changes

c-bata assigned Alnusjaponica and unassigned Alnusjaponica and c-bata Mar 22, 2023

Alnusjaponica approved these changes Mar 24, 2023

View reviewed changes

Alnusjaponica removed their assignment Mar 24, 2023

c-bata added this to the v3.2.0 milestone Mar 24, 2023

c-bata merged commit 3ebb0db into optuna:master Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat - add index study_id column on trials table #4449

feat - add index study_id column on trials table #4449

Ilevk commented Feb 21, 2023

c-bata commented Feb 21, 2023

Ilevk commented Feb 22, 2023

Ilevk commented Feb 25, 2023 •

edited

Loading

codecov-commenter commented Feb 25, 2023 •

edited

Loading

Ilevk commented Feb 28, 2023 •

edited

Loading

github-actions bot commented Mar 12, 2023

c-bata left a comment

c-bata left a comment

c-bata commented Mar 14, 2023 •

edited

Loading

toshihikoyanase commented Mar 14, 2023

Alnusjaponica commented Mar 17, 2023 •

edited

Loading

Ilevk commented Mar 17, 2023

c-bata commented Mar 17, 2023

Ilevk commented Mar 17, 2023

Ilevk commented Mar 17, 2023

c-bata left a comment

c-bata commented Mar 22, 2023

Alnusjaponica left a comment

feat - add index study_id column on trials table #4449

feat - add index study_id column on trials table #4449

Conversation

Ilevk commented Feb 21, 2023

Motivation

Description of the changes

c-bata commented Feb 21, 2023

Ilevk commented Feb 22, 2023

Ilevk commented Feb 25, 2023 • edited Loading

codecov-commenter commented Feb 25, 2023 • edited Loading

Codecov Report

Ilevk commented Feb 28, 2023 • edited Loading

github-actions bot commented Mar 12, 2023

c-bata left a comment

Choose a reason for hiding this comment

c-bata left a comment

Choose a reason for hiding this comment

c-bata commented Mar 14, 2023 • edited Loading

Benchmarking on PostgreSQL

optuna-e2e sciripts

toshihikoyanase commented Mar 14, 2023

Alnusjaponica commented Mar 17, 2023 • edited Loading

Ilevk commented Mar 17, 2023

c-bata commented Mar 17, 2023

Ilevk commented Mar 17, 2023

Ilevk commented Mar 17, 2023

c-bata left a comment

Choose a reason for hiding this comment

Before

After

c-bata commented Mar 22, 2023

Alnusjaponica left a comment

Choose a reason for hiding this comment

Ilevk commented Feb 25, 2023 •

edited

Loading

codecov-commenter commented Feb 25, 2023 •

edited

Loading

Ilevk commented Feb 28, 2023 •

edited

Loading

c-bata commented Mar 14, 2023 •

edited

Loading

Alnusjaponica commented Mar 17, 2023 •

edited

Loading