Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat - add index study_id column on trials table #4449

Merged
merged 14 commits into from
Mar 24, 2023

Conversation

Ilevk
Copy link
Contributor

@Ilevk Ilevk commented Feb 21, 2023

Motivation

I have experienced bottleneck in using rdbstorage with access more than 100 sessions simultaneously.
contribute #4444

Description of the changes

  • add index to study_id on trials table

@github-actions github-actions bot added the optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions. label Feb 21, 2023
@c-bata
Copy link
Member

c-bata commented Feb 21, 2023

Thank you for your pull request!
I have some questions and suggestions.

  • Could you add a schema migration file using alembic? See wiki for details.
  • I guess a composite index of study_id and state columns might be more efficient. Is it possible to benchmark in your environment and share the result with us?

@c-bata c-bata self-assigned this Feb 21, 2023
@Ilevk
Copy link
Contributor Author

Ilevk commented Feb 22, 2023

Okay, I'll try it and share it with you.

@Ilevk
Copy link
Contributor Author

Ilevk commented Feb 25, 2023

I tried composit index(study_id, state) for 3 days and didn't see any dramatic performance changes.
The previous range is a single index of study_id.

스크린샷 2023-02-25 오후 1 45 33

스크린샷 2023-02-25 오후 1 45 21

@Ilevk Ilevk force-pushed the feat/add-index-to-trials-study-id branch from 7d64fa7 to 3660246 Compare February 25, 2023 07:06
@codecov-commenter
Copy link

codecov-commenter commented Feb 25, 2023

Codecov Report

Merging #4449 (352a88d) into master (ac169ea) will increase coverage by 0.65%.
The diff coverage is 90.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master    #4449      +/-   ##
==========================================
+ Coverage   89.68%   90.33%   +0.65%     
==========================================
  Files         178      184       +6     
  Lines       13974    14099     +125     
==========================================
+ Hits        12532    12736     +204     
+ Misses       1442     1363      -79     
Impacted Files Coverage Δ
optuna/storages/_rdb/alembic/versions/v3.2.0.a_.py 88.88% <88.88%> (ø)
optuna/storages/_rdb/models.py 97.69% <100.00%> (+<0.01%) ⬆️

... and 24 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Ilevk
Copy link
Contributor Author

Ilevk commented Feb 28, 2023

If I were to run a large number of experiments per HPO, I would expect the study_id, state composite index to be more efficient.
In my case, I run 10 experiments per HPO, which seems to make a small difference in performance.

@github-actions
Copy link
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Mar 12, 2023
Copy link
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I'll review this PR today. Let me leave an early feedback for now.

optuna/storages/_rdb/alembic/versions/v3.1.0.a_.py Outdated Show resolved Hide resolved
@c-bata c-bata removed the stale Exempt from stale bot labeling. label Mar 13, 2023
Copy link
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick update! I left a minor suggestion.

Regarding the migration script, it looks good to me. The backward migration does not work in MySQL, but is acceptable since Optuna does not actually provide an API for backward migration to users.

I am checking for performance gains with this change using the following script.
https://gist.github.com/c-bata/c08fb89a583adbcdc3eddcf8cf192c1a

I will verify the performance on PostgreSQL tomorrow with more study and trial records.

optuna/storages/_rdb/alembic/versions/v3.2.0.a_.py Outdated Show resolved Hide resolved
@c-bata
Copy link
Member

c-bata commented Mar 14, 2023

I have confirmed with PostgreSQL that this change improves performance. I will approve this PR after my suggestion is reflected 👍

Benchmarking on PostgreSQL

Here is a benchmark script and its result.

Benchmark script:

https://gist.github.com/c-bata/98532a60609a8a5f9e1e4dd162d45886

Before (master):

$ python profile_get_all_trials.py
Elapsed: 24.4298s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
0.5289	100	SELECT trial_params.trial_id AS trial_params_trial_id, trial_params.param_id AS trial_params_param_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params
WHERE trial_params.trial_id IN (%(primary_keys_1)s, %(primary_keys_2)s, %(primary_keys_3)s, %(primary_keys_4)s, %(primary_keys_5)s, ...)
0.2383	100	SELECT trials.trial_id AS trials_trial_id
FROM trials
WHERE trials.study_id = %(study_id_1)s

After (This PR):

# python profile_all_trials.py
Elapsed: 21.5016s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
0.4890	100	SELECT trial_params.trial_id AS trial_params_trial_id, trial_params.param_id AS trial_params_param_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params
WHERE trial_params.trial_id IN (%(primary_keys_1)s, %(primary_keys_2)s, %(primary_keys_3)s, %(primary_keys_4)s, ...)
0.2103	100	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id IN (%(trial_id_1_1)s, %(trial_id_1_2)s, %(trial_id_1_3)s, %(trial_id_1_4)s, ...) AND trials.study_id = %(study_id_1)s ORDER BY trials.trial_id

According to the slow queries, a composite index of study_id and state does not improves the performance since state is not included in the query.

optuna-e2e sciripts

For the second reviewer, let me share the optuna-e2e branch I used to check the migration script.
https://github.com/c-bata/optuna-e2e/tree/optuna-4449

$ docker compose up -d --build
$ docker compose run --rm optuna-300 python src/init.py
$ docker compose run --rm optuna-dev bash src/upgrade.sh
mysql
[I 2023-03-13 12:16:15,177] Upgrading the storage schema to the latest version.
[I 2023-03-13 12:16:15,417] Completed to upgrade the storage.
postgresql
[I 2023-03-13 12:16:16,268] Upgrading the storage schema to the latest version.
[I 2023-03-13 12:16:16,500] Completed to upgrade the storage.
sqlite
[I 2023-03-13 12:16:17,261] Upgrading the storage schema to the latest version.
[I 2023-03-13 12:16:17,581] Completed to upgrade the storage.

@c-bata c-bata added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Mar 14, 2023
@toshihikoyanase
Copy link
Member

@Alnusjaponica Could you review this PR, please?
Let me explain how to confirm the schema migration with the optuna-e2e tool together.

Co-authored-by: Masashi Shibata <c-bata@users.noreply.github.com>
@Ilevk Ilevk force-pushed the feat/add-index-to-trials-study-id branch 2 times, most recently from 3385564 to f551a89 Compare March 14, 2023 14:06
@Ilevk Ilevk force-pushed the feat/add-index-to-trials-study-id branch from f551a89 to 352a88d Compare March 14, 2023 14:12
@Alnusjaponica
Copy link
Collaborator

Alnusjaponica commented Mar 17, 2023

@Ilevk Thank you for your contribution.
I also confirmed that the migration code works properly and the change itself looks good to me.

As we discussed offline, @c-bata found out that this change might not have large affect on the performance (he'll share the data presently) and wonders what made this change affect the performance drastically in your environment. Could you provide us some information about what kind of job you're running or any reproducible codes?

@Ilevk
Copy link
Contributor Author

Ilevk commented Mar 17, 2023

@Alnusjaponica We train about 5,000 models every day, and we run 5-7 HPOs per model. We have about 100 training instances. If you have a lot of connections happening at the same time, you seem to have the same problem as me.

@c-bata
Copy link
Member

c-bata commented Mar 17, 2023

@Ilevk Thank you for your swift response. Could you also share which sampler you used and the number of trials per study?

@Ilevk
Copy link
Contributor Author

Ilevk commented Mar 17, 2023

@c-bata we use a TPESampler & 5 ~ 7 trials each. almost 5 trials.

@Ilevk
Copy link
Contributor Author

Ilevk commented Mar 17, 2023

There is one more bottleneck in our environment. Currently, RDBStorage creates engines internally with create_engine, and when accessed by many instances at the same time, the previously created engines & connections are not cleaned up and delay the next experiment.

Copy link
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I could see a clear performance improvement in the following scenario.

https://gist.github.com/c-bata/98532a60609a8a5f9e1e4dd162d45886

Before

optuna=# \d trials;
                                               Table "public.trials"
      Column       |            Type             | Collation | Nullable |                 Default
-------------------+-----------------------------+-----------+----------+------------------------------------------
 trial_id          | integer                     |           | not null | nextval('trials_trial_id_seq'::regclass)
 number            | integer                     |           |          |
 study_id          | integer                     |           |          |
 state             | trialstate                  |           | not null |
 datetime_start    | timestamp without time zone |           |          |
 datetime_complete | timestamp without time zone |           |          |
Indexes:
    "trials_pkey" PRIMARY KEY, btree (trial_id)
Foreign-key constraints:
    "trials_study_id_fkey" FOREIGN KEY (study_id) REFERENCES studies(study_id)
Referenced by:
    TABLE "trial_heartbeats" CONSTRAINT "trial_heartbeats_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_intermediate_values" CONSTRAINT "trial_intermediate_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_params" CONSTRAINT "trial_params_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_system_attributes" CONSTRAINT "trial_system_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_user_attributes" CONSTRAINT "trial_user_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_values" CONSTRAINT "trial_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
# python profiler.py
Elapsed: 157.2042s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
43.5951	11000	SELECT trials.trial_id AS trials_trial_id
FROM trials
WHERE trials.study_id = %(study_id_1)s
40.9885	10000	SELECT trial_params.param_id AS trial_params_param_id, trial_params.trial_id AS trial_params_trial_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params JOIN trials ON trials.trial_id = trial_params.trial_id
WHERE trials.study_id = %(study_id_1)s AND trial_params.param_name = %(param_name_1)s
 LIMIT %(param_1)s
4.7336	21000	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id = %(trial_id_1)s
4.5883	1000	SELECT count(trials.trial_id) AS count_1
FROM trials
WHERE trials.study_id = %(study_id_1)s AND trials.trial_id < %(trial_id_1)s
3.5560	11100	SELECT studies.study_id AS studies_study_id, studies.study_name AS studies_study_name
FROM studies
WHERE studies.study_id = %(study_id_1)s

After

optuna=# create index ix_trials_study_id on trials(study_id);
CREATE INDEX
optuna=# \d trials;
                                               Table "public.trials"
      Column       |            Type             | Collation | Nullable |                 Default
-------------------+-----------------------------+-----------+----------+------------------------------------------
 trial_id          | integer                     |           | not null | nextval('trials_trial_id_seq'::regclass)
 number            | integer                     |           |          |
 study_id          | integer                     |           |          |
 state             | trialstate                  |           | not null |
 datetime_start    | timestamp without time zone |           |          |
 datetime_complete | timestamp without time zone |           |          |
Indexes:
    "trials_pkey" PRIMARY KEY, btree (trial_id)
    "ix_trials_study_id" btree (study_id)
Foreign-key constraints:
    "trials_study_id_fkey" FOREIGN KEY (study_id) REFERENCES studies(study_id)
Referenced by:
    TABLE "trial_heartbeats" CONSTRAINT "trial_heartbeats_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_intermediate_values" CONSTRAINT "trial_intermediate_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_params" CONSTRAINT "trial_params_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_system_attributes" CONSTRAINT "trial_system_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_user_attributes" CONSTRAINT "trial_user_attributes_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
    TABLE "trial_values" CONSTRAINT "trial_values_trial_id_fkey" FOREIGN KEY (trial_id) REFERENCES trials(trial_id)
# python profiler.py
Elapsed: 66.0051s (n_trials=500 n_params=10)

Sort by Total:
Total Time(s)	Query Count	Statement
4.3927	21000	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id = %(trial_id_1)s
3.0978	10000	SELECT trial_params.param_id AS trial_params_param_id, trial_params.trial_id AS trial_params_trial_id, trial_params.param_name AS trial_params_param_name, trial_params.param_value AS trial_params_param_value, trial_params.distribution_json AS trial_params_distribution_json
FROM trial_params JOIN trials ON trials.trial_id = trial_params.trial_id
WHERE trials.study_id = %(study_id_1)s AND trial_params.param_name = %(param_name_1)s
 LIMIT %(param_1)s
2.9455	11100	SELECT studies.study_id AS studies_study_id, studies.study_name AS studies_study_name
FROM studies
WHERE studies.study_id = %(study_id_1)s
2.3931	10000	INSERT INTO trial_params (trial_id, param_name, param_value, distribution_json) VALUES (%(trial_id)s, %(param_name)s, %(param_value)s, %(distribution_json)s) RETURNING trial_params.param_id
1.8892	11000	SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id IN (NULL) AND (1 != 1) AND trials.study_id = %(study_id_1)s ORDER BY trials.trial_id

@c-bata c-bata assigned Alnusjaponica and unassigned Alnusjaponica and c-bata Mar 22, 2023
@c-bata
Copy link
Member

c-bata commented Mar 22, 2023

There is one more bottleneck in our environment. Currently, RDBStorage creates engines internally with create_engine, and when accessed by many instances at the same time, the previously created engines & connections are not cleaned up and delay the next experiment.

@Ilevk Thank you for sharing. The connection objects are basically cleaned up when the reference count of RDBStorage reaches zero, but if there are any connections left, they can be explicitly cleaned up as follows.

storage = optuna.storages.RDBStorage(storage_url)
study = optuna.create_study(storage=storage)
study.optimize(objective, ...)

# Explicitly clean up connections
storage.engine.dispose()

If you find any problems with the handling of connection objects in Optuna, please report them to us.

Copy link
Collaborator

@Alnusjaponica Alnusjaponica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my delayed reply. I also run optimizations in the same scenario with two different versions and confirmed that it is the newer one is about twice as fast as the older one. LGTM.

@Alnusjaponica Alnusjaponica removed their assignment Mar 24, 2023
@c-bata c-bata added this to the v3.2.0 milestone Mar 24, 2023
@c-bata c-bata merged commit 3ebb0db into optuna:master Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants