Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate string-length limitation of RDBStorage #2395

Merged
merged 5 commits into from Mar 4, 2021

Conversation

toshihikoyanase
Copy link
Member

@toshihikoyanase toshihikoyanase commented Feb 26, 2021

Motivation

This PR relates to #1860.

SQLAlchemy has multiple types for string data. Currently, sqlalchemy.types.String is used for variable-length strings such as Trial.system_attr and Trial.user_attr. It has strict length limitation, i.e., 2048 characters.
On the other hand, sqlalchemy.types.Text may be suitable for such variable-length strings. It is corresponding to TEXT type of each database system, which is designed for large text objects.

Although TEXT is not a standard type of SQL, many database systems implement it.

Database Corresponding type Length limitation
SQLite TEXT SQLITE_MAX_LENGTH
PostgreSQL text unlimited
MySQL text 2^16 + 2 bytes
SQL Server text 2^31-1 bytes
Oracle CLOB 128TB depending on database block size

Description of the changes

This PR replace String type with Text type.

TODO

  • schema migration script for MySQL

Example script

import sys
import optuna


def objective(trial):
    trial.set_user_attr("a", "a" * 10000)
    trial.set_system_attr("b", "b" * 10000)
    trial.suggest_categorical("x", [i for i in range(10000)])
    return 1

study = optuna.create_study(storage=sys.argv[1])
study.set_user_attr("c", "c" * 10000)
study.set_system_attr("d", "d" * 10000)
study.optimize(objective, n_trials=1)
$ MYSQL_HOST=localhost
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8  mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python long-string.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna

Output

master

$ python long-string.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
sqlalchemy.exc.DataError: (pymysql.err.DataError) (1406, "Data too long for column 'value_json' at row 1")
[SQL: INSERT INTO study_user_attributes (study_id, `key`, value_json) VALUES (%(study_id)s, %(key)s, %(value_json)s)]
[parameters: {'study_id': 1, 'key': 'c', 'value_json': '"cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc ... (9704 characters truncated) ... cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"'}]
(Background on this error at: http://sqlalche.me/e/13/9h9h)

This PR

$ python long-string.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
[I 2021-02-26 14:48:12,410] A new study created in RDB with name: no-name-18194847-bf32-4258-b6f3-65919c2fefaa
[I 2021-02-26 14:48:12,563] Trial 0 finished with value: 1.0 and parameters: {'x': 8993}. Best is trial 0 with value: 1.0.

@toshihikoyanase toshihikoyanase added enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. compatibility Change that breaks compatibility. labels Feb 26, 2021
@github-actions github-actions bot added the optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions. label Feb 26, 2021
@codecov-io
Copy link

codecov-io commented Feb 26, 2021

Codecov Report

Merging #2395 (ce81e20) into master (30f58b5) will decrease coverage by 0.06%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2395      +/-   ##
==========================================
- Coverage   91.44%   91.38%   -0.07%     
==========================================
  Files         134      135       +1     
  Lines       11270    11300      +30     
==========================================
+ Hits        10306    10326      +20     
- Misses        964      974      +10     
Impacted Files Coverage Δ
optuna/storages/_rdb/alembic/versions/v2.6.0.a_.py 65.51% <65.51%> (ø)
optuna/storages/_rdb/models.py 99.59% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30f58b5...ce81e20. Read the comment docs.

@keisuke-umezawa
Copy link
Member

@hvy
Could you also review it?

@hvy
Copy link
Member

hvy commented Mar 1, 2021

Thanks a lot for this simple fix. I haven't went over in detail yet but some quick questions,

  • Does this require a DB migration (ALTER TABLE)? If so, we might want to provide a migration script.
  • Does this have any performance impacts that we should be aware of?
  • Should we document/add unit tests? (this could be a separate issue since that'd widen the scope)

@toshihikoyanase
Copy link
Member Author

I checked the changes in the tables using SQLite, MySQL and PostgreSQL.
We need to add a migration script for these changes. But, the migration code is not generated by Alembic, so I'll add it manually.

SQLite3

diff of SQLite
32c32
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
42c42
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
64c64
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
74c74
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
85c85
<       distribution_json VARCHAR(2048), 
---
>       distribution_json TEXT, 

MySQL

Diff of MySQL tables
105c105
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,
133c133
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,
215c215
<   `distribution_json` varchar(2048) DEFAULT NULL,
---
>   `distribution_json` text,
243c243
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,
271c271
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,

PostgreSQL

Diff of PostgreSQL tables
140c140
<     value_json character varying(2048)
---
>     value_json text
176c176
<     value_json character varying(2048)
---
>     value_json text
284c284
<     distribution_json character varying(2048)
---
>     distribution_json text
320c320
<     value_json character varying(2048)
---
>     value_json text
356c356
<     value_json character varying(2048)
---
>     value_json text

@c-bata
Copy link
Member

c-bata commented Mar 1, 2021

How about using JSON column instead of TEXT?
https://docs.sqlalchemy.org/en/13/core/type_basics.html#sqlalchemy.types.JSON

@toshihikoyanase
Copy link
Member Author

How about using JSON column instead of TEXT?

Thank you for the offline-discussion. In summary, we'll use TEXT mainly for the users who use old versions of DB servers. For example, HPC users and private cloud users may have no choice to upgrade the DB servers.

Copy link
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes basically LGTM. Can we remove here?

MAX_STRING_LENGTH = 2048

@toshihikoyanase
Copy link
Member Author

toshihikoyanase commented Mar 1, 2021

Does this have any performance impacts that we should be aware of?

I executed a simple benchmark using SQLite3, MySQL 8.0 and PostgreSQL 12. I ran the following script ten times and averaged the wall-clock time. We can see the performance degradation in SQLite3. It took 30% more time. On the other hand, the execution times of MySQL and PostgreSQL are comparable.

I think we expect SQLite3 is for casual experiments, and DB servers like MySQL and PostgreSQL are for large-scale experiments. So, the performance degradation of SQLite3 can be acceptable, but I'm not fully sure. What do you think?

benchmark-2395.py
import sys
import optuna


def objective(trial):
    trial.set_user_attr("a", "a" * 2000)
    trial.set_system_attr("b", "b" * 2000)
    trial.suggest_categorical("x", [f"categorical_variable_{i:02}" for i in range(60)])
    return 1

optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])
study.set_user_attr("c", "c" * 2000)
study.set_system_attr("d", "d" * 2000)
study.optimize(objective, n_trials=200)
Execution command

The execution time was measured by time command.

SQLite

# master
$ python benchmark-2395.py sqlite:///master.db  # Discard the first execution since it has DB creation time.
$ for i in $(seq 0 9); do echo $i; time (python benchmark-2395.py sqlite:///master.db) >> master.sqlite.log2 2>&1; done
$ grep real master.sqlite.log2 | cut -d 'm' -f 2 | cut -d 's' -f 1

# This PR
$ python benchmark-2395.py sqlite:///feature.db  # Discard the first execution since it has DB creation time.
$ for i in $(seq 0 9); do echo $i; time (python benchmark-2395.py sqlite:///feature.db) >> feature.sqlite.log2 2>&1; done
$ grep real feature.sqlite.log2 | cut -d 'm' -f 2 | cut -d 's' -f 1

MySQL

# master
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8  mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py  mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna) >> master.mysql.log 2>&1
done
$ cat master.mysql.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1

# This PR
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8  mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py  mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna) >> feature.mysql.log 2>&1
done
$ cat feature.mysql.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1

PostgreSQL

# master
$ docker run -it --rm --name postgres-test -e POSTGRES_PASSWORD=test -p 15432:5432 -d postgres:12
$ python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres) >> master.postgres.log 2>&1
done
$ cat master.postgres.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1

# This PR
$ docker run -it --rm --name postgres-test -e POSTGRES_PASSWORD=test -p 15432:5432 -d postgres:12
$ python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres) >> feature.postgres.log 2>&1
done
$  cat feature.postgres.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1
DBMS master This PR This PR / master
SQLite3 5.894 7.827 1.327
MySQL 7.552 7.531 0.997
PostgreSQL 4.5235 4.6395 1.025

@toshihikoyanase
Copy link
Member Author

I took profiles with cProfile.

Study.set_system_attr and Study.set_user_atter

I measured the execution time of study.set_user_attr and study.set_system_attr.

proflie `Study.system_attr` and `Study.user_attr`
import sys
import optuna


def main():
    for i in range(100):
        study.set_user_attr(f"{i:03}", "c" * 2000)
        study.set_system_attr(f"{i:03}", "d" * 2000)

import cProfile as profile
optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])

profile.run('main()', sys.argv[2])

The following graphs are visualization results. The difference of total execution time is 0.912s and 1.03s. The feature took 12% slower than the master branch. The major part of the increase came from session.commit() time (0.655s to 0.726s, difference is about 0.07s).

The master branch
image

This branch
image

Study.suggest_categorical

I also took profile of trial.suggest_categorical with 60 items in choices.

profile of `Study.optimize`
import sys
import optuna


def objective(trial):
    trial.suggest_categorical("x", [f"categorical_variable_{i:02}" for i in range(60)])
    return 1

optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])

def main():
    study.optimize(objective, n_trials=100)

import cProfile as profile

profile.run('main()', sys.argv[2])

Large portions of the execution time are occupied by suggest_categorical, create_new_trial, and set_trial_state. These methods include session.commit, and session.commit() of this branch was about 30% slower than the main branch (1.21 to 1.55).

The master branch
image

This branch
image

Study.suggest_float

I also tested with the script without suggest_categorical, set_system_attr or set_user_attr. The execution time of this PR is also slower.

profiling `Study.optimize` without `suggest_categorical` import sys import optuna

def objective(trial):
trial.suggest_float("x", 0, 1)
return 1

optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])

def main():
study.optimize(objective, n_trials=100)

import cProfile as profile

profile.run('main()', sys.argv[2])

The master

image

The feature

image

@toshihikoyanase
Copy link
Member Author

I executed the same experiment using SQLite, but the execution time was not stable. I doubted the time differences mainly came from the external noise sources such as the other processes writing the same disk and security software.
So, I used another clean machine to rerun the experiment (#2395 (comment)).

The results are shown as follows:

DBMS master This PR This PR / master
SQLite3 15.4975 (0.40) 15.2633 (0.34) 0.98
MySQL 23.6462 (0.81) 23.4407 (0.22) 0.99
PostgreSQL 16.7208 (0.45) 16.7819 (0.42) 1.00

Then, we cannot see significant differences. This result is consistent with the SQLite3 implementation because it simply uses the same TEXT type internally for both VARCHARand TEXT according to the official document.

Copy link
Member

@hvy hvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the benchmarks and the detailed investigation. The numbers look promising and the changes LGTM besides the comment by c-bata san regarding the obsolete MAX_STRING_LENGTH constant.

For the record, also verified the upgrade script with SQLite, MySQL and PostgreSQL.

@toshihikoyanase
Copy link
Member Author

@c-bata @hvy Thank you for your careful review. I removed the MAX_STRING_LENGTH constant in the commit 6eba4be. I confirmed that it is not used anymore.

Copy link
Member

@keisuke-umezawa keisuke-umezawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for checking performance and migration scripts!

Copy link
Member

@ytsmiling ytsmiling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this PR. Let me report that I confirmed that

  • the migration script worked without problems in my environment (MySQL, optuna v2.5 -> this branch)
  • this PR fixed a problem that have occurred when the #choices in CategoricalDistribution is large.

@hvy hvy added this to the v2.6.0 milestone Mar 3, 2021
@hvy
Copy link
Member

hvy commented Mar 4, 2021

Let me merge this PR since it has four approvals. It requires a migration so we should be careful and highlight it in the release note but it will improve the usability for users using the RDB storage.

@hvy hvy merged commit ca04078 into optuna:master Mar 4, 2021
@toshihikoyanase toshihikoyanase deleted the rdb-storage-string-to-text branch March 4, 2021 04:39
@PhilipMay
Copy link
Contributor

Should #1860 also be closed now?

@hvy
Copy link
Member

hvy commented Mar 29, 2021

Thanks for the heads up, let me do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility Change that breaks compatibility. enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants