Mitigate string-length limitation of `RDBStorage` #2395

toshihikoyanase · 2021-02-26T05:53:51Z

Motivation

This PR relates to #1860.

SQLAlchemy has multiple types for string data. Currently, sqlalchemy.types.String is used for variable-length strings such as Trial.system_attr and Trial.user_attr. It has strict length limitation, i.e., 2048 characters.
On the other hand, sqlalchemy.types.Text may be suitable for such variable-length strings. It is corresponding to TEXT type of each database system, which is designed for large text objects.

Although TEXT is not a standard type of SQL, many database systems implement it.

Database	Corresponding type	Length limitation
SQLite	TEXT	SQLITE_MAX_LENGTH
PostgreSQL	text	unlimited
MySQL	text	2^16 + 2 bytes
SQL Server	text	2^31-1 bytes
Oracle	CLOB	128TB depending on database block size

Description of the changes

This PR replace String type with Text type.

TODO

schema migration script for MySQL

Example script

import sys
import optuna


def objective(trial):
    trial.set_user_attr("a", "a" * 10000)
    trial.set_system_attr("b", "b" * 10000)
    trial.suggest_categorical("x", [i for i in range(10000)])
    return 1

study = optuna.create_study(storage=sys.argv[1])
study.set_user_attr("c", "c" * 10000)
study.set_system_attr("d", "d" * 10000)
study.optimize(objective, n_trials=1)

$ MYSQL_HOST=localhost
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8  mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python long-string.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna

Output

master

$ python long-string.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
sqlalchemy.exc.DataError: (pymysql.err.DataError) (1406, "Data too long for column 'value_json' at row 1")
[SQL: INSERT INTO study_user_attributes (study_id, `key`, value_json) VALUES (%(study_id)s, %(key)s, %(value_json)s)]
[parameters: {'study_id': 1, 'key': 'c', 'value_json': '"cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc ... (9704 characters truncated) ... cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"'}]
(Background on this error at: http://sqlalche.me/e/13/9h9h)

This PR

$ python long-string.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
[I 2021-02-26 14:48:12,410] A new study created in RDB with name: no-name-18194847-bf32-4258-b6f3-65919c2fefaa
[I 2021-02-26 14:48:12,563] Trial 0 finished with value: 1.0 and parameters: {'x': 8993}. Best is trial 0 with value: 1.0.

codecov-io · 2021-02-26T06:06:16Z

Codecov Report

Merging #2395 (ce81e20) into master (30f58b5) will decrease coverage by 0.06%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master    #2395      +/-   ##
==========================================
- Coverage   91.44%   91.38%   -0.07%     
==========================================
  Files         134      135       +1     
  Lines       11270    11300      +30     
==========================================
+ Hits        10306    10326      +20     
- Misses        964      974      +10

Impacted Files	Coverage Δ
optuna/storages/_rdb/alembic/versions/v2.6.0.a_.py	`65.51% <65.51%> (ø)`
optuna/storages/_rdb/models.py	`99.59% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30f58b5...ce81e20. Read the comment docs.

keisuke-umezawa · 2021-02-28T08:45:42Z

@hvy
Could you also review it?

hvy · 2021-03-01T01:29:30Z

Thanks a lot for this simple fix. I haven't went over in detail yet but some quick questions,

Does this require a DB migration (ALTER TABLE)? If so, we might want to provide a migration script.
Does this have any performance impacts that we should be aware of?
Should we document/add unit tests? (this could be a separate issue since that'd widen the scope)

toshihikoyanase · 2021-03-01T01:47:31Z

I checked the changes in the tables using SQLite, MySQL and PostgreSQL.
We need to add a migration script for these changes. But, the migration code is not generated by Alembic, so I'll add it manually.

SQLite3

diff of SQLite

32c32
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
42c42
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
64c64
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
74c74
<       value_json VARCHAR(2048), 
---
>       value_json TEXT, 
85c85
<       distribution_json VARCHAR(2048), 
---
>       distribution_json TEXT,

MySQL

Diff of MySQL tables

105c105
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,
133c133
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,
215c215
<   `distribution_json` varchar(2048) DEFAULT NULL,
---
>   `distribution_json` text,
243c243
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,
271c271
<   `value_json` varchar(2048) DEFAULT NULL,
---
>   `value_json` text,

PostgreSQL

Diff of PostgreSQL tables

140c140
<     value_json character varying(2048)
---
>     value_json text
176c176
<     value_json character varying(2048)
---
>     value_json text
284c284
<     distribution_json character varying(2048)
---
>     distribution_json text
320c320
<     value_json character varying(2048)
---
>     value_json text
356c356
<     value_json character varying(2048)
---
>     value_json text

c-bata · 2021-03-01T02:24:38Z

How about using JSON column instead of TEXT?
https://docs.sqlalchemy.org/en/13/core/type_basics.html#sqlalchemy.types.JSON

toshihikoyanase · 2021-03-01T06:56:23Z

How about using JSON column instead of TEXT?

Thank you for the offline-discussion. In summary, we'll use TEXT mainly for the users who use old versions of DB servers. For example, HPC users and private cloud users may have no choice to upgrade the DB servers.

c-bata

The changes basically LGTM. Can we remove here?

optuna/optuna/storages/_rdb/models.py

Line 31 in ce81e20

MAX_STRING_LENGTH = 2048

toshihikoyanase · 2021-03-01T07:08:17Z

Does this have any performance impacts that we should be aware of?

I executed a simple benchmark using SQLite3, MySQL 8.0 and PostgreSQL 12. I ran the following script ten times and averaged the wall-clock time. We can see the performance degradation in SQLite3. It took 30% more time. On the other hand, the execution times of MySQL and PostgreSQL are comparable.

I think we expect SQLite3 is for casual experiments, and DB servers like MySQL and PostgreSQL are for large-scale experiments. So, the performance degradation of SQLite3 can be acceptable, but I'm not fully sure. What do you think?

benchmark-2395.py

import sys
import optuna


def objective(trial):
    trial.set_user_attr("a", "a" * 2000)
    trial.set_system_attr("b", "b" * 2000)
    trial.suggest_categorical("x", [f"categorical_variable_{i:02}" for i in range(60)])
    return 1

optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])
study.set_user_attr("c", "c" * 2000)
study.set_system_attr("d", "d" * 2000)
study.optimize(objective, n_trials=200)

Execution command

The execution time was measured by time command.

SQLite

# master
$ python benchmark-2395.py sqlite:///master.db  # Discard the first execution since it has DB creation time.
$ for i in $(seq 0 9); do echo $i; time (python benchmark-2395.py sqlite:///master.db) >> master.sqlite.log2 2>&1; done
$ grep real master.sqlite.log2 | cut -d 'm' -f 2 | cut -d 's' -f 1

# This PR
$ python benchmark-2395.py sqlite:///feature.db  # Discard the first execution since it has DB creation time.
$ for i in $(seq 0 9); do echo $i; time (python benchmark-2395.py sqlite:///feature.db) >> feature.sqlite.log2 2>&1; done
$ grep real feature.sqlite.log2 | cut -d 'm' -f 2 | cut -d 's' -f 1

MySQL

# master
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8  mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py  mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna) >> master.mysql.log 2>&1
done
$ cat master.mysql.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1

# This PR
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8  mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py  mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna) >> feature.mysql.log 2>&1
done
$ cat feature.mysql.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1

PostgreSQL

# master
$ docker run -it --rm --name postgres-test -e POSTGRES_PASSWORD=test -p 15432:5432 -d postgres:12
$ python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres) >> master.postgres.log 2>&1
done
$ cat master.postgres.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1

# This PR
$ docker run -it --rm --name postgres-test -e POSTGRES_PASSWORD=test -p 15432:5432 -d postgres:12
$ python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres
$ for i in $(seq 0 9); do
  echo $i;
  time (python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres) >> feature.postgres.log 2>&1
done
$  cat feature.postgres.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1

DBMS	master	This PR	This PR / master
SQLite3	5.894	7.827	1.327
MySQL	7.552	7.531	0.997
PostgreSQL	4.5235	4.6395	1.025

toshihikoyanase · 2021-03-01T09:14:11Z

I took profiles with cProfile.

`Study.set_system_attr` and `Study.set_user_atter`

I measured the execution time of study.set_user_attr and study.set_system_attr.

proflie `Study.system_attr` and `Study.user_attr`

import sys
import optuna


def main():
    for i in range(100):
        study.set_user_attr(f"{i:03}", "c" * 2000)
        study.set_system_attr(f"{i:03}", "d" * 2000)

import cProfile as profile
optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])

profile.run('main()', sys.argv[2])

The following graphs are visualization results. The difference of total execution time is 0.912s and 1.03s. The feature took 12% slower than the master branch. The major part of the increase came from session.commit() time (0.655s to 0.726s, difference is about 0.07s).

The master branch

This branch

`Study.suggest_categorical`

I also took profile of trial.suggest_categorical with 60 items in choices.

profile of `Study.optimize`

import sys
import optuna


def objective(trial):
    trial.suggest_categorical("x", [f"categorical_variable_{i:02}" for i in range(60)])
    return 1

optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])

def main():
    study.optimize(objective, n_trials=100)

import cProfile as profile

profile.run('main()', sys.argv[2])

Large portions of the execution time are occupied by suggest_categorical, create_new_trial, and set_trial_state. These methods include session.commit, and session.commit() of this branch was about 30% slower than the main branch (1.21 to 1.55).

The master branch

This branch

`Study.suggest_float`

I also tested with the script without suggest_categorical, set_system_attr or set_user_attr. The execution time of this PR is also slower.

profiling `Study.optimize` without `suggest_categorical`

import sys import optuna

def objective(trial):
trial.suggest_float("x", 0, 1)
return 1

optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])

def main():
study.optimize(objective, n_trials=100)

import cProfile as profile

profile.run('main()', sys.argv[2])

The master

The feature

toshihikoyanase · 2021-03-02T07:43:19Z

I executed the same experiment using SQLite, but the execution time was not stable. I doubted the time differences mainly came from the external noise sources such as the other processes writing the same disk and security software.
So, I used another clean machine to rerun the experiment (#2395 (comment)).

The results are shown as follows:

DBMS	master	This PR	This PR / master
SQLite3	15.4975 (0.40)	15.2633 (0.34)	0.98
MySQL	23.6462 (0.81)	23.4407 (0.22)	0.99
PostgreSQL	16.7208 (0.45)	16.7819 (0.42)	1.00

Then, we cannot see significant differences. This result is consistent with the SQLite3 implementation because it simply uses the same TEXT type internally for both VARCHARand TEXT according to the official document.

hvy

Thank you for the benchmarks and the detailed investigation. The numbers look promising and the changes LGTM besides the comment by c-bata san regarding the obsolete MAX_STRING_LENGTH constant.

For the record, also verified the upgrade script with SQLite, MySQL and PostgreSQL.

toshihikoyanase · 2021-03-02T08:19:59Z

@c-bata @hvy Thank you for your careful review. I removed the MAX_STRING_LENGTH constant in the commit 6eba4be. I confirmed that it is not used anymore.

keisuke-umezawa

LGTM! Thank you for checking performance and migration scripts!

ytsmiling

Thank you for creating this PR. Let me report that I confirmed that

the migration script worked without problems in my environment (MySQL, optuna v2.5 -> this branch)
this PR fixed a problem that have occurred when the #choices in CategoricalDistribution is large.

hvy · 2021-03-04T04:36:44Z

Let me merge this PR since it has four approvals. It requires a migration so we should be careful and highlight it in the release note but it will improve the usability for users using the RDB storage.

PhilipMay · 2021-03-26T07:33:32Z

Should #1860 also be closed now?

hvy · 2021-03-29T00:47:49Z

Thanks for the heads up, let me do that.

toshihikoyanase added enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. compatibility Change that breaks compatibility. labels Feb 26, 2021

github-actions bot added the optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions. label Feb 26, 2021

keisuke-umezawa assigned keisuke-umezawa and hvy Feb 28, 2021

c-bata reviewed Mar 1, 2021

View reviewed changes

toshihikoyanase added 4 commits March 2, 2021 14:00

Replace String type with Text type.

a8b21f5

Add a migration script.

47d3209

Update test case.

2d4f2fd

Apply black.

ab39c62

toshihikoyanase force-pushed the rdb-storage-string-to-text branch from ce81e20 to ab39c62 Compare March 2, 2021 05:07

hvy approved these changes Mar 2, 2021

View reviewed changes

Remove MAX_STRING_LENGTH.

6eba4be

keisuke-umezawa approved these changes Mar 2, 2021

View reviewed changes

c-bata approved these changes Mar 2, 2021

View reviewed changes

ytsmiling approved these changes Mar 3, 2021

View reviewed changes

hvy added this to the v2.6.0 milestone Mar 3, 2021

hvy merged commit ca04078 into optuna:master Mar 4, 2021

toshihikoyanase deleted the rdb-storage-string-to-text branch March 4, 2021 04:39

This was referenced Mar 23, 2021

MOTPESampler StorageInternalError after ~300 iterations #2331

Closed

Change caching implementation of MOTPE #2406

Merged

y0z mentioned this pull request Mar 25, 2021

Truncate the length of valued of set_user_attr. #2522

Closed

hvy mentioned this pull request Mar 29, 2021

[RFC] Storage system and user attribute length limits #1860

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate string-length limitation of `RDBStorage` #2395

Mitigate string-length limitation of `RDBStorage` #2395

toshihikoyanase commented Feb 26, 2021 •

edited

codecov-io commented Feb 26, 2021 •

edited

keisuke-umezawa commented Feb 28, 2021

hvy commented Mar 1, 2021

toshihikoyanase commented Mar 1, 2021

c-bata commented Mar 1, 2021

toshihikoyanase commented Mar 1, 2021

c-bata left a comment

toshihikoyanase commented Mar 1, 2021 •

edited

toshihikoyanase commented Mar 1, 2021

toshihikoyanase commented Mar 2, 2021

hvy left a comment

toshihikoyanase commented Mar 2, 2021

keisuke-umezawa left a comment

ytsmiling left a comment

hvy commented Mar 4, 2021

PhilipMay commented Mar 26, 2021

hvy commented Mar 29, 2021

Mitigate string-length limitation of RDBStorage #2395

Mitigate string-length limitation of RDBStorage #2395

Conversation

toshihikoyanase commented Feb 26, 2021 • edited

Motivation

Description of the changes

TODO

Example script

codecov-io commented Feb 26, 2021 • edited

Codecov Report

keisuke-umezawa commented Feb 28, 2021

hvy commented Mar 1, 2021

toshihikoyanase commented Mar 1, 2021

c-bata commented Mar 1, 2021

toshihikoyanase commented Mar 1, 2021

c-bata left a comment

Choose a reason for hiding this comment

toshihikoyanase commented Mar 1, 2021 • edited

toshihikoyanase commented Mar 1, 2021

Study.set_system_attr and Study.set_user_atter

Study.suggest_categorical

Study.suggest_float

toshihikoyanase commented Mar 2, 2021

hvy left a comment

Choose a reason for hiding this comment

toshihikoyanase commented Mar 2, 2021

keisuke-umezawa left a comment

Choose a reason for hiding this comment

ytsmiling left a comment

Choose a reason for hiding this comment

hvy commented Mar 4, 2021

PhilipMay commented Mar 26, 2021

hvy commented Mar 29, 2021

Mitigate string-length limitation of `RDBStorage` #2395

Mitigate string-length limitation of `RDBStorage` #2395

toshihikoyanase commented Feb 26, 2021 •

edited

codecov-io commented Feb 26, 2021 •

edited

toshihikoyanase commented Mar 1, 2021 •

edited

`Study.set_system_attr` and `Study.set_user_atter`

`Study.suggest_categorical`

`Study.suggest_float`