New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate string-length limitation of RDBStorage
#2395
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2395 +/- ##
==========================================
- Coverage 91.44% 91.38% -0.07%
==========================================
Files 134 135 +1
Lines 11270 11300 +30
==========================================
+ Hits 10306 10326 +20
- Misses 964 974 +10
Continue to review full report at Codecov.
|
@hvy |
Thanks a lot for this simple fix. I haven't went over in detail yet but some quick questions,
|
I checked the changes in the tables using SQLite, MySQL and PostgreSQL. SQLite3 diff of SQLite32c32
< value_json VARCHAR(2048),
---
> value_json TEXT,
42c42
< value_json VARCHAR(2048),
---
> value_json TEXT,
64c64
< value_json VARCHAR(2048),
---
> value_json TEXT,
74c74
< value_json VARCHAR(2048),
---
> value_json TEXT,
85c85
< distribution_json VARCHAR(2048),
---
> distribution_json TEXT, MySQL Diff of MySQL tables105c105
< `value_json` varchar(2048) DEFAULT NULL,
---
> `value_json` text,
133c133
< `value_json` varchar(2048) DEFAULT NULL,
---
> `value_json` text,
215c215
< `distribution_json` varchar(2048) DEFAULT NULL,
---
> `distribution_json` text,
243c243
< `value_json` varchar(2048) DEFAULT NULL,
---
> `value_json` text,
271c271
< `value_json` varchar(2048) DEFAULT NULL,
---
> `value_json` text, PostgreSQL Diff of PostgreSQL tables140c140
< value_json character varying(2048)
---
> value_json text
176c176
< value_json character varying(2048)
---
> value_json text
284c284
< distribution_json character varying(2048)
---
> distribution_json text
320c320
< value_json character varying(2048)
---
> value_json text
356c356
< value_json character varying(2048)
---
> value_json text |
How about using |
Thank you for the offline-discussion. In summary, we'll use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes basically LGTM. Can we remove here?
optuna/optuna/storages/_rdb/models.py
Line 31 in ce81e20
MAX_STRING_LENGTH = 2048 |
I executed a simple benchmark using SQLite3, MySQL 8.0 and PostgreSQL 12. I ran the following script ten times and averaged the wall-clock time. We can see the performance degradation in SQLite3. It took 30% more time. On the other hand, the execution times of MySQL and PostgreSQL are comparable. I think we expect SQLite3 is for casual experiments, and DB servers like MySQL and PostgreSQL are for large-scale experiments. So, the performance degradation of SQLite3 can be acceptable, but I'm not fully sure. What do you think? benchmark-2395.pyimport sys
import optuna
def objective(trial):
trial.set_user_attr("a", "a" * 2000)
trial.set_system_attr("b", "b" * 2000)
trial.suggest_categorical("x", [f"categorical_variable_{i:02}" for i in range(60)])
return 1
optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(storage=sys.argv[1])
study.set_user_attr("c", "c" * 2000)
study.set_system_attr("d", "d" * 2000)
study.optimize(objective, n_trials=200) Execution commandThe execution time was measured by SQLite # master
$ python benchmark-2395.py sqlite:///master.db # Discard the first execution since it has DB creation time.
$ for i in $(seq 0 9); do echo $i; time (python benchmark-2395.py sqlite:///master.db) >> master.sqlite.log2 2>&1; done
$ grep real master.sqlite.log2 | cut -d 'm' -f 2 | cut -d 's' -f 1
# This PR
$ python benchmark-2395.py sqlite:///feature.db # Discard the first execution since it has DB creation time.
$ for i in $(seq 0 9); do echo $i; time (python benchmark-2395.py sqlite:///feature.db) >> feature.sqlite.log2 2>&1; done
$ grep real feature.sqlite.log2 | cut -d 'm' -f 2 | cut -d 's' -f 1 MySQL # master
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8 mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
$ for i in $(seq 0 9); do
echo $i;
time (python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna) >> master.mysql.log 2>&1
done
$ cat master.mysql.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1
# This PR
$ docker run --name mysql -e MYSQL_ROOT_PASSWORD=test -p 3306:3306 -p 33060:33060 -d mysql:8
$ docker run --network host -it --rm mysql:8 mysql -h ${MYSQL_HOST} -uroot -ptest -e "create database test_optuna;"
$ python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna
$ for i in $(seq 0 9); do
echo $i;
time (python benchmark-2395.py mysql+pymysql://root:test@${MYSQL_HOST}:3306/test_optuna) >> feature.mysql.log 2>&1
done
$ cat feature.mysql.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1 PostgreSQL # master
$ docker run -it --rm --name postgres-test -e POSTGRES_PASSWORD=test -p 15432:5432 -d postgres:12
$ python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres
$ for i in $(seq 0 9); do
echo $i;
time (python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres) >> master.postgres.log 2>&1
done
$ cat master.postgres.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1
# This PR
$ docker run -it --rm --name postgres-test -e POSTGRES_PASSWORD=test -p 15432:5432 -d postgres:12
$ python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres
$ for i in $(seq 0 9); do
echo $i;
time (python benchmark-2395.py postgres+psycopg2://postgres:test@$DB_HOST:15432/postgres) >> feature.postgres.log 2>&1
done
$ cat feature.postgres.log | grep real | cut -d 'm' -f 2 | cut -d 's' -f 1
|
I took profiles with cProfile.
|
ce81e20
to
ab39c62
Compare
I executed the same experiment using SQLite, but the execution time was not stable. I doubted the time differences mainly came from the external noise sources such as the other processes writing the same disk and security software. The results are shown as follows:
Then, we cannot see significant differences. This result is consistent with the SQLite3 implementation because it simply uses the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the benchmarks and the detailed investigation. The numbers look promising and the changes LGTM besides the comment by c-bata san regarding the obsolete MAX_STRING_LENGTH
constant.
For the record, also verified the upgrade script with SQLite, MySQL and PostgreSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for checking performance and migration scripts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating this PR. Let me report that I confirmed that
- the migration script worked without problems in my environment (MySQL, optuna v2.5 -> this branch)
- this PR fixed a problem that have occurred when the
#choices
inCategoricalDistribution
is large.
Let me merge this PR since it has four approvals. It requires a migration so we should be careful and highlight it in the release note but it will improve the usability for users using the RDB storage. |
Should #1860 also be closed now? |
Thanks for the heads up, let me do that. |
Motivation
This PR relates to #1860.
SQLAlchemy has multiple types for string data. Currently,
sqlalchemy.types.String
is used for variable-length strings such asTrial.system_attr
andTrial.user_attr
. It has strict length limitation, i.e., 2048 characters.On the other hand,
sqlalchemy.types.Text
may be suitable for such variable-length strings. It is corresponding toTEXT
type of each database system, which is designed for large text objects.Although
TEXT
is not a standard type of SQL, many database systems implement it.Description of the changes
This PR replace
String
type withText
type.TODO
Example script
Output
master
This PR