Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on GCP cluster using sqlite (sqlite3.OperationalError: too many SQL variables) #1457

Closed
salvadord opened this issue Jun 30, 2020 · 16 comments
Assignees
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. question Question about Optuna.

Comments

@salvadord
Copy link

I am running optuna to optimize large-scale detailed models of cortical circuits on Google Cloud supercomputers. I integrated optuna within our modeling tool NetPyNE: https://github.com/Neurosim-lab/netpyne/blob/optuna/netpyne/batch/optuna_parallel.py

I was running 50 parallel processes on the controller node of a Google Cloud Platform (GCP) Slurm-based cluster. Each process is an optuna instance (run via 'screen') where the objective function submits a Slurm job to run a cortical simulation on 96-core compute job. Optuna was set up so the 50 processes exchange info via an sqlite db file:

    study = optuna.create_study(study_name=self.batchLabel, storage='sqlite:///%s/%s_storage.db' % (self.saveFolder, self.batchLabel), load_if_exists=True, direction=args['direction'])
    study.optimize(lambda trial: objective(trial, args), n_trials=args['maxiters'], timeout=args['maxtime'])

At around trial 1000 I got the error sqlite3.OperationalError: too many SQL variables on all processes (see detailed error below).

I know that the doc suggests using PostgreSQL or MySQL for large distributed optimization, particularly if using NFS drives (as do the compute nodes in the cluster). Unfortunately, I have no idea how to replace SQLite with PostgresSQL or MySQL. My question is, do you have any examples of how to set that up? Or have any other suggestions on how to fix this error?

Thanks!

Full error:
`[I 2020-06-30 00:53:40,502] Finished trial#996 with value: 409.0361137631617 with parameters: {'EEGain': 1.2190444451070857, 'EIGain': 1.810497756045793, "('IELayerGain', '1-3')": 1.087901880175065, "('IELayerGain', '4')": 1.7309720923711636, "('IELayerGain', '5')": 0.5960648412842655, "('IELayerGain', '6')": 0.7978043150339824, "('IILayerGain', '1-3')": 1.1688903007202631, "('IILayerGain', '4')": 0.672447638702098, "('IILayerGain', '5')": 0.8073360248421282, "('IILayerGain', '6')": 1.8855740649714237, 'thalamoCorticalGain': 1.08285373691784, 'intraThalamicGain': 1.7758770967297166, 'corticoThalamicGain': 1.6566204337609691}. Best is trial#375 with value: 294.2097451307643.
[W 2020-06-30 00:53:40,703] Setting status of trial#1046 as TrialState.FAIL because of the following error: OperationalError('(sqlite3.OperationalError) too many SQL variables',)
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1278, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: too many SQL variables

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/study.py", line 734, in _run_trial
result = func(trial)
File "/home/ext_salvadordura_gmail_com/netpyne/netpyne/batch/optuna_parallel.py", line 387, in
study.optimize(lambda trial: objective(trial, args), n_trials=args['maxiters'], timeout=args['maxtime'])
File "/home/ext_salvadordura_gmail_com/netpyne/netpyne/batch/optuna_parallel.py", line 130, in objective
candidate.append(trial.suggest_uniform(str(paramLabel), minVal, maxVal))
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/trial/_trial.py", line 221, in suggest_uniform
return self._suggest(name, distribution)
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/trial/_trial.py", line 650, in _suggest
param_value = self.study.sampler.sample_independent(study, trial, name, distribution)
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/samplers/tpe/sampler.py", line 174, in sample_independent
values, scores = _get_observation_pairs(study, param_name, trial)
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/samplers/tpe/sampler.py", line 618, in _get_observation_pairs
for trial in study.get_trials(deepcopy=False):
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/study.py", line 145, in get_trials
return self._storage.get_all_trials(self._study_id, deepcopy=deepcopy)
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/storages/cached_storage.py", line 363, in get_all_trials
study_id, excluded_trial_ids=study.owned_or_finished_trial_ids
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/storages/rdb/storage.py", line 951, in _get_trials
models.TrialModel.study_id == study_id,
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3341, in all
return list(self)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3503, in iter
return self._execute_and_instances(context)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3528, in _execute_and_instances
result = conn.execute(querycontext.statement, self._params)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1014, in execute
return meth(self, multiparams, params)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1133, in _execute_clauseelement
distilled_params,
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1318, in execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1512, in handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from
=e
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise

raise exception
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1278, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables
[SQL: SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.value AS trials_value, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id NOT IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) AND trials.study_id = ?]
[parameters: (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755`

@salvadord salvadord added the question Question about Optuna. label Jun 30, 2020
@HideakiImamura HideakiImamura added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Jul 1, 2020
@hvy
Copy link
Member

hvy commented Jul 1, 2020

Thanks for reporting this issue. I guess this could happen for any dialect and not just SQLite and should be fixed by Optuna. Digging into the cause for SQLite, it's likely related to the following, also explaining why you observe this at around 1000 trials.

SQLite allocates space to hold all host parameters between 1 and the largest host parameter number used. Hence, an SQL statement that contains a host parameter like ?1000000000 would require gigabytes of storage. This could easily overwhelm the resources of the host machine. To prevent excessive memory allocations, the maximum value of a host parameter number is SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

https://www.sqlite.org/limits.html

Until it's properly fixed by Optuna. There are a couple of workaround you could try.

  • Upgrade your SQLite installation
  • Use PostreSQL of MySQL (which might have higher tolerances)
  • Do not use the "cached storage" (_CachedStorage) which is creating these is long query. This'll slow down trials but should mitigate this issue. After creating the study, call the following.
    • study._storage = study._storage._backend

@hvy
Copy link
Member

hvy commented Jul 1, 2020

For the record, #1426 should properly fix this issue but as noted in the PR, it has some issues.

@hvy
Copy link
Member

hvy commented Jul 1, 2020

I somehow fail to reproduce your error. Can you provide more details on the failure such as your environment including the version of sqlite3, sqlalchemy, optuna (I assume it's v1.5.0) and maybe your script for running the code.

I'd imagine that a hot-fix could look something like the following, #1467. Do you happen to be able to test it?

@salvadord
Copy link
Author

Thanks I will try the solutions posted above (upgrade sqlite, mysql, cached storage).

My current setup is quite computationally expensive (each objective function requires 15 min on 96 cores), but I can try to set up a simpler example to try out the hot-fix.

Here's some info on the environment: This is in CentOS 7 virtual machines with a controller node where script runs and jobs are submitted; jobs run on compute nodes with NFS shared home folder with controller.

Python 3.6.8

sqlite3.version
'2.6.0'

sqlite3.sqlite_version
'3.7.17'

sqlalchemy.version
'1.3.18'

optuna.version
'1.5.0'

the script is not in a public repo, but the optuna calls are here: https://github.com/Neurosim-lab/netpyne/blob/optuna/netpyne/batch/optuna_parallel.py .

I use a bash script to submit multiple instances with screen:

#!/bin/bash
numproc=$1; if [ -z $numproc ]; then numproc=4; fi # Number of processes to use

echo $numproc

for (( i=1; i<=$numproc; i++ ))
    do
        echo "Running batch process $i ..."
        screen -Ldm python3 batch.py @# Run the models
        sleep 1
    done

@salvadord
Copy link
Author

FYI, I am unable to reproduce the error on my laptop (Mac OS) using the same setup (50 processes and >1000 trials) but with fast objective func. So might be due to the HPC setup (CentOS, etc) or maybe because it's using NFS, ...

I did reproduce the same issue twice now on the HPC after 1000 trials.

@salvadord
Copy link
Author

Switched to using mysql (after installing required libraries) but getting the error below... any idea how to fix? thx

File "batch.py", line 77, in <module>
  batchOptuna()  # 'simple' or 'complex' 
File "batch.py", line 73, in batchOptuna
  b.run()
File "/home/ext_salvadordura_gmail_com/netpyne/netpyne/batch/batch.py", line 219, in run
  optunaOptim(self, pc)
File "/home/ext_salvadordura_gmail_com/netpyne/netpyne/batch/optuna_parallel.py", line 386, in optunaOptim
  load_if_exists=True, direction=args['direction'])
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/study.py", line 866, in create_study
  storage = storages.get_storage(storage)
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/storages/__init__.py", line 19, in get_storage
  return _CachedStorage(RDBStorage(storage))
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/optuna/storages/rdb/storage.py", line 113, in __init__
  models.BaseModel.metadata.create_all(self.engine)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/sql/schema.py", line 4547, in create_all
  ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2097, in _run_visitor
  with self._optional_conn_ctx_manager(connection) as conn:
File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
  return next(self.gen)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2089, in _optional_conn_ctx_manager
  with self._contextual_connect() as conn:
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2305, in _contextual_connect
  self._wrap_pool_connect(self.pool.connect, None),
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2343, in _wrap_pool_connect
  e, dialect, self
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1585, in _handle_dbapi_exception_noconnection
  sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
  raise exception
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
  return fn()
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 364, in connect
  return _ConnectionFairy._checkout(self)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
  fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
  rec = pool._do_get()
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 140, in _do_get
  self._dec_overflow()
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 69, in __exit__
  exc_value, with_traceback=exc_tb,
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
  raise exception
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/impl.py", line 137, in _do_get
  return self._create_connection()
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
  return _ConnectionRecord(self)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
  self.__connect(first_connect_check=True)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
  pool.logger.debug("Error on connect(): %s", e)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 69, in __exit__
  exc_value, with_traceback=exc_tb,
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
  raise exception
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
  connection = pool._invoke_creator(self)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
  return dialect.connect(*cargs, **cparams)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 493, in connect
  return self.dbapi.connect(*cargs, **cparams)
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/MySQLdb/__init__.py", line 84, in Connect
  return Connection(*args, **kwargs)
File "/home/ext_salvadordura_gmail_com/.local/lib/python3.6/site-packages/MySQLdb/connections.py", line 179, in __init__
  super(Connection, self).__init__(*args, **kwargs2)
sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1044, "Access denied for user ''@'localhost' to database './simple/simple_storage.db'")
(Background on this error at: http://sqlalche.me/e/13/e3q8)

@salvadord
Copy link
Author

I'm trying your hot fix on the HPC with the full model.

@salvadord
Copy link
Author

The hot fix seems to be working as it's already on trial 4490... thanks!

@hvy
Copy link
Member

hvy commented Jul 3, 2020

Thanks for the detailed report, and great, maybe we can proceed with the hot fix.

I also reproduced the issue/verified the fix by building sqlite3 from source with a lower SQLITE_MAX_VARIABLE_NUMBER (as seen below) (https://gist.github.com/hvy/3bf60580d810597fcf15bda3f5e6447a) but you are also right about the issue being Linux distribution dependent, c.f. https://bugzilla.redhat.com/show_bug.cgi?id=1798134. You can actually reproduce it on e.g. a mac by running 500000 trials (or creating a query with that many variables in some other way). So the cause was rather simple and it really just boiled down to how sqlite3 was compiled. I'll see if I can come up with decent unit tests for the fix.

[I 2020-07-03 14:53:07,043] Trial 98 finished with value: 0.4749849575921493 and parameters: {'x': 0.23749247879607466}. Best is trial 60 with value: 0.00016943976968276887.
[I 2020-07-03 14:53:07,072] Trial 99 finished with value: 0.0015521006110028536 and parameters: {'x': 0.0007760503055014268}. Best is trial 60 with value: 0.00016943976968276887.
[W 2020-07-03 14:53:07,076] Caught an error from sqlalchemy: (sqlite3.OperationalError) too many SQL variables
[SQL: SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.value AS trials_value, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id NOT IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) AND trials.study_id = ?]
[parameters: (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 1)]
(Background on this error at: http://sqlalche.me/e/13/e3q8). Falling back to a slower alternative.
[I 2020-07-03 14:53:07,104] Trial 100 finished with value: 0.15328577383303693 and parameters: {'x': 0.07664288691651847}. Best is trial 60 with value: 0.00016943976968276887.
[I 2020-07-03 14:53:07,131] Trial 101 finished with value: 0.3023945727933951 and parameters: {'x': 0.15119728639669755}. Best is trial 60 with value: 0.00016943976968276887.
[I 2020-07-03 14:53:07,157] Trial 102 finished with value: 0.12291831681480901 and parameters: {'x': 0.061459158407404506}. Best is trial 60 with value: 0.00016943976968276887.
[I 2020-07-03 14:53:07,183] Trial 103 finished with value: 0.00535384280555289 and parameters: {'x': 0.002676921402776445}. Best is trial 60 with value: 0.00016943976968276887.

Note: sqlite3 is part of the Python standard library, but it dynamically links to a sqlite3 shared object when being imported which can be swapped out to any other version.

@hvy
Copy link
Member

hvy commented Jul 7, 2020

Closing this issue as the fix #1467 has been merged. @salvadord, feel free to reopen as needed.

@hvy hvy closed this as completed Jul 7, 2020
@salvadord
Copy link
Author

salvadord commented Sep 19, 2020

I'm actually still getting this issue after around trial 1000; same setup as above (CentOS), but with optuna updated to v2.1:

In [4]: sqlite3.version
Out[4]: '2.6.0'

In [5]: sqlite3.sqlite_version
Out[5]: '3.7.17'

In [10]: sqlalchemy.version
Out[10]: '1.3.19'

In [14]: optuna.version
Out[14]: '2.1.0'

Error:
22, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1)] (Background on this error at: http://sqlalche.me/e/13/e3q8). Falling back to a slower alternative. [W 2020-09-15 21:20:20,019] Caught an error from sqlalchemy: (sqlite3.OperationalError) too many SQL variables [SQL: SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trials.state AS trials_state, trials.value AS trials_value, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete FROM trials WHERE trials.trial_id IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,

@hvy can you please confirm the hot fix is working ok in the main branch? thanks

@salvadord
Copy link
Author

@hvy I tried reopening this issue, but don't think I have permissions to do this.

@hvy hvy reopened this Sep 23, 2020
@hvy
Copy link
Member

hvy commented Sep 23, 2020

Wondering why it's not caught in https://github.com/optuna/optuna/blob/master/optuna/storages/_rdb/storage.py#L979? Could you give repro steps, more context and possibly the full stack trace?

@salvadord
Copy link
Author

The setup is the same described above (#1457 (comment)).

I'll try to reproduce with simpler steps if possible. thx

@salvadord
Copy link
Author

I was able to reproduce the issue with simpler code and figured out it was due to having the following line (one of the solutions suggested above a while ago):
study._storage = study._storage._backend # avoid using cached storage

@hvy
Copy link
Member

hvy commented Sep 29, 2020

Thanks for the update. Also, I was mistaking your previous log as if an error was raised to the user code but taking a proper look I now see that the error is caught inside _get_trials and it's falling back to the slower alternative as expected. I think we can close this issue once again. Feel free to ping or to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. question Question about Optuna.
Projects
None yet
Development

No branches or pull requests

3 participants