-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Update parameter.add_data()
functionality
#112
base: main
Are you sure you want to change the base?
Conversation
Thanks @glatterf42! I didn't have a chance to look at the code in detail, but if pulling all data and doing the comparison in memory is faster, that sounds like a good approach to me. |
* Covers: * run__id, data, name, uniqueness of name together with run__id * Adapts tests since default order of columns changes
* Make Column generic enough for multiple parents * Introduce optimization.Parameter * Add tests for add_data * Enable remaining parameter tests (#86) * Enable remaining parameter tests * Include optimization parameter api layer (#89) * Bump several dependency versions * Let api/column handle both tables and parameters * Make api-layer tests pass * Include optimization parameter core layer (#90) * Enable parameter core layer and test it * Fix things after rebase * Ensure all intended changes survive the rebase * Adapt data validation function for parameters * Allow tests to pass again
acc783f
to
a9e5ac8
Compare
A while ago, Volker asked me to also benchmark upserting a parameter with roughly ten indexsets, but a million values. This gives a similar picture to before, though it may be a little worrying (I still have to look into it further): the only test that finished within an hour was the one using pandas and sqlite: ----------------------------------------------- benchmark: 1 tests -----------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------
test_add_data[sqlite] 96.3460 96.3460 96.3460 0.0000 96.3460 0.0000 0;0 0.0104 1 1
------------------------------------------------------------------------------------------------------------------ I'm not sure if 96 seconds is fine for us for this kind of operation, I'll benchmark the same operation with ixmp tomorrow. When running on postgres, the benchmark produced this error: self = <psycopg.Cursor [closed] [IDLE] (host=localhost user=postgres database=test) at 0x7f000bcd8290>
query = 'UPDATE optimization_parameter SET data=%(data)s::JSONB WHERE optimization_parameter.id = %(optimization_parameter_id)s::INTEGER'
params = {'data': Jsonb({'Indexset 0': [0, 0, 0, 0, 0, 0, 0 ... (128780213 chars)), 'optimization_parameter_id': 1}
def execute(
self,
query: Query,
params: Optional[Params] = None,
*,
prepare: Optional[bool] = None,
binary: Optional[bool] = None,
) -> Self:
"""
Execute a query or command to the database.
"""
try:
with self._conn.lock:
self._conn.wait(
self._execute_gen(query, params, prepare=prepare, binary=binary)
)
except e._NO_TRACEBACK as ex:
> raise ex.with_traceback(None)
E sqlalchemy.exc.OperationalError: (psycopg.errors.ProgramLimitExceeded) total size of jsonb object elements exceeds the maximum of 268435455 bytes
E CONTEXT: unnamed portal parameter $1 = '...'
E [SQL: UPDATE optimization_parameter SET data=%(data)s::JSONB WHERE optimization_parameter.id = %(optimization_parameter_id)s::INTEGER]
E [parameters: {'data': Jsonb({'Indexset 0': [0, 0, 0, 0, 0, 0, 0 ... (128780213 chars)), 'optimization_parameter_id': 1}]
E (Background on this error at: https://sqlalche.me/e/20/e3q8)
.venv/lib/python3.12/site-packages/psycopg/cursor.py:732: OperationalError Meanwhile, when using the in-DB |
Running the same benchmark (upserting a million values to one parameter with 12 indexsets) in ixmp takes just 53.48 seconds, so the current pandas-implementation in ixmp4 would actually constitute a downgrade, it seems. |
* Covers: * run__id, data, name, uniqueness of name together with run__id * Adapts tests since default order of columns changes
* Make Column generic enough for multiple parents * Introduce optimization.Parameter * Add tests for add_data * Enable remaining parameter tests (#86) * Enable remaining parameter tests * Include optimization parameter api layer (#89) * Bump several dependency versions * Let api/column handle both tables and parameters * Make api-layer tests pass * Include optimization parameter core layer (#90) * Enable parameter core layer and test it * Fix things after rebase * Ensure all intended changes survive the rebase * Adapt data validation function for parameters * Allow tests to pass again
* Disable those until their PRs are merged
* Update to pandas workflow * New in-DB functionality for testing
460f63c
to
c6b29b9
Compare
As discovered in #108, we need a different implementation of
parameter.add_data()
. Ideally, we would like to make use of theJSONB
type in-DB to avoid parsing the JSON field. However, with my current implementation, even that (which is only possible on postgres in the first place) is not at all faster than the pandas implementation (i.e. parsing the field to python, using pandas, writing it back).I've tested this with a parameter with 1 indexset with 1000 values and with 60 indexsets with 1000 values (though the parameter was limited to 10000 values and units there): pandas is faster. Some numbers from my most recent test tun:
You'll notice that
test_add_data_json[postgres]
is missing: this test currently fails because it struggles to read in thecolumn.name
s at one point. Hardcoding them does work, so it's doable, and you can test locally: it's still slower than pandas.In general,
_json
indicated the variants working directly in-DB, while those without suffix are the pandas versions.We might want to consider something like this if we want to use temporary tables in postgres.
Finally, for the 10000-values case, the in-DB workflow was not just slower: it didn't manage at all. My system did still work, but the test case wouldn't finish in more than 30 minutes. Pandas, on the other hand, took 2.8 seconds max (mean of 2.5).
So please, @danielhuppmann, let me know if using pandas is okay or if we should find a way to utilize JSONB (which would be nice for sure).
And if we go for JSONB, @meksor, please help me figure out the proper way to use it, i.e. one that's actually faster than pandas.