WIP: Update `parameter.add_data()` functionality #112

glatterf42 · 2024-08-26T12:57:06Z

As discovered in #108, we need a different implementation of parameter.add_data(). Ideally, we would like to make use of the JSONB type in-DB to avoid parsing the JSON field. However, with my current implementation, even that (which is only possible on postgres in the first place) is not at all faster than the pandas implementation (i.e. parsing the field to python, using pandas, writing it back).
I've tested this with a parameter with 1 indexset with 1000 values and with 60 indexsets with 1000 values (though the parameter was limited to 10000 values and units there): pandas is faster. Some numbers from my most recent test tun:

--------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_add_data[sqlite]           79.8390 (1.0)       79.8390 (1.0)       79.8390 (1.0)      0.0000 (1.0)       79.8390 (1.0)      0.0000 (1.0)           0;0  12.5252 (1.0)           1           1
test_add_data[postgres]        128.1381 (1.60)     128.1381 (1.60)     128.1381 (1.60)     0.0000 (1.0)      128.1381 (1.60)     0.0000 (1.0)           0;0   7.8041 (0.62)          1           1
test_add_data_json[sqlite]     276.1958 (3.46)     276.1958 (3.46)     276.1958 (3.46)     0.0000 (1.0)      276.1958 (3.46)     0.0000 (1.0)           0;0   3.6206 (0.29)          1           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You'll notice that test_add_data_json[postgres] is missing: this test currently fails because it struggles to read in the column.names at one point. Hardcoding them does work, so it's doable, and you can test locally: it's still slower than pandas.

In general, _json indicated the variants working directly in-DB, while those without suffix are the pandas versions.

We might want to consider something like this if we want to use temporary tables in postgres.

Finally, for the 10000-values case, the in-DB workflow was not just slower: it didn't manage at all. My system did still work, but the test case wouldn't finish in more than 30 minutes. Pandas, on the other hand, took 2.8 seconds max (mean of 2.5).
So please, @danielhuppmann, let me know if using pandas is okay or if we should find a way to utilize JSONB (which would be nice for sure).
And if we go for JSONB, @meksor, please help me figure out the proper way to use it, i.e. one that's actually faster than pandas.

danielhuppmann · 2024-08-27T09:59:21Z

Thanks @glatterf42! I didn't have a chance to look at the code in detail, but if pulling all data and doing the comparison in memory is faster, that sounds like a good approach to me.

* Covers: * run__id, data, name, uniqueness of name together with run__id * Adapts tests since default order of columns changes

* Make Column generic enough for multiple parents * Introduce optimization.Parameter * Add tests for add_data * Enable remaining parameter tests (#86) * Enable remaining parameter tests * Include optimization parameter api layer (#89) * Bump several dependency versions * Let api/column handle both tables and parameters * Make api-layer tests pass * Include optimization parameter core layer (#90) * Enable parameter core layer and test it * Fix things after rebase * Ensure all intended changes survive the rebase * Adapt data validation function for parameters * Allow tests to pass again

glatterf42 · 2024-09-30T14:46:47Z

A while ago, Volker asked me to also benchmark upserting a parameter with roughly ten indexsets, but a million values. This gives a similar picture to before, though it may be a little worrying (I still have to look into it further): the only test that finished within an hour was the one using pandas and sqlite:

----------------------------------------------- benchmark: 1 tests -----------------------------------------------
Name (time in s)              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_add_data[sqlite]     96.3460  96.3460  96.3460  0.0000  96.3460  0.0000       0;0  0.0104       1           1
------------------------------------------------------------------------------------------------------------------

I'm not sure if 96 seconds is fine for us for this kind of operation, I'll benchmark the same operation with ixmp tomorrow.

When running on postgres, the benchmark produced this error:

self = <psycopg.Cursor [closed] [IDLE] (host=localhost user=postgres database=test) at 0x7f000bcd8290>
query = 'UPDATE optimization_parameter SET data=%(data)s::JSONB WHERE optimization_parameter.id = %(optimization_parameter_id)s::INTEGER'
params = {'data': Jsonb({'Indexset 0': [0, 0, 0, 0, 0, 0, 0 ... (128780213 chars)), 'optimization_parameter_id': 1}

    def execute(
        self,
        query: Query,
        params: Optional[Params] = None,
        *,
        prepare: Optional[bool] = None,
        binary: Optional[bool] = None,
    ) -> Self:
        """
        Execute a query or command to the database.
        """
        try:
            with self._conn.lock:
                self._conn.wait(
                    self._execute_gen(query, params, prepare=prepare, binary=binary)
                )
        except e._NO_TRACEBACK as ex:
>           raise ex.with_traceback(None)
E           sqlalchemy.exc.OperationalError: (psycopg.errors.ProgramLimitExceeded) total size of jsonb object elements exceeds the maximum of 268435455 bytes
E           CONTEXT:  unnamed portal parameter $1 = '...'
E           [SQL: UPDATE optimization_parameter SET data=%(data)s::JSONB WHERE optimization_parameter.id = %(optimization_parameter_id)s::INTEGER]
E           [parameters: {'data': Jsonb({'Indexset 0': [0, 0, 0, 0, 0, 0, 0 ... (128780213 chars)), 'optimization_parameter_id': 1}]
E           (Background on this error at: https://sqlalche.me/e/20/e3q8)

.venv/lib/python3.12/site-packages/psycopg/cursor.py:732: OperationalError

Meanwhile, when using the in-DB _json functionality, the sqlite test wouldn't finish within an hour (so that I couldn't even start the one on postgres, though I doubt it would fare much better).

glatterf42 · 2024-10-01T06:00:36Z

Running the same benchmark (upserting a million values to one parameter with 12 indexsets) in ixmp takes just 53.48 seconds, so the current pandas-implementation in ixmp4 would actually constitute a downgrade, it seems.
Hence, I should invest more time into making the new implementation more efficient. Since I tried that before and wasn't too successful, I'd appreciate any help or guidance you can give, @meksor.

* Covers: * run__id, data, name, uniqueness of name together with run__id * Adapts tests since default order of columns changes

* Make Column generic enough for multiple parents * Introduce optimization.Parameter * Add tests for add_data * Enable remaining parameter tests (#86) * Enable remaining parameter tests * Include optimization parameter api layer (#89) * Bump several dependency versions * Let api/column handle both tables and parameters * Make api-layer tests pass * Include optimization parameter core layer (#90) * Enable parameter core layer and test it * Fix things after rebase * Ensure all intended changes survive the rebase * Adapt data validation function for parameters * Allow tests to pass again

* Disable those until their PRs are merged

* Update to pandas workflow * New in-DB functionality for testing

glatterf42 added the enhancement New feature or request label Aug 26, 2024

glatterf42 requested review from meksor and danielhuppmann August 26, 2024 12:57

glatterf42 self-assigned this Aug 26, 2024

glatterf42 and others added 19 commits September 30, 2024 10:40

Make creation information a mixin

5499b87

Make name column a mixin

a2a63ff

Make optimization columns mixins

90b7105

* Covers: * run__id, data, name, uniqueness of name together with run__id * Adapts tests since default order of columns changes

Inherit mixin requirements directly

7b8e389

Fix references to DB filters in docs

02e3f43

Streamline naming in tests

33d186e

Fix and test parameter list and tabulate for specific runs

6910d9e

Include Run-side of relationship

247b06c

Make indexset-creation a test utility

2d02e4f

Make new tests more efficient

87dd645

Incorporate changes from #110

94b337e

Use pandas for updated add_data behaviour

8e737c6

Remove superfluous session.add() for parameter

52a07ed

Use core layer in core test

5b488ab

Raise minimum pandas version to enable add_data upsert

ce5ff14

Generalize UsageError for more optimization items

b41eb28

Use generalized UsageError for Table

f160e9c

Use own errors for Parameter

a9e5ac8

glatterf42 force-pushed the include/optimization-parameter branch from acc783f to a9e5ac8 Compare September 30, 2024 09:17

glatterf42 and others added 3 commits October 1, 2024 09:00

Make optimization columns mixins

445cc16

* Covers: * run__id, data, name, uniqueness of name together with run__id * Adapts tests since default order of columns changes

Inherit mixin requirements directly

85c0bc2

glatterf42 added 12 commits October 1, 2024 09:09

Fix typo in docs

a057ffe

TEMPORARY Add test-parameter.data creation helper script

b44e56c

Add test-parameter.data as csvs

50a303d

Add test-parameter.data as fixture

0250108

Handle SQL events for temp tables

4dd8103

Add option to filter columns by optimization_item.id

0f3a6b1

Add equation__id and variable__id to column filtering

4ac0650

* Disable those until their PRs are merged

Add parameter.data validation for in-DB use

c303393

Introduce new add_data()

1e1fe65

* Update to pandas workflow * New in-DB functionality for testing

Add benchmark test for parameter.add_data comparison

bced419

Add duplicate test for in-DB parameter.add_data

4fb9741

Add 1e6 value benchmark

c6b29b9

glatterf42 force-pushed the enh/add-data-functionality branch from 460f63c to c6b29b9 Compare October 1, 2024 07:31

Base automatically changed from include/optimization-parameter to main October 3, 2024 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Update `parameter.add_data()` functionality #112

WIP: Update `parameter.add_data()` functionality #112

glatterf42 commented Aug 26, 2024

danielhuppmann commented Aug 27, 2024

glatterf42 commented Sep 30, 2024 •

edited

Loading

glatterf42 commented Oct 1, 2024

WIP: Update parameter.add_data() functionality #112

Are you sure you want to change the base?

WIP: Update parameter.add_data() functionality #112

Conversation

glatterf42 commented Aug 26, 2024

danielhuppmann commented Aug 27, 2024

glatterf42 commented Sep 30, 2024 • edited Loading

glatterf42 commented Oct 1, 2024

WIP: Update `parameter.add_data()` functionality #112

WIP: Update `parameter.add_data()` functionality #112

glatterf42 commented Sep 30, 2024 •

edited

Loading