Include optimization.table #40

glatterf42 · 2024-01-26T13:12:22Z

The next part of adding the message_ix/ixmp data model.

Still to be done:

Add list, enumerate, tabulate, docs functionality in DB layer
Add API layer
Add Core layer

For now, still a WIP.

* Allow ruff to run on tests/ directory

* Note: still rudimentary, requires clean up and expansion * Can only create tables without data for now

* Include several tests for tables.add_data() * Update table docs * Refactor column docs to own file

glatterf42 · 2024-01-26T13:45:03Z

While this PR is still a WIP, you can already take a look at what is arguably the most interesting part: adding data in the DB layer to a Table, which requires lots of validation. The following behaviour and possibly more can be inferred from the existing tests and files in the data/abstract and data/db directories.

Current details on adding and validating data

How does it work?

Currently, data can be provided as either a dict or a pd.DataFrame. Note that when creating a dict with values of len() == 1, you still need to provide them as lists, like dict = {"key": [value]}. This is because we use pandas and its existing functionality for validation, for which we need to convert the dict to a DataFrame, which requires a fixed order.

Before validation, though, we use the dict form of the data to merge them with possibly pre-existing Table.data. This is done via the union operator dict1 | dict2 for dicts, which overwrites keys in dict1 if dict2 also provides them, so this already goes in the direction of updating data and probably warrants logging information at least.

Whenever table.data is set (note that table.data |= data does not count as setting, while table.data = table.data | data does; similar to adding elements to IndexSets), the validate_data() function is triggered and checks that

no values are missing (each column contains the same number of values);
there are no duplicate rows in the data;
all columns only contain data allowed as per the linked Indexset.elements

At the moment, all of these cases raise ValueErrors (though with differing messages); I don't know if we want to raise custom errors instead.

IndexSets are constrained when the Table is created. Per default, the IndexSet.names become the column names, but this can be overwritten when specifying column_names=[...]. In that case, column_names needs to contain exactly one unique name for each column.

Some more edge case considerations

`Data.keys != len(constrained_to_indexsets)`

I've tried to ensure that several edge cases are covered already. For example, if you provide a data dict that doesn't contain data for all specified columns, nothing happens. We should introduce a check that all required data is present before the Table is used elsewhere, but it's fine to add the data piecemeal. If you provide duplicate keys in the same dict, the latter will overwrite the former, but tools like ruff will warn you before that happens.

Using an `IndexSet` twice in `constrained_to_indexsets`

For now, this is fine. I think something like a model_years indexset might be used for something like year_active and year_vintage, but @danielhuppmann, please let me know if this should be constrained as well.

Duplicates in `column_names`

Will raise a ValueError.

Different data types

As one test demonstrates (with table_5, I think), a column can contain data of different types. At one point, I noticed Column.dtype being a strange object, but I've not studied this further for now since I don't know what we'll be using the dtype for and it's not currently in use, anyway. @meksor, maybe you could tell me again what the dtype and unique parameters from the requirements were supposed to achieve? I might be missing more than I think.

* Also expand tests for Indexsets and Scalars

* TEMPORARILY limited to DB layer

glatterf42 · 2024-02-19T12:09:56Z

Questions and possible ToDos

@meksor and @danielhuppmann, sorry for the long post. While cleaning up this PR, I came across the following questions/notes I took during its implementation. I'd like to clarify them before merging the PR even if our solution is to open a new issue and include the fix/expansion in another PR.

Adding data to a Table in the data layer

We currently have the following syntax for that in the DB layer (very similar to how elements are added to an Indexset):

        table = test_mp.backend.optimization.tables.create(
            run_id=run.id,
            name="Table",
            constrained_to_indexsets=[indexset_1.name, indexset_2.name],
        )
        test_mp.backend.optimization.tables.add_data(
            table_id=table.id, data=test_data_1
        )
        table = test_mp.backend.optimization.tables.get(run_id=run.id, name="Table")
        assert table.data == test_data_1

I'm wondering if we want to keep it that way. In the core layer, we have

        table = run.optimization.tables.create(
            "Table",
            constrained_to_indexsets=[indexset.name, indexset_2.name],
        )
        table.add(data=test_data_1)
        assert table.data == test_data_1

So the object is updated without another get() call. Note that this only works because table.add() both adds the data and calls get() to return the updated Table.

Linking each to Column to a unique Indexset

I'm wondering if different Columns of a Table can be constrained to the same Indexset. Or in other words, I'm wondering if the following should raise an error:

        with pytest.raises(ValueError):
            _ = test_mp.backend.optimization.tables.create(
                run_id=run.id,
                name="Table 2",
                constrained_to_indexsets=[indexset_1.name, indexset_1.name],
                column_names=["Column 1", "Column 2"]
            )

Raising distinct error messages when validating data

At the moment, we have numerous validation checks for data that is being added to a Table, but all of these checks only raise ValueErrors (though with unique messages):

ixmp4/ixmp4/data/db/optimization/table/model.py

Lines 48 to 71 in b682f7e

    
           def validate_data(self, key, data: dict[str, Any]): 
        
               # if isinstance(data, dict): 
        
               data_frame: pd.DataFrame = pd.DataFrame.from_dict(data) 
        
               # TODO for all of the following, we might want to create unique exceptions 
        
               # TODO: we could make this more specific maybe by pointing to the missing values 
        
               if data_frame.isna().any(axis=None): 
        
                   raise ValueError( 
        
                       "Table.data is missing values, please make sure it does " 
        
                       "not contain None or NaN, either!" 
        
                   ) 
        
               # TODO we can make this more specific e.g. highlighting all duplicate rows via 
        
               # pd.DataFrame.duplicated(keep="False") 
        
               if data_frame.value_counts().max() > 1: 
        
                   raise ValueError("Table.data contains duplicate rows!") 
        
               # TODO can we make this more specific? Iterating over columns; if any is False, 
        
               # return its name or something? 
        
               limited_to_indexsets = self.collect_indexsets_to_check() 
        
               if not data_frame.isin(limited_to_indexsets).all(axis=None): 
        
                   raise ValueError( 
        
                       "Table.data contains keys and/or values that are not allowed as per " 
        
                       "the IndexSets and Columns it is constrained to!" 
        
                   )

And

ixmp4/ixmp4/data/db/optimization/table/repository.py

Lines 79 to 97 in b682f7e

    
           def create( 
        
               self, 
        
               run_id: int, 
        
               name: str, 
        
               constrained_to_indexsets: list[str],  # TODO: try passing a str to this 
        
               column_names: list[str] | None = None, 
        
               **kwargs, 
        
           ) -> Table: 
        
               if column_names and len(column_names) != len(constrained_to_indexsets): 
        
                   raise ValueError( 
        
                       "`constrained_to_indexsets` and `column_names` not equal in length! " 
        
                       "Please provide the same number of entries for both!" 
        
                   ) 
        
               # TODO: activate something like this if each column must be indexed by a unique 
        
               # indexset 
        
               # if len(constrained_to_indexsets) != len(set(constrained_to_indexsets)): 
        
               #     raise ValueError("Each dimension must be constrained to a unique indexset!") # noqa 
        
               if column_names and len(column_names) != len(set(column_names)): 
        
                   raise ValueError("The given `column_names` are not unique!")

So my question is: are we fine with that or would we want to have distinct custom errors for all these checks?

Allowing Table.data to be added piece-meal

Currently, we allow to add data for the various Columns piece-meal like so:

        table_3 = run.optimization.tables.create(
            name="Table 3",
            constrained_to_indexsets=[indexset.name, indexset_2.name],
            column_names=["Column 1", "Column 2"],
        )
        table_3.add(data={"Column 1": ["bar"]})
        assert table_3.data == {"Column 1": ["bar"]}

        table_3.add(data={"Column 2": [2]})
        assert table_3.data == {"Column 1": ["bar"], "Column 2": [2]}

        table_3.add(
            data=pd.DataFrame({"Column 1": ["foo"], "Column 2": [3]}),
        )
        assert table_3.data == {"Column 1": ["foo"], "Column 2": [3]}

If the data to be added contains data for a Column that already has data, that Column's data is overwritten.
If we are fine with this behavior, we should probably still

Add functionality to check which data is already present/still needed (we might also want to somehow include information on how to see which data are permitted by an Indexset)
Log information if a Column's data is overwritten
Add a check that all data is present before the Table is used elsewhere (though I'm not sure this is possible since there's no clear endpoint of "this is where we stop adding data to the Table", so it's not clear to me when this kind of test should occur)

Specifying the type of `constrained_to_indexsets`

In the core layer, we currently support tables.create() to receive the parameter constrained_to_indexsets as a list of strings:

def create(
        self,
        name: str,
        constrained_to_indexsets: list[str],
        column_names: list[str] | None = None,
    ) -> Table:

However, we return the property Table.constrained_to_indexsets as a list of integers:

    def constrained_to_indexsets(self) -> list[int]:
        return [column.constrained_to_indexset for column in self._model.columns]

Which is the preferred form? Or should we accept list[str | int] for create() and still only return int?
In the DB layer, such a property does not exist. A Table has Columns and each Column has constrained_to_indexset, which is of type int. I only added the property in the core layer on a whim because I thought it might be convenient.

glatterf42 · 2024-02-19T12:56:31Z

Handling Columns of Tables

At the moment, Columns are only added to a Table during the creation of the Table. Do we want to keep it that way?
We could alternatively think about allowing Columns to be added to Tables manually after their creation and also then allow removing Columns from Tables (which might entail removing data).

danielhuppmann · 2024-02-19T13:36:07Z

Thanks @glatterf42, see my responses below:

Adding data to a Table in the data layer

Yes, this approach makes sense to me.

You could improve performance by setting table to None after an add(), and only get() when table is explicitly accessed afterwards. This way, a user wouldn't have to load the entire table multiple times.

def add():
    test_mp.backend.optimization.tables.add_data(
        table_id=table.id, data=test_data_1
    )
    self._table = None

@getter 
def table:
    if self._table is None:
        self._table = test_mp.backend.optimization.tables.get(run_id=run.id, name="Table")
     return self_table

Linking each to Column to a unique Indexset

Yes, it is absolutely crucial that multiple columns can be foreign-keyed to the same index set. Only the name has to be unique.

Raising distinct error messages when validating data

Good enough for now, but maybe to be improved later.

Allowing Table.data to be added piece-meal

Fine in principle to add data step by step similar to the current behavior, but data should be added row-wise. So if you have a table with two columns, the following line should raise an error.

table_3.add(data={"Column 1": ["bar"]})
assert table_3.data == {"Column 1": ["bar"]}
> ValueError Missing entry for 'Column 2'

Specifying the type of constrained_to_indexsets

I assume that the int is the unique id of an indexset? In terms of usability, I would only show the (unique) name of an indexset to a user.

Handling Columns of Tables

It should not be possible to add columns to a table after creation.

* Remove some outdated TODOs * Make _add_column() a private function * Return Indexset.names as constrained_to_indexsets * Enforce that Table.data be added row-wise

danielhuppmann · 2024-03-08T12:42:39Z

Thanks @glatterf42

To be clear about the current workflow to add data to a table: you'd create the table and then need to call add_data() with all the data you want to add. If you add something like {"test": 2, "nexttest": 3} and then decide you want to use also/instead add {"test": 4, "nexttest": 5}, we currently take that as an "instead" and overwrite the existing keys.

No, this is not the expected use case, given current behavior of ixmp and usual modelling workflows.

Imagine a "table" being the investment costs for different power plants in several regions. You often have situations where a modeler has a script e.g. computing parameters for only one type of power plant (or one region).

Similar to the current MESSAGE tutorials, I see the need for a workflow like

inv_cost_wind = pd.DataFrame()  # get a dataframe of investment cost parameters for a certain technology
run.parameters.get("inv_cost").add_data(inv_cost_wind)

adding (or replacing existing) datapoints but not removing existing parameter datapoints.

glatterf42 · 2024-03-08T13:37:20Z

Sorry, I don't quite follow: in your example, would inv_cost_wind contain all the data currently in the parameter plus the updated values or would it only contain the values that need updating?

And mainly for me, to clarify: I'm imagining a table like this:

power plant type	Region	Investment cost
wind	EEU	x
wind	WEU	y
water	EEU	z

And a modeler might now want to update wind in EEU to w, so they would only provide the w value (plus other columns required for this one row) and expect this to overwrite the existing value for that row, while leaving the rest untouched?

danielhuppmann · 2024-03-08T14:04:41Z

The usual workflow is to only provide the data that should be changed.

For example, IEA publishes new wind turbine cost estimates and a modeller updates the data like

run.parameters.get("inv_cost").add_data(
  {‘technology’: ‘wind’, ‘region’: ‘EEU’, ‘value’: w, ‘unit’: ‘EUR/GW’}
)

glatterf42 · 2024-03-12T07:19:38Z

Okay, I've updated the behaviour to this:

        table_3 = run.optimization.tables.create(
            name="Table 3",
            constrained_to_indexsets=[indexset.name, indexset_2.name],
            column_names=["Column 1", "Column 2"],
        )

        table_3.add(data={"Column 1": ["bar"], "Column 2": [2]})
        assert table_3.data == {"Column 1": ["bar"], "Column 2": [2]}

        # Test data is expanded when Column.name is already present
        table_3.add(
            data=pd.DataFrame({"Column 1": ["foo"], "Column 2": [3]}),
        )
        assert table_3.data == {"Column 1": ["bar", "foo"], "Column 2": [2, 3]}

If this is now how it should be, I'll add a DB migration and hopefully we can merge this PR soon :)

ixmp4/server/rest/optimization/table.py

ixmp4/data/api/optimization/column.py

ixmp4/data/db/optimization/column/filter.py

ixmp4/data/db/optimization/column/model.py

ixmp4/data/db/optimization/table/filter.py

tests/data/test_docs.py

* Include pagination update for Scalar * Include pagination update for Table

meksor

Hiya, I looked at the current state of this PR code-wise and it looks good.

danielhuppmann

Thank you @glatterf42, this looks very nice, almost ready to be merged. A few observations inline, and one more general observation (which I made a while ago already): I believe that the order of arguments table-name -> table-columns -> table-columns-constraints-to-indexsets would be more intuitive...

ixmp4/data/abstract/optimization/column.py

ixmp4/core/optimization/table.py

ixmp4/data/api/optimization/column.py

tests/core/test_table.py

danielhuppmann

Looks good to me, many thanks!

glatterf42 added 10 commits January 25, 2024 14:46

Introduce Optimization.Scalar

3e7fc6f

Correct Indexset UniqueConstraint and streamline core syntax

871dc3d

Replace black with ruff format

f2da091

Apply ruff formatting with reduced line length

b8f2873

* Allow ruff to run on tests/ directory

Introduce optimization.Column and optimization.Table

df625f0

* Note: still rudimentary, requires clean up and expansion * Can only create tables without data for now

Replace black badge with ruff badge

a3519dd

Start developing tests for tables.add_data()

eaec4ae

Start work on db.optimization.tables.add_data()

b46dede

Add type hint for indexset.elements validator

4edf63e

Expand Tables.add_data() and update docs

9b99fcd

* Include several tests for tables.add_data() * Update table docs * Refactor column docs to own file

glatterf42 added the enhancement New feature or request label Jan 26, 2024

glatterf42 self-assigned this Jan 26, 2024

glatterf42 added 2 commits January 26, 2024 14:13

Remove outdated comments

0baea13

Fix core/indexset tests

7ff2093

glatterf42 added 10 commits February 13, 2024 15:03

Merge branch 'main' into include/optimization-table

5a6966a

Add lock file after --no-upgrade lock

c799c4d

Comply with removal of lazy_fixture

fad6320

Expand Table tests in DB layer

60d163e

* Also expand tests for Indexsets and Scalars

Expand docs tests for TableDocs

6f71751

* TEMPORARILY limited to DB layer

Complete tabulate docs returned columns

f2ecd37

Add support for float IndexSet.elements

2cf2154

Add API layer for optimization.Table

6092720

Small type hint improvements

24238fd

Add core layer for optimization.Table

b682f7e

glatterf42 marked this pull request as ready for review February 16, 2024 12:00

Clean up before merging

922f958

* Remove some outdated TODOs * Make _add_column() a private function * Return Indexset.names as constrained_to_indexsets * Enforce that Table.data be added row-wise

glatterf42 requested a review from danielhuppmann March 8, 2024 12:05

Expand table.data instead of overwriting it

db166e0

This comment was marked as resolved.

Sign in to view

This comment was marked as duplicate.

Sign in to view

meksor reviewed Mar 14, 2024

View reviewed changes

Merge branch 'main' into include/optimization-table

e7c0f3b

This comment was marked as resolved.

Sign in to view

Test TableDocs on all_platforms

44df016

This comment was marked as resolved.

Sign in to view

glatterf42 added 4 commits March 15, 2024 11:23

Make Column UniqueConstraint target more accurate

870a231

Remove unused ColumnRepository from abstract and api

76d1d33

Make table REST api more concise

ccae044

Remove general get endpoint in REST api

a9f79ff

* Include pagination update for Scalar * Include pagination update for Table

This comment was marked as outdated.

Sign in to view

Clean up filters

bcbdff3

meksor approved these changes Apr 9, 2024

View reviewed changes

danielhuppmann reviewed Apr 11, 2024

View reviewed changes

danielhuppmann approved these changes Apr 12, 2024

View reviewed changes

glatterf42 added 6 commits April 12, 2024 09:56

Add test for empty string

4fcabc4

Remove outdated comment

2bb9227

Remove creation information for Columns

0ee8ced

Fix docstring

bb39aa0

Use ruff for alembic post write hook

11f455e

Add DB migration file for Table

ee3e530

glatterf42 merged commit 593ffe7 into main Apr 12, 2024
7 checks passed

glatterf42 deleted the include/optimization-table branch April 12, 2024 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include optimization.table #40

Include optimization.table #40

glatterf42 commented Jan 26, 2024 •

edited

Loading

glatterf42 commented Jan 26, 2024 •

edited

Loading

How does it work?

Some more edge case considerations

`Data.keys != len(constrained_to_indexsets)`

Using an `IndexSet` twice in `constrained_to_indexsets`

Duplicates in `column_names`

Different data types

glatterf42 commented Feb 19, 2024 •

edited

Loading

glatterf42 commented Feb 19, 2024

danielhuppmann commented Feb 19, 2024 •

edited by glatterf42

Loading

danielhuppmann commented Mar 8, 2024

glatterf42 commented Mar 8, 2024

danielhuppmann commented Mar 8, 2024 •

edited

Loading

glatterf42 commented Mar 12, 2024

This comment was marked as resolved.

This comment was marked as duplicate.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

meksor left a comment

danielhuppmann left a comment

danielhuppmann left a comment

Include optimization.table #40

Include optimization.table #40

Conversation

glatterf42 commented Jan 26, 2024 • edited Loading

glatterf42 commented Jan 26, 2024 • edited Loading

How does it work?

Some more edge case considerations

Data.keys != len(constrained_to_indexsets)

Using an IndexSet twice in constrained_to_indexsets

Duplicates in column_names

Different data types

glatterf42 commented Feb 19, 2024 • edited Loading

Questions and possible ToDos

Adding data to a Table in the data layer

Linking each to Column to a unique Indexset

Raising distinct error messages when validating data

Allowing Table.data to be added piece-meal

Specifying the type of constrained_to_indexsets

glatterf42 commented Feb 19, 2024

Handling Columns of Tables

danielhuppmann commented Feb 19, 2024 • edited by glatterf42 Loading

Adding data to a Table in the data layer

Linking each to Column to a unique Indexset

Raising distinct error messages when validating data

Allowing Table.data to be added piece-meal

Specifying the type of constrained_to_indexsets

Handling Columns of Tables

danielhuppmann commented Mar 8, 2024

glatterf42 commented Mar 8, 2024

danielhuppmann commented Mar 8, 2024 • edited Loading

glatterf42 commented Mar 12, 2024

This comment was marked as resolved.

This comment was marked as duplicate.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

meksor left a comment

Choose a reason for hiding this comment

danielhuppmann left a comment

Choose a reason for hiding this comment

danielhuppmann left a comment

Choose a reason for hiding this comment

glatterf42 commented Jan 26, 2024 •

edited

Loading

glatterf42 commented Jan 26, 2024 •

edited

Loading

`Data.keys != len(constrained_to_indexsets)`

Using an `IndexSet` twice in `constrained_to_indexsets`

Duplicates in `column_names`

glatterf42 commented Feb 19, 2024 •

edited

Loading

Specifying the type of `constrained_to_indexsets`

danielhuppmann commented Feb 19, 2024 •

edited by glatterf42

Loading

danielhuppmann commented Mar 8, 2024 •

edited

Loading