Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include optimization.table #40

Merged
merged 39 commits into from
Apr 12, 2024
Merged

Include optimization.table #40

merged 39 commits into from
Apr 12, 2024

Conversation

glatterf42
Copy link
Member

@glatterf42 glatterf42 commented Jan 26, 2024

The next part of adding the message_ix/ixmp data model.

Still to be done:

  • Add list, enumerate, tabulate, docs functionality in DB layer
  • Add API layer
  • Add Core layer

For now, still a WIP.

@glatterf42 glatterf42 added the enhancement New feature or request label Jan 26, 2024
@glatterf42 glatterf42 self-assigned this Jan 26, 2024
@glatterf42
Copy link
Member Author

glatterf42 commented Jan 26, 2024

While this PR is still a WIP, you can already take a look at what is arguably the most interesting part: adding data in the DB layer to a Table, which requires lots of validation. The following behaviour and possibly more can be inferred from the existing tests and files in the data/abstract and data/db directories.

Current details on adding and validating data

How does it work?

Currently, data can be provided as either a dict or a pd.DataFrame. Note that when creating a dict with values of len() == 1, you still need to provide them as lists, like dict = {"key": [value]}. This is because we use pandas and its existing functionality for validation, for which we need to convert the dict to a DataFrame, which requires a fixed order.

Before validation, though, we use the dict form of the data to merge them with possibly pre-existing Table.data. This is done via the union operator dict1 | dict2 for dicts, which overwrites keys in dict1 if dict2 also provides them, so this already goes in the direction of updating data and probably warrants logging information at least.

Whenever table.data is set (note that table.data |= data does not count as setting, while table.data = table.data | data does; similar to adding elements to IndexSets), the validate_data() function is triggered and checks that

  • no values are missing (each column contains the same number of values);
  • there are no duplicate rows in the data;
  • all columns only contain data allowed as per the linked Indexset.elements

At the moment, all of these cases raise ValueErrors (though with differing messages); I don't know if we want to raise custom errors instead.

IndexSets are constrained when the Table is created. Per default, the IndexSet.names become the column names, but this can be overwritten when specifying column_names=[...]. In that case, column_names needs to contain exactly one unique name for each column.

Some more edge case considerations

Data.keys != len(constrained_to_indexsets)

I've tried to ensure that several edge cases are covered already. For example, if you provide a data dict that doesn't contain data for all specified columns, nothing happens. We should introduce a check that all required data is present before the Table is used elsewhere, but it's fine to add the data piecemeal. If you provide duplicate keys in the same dict, the latter will overwrite the former, but tools like ruff will warn you before that happens.

Using an IndexSet twice in constrained_to_indexsets

For now, this is fine. I think something like a model_years indexset might be used for something like year_active and year_vintage, but @danielhuppmann, please let me know if this should be constrained as well.

Duplicates in column_names

Will raise a ValueError.

Different data types

As one test demonstrates (with table_5, I think), a column can contain data of different types. At one point, I noticed Column.dtype being a strange object, but I've not studied this further for now since I don't know what we'll be using the dtype for and it's not currently in use, anyway. @meksor, maybe you could tell me again what the dtype and unique parameters from the requirements were supposed to achieve? I might be missing more than I think.

@glatterf42 glatterf42 marked this pull request as ready for review February 16, 2024 12:00
@glatterf42
Copy link
Member Author

glatterf42 commented Feb 19, 2024

Questions and possible ToDos

@meksor and @danielhuppmann, sorry for the long post. While cleaning up this PR, I came across the following questions/notes I took during its implementation. I'd like to clarify them before merging the PR even if our solution is to open a new issue and include the fix/expansion in another PR.

Adding data to a Table in the data layer

We currently have the following syntax for that in the DB layer (very similar to how elements are added to an Indexset):

        table = test_mp.backend.optimization.tables.create(
            run_id=run.id,
            name="Table",
            constrained_to_indexsets=[indexset_1.name, indexset_2.name],
        )
        test_mp.backend.optimization.tables.add_data(
            table_id=table.id, data=test_data_1
        )
        table = test_mp.backend.optimization.tables.get(run_id=run.id, name="Table")
        assert table.data == test_data_1

I'm wondering if we want to keep it that way. In the core layer, we have

        table = run.optimization.tables.create(
            "Table",
            constrained_to_indexsets=[indexset.name, indexset_2.name],
        )
        table.add(data=test_data_1)
        assert table.data == test_data_1

So the object is updated without another get() call. Note that this only works because table.add() both adds the data and calls get() to return the updated Table.

Linking each to Column to a unique Indexset

I'm wondering if different Columns of a Table can be constrained to the same Indexset. Or in other words, I'm wondering if the following should raise an error:

        with pytest.raises(ValueError):
            _ = test_mp.backend.optimization.tables.create(
                run_id=run.id,
                name="Table 2",
                constrained_to_indexsets=[indexset_1.name, indexset_1.name],
                column_names=["Column 1", "Column 2"]
            )

Raising distinct error messages when validating data

At the moment, we have numerous validation checks for data that is being added to a Table, but all of these checks only raise ValueErrors (though with unique messages):

def validate_data(self, key, data: dict[str, Any]):
# if isinstance(data, dict):
data_frame: pd.DataFrame = pd.DataFrame.from_dict(data)
# TODO for all of the following, we might want to create unique exceptions
# TODO: we could make this more specific maybe by pointing to the missing values
if data_frame.isna().any(axis=None):
raise ValueError(
"Table.data is missing values, please make sure it does "
"not contain None or NaN, either!"
)
# TODO we can make this more specific e.g. highlighting all duplicate rows via
# pd.DataFrame.duplicated(keep="False")
if data_frame.value_counts().max() > 1:
raise ValueError("Table.data contains duplicate rows!")
# TODO can we make this more specific? Iterating over columns; if any is False,
# return its name or something?
limited_to_indexsets = self.collect_indexsets_to_check()
if not data_frame.isin(limited_to_indexsets).all(axis=None):
raise ValueError(
"Table.data contains keys and/or values that are not allowed as per "
"the IndexSets and Columns it is constrained to!"
)

And
def create(
self,
run_id: int,
name: str,
constrained_to_indexsets: list[str], # TODO: try passing a str to this
column_names: list[str] | None = None,
**kwargs,
) -> Table:
if column_names and len(column_names) != len(constrained_to_indexsets):
raise ValueError(
"`constrained_to_indexsets` and `column_names` not equal in length! "
"Please provide the same number of entries for both!"
)
# TODO: activate something like this if each column must be indexed by a unique
# indexset
# if len(constrained_to_indexsets) != len(set(constrained_to_indexsets)):
# raise ValueError("Each dimension must be constrained to a unique indexset!") # noqa
if column_names and len(column_names) != len(set(column_names)):
raise ValueError("The given `column_names` are not unique!")

So my question is: are we fine with that or would we want to have distinct custom errors for all these checks?

Allowing Table.data to be added piece-meal

Currently, we allow to add data for the various Columns piece-meal like so:

        table_3 = run.optimization.tables.create(
            name="Table 3",
            constrained_to_indexsets=[indexset.name, indexset_2.name],
            column_names=["Column 1", "Column 2"],
        )
        table_3.add(data={"Column 1": ["bar"]})
        assert table_3.data == {"Column 1": ["bar"]}

        table_3.add(data={"Column 2": [2]})
        assert table_3.data == {"Column 1": ["bar"], "Column 2": [2]}

        table_3.add(
            data=pd.DataFrame({"Column 1": ["foo"], "Column 2": [3]}),
        )
        assert table_3.data == {"Column 1": ["foo"], "Column 2": [3]}

If the data to be added contains data for a Column that already has data, that Column's data is overwritten.
If we are fine with this behavior, we should probably still

  • Add functionality to check which data is already present/still needed (we might also want to somehow include information on how to see which data are permitted by an Indexset)
  • Log information if a Column's data is overwritten
  • Add a check that all data is present before the Table is used elsewhere (though I'm not sure this is possible since there's no clear endpoint of "this is where we stop adding data to the Table", so it's not clear to me when this kind of test should occur)

Specifying the type of constrained_to_indexsets

In the core layer, we currently support tables.create() to receive the parameter constrained_to_indexsets as a list of strings:

def create(
        self,
        name: str,
        constrained_to_indexsets: list[str],
        column_names: list[str] | None = None,
    ) -> Table:

However, we return the property Table.constrained_to_indexsets as a list of integers:

    def constrained_to_indexsets(self) -> list[int]:
        return [column.constrained_to_indexset for column in self._model.columns]

Which is the preferred form? Or should we accept list[str | int] for create() and still only return int?
In the DB layer, such a property does not exist. A Table has Columns and each Column has constrained_to_indexset, which is of type int. I only added the property in the core layer on a whim because I thought it might be convenient.

@glatterf42
Copy link
Member Author

Handling Columns of Tables

At the moment, Columns are only added to a Table during the creation of the Table. Do we want to keep it that way?
We could alternatively think about allowing Columns to be added to Tables manually after their creation and also then allow removing Columns from Tables (which might entail removing data).

@danielhuppmann
Copy link
Member

danielhuppmann commented Feb 19, 2024

Thanks @glatterf42, see my responses below:

Adding data to a Table in the data layer

Yes, this approach makes sense to me.

You could improve performance by setting table to None after an add(), and only get() when table is explicitly accessed afterwards. This way, a user wouldn't have to load the entire table multiple times.

def add():
    test_mp.backend.optimization.tables.add_data(
        table_id=table.id, data=test_data_1
    )
    self._table = None

@getter 
def table:
    if self._table is None:
        self._table = test_mp.backend.optimization.tables.get(run_id=run.id, name="Table")
     return self_table

Linking each to Column to a unique Indexset

Yes, it is absolutely crucial that multiple columns can be foreign-keyed to the same index set. Only the name has to be unique.

Raising distinct error messages when validating data

Good enough for now, but maybe to be improved later.

Allowing Table.data to be added piece-meal

Fine in principle to add data step by step similar to the current behavior, but data should be added row-wise. So if you have a table with two columns, the following line should raise an error.

table_3.add(data={"Column 1": ["bar"]})
assert table_3.data == {"Column 1": ["bar"]}
> ValueError Missing entry for 'Column 2'

Specifying the type of constrained_to_indexsets

I assume that the int is the unique id of an indexset? In terms of usability, I would only show the (unique) name of an indexset to a user.

Handling Columns of Tables

It should not be possible to add columns to a table after creation.

* Remove some outdated TODOs
* Make _add_column() a private function
* Return Indexset.names as constrained_to_indexsets
* Enforce that Table.data be added row-wise
@danielhuppmann
Copy link
Member

Thanks @glatterf42

To be clear about the current workflow to add data to a table: you'd create the table and then need to call add_data() with all the data you want to add. If you add something like {"test": 2, "nexttest": 3} and then decide you want to use also/instead add {"test": 4, "nexttest": 5}, we currently take that as an "instead" and overwrite the existing keys.

No, this is not the expected use case, given current behavior of ixmp and usual modelling workflows.

Imagine a "table" being the investment costs for different power plants in several regions. You often have situations where a modeler has a script e.g. computing parameters for only one type of power plant (or one region).

Similar to the current MESSAGE tutorials, I see the need for a workflow like

inv_cost_wind = pd.DataFrame()  # get a dataframe of investment cost parameters for a certain technology
run.parameters.get("inv_cost").add_data(inv_cost_wind)

adding (or replacing existing) datapoints but not removing existing parameter datapoints.

@glatterf42
Copy link
Member Author

Sorry, I don't quite follow: in your example, would inv_cost_wind contain all the data currently in the parameter plus the updated values or would it only contain the values that need updating?

And mainly for me, to clarify: I'm imagining a table like this:

power plant type Region Investment cost
wind EEU x
wind WEU y
water EEU z

And a modeler might now want to update wind in EEU to w, so they would only provide the w value (plus other columns required for this one row) and expect this to overwrite the existing value for that row, while leaving the rest untouched?

@danielhuppmann
Copy link
Member

danielhuppmann commented Mar 8, 2024

The usual workflow is to only provide the data that should be changed.

For example, IEA publishes new wind turbine cost estimates and a modeller updates the data like

run.parameters.get("inv_cost").add_data(
  {‘technology’: ‘wind’, ‘region’: ‘EEU’, ‘value’: w, ‘unit’: ‘EUR/GW’}
)

@glatterf42
Copy link
Member Author

Okay, I've updated the behaviour to this:

        table_3 = run.optimization.tables.create(
            name="Table 3",
            constrained_to_indexsets=[indexset.name, indexset_2.name],
            column_names=["Column 1", "Column 2"],
        )

        table_3.add(data={"Column 1": ["bar"], "Column 2": [2]})
        assert table_3.data == {"Column 1": ["bar"], "Column 2": [2]}

        # Test data is expanded when Column.name is already present
        table_3.add(
            data=pd.DataFrame({"Column 1": ["foo"], "Column 2": [3]}),
        )
        assert table_3.data == {"Column 1": ["bar", "foo"], "Column 2": [2, 3]}

If this is now how it should be, I'll add a DB migration and hopefully we can merge this PR soon :)

@danielhuppmann

This comment was marked as resolved.

@danielhuppmann

This comment was marked as duplicate.

ixmp4/server/rest/optimization/table.py Outdated Show resolved Hide resolved
ixmp4/server/rest/optimization/table.py Outdated Show resolved Hide resolved
ixmp4/data/api/optimization/column.py Outdated Show resolved Hide resolved
ixmp4/data/db/optimization/column/filter.py Outdated Show resolved Hide resolved
ixmp4/data/db/optimization/column/model.py Outdated Show resolved Hide resolved
ixmp4/data/db/optimization/table/filter.py Outdated Show resolved Hide resolved
tests/data/test_docs.py Outdated Show resolved Hide resolved
@glatterf42

This comment was marked as resolved.

@danielhuppmann

This comment was marked as resolved.

@glatterf42

This comment was marked as outdated.

Copy link
Contributor

@meksor meksor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hiya, I looked at the current state of this PR code-wise and it looks good.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @glatterf42, this looks very nice, almost ready to be merged. A few observations inline, and one more general observation (which I made a while ago already): I believe that the order of arguments table-name -> table-columns -> table-columns-constraints-to-indexsets would be more intuitive...

ixmp4/data/abstract/optimization/column.py Outdated Show resolved Hide resolved
ixmp4/core/optimization/table.py Outdated Show resolved Hide resolved
ixmp4/core/optimization/table.py Show resolved Hide resolved
ixmp4/data/api/optimization/column.py Outdated Show resolved Hide resolved
tests/core/test_table.py Show resolved Hide resolved
Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, many thanks!

@glatterf42 glatterf42 merged commit 593ffe7 into main Apr 12, 2024
7 checks passed
@glatterf42 glatterf42 deleted the include/optimization-table branch April 12, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants