Use dags package for implementation of DAG backend. #399

ChristianZimpelmann · 2022-06-21T07:03:14Z

What problem do you want to solve?

The DAG backend is currently implemented in dag.py.
The dags package provides most of the functionality and will be used to avoid repetition in code across different projects.

Todo

Use dags for compute_taxes_and_transfers
Find out whether and how dags can be used for plot_dag
Decide whether dags should be used for aggregation functions, too
Put contribution in CHANGES.rst.

ChristianZimpelmann · 2022-06-21T08:46:24Z

In GETTSIM, we currently proceed in 2 steps:

Build the DAG (as networkx.DiGraph object)
Execute the DAG.

dags, currently, does not return the full DAG, but a concatenated function where "The arguments of the combined function are all arguments of relevant functions that are not themselves function names, in alphabetical order."

I think it would be beneficial to keep the functionality to have the DAG as object within GETTSIM.

Most important reason: allow for the visualisation of the DAG.
Also, some operations, e.g. rounding, are currently applied to the functions in the DAG after it is created, but before it is executed. But we probably can and should do these operations before building the DAG anyway.

I, hence, suggest that I adjust dags such that it allows to return the DAG. This would likely just split the concatenate_functions function in two parts which then could be called separately.

@janosg @hmgaudecker, what do you think?

janosg · 2022-06-21T09:01:40Z

I would hope that gettsim developers contribute their visualization code to dags. This would then use the internal dags functions to build the dag object and plot it. Then plotting would not be a reason to have a dag object.
I would also not do anything outside the dag. dags provides a few decorators/higher order functions to rename arguments of functions. When using these it usually becomes very easy to do everything inside the dag.

Having said that: I see no problem in adding a return_type argument to concatenate_functions or having a separate function to create a DAG object.

hmgaudecker · 2022-06-21T10:09:24Z

I would prefer to refactor concatenate_functions so that the first part (up to the pruning of the dag, currently line 54) is in a separate function set_up_dag or so.

I'd prefer that to changing the return types. Calling the set_up_dag part twice will have close to zero overhead.

Once that is in place, it will be trivial to have the plotting code in either location, once we have something in GETTSIM we can decide whether it is general enough to port it to dags or not.

Finally, the above sounds like a might good idea to swap the order of applying decorators and creating the DAG right away, before doing anything else.

ChristianZimpelmann · 2022-06-21T10:43:39Z

I would prefer to refactor concatenate_functions so that the first part (up to the pruning of the dag, currently line 54) is in a separate function set_up_dag or so.

That's what I have in mind. Will do! Sounds all very good!

However, I am not sure whether I understand correctly what you mean by decorators/higher order functions. The most important higher order changes we are currently doing are:

create aggregation functions based on aggregations_specs (or automatically) -> needs to be done before DAG is build
partial policy parameters to functions -> needs to be done before DAG is build
apply rounding to some functions (as wrapper of the raw function) -> is currently done inside the DAG, but could be done before creating the DAG (which I think is cleaner).
take care of columns_overriding_functions -> currently done while creating the DAG, but AFAICT could be done before

janosg · 2022-06-21T12:44:52Z

Also, some operations, e.g. rounding, are currently applied to the functions in the DAG after it is created, but before it is executed. But we probably can and should do these operations before building the DAG anyway.

What I meant is that I would not modify the dag after it is created. Instead I would always modify a built-in or user provided function dict and let the dag do the rest. E.g. if a function, say kindergeld needs rounding and the original function dict is {"kindergeld": unrounded_implementation}, I would rename the key "kindergeld" to "kindergeld_unrounded" internally, then add a function that takes kindergeld_unrounded as input and does the rounding under the key "kindergeld".

This second step often requires renaming arguments of functions. This is where the decorators are helpful.

hmgaudecker · 2022-06-21T12:58:39Z

Thanks! 2 questions

create aggregation functions based on aggregations_specs (or automatically) -> needs to be done before DAG is build

Based on aggregation_specs (known) seems clear. But the automatic thing I am not sure about -- basically that should happpen during creation of the DAG on demand? I.e., if bruttolohn_m exists and bruttolohn_m_hh is requested somewhere, create that function? Do you see a clever way on how to do that other than building a DAG first and then looking for such patterns, setting up a new dict with the required functions?

take care of columns_overriding_functions -> currently done while creating the DAG, but AFAICT could be done before

Yes, I guess we could just remove functions that are part of the columns? Maybe easiest if _harmonize_functions was exposed?

janosg · 2022-06-21T13:20:49Z

basically that should happpen during creation of the DAG on demand? I.e., if bruttolohn_m exists and bruttolohn_m_hh is requested somewhere, create that function?

I would have simply created all those functions and let the dag figure out which of them are needed. Creating such functions is almost cost free compared to calculating the taxes and transfers.

ChristianZimpelmann · 2022-06-21T13:28:07Z

What I meant is that I would not modify the dag after it is created.

Ok, that is in line with what I meant. 👍

Based on aggregation_specs (known) seems clear. But the automatic thing I am not sure about -- basically that should happpen during creation of the DAG on demand? I.e., if bruttolohn_m exists and bruttolohn_m_hh is requested somewhere, create that function? Do you see a clever way on how to do that other than building a DAG first and then looking for such patterns, setting up a new dict with the required functions?

We currently just look at the set of all functions and all arguments of those functions to determine the automatic aggregation specs.

Janos' solution to create all potential aggregation function seems also fine, but might not be necessary.

Yes, I guess we could just remove functions that are part of the columns?

Yes.

hmgaudecker · 2022-06-21T13:30:40Z

We currently just look at the set of all functions and all arguments of those functions to determine the automatic aggregation specs.

Yes, but that is the core functionality of dags and we would just be repeating it.

ChristianZimpelmann · 2022-06-23T08:30:19Z

We currently just look at the set of all functions and all arguments of those functions to determine the automatic aggregation specs.

Yes, but that is the core functionality of dags and we would just be repeating it.

To find out the set of all functions and all arguments, yes. But determining the automatic aggregation specs based on the naming convention seems very specific to GETTSIM. I would rather not add this functionality to dags unless another project would also use it.

hmgaudecker · 2022-06-23T08:42:49Z

To find out the set of all functions and all arguments, yes.

This is what I meant. How could we do the second step without repeating this first step? I'd think just creating all functions would require much less code on GETTSIM's side and certainly much less sophisticated code.

ChristianZimpelmann · 2022-06-23T13:34:40Z

This is what I meant. How could we do the second step without repeating this first step? I'd think just creating all functions would require much less code on GETTSIM's side and certainly much less sophisticated code.

I have problems understanding this comment. What exactly do you mean by second step?

And are you proposing to create aggregation functions for all functions (for all supported groupings) independent of whether the aggregated function is requested somewhere and of whether it is defined somewhere explicitely?

hmgaudecker · 2022-06-23T13:49:23Z

First step:

We currently just look at the set of all functions and all arguments of those functions

Second step:

determine the automatic aggregation specs.

And are you proposing to create aggregation functions for all functions (for all supported groupings) independent of whether the aggregated function is requested somewhere and of whether it is defined somewhere explicitly?

Yes, anything that is not required will be weeded out by the DAG. According to @janosg, should be very quick!

codecov · 2022-10-09T13:53:57Z

Codecov Report

Base: 92.88% // Head: 92.76% // Decreases project coverage by -0.11% ⚠️

Coverage data is based on head (49fbc2e) compared to base (8140ff6).
Patch coverage: 93.44% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #399      +/-   ##
==========================================
- Coverage   92.88%   92.76%   -0.12%     
==========================================
  Files          76       74       -2     
  Lines        3905     3896       -9     
==========================================
- Hits         3627     3614      -13     
- Misses        278      282       +4

Impacted Files	Coverage Δ
dashboard/pre_processing_data.py	`0.00% <0.00%> (ø)`
gettsim/functions_loader.py	`86.87% <88.82%> (+1.15%)`	⬆️
gettsim/tests/test_rounding.py	`96.00% <90.90%> (-1.02%)`	⬇️
gettsim/tests/test_interface.py	`95.42% <92.18%> (-2.01%)`	⬇️
gettsim/shared.py	`96.07% <96.66%> (-0.08%)`	⬇️
gettsim/config.py	`100.00% <100.00%> (ø)`
gettsim/interface.py	`100.00% <100.00%> (+7.60%)`	⬆️
gettsim/policy_environment.py	`100.00% <100.00%> (ø)`
gettsim/tests/test_docs.py	`95.77% <100.00%> (ø)`
gettsim/tests/test_full_taxes_and_transfers.py	`96.07% <100.00%> (ø)`
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

…s/gettsim into use_dags

ChristianZimpelmann · 2022-11-06T09:54:22Z

Thanks, good idea!

I moved the respective functions.

I also adjusted the comments in compute_taxes_and_transfers (with slight modifications as I think that way the structure in the function is the clearest).

hmgaudecker · 2022-11-07T09:26:57Z

In the process of changing the docstring, I also changed

load_reforms_for_date -> load_functions_for_date

in 14afe03. Hope that is uncontroversial?

hmgaudecker

Almost there, I think, excellent!!!

Please double-check my commits, I think they all should be fairly uncontroversial, adding some clarity / consistency (I hope).

I did not go through interface.py in detail yet. Two things, first minor, second important imho.

Can we get by without that KeyErrorMessage class? I find it distracting and I'd hope that by re-using format_errors_and_warnings. Could you check, please?
The module is very large and likely to grow. This might be a good point to think about the order of the functions (reproduced just below). I do not see any particular structure in that order right now, which makes the module difficult to navigate. Could you make a suggestion here? We can discuss and then implement later. Thanks!

Functions in interface.py:

compute_taxes_and_transfers
set_up_dag
_process_and_check_data
_round_and_partial_parameters_to_functions
_create_input_data
_prepare_results
_fail_if_columns_overriding_functions_are_not_in_dag
_convert_data_to_correct_types
_fail_if_group_variables_not_constant_within_groups
_fail_if_columns_overriding_functions_are_not_in_data
_reorder_columns
_fail_if_root_nodes_are_missing
_reduce_to_necessary_data
_fail_if_pid_is_non_unique
_fail_if_duplicates_in_columns
_root_nodes
_add_rounding_to_one_function
_add_rounding_to_functions

gettsim/visualization.py

gettsim/tests/test_data/grundrente_proxy_rente.csv

ChristianZimpelmann · 2022-11-14T07:57:59Z

Can we get by without that KeyErrorMessage class? I find it distracting and I'd hope that by re-using format_errors_and_warnings. Could you check, please?

Apparently, KeyError is always wrapped by quotes and all line breaks, therefore, are ignored. See this post.

An alternative to using the KeyErrorMessage class would be to raise the error as follows:

try:
    raise KeyError("This is a \n Line break")
except KeyError as e:
    print(e.args[0])

My prefered alternative would be to keep KeyErrorMessage, but moved it to the shared module.

ChristianZimpelmann · 2022-11-14T08:17:07Z

Proposal for order of functions is below.

public functions

compute_taxes_and_transfers
set_up_dag

data checks and data processing before calculations

(_process_and_check_data, _convert_data_to_correct_types, and _create_input_data and all functions called therein in order of appearance)

_process_and_check_data
_convert_data_to_correct_types
_create_input_data
_fail_if_duplicates_in_columns
_fail_if_group_variables_not_constant_within_groups
_fail_if_columns_overriding_functions_are_not_in_data
_fail_if_pid_is_non_unique
_root_nodes
_fail_if_root_nodes_are_missing
_reduce_to_necessary_data

function processing

(_round_and_partial_parameters_to_functions and and all functions called therein in order of appearance)
(also add _fail_if_columns_overriding_functions_are_not_in_dag, which is called in set_up_dag, at the end of this block)

_round_and_partial_parameters_to_functions
_add_rounding_to_functions
_add_rounding_to_one_function
_fail_if_columns_overriding_functions_are_not_in_dag

data processing after calculations

_prepare_results
_reorder_columns

hmgaudecker · 2022-11-15T17:45:31Z

Looks great, thanks! Can you implement that? I think I'll want to inline _root_nodes eventually (feels out of place, only called there, and should be possible to do in a one-liner), but I am too tired/distracted right now.

…rd breaks.

…s/gettsim into use_dags

hmgaudecker

Great! Just a couple of little notes left.

Small comment on the commit message in 52f97cb: It should be "Merge branch 'main'." or so -- whether that required resolving merge conflicts is irrelevant. This way I needed to open it in a GUI to understand what was happening.

gettsim/shared.py

…s/gettsim into use_dags

ChristianZimpelmann · 2022-11-17T12:56:40Z

Small comment on the commit message in 52f97cb: It should be "Merge branch 'main'." or so -- whether that required resolving merge conflicts is irrelevant. This way I needed to open it in a GUI to understand what was happening.

Thanks! Will consider next time.

I added four new test cases and opened the issue. Will merge as soon as tests have run through.

add dags to environment.

7fffd7e

ChristianZimpelmann closed this Jun 21, 2022

ChristianZimpelmann reopened this Jun 21, 2022

ChristianZimpelmann added 11 commits June 29, 2022 11:20

first changes. not running through yet.

9716cae

merge main and fix merge conflicts.

70c5832

add dags to sphinx environment.

a8572c1

adjust table notes.

2d37495

build dag twice to filter out functions.

074c2b7

visualization also running through now.

44cccce

add contribution to CHANGES.rst.

4b7e4ae

improve code documentation.

3413280

restructure visualization. Delete old internal dag files.

1c04b00

minor adjustment to dags.

4a6c77b

Merge branch 'main' into use_dags

85192dd

ChristianZimpelmann added 2 commits October 9, 2022 18:14

improve documentation. Add three tests for checks of inputs.

d4c19d6

Merge branch 'use_dags' of github.com:iza-institute-of-labor-economic…

d8ffe5f

…s/gettsim into use_dags

adjust comments.

f91f4f3

Fix misnomer (reforms -> functions.

14afe03

hmgaudecker added 5 commits November 7, 2022 10:41

Another better name for a helper function.

d5b65e8

Another round of renaming for consistency.

3bf4ff0

Spelling.

dee3da7

Typo, internal consistency in naming.

908226a

Spelling, consistent naming, consistent order in functions_loader.

6630625

hmgaudecker reviewed Nov 7, 2022

View reviewed changes

gettsim/visualization.py Outdated Show resolved Hide resolved

gettsim/visualization.py Outdated Show resolved Hide resolved

gettsim/tests/test_data/grundrente_proxy_rente.csv Outdated Show resolved Hide resolved

adjust inline comments. Revert change in test data.

741a4b1

move KeyErrorMessage to module shared.

ba84173

hmgaudecker and others added 6 commits November 16, 2022 09:05

flake8 moved.

7f39d71

Use bokeh < 3.0, seems like interface has changed so that our dashboa…

ad73013

…rd breaks.

Replace '_root_nodes' function by simple set comprehension.

72813cd

Merge branch 'use_dags' of github.com:iza-institute-of-labor-economic…

64f63cd

…s/gettsim into use_dags

fix merge conflicts.

52f97cb

reorder functions in interface.py.

a046c44

ChristianZimpelmann requested a review from hmgaudecker November 16, 2022 18:40

ChristianZimpelmann and others added 2 commits November 17, 2022 08:31

fix typo

0a18e2a

Remove unneccessary comment.

109aac5

hmgaudecker approved these changes Nov 17, 2022

View reviewed changes

gettsim/shared.py Show resolved Hide resolved

gettsim/shared.py Show resolved Hide resolved

ChristianZimpelmann added 2 commits November 17, 2022 13:55

add test case for KeyErrorMessage. Also fix error in docstring.

b232f70

Merge branch 'use_dags' of github.com:iza-institute-of-labor-economic…

49fbc2e

…s/gettsim into use_dags

ChristianZimpelmann merged commit bd67ace into main Nov 17, 2022

ChristianZimpelmann deleted the use_dags branch November 17, 2022 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dags package for implementation of DAG backend. #399

Use dags package for implementation of DAG backend. #399

ChristianZimpelmann commented Jun 21, 2022 •

edited

ChristianZimpelmann commented Jun 21, 2022 •

edited

janosg commented Jun 21, 2022

hmgaudecker commented Jun 21, 2022

ChristianZimpelmann commented Jun 21, 2022 •

edited

janosg commented Jun 21, 2022

hmgaudecker commented Jun 21, 2022

janosg commented Jun 21, 2022 •

edited

ChristianZimpelmann commented Jun 21, 2022

hmgaudecker commented Jun 21, 2022

ChristianZimpelmann commented Jun 23, 2022

hmgaudecker commented Jun 23, 2022

ChristianZimpelmann commented Jun 23, 2022

hmgaudecker commented Jun 23, 2022

codecov bot commented Oct 9, 2022 •

edited

ChristianZimpelmann commented Nov 6, 2022 •

edited

hmgaudecker commented Nov 7, 2022

hmgaudecker left a comment

ChristianZimpelmann commented Nov 14, 2022

ChristianZimpelmann commented Nov 14, 2022 •

edited

hmgaudecker commented Nov 15, 2022

hmgaudecker left a comment

ChristianZimpelmann commented Nov 17, 2022

Use dags package for implementation of DAG backend. #399

Use dags package for implementation of DAG backend. #399

Conversation

ChristianZimpelmann commented Jun 21, 2022 • edited

What problem do you want to solve?

Todo

ChristianZimpelmann commented Jun 21, 2022 • edited

janosg commented Jun 21, 2022

hmgaudecker commented Jun 21, 2022

ChristianZimpelmann commented Jun 21, 2022 • edited

janosg commented Jun 21, 2022

hmgaudecker commented Jun 21, 2022

janosg commented Jun 21, 2022 • edited

ChristianZimpelmann commented Jun 21, 2022

hmgaudecker commented Jun 21, 2022

ChristianZimpelmann commented Jun 23, 2022

hmgaudecker commented Jun 23, 2022

ChristianZimpelmann commented Jun 23, 2022

hmgaudecker commented Jun 23, 2022

codecov bot commented Oct 9, 2022 • edited

Codecov Report

ChristianZimpelmann commented Nov 6, 2022 • edited

hmgaudecker commented Nov 7, 2022

hmgaudecker left a comment

Choose a reason for hiding this comment

ChristianZimpelmann commented Nov 14, 2022

ChristianZimpelmann commented Nov 14, 2022 • edited

public functions

data checks and data processing before calculations

function processing

data processing after calculations

hmgaudecker commented Nov 15, 2022

hmgaudecker left a comment

Choose a reason for hiding this comment

ChristianZimpelmann commented Nov 17, 2022

ChristianZimpelmann commented Jun 21, 2022 •

edited

ChristianZimpelmann commented Jun 21, 2022 •

edited

ChristianZimpelmann commented Jun 21, 2022 •

edited

janosg commented Jun 21, 2022 •

edited

codecov bot commented Oct 9, 2022 •

edited

ChristianZimpelmann commented Nov 6, 2022 •

edited

ChristianZimpelmann commented Nov 14, 2022 •

edited