Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dags package for implementation of DAG backend. #399

Merged
merged 45 commits into from Nov 17, 2022
Merged

Conversation

ChristianZimpelmann
Copy link
Member

@ChristianZimpelmann ChristianZimpelmann commented Jun 21, 2022

What problem do you want to solve?

  • The DAG backend is currently implemented in dag.py.
  • The dags package provides most of the functionality and will be used to avoid repetition in code across different projects.

Todo

  • Use dags for compute_taxes_and_transfers
  • Find out whether and how dags can be used for plot_dag
  • Decide whether dags should be used for aggregation functions, too
  • Put contribution in CHANGES.rst.

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Jun 21, 2022

In GETTSIM, we currently proceed in 2 steps:

  1. Build the DAG (as networkx.DiGraph object)
  2. Execute the DAG.
  • dags, currently, does not return the full DAG, but a concatenated function where "The arguments of the combined function are all arguments of relevant functions that are not themselves function names, in alphabetical order."

I think it would be beneficial to keep the functionality to have the DAG as object within GETTSIM.

  • Most important reason: allow for the visualisation of the DAG.
  • Also, some operations, e.g. rounding, are currently applied to the functions in the DAG after it is created, but before it is executed. But we probably can and should do these operations before building the DAG anyway.

I, hence, suggest that I adjust dags such that it allows to return the DAG. This would likely just split the concatenate_functions function in two parts which then could be called separately.

@janosg @hmgaudecker, what do you think?

@janosg
Copy link
Collaborator

janosg commented Jun 21, 2022

  • I would hope that gettsim developers contribute their visualization code to dags. This would then use the internal dags functions to build the dag object and plot it. Then plotting would not be a reason to have a dag object.
  • I would also not do anything outside the dag. dags provides a few decorators/higher order functions to rename arguments of functions. When using these it usually becomes very easy to do everything inside the dag.

Having said that: I see no problem in adding a return_type argument to concatenate_functions or having a separate function to create a DAG object.

@hmgaudecker
Copy link
Collaborator

I would prefer to refactor concatenate_functions so that the first part (up to the pruning of the dag, currently line 54) is in a separate function set_up_dag or so.

I'd prefer that to changing the return types. Calling the set_up_dag part twice will have close to zero overhead.

Once that is in place, it will be trivial to have the plotting code in either location, once we have something in GETTSIM we can decide whether it is general enough to port it to dags or not.

Finally, the above sounds like a might good idea to swap the order of applying decorators and creating the DAG right away, before doing anything else.

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Jun 21, 2022

I would prefer to refactor concatenate_functions so that the first part (up to the pruning of the dag, currently line 54) is in a separate function set_up_dag or so.

That's what I have in mind. Will do! Sounds all very good!

However, I am not sure whether I understand correctly what you mean by decorators/higher order functions. The most important higher order changes we are currently doing are:

  • create aggregation functions based on aggregations_specs (or automatically) -> needs to be done before DAG is build
  • partial policy parameters to functions -> needs to be done before DAG is build
  • apply rounding to some functions (as wrapper of the raw function) -> is currently done inside the DAG, but could be done before creating the DAG (which I think is cleaner).
  • take care of columns_overriding_functions -> currently done while creating the DAG, but AFAICT could be done before

@janosg
Copy link
Collaborator

janosg commented Jun 21, 2022

Also, some operations, e.g. rounding, are currently applied to the functions in the DAG after it is created, but before it is executed. But we probably can and should do these operations before building the DAG anyway.

What I meant is that I would not modify the dag after it is created. Instead I would always modify a built-in or user provided function dict and let the dag do the rest. E.g. if a function, say kindergeld needs rounding and the original function dict is {"kindergeld": unrounded_implementation}, I would rename the key "kindergeld" to "kindergeld_unrounded" internally, then add a function that takes kindergeld_unrounded as input and does the rounding under the key "kindergeld".

This second step often requires renaming arguments of functions. This is where the decorators are helpful.

@hmgaudecker
Copy link
Collaborator

Thanks! 2 questions

  • create aggregation functions based on aggregations_specs (or automatically) -> needs to be done before DAG is build

Based on aggregation_specs (known) seems clear. But the automatic thing I am not sure about -- basically that should happpen during creation of the DAG on demand? I.e., if bruttolohn_m exists and bruttolohn_m_hh is requested somewhere, create that function? Do you see a clever way on how to do that other than building a DAG first and then looking for such patterns, setting up a new dict with the required functions?

  • take care of columns_overriding_functions -> currently done while creating the DAG, but AFAICT could be done before

Yes, I guess we could just remove functions that are part of the columns? Maybe easiest if _harmonize_functions was exposed?

@janosg
Copy link
Collaborator

janosg commented Jun 21, 2022

basically that should happpen during creation of the DAG on demand? I.e., if bruttolohn_m exists and bruttolohn_m_hh is requested somewhere, create that function?

I would have simply created all those functions and let the dag figure out which of them are needed. Creating such functions is almost cost free compared to calculating the taxes and transfers.

@ChristianZimpelmann
Copy link
Member Author

What I meant is that I would not modify the dag after it is created.

Ok, that is in line with what I meant. 👍

Based on aggregation_specs (known) seems clear. But the automatic thing I am not sure about -- basically that should happpen during creation of the DAG on demand? I.e., if bruttolohn_m exists and bruttolohn_m_hh is requested somewhere, create that function? Do you see a clever way on how to do that other than building a DAG first and then looking for such patterns, setting up a new dict with the required functions?

We currently just look at the set of all functions and all arguments of those functions to determine the automatic aggregation specs.

Janos' solution to create all potential aggregation function seems also fine, but might not be necessary.

Yes, I guess we could just remove functions that are part of the columns?

Yes.

@hmgaudecker
Copy link
Collaborator

We currently just look at the set of all functions and all arguments of those functions to determine the automatic aggregation specs.

Yes, but that is the core functionality of dags and we would just be repeating it.

@ChristianZimpelmann
Copy link
Member Author

We currently just look at the set of all functions and all arguments of those functions to determine the automatic aggregation specs.

Yes, but that is the core functionality of dags and we would just be repeating it.

To find out the set of all functions and all arguments, yes. But determining the automatic aggregation specs based on the naming convention seems very specific to GETTSIM. I would rather not add this functionality to dags unless another project would also use it.

@hmgaudecker
Copy link
Collaborator

To find out the set of all functions and all arguments, yes.

This is what I meant. How could we do the second step without repeating this first step? I'd think just creating all functions would require much less code on GETTSIM's side and certainly much less sophisticated code.

@ChristianZimpelmann
Copy link
Member Author

This is what I meant. How could we do the second step without repeating this first step? I'd think just creating all functions would require much less code on GETTSIM's side and certainly much less sophisticated code.

I have problems understanding this comment. What exactly do you mean by second step?

And are you proposing to create aggregation functions for all functions (for all supported groupings) independent of whether the aggregated function is requested somewhere and of whether it is defined somewhere explicitely?

@hmgaudecker
Copy link
Collaborator

First step:

We currently just look at the set of all functions and all arguments of those functions

Second step:

determine the automatic aggregation specs.

And are you proposing to create aggregation functions for all functions (for all supported groupings) independent of whether the aggregated function is requested somewhere and of whether it is defined somewhere explicitly?

Yes, anything that is not required will be weeded out by the DAG. According to @janosg, should be very quick!

@codecov
Copy link

codecov bot commented Oct 9, 2022

Codecov Report

Base: 92.88% // Head: 92.76% // Decreases project coverage by -0.11% ⚠️

Coverage data is based on head (49fbc2e) compared to base (8140ff6).
Patch coverage: 93.44% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #399      +/-   ##
==========================================
- Coverage   92.88%   92.76%   -0.12%     
==========================================
  Files          76       74       -2     
  Lines        3905     3896       -9     
==========================================
- Hits         3627     3614      -13     
- Misses        278      282       +4     
Impacted Files Coverage Δ
dashboard/pre_processing_data.py 0.00% <0.00%> (ø)
gettsim/functions_loader.py 86.87% <88.82%> (+1.15%) ⬆️
gettsim/tests/test_rounding.py 96.00% <90.90%> (-1.02%) ⬇️
gettsim/tests/test_interface.py 95.42% <92.18%> (-2.01%) ⬇️
gettsim/shared.py 96.07% <96.66%> (-0.08%) ⬇️
gettsim/config.py 100.00% <100.00%> (ø)
gettsim/interface.py 100.00% <100.00%> (+7.60%) ⬆️
gettsim/policy_environment.py 100.00% <100.00%> (ø)
gettsim/tests/test_docs.py 95.77% <100.00%> (ø)
gettsim/tests/test_full_taxes_and_transfers.py 96.07% <100.00%> (ø)
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Nov 6, 2022

Thanks, good idea!

I moved the respective functions.

I also adjusted the comments in compute_taxes_and_transfers (with slight modifications as I think that way the structure in the function is the clearest).

@hmgaudecker
Copy link
Collaborator

In the process of changing the docstring, I also changed

load_reforms_for_date -> load_functions_for_date

in 14afe03. Hope that is uncontroversial?

Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, I think, excellent!!!

Please double-check my commits, I think they all should be fairly uncontroversial, adding some clarity / consistency (I hope).

I did not go through interface.py in detail yet. Two things, first minor, second important imho.

  • Can we get by without that KeyErrorMessage class? I find it distracting and I'd hope that by re-using format_errors_and_warnings. Could you check, please?
  • The module is very large and likely to grow. This might be a good point to think about the order of the functions (reproduced just below). I do not see any particular structure in that order right now, which makes the module difficult to navigate. Could you make a suggestion here? We can discuss and then implement later. Thanks!

Functions in interface.py:

  • compute_taxes_and_transfers
  • set_up_dag
  • _process_and_check_data
  • _round_and_partial_parameters_to_functions
  • _create_input_data
  • _prepare_results
  • _fail_if_columns_overriding_functions_are_not_in_dag
  • _convert_data_to_correct_types
  • _fail_if_group_variables_not_constant_within_groups
  • _fail_if_columns_overriding_functions_are_not_in_data
  • _reorder_columns
  • _fail_if_root_nodes_are_missing
  • _reduce_to_necessary_data
  • _fail_if_pid_is_non_unique
  • _fail_if_duplicates_in_columns
  • _root_nodes
  • _add_rounding_to_one_function
  • _add_rounding_to_functions

gettsim/visualization.py Outdated Show resolved Hide resolved
gettsim/visualization.py Outdated Show resolved Hide resolved
gettsim/tests/test_data/grundrente_proxy_rente.csv Outdated Show resolved Hide resolved
@ChristianZimpelmann
Copy link
Member Author

  • Can we get by without that KeyErrorMessage class? I find it distracting and I'd hope that by re-using format_errors_and_warnings. Could you check, please?

Apparently, KeyError is always wrapped by quotes and all line breaks, therefore, are ignored. See this post.

An alternative to using the KeyErrorMessage class would be to raise the error as follows:

try:
    raise KeyError("This is a \n Line break")
except KeyError as e:
    print(e.args[0])

My prefered alternative would be to keep KeyErrorMessage, but moved it to the shared module.

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Nov 14, 2022

Proposal for order of functions is below.

public functions

  • compute_taxes_and_transfers
  • set_up_dag

data checks and data processing before calculations

(_process_and_check_data, _convert_data_to_correct_types, and _create_input_data and all functions called therein in order of appearance)

  • _process_and_check_data
  • _convert_data_to_correct_types
  • _create_input_data
  • _fail_if_duplicates_in_columns
  • _fail_if_group_variables_not_constant_within_groups
  • _fail_if_columns_overriding_functions_are_not_in_data
  • _fail_if_pid_is_non_unique
  • _root_nodes
  • _fail_if_root_nodes_are_missing
  • _reduce_to_necessary_data

function processing

(_round_and_partial_parameters_to_functions and and all functions called therein in order of appearance)
(also add _fail_if_columns_overriding_functions_are_not_in_dag, which is called in set_up_dag, at the end of this block)

  • _round_and_partial_parameters_to_functions
  • _add_rounding_to_functions
  • _add_rounding_to_one_function
  • _fail_if_columns_overriding_functions_are_not_in_dag

data processing after calculations

  • _prepare_results
  • _reorder_columns

@hmgaudecker
Copy link
Collaborator

Looks great, thanks! Can you implement that? I think I'll want to inline _root_nodes eventually (feels out of place, only called there, and should be possible to do in a one-liner), but I am too tired/distracted right now.

Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just a couple of little notes left.

Small comment on the commit message in 52f97cb: It should be "Merge branch 'main'." or so -- whether that required resolving merge conflicts is irrelevant. This way I needed to open it in a GUI to understand what was happening.

gettsim/shared.py Show resolved Hide resolved
gettsim/shared.py Show resolved Hide resolved
@ChristianZimpelmann
Copy link
Member Author

Small comment on the commit message in 52f97cb: It should be "Merge branch 'main'." or so -- whether that required resolving merge conflicts is irrelevant. This way I needed to open it in a GUI to understand what was happening.

Thanks! Will consider next time.

I added four new test cases and opened the issue. Will merge as soon as tests have run through.

@ChristianZimpelmann ChristianZimpelmann merged commit bd67ace into main Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants