Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite create_synthetic_data #514

Merged
merged 25 commits into from May 5, 2023
Merged

Rewrite create_synthetic_data #514

merged 25 commits into from May 5, 2023

Conversation

ChristianZimpelmann
Copy link
Member

@ChristianZimpelmann ChristianZimpelmann commented Jan 30, 2023

What problem do you want to solve?

create_synthetic_data is currently overly complicated and slow (especially when many households are created at the same time). Main problem seems to be that a lot of pandas objects are created and concatenated.

To fix the problems, I rewrote most of the code. Remarks:

  • all variables that do not vary across households are now collected in a dict of lists: specs_constant_over_households
  • all variables that should vary across households are now collected in a dict of lists of lists: specs_heterogeneous
  • At the moment, it is no longer possible to create single and couple households or households with varying numbers of children with one function call. I would deliberately drop this feature, but this is up for discussion and could be implemented, as well.

Todo

  • Pick an appropriate title.
  • Rewrite code.
  • Improve documentation.
  • Adjust test cases
  • Adjust tutorials
  • Document PR in CHANGES.md.

Collection of ideas

  • allow for more than 2 children by either extending official data on Bedarfsgemeinschaften or by dropping use of official data
  • allow for easier specification of variables which are constant within households via broadcasting (i.e. allow for mietstufe: 2 instead of mietstufe: [2, 2, 2, 2]

@hmgaudecker
Copy link
Collaborator

Thanks, sounds great!

+1 for removing that feature

@hmgaudecker
Copy link
Collaborator

Any chance to get this merged within a couple of days?

@ChristianZimpelmann
Copy link
Member Author

I can have a look at it on Monday.

@codecov
Copy link

codecov bot commented Apr 18, 2023

Codecov Report

Patch coverage: 98.24% and project coverage change: -0.13 ⚠️

Comparison is base (993fc26) 94.20% compared to head (a4b870a) 94.07%.

❗ Current head a4b870a differs from pull request most recent head 51bd1c1. Consider uploading reports for the commit 51bd1c1 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #514      +/-   ##
==========================================
- Coverage   94.20%   94.07%   -0.13%     
==========================================
  Files          48       48              
  Lines        2983     2903      -80     
==========================================
- Hits         2810     2731      -79     
+ Misses        173      172       -1     
Impacted Files Coverage Δ
src/_gettsim/synthetic.py 98.46% <98.24%> (+3.28%) ⬆️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ChristianZimpelmann ChristianZimpelmann marked this pull request as ready for review April 18, 2023 09:30
@ChristianZimpelmann
Copy link
Member Author

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Apr 18, 2023

Somehow, the last commit 4bd9432a4aaaa8538367fef9dbaa3f01b74cb0f9 shows up on the branch, but not in this PR yet.

@hmgaudecker, please check again before your review.

Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, much clearer than before! Just a smaller things.

src/_gettsim/synthetic.py Outdated Show resolved Hide resolved
src/_gettsim/synthetic.py Outdated Show resolved Hide resolved
src/_gettsim/synthetic.py Outdated Show resolved Hide resolved
src/_gettsim_tests/test_synthetic.py Outdated Show resolved Hide resolved
src/_gettsim_tests/test_synthetic.py Show resolved Hide resolved
@ChristianZimpelmann
Copy link
Member Author

Thanks for the feedback! I addressed all the comments now.

Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thank you!

@hmgaudecker
Copy link
Collaborator

Only realised ex post that docs are failing because some notebook seems to be using the old interface of create_synthetic_data, please fix that before merging, should be trivial.

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Apr 25, 2023

Thanks!

See my comment above:

The documentation is currently incomplete. Easiest to fix is probably to revert to the state before the change of pre-commit hooks. So we shouldn't make any changes to the current version on main.

So I think the options are:

@hmgaudecker
Copy link
Collaborator

Hope that last commit will do, this brought all print-statements back in and changed the hooks.

@ChristianZimpelmann
Copy link
Member Author

Thanks for fixing the tutorials! I updated them now using the new interface of create_synthetic

Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks! Just a couple of typos / language details, will implement myself.

docs/tutorials/advanced_usage.ipynb Outdated Show resolved Hide resolved
docs/tutorials/advanced_usage.ipynb Outdated Show resolved Hide resolved
@hmgaudecker hmgaudecker merged commit 8204788 into main May 5, 2023
1 of 6 checks passed
@hmgaudecker hmgaudecker deleted the synthetic branch May 5, 2023 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants