Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] add weight to counter and reference group_by in docs #6856

Merged
merged 13 commits into from Aug 26, 2019

Conversation

@danking
Copy link
Collaborator

@danking danking commented Aug 12, 2019

I think this may help users discover group_by and also help users
who are comfortable with the idea of a counter but not with group_by.

I added a new dataset for doctests and I realized a couple things:

  • doctest_write_data.py is not deterministic
  • if I add/change one dataset, my commit explodes with changes to all the datasets (see above)
  • doctest_write_data.py has to be run by me, it's not run by CI

I also noticed that when you specify no row_key to import_matrix_table you get a row key called row_id, which is annoying.

Anyway now when someone asks how to count the mutations in each gene by consequence type we can point them to the counter docs.


Adding a dataset caused a bunch of docs failure that lead me to change how we do doctesting. The changes are summarized below.

  • ignore python/.eggs
  • make PARALLELISM configurable in Makefile
  • fix make pytest (it referenced a non-extant target)
  • add make doctest (this and pytest use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables)
  • harmonize doctest and pytest parameters in build.gradle and Makefile
  • clean up import order in conftest.py to match pylint's desired ordering
  • use a temple.TemporaryDirectory for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you can ctrl-z a running doctest); this allows us to not copy the entire python directory into a build directory before running pytest
  • important: re-generate all input datasets on every run of the tests. Previously, there was a file doctest_write_data.py which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated by doctest_write_data.py might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last time doctest_write_data.py's results were checked in. (the doctests still only take about 2 minutes)
  • I fixed several latent doc bugs caused by the aforementioned situation
  • I changed "Using Variants as Covariates" in guides/genetics.rst because it seemed complicated and was broken by the aforementioned situation
  • I removed NOTEST which was a custom pytest annotation that duplicates the functionality of SKIP (I changed all NOTEST annotations to SKIP)
Copy link
Collaborator

@chrisvittal chrisvittal left a comment

fix tests, looks great otherwise

Loading

@danking
Copy link
Collaborator Author

@danking danking commented Aug 13, 2019

ugh. the doctest data is not regenerated on each commit so it's been stale for a long time.

Loading

Copy link
Collaborator Author

@danking danking left a comment

OK, This spiraled into a huge PITA. The fixes:

  • ignore python/.eggs
  • make PARALLELISM configurable in Makefile
  • fix make pytest (it referenced a non-extant target)
  • add make doctest (this and pytest use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables)
  • harmonize doctest and pytest parameters in build.gradle and Makefile
  • clean up import order in conftest.py to match pylint's desired ordering
  • use a temple.TemporaryDirectory for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you can ctrl-z a running doctest); this allows us to not copy the entire python directory into a build directory before running pytest
  • important: re-generate all input datasets on every run of the tests. Previously, there was a file doctest_write_data.py which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated by doctest_write_data.py might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last time doctest_write_data.py's results were checked in. (the doctests still only take about 2 minutes)
  • I fixed several latent doc bugs caused by the aforementioned situation
  • I changed "Using Variants as Covariates" in guides/genetics.rst because it seemed complicated and was broken by the aforementioned situation
  • I removed NOTEST which was a custom pytest annotation that duplicates the functionality of SKIP (I changed all NOTEST annotations to SKIP)

Loading

Copy link
Collaborator

@tpoterba tpoterba left a comment

want to understand these changes tomorrow before this goes in

Loading

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Did not realize there was a difference between NOTEST and SKIP, approval revoked until that's resolved.

Loading

Copy link
Collaborator

@tpoterba tpoterba left a comment

love the doctest changes and I like the new feature. Need to revert SKIP to NOTEST, though.

Loading

@typecheck(expr=expr_any)
def counter(expr) -> DictExpression:
@typecheck(expr=expr_any, weight=nullable(expr_any))
def counter(expr, weight=None) -> DictExpression:
Copy link
Collaborator

@tpoterba tpoterba Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if weight is a keyword-only argument, so it's totally clear what's going on when this is used (this makes it easier to maintain back-compatibility in interfaces too).

The typecheck should also be expr_int64

Loading

Copy link
Collaborator Author

@danking danking Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now keyword-only. The type check is now expr_numeric. Floats are fine. Added a test demonstrating this.

Loading

{'M': 2L, 'F': 2L}
For each sample and gene, count the number of alternate alleles in each
Copy link
Collaborator

@tpoterba tpoterba Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example seems like it belongs in a how-to, not the counter docs. I'd prefer an example using weight= here generically.

Weight also needs to be documented.

Loading

Copy link
Collaborator Author

@danking danking Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gene consequence example removed.

Loading

@@ -285,7 +285,25 @@ task testPython(type: Exec, dependsOn: shadowJar) {
}

task doctest(type: Exec, dependsOn: shadowJar) {
Copy link
Collaborator

@tpoterba tpoterba Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just delete this I think

Loading

Copy link
Collaborator Author

@danking danking Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my makefile PR addresses this.

Loading

@danking danking dismissed stale reviews from tpoterba and chrisvittal Aug 23, 2019

x

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Merge these conflicts

Loading

danking added 10 commits Aug 26, 2019
I think this may help users discover `group_by` and also help users
who are comfortable with the idea of a counter but not with `group_by`.
@danking danking merged commit 4c91c36 into hail-is:master Aug 26, 2019
1 check passed
Loading
@danking danking deleted the counter-docs branch Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants