Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] add weight to counter and reference group_by in docs #6856

Merged
merged 13 commits into from Aug 26, 2019

Conversation

@danking
Copy link
Collaborator

commented Aug 12, 2019

I think this may help users discover group_by and also help users
who are comfortable with the idea of a counter but not with group_by.

I added a new dataset for doctests and I realized a couple things:

  • doctest_write_data.py is not deterministic
  • if I add/change one dataset, my commit explodes with changes to all the datasets (see above)
  • doctest_write_data.py has to be run by me, it's not run by CI

I also noticed that when you specify no row_key to import_matrix_table you get a row key called row_id, which is annoying.

Anyway now when someone asks how to count the mutations in each gene by consequence type we can point them to the counter docs.


Adding a dataset caused a bunch of docs failure that lead me to change how we do doctesting. The changes are summarized below.

  • ignore python/.eggs
  • make PARALLELISM configurable in Makefile
  • fix make pytest (it referenced a non-extant target)
  • add make doctest (this and pytest use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables)
  • harmonize doctest and pytest parameters in build.gradle and Makefile
  • clean up import order in conftest.py to match pylint's desired ordering
  • use a temple.TemporaryDirectory for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you can ctrl-z a running doctest); this allows us to not copy the entire python directory into a build directory before running pytest
  • important: re-generate all input datasets on every run of the tests. Previously, there was a file doctest_write_data.py which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated by doctest_write_data.py might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last time doctest_write_data.py's results were checked in. (the doctests still only take about 2 minutes)
  • I fixed several latent doc bugs caused by the aforementioned situation
  • I changed "Using Variants as Covariates" in guides/genetics.rst because it seemed complicated and was broken by the aforementioned situation
  • I removed NOTEST which was a custom pytest annotation that duplicates the functionality of SKIP (I changed all NOTEST annotations to SKIP)
@chrisvittal
Copy link
Collaborator

left a comment

fix tests, looks great otherwise

@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 13, 2019

ugh. the doctest data is not regenerated on each commit so it's been stale for a long time.

@danking
Copy link
Collaborator Author

left a comment

OK, This spiraled into a huge PITA. The fixes:

  • ignore python/.eggs
  • make PARALLELISM configurable in Makefile
  • fix make pytest (it referenced a non-extant target)
  • add make doctest (this and pytest use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables)
  • harmonize doctest and pytest parameters in build.gradle and Makefile
  • clean up import order in conftest.py to match pylint's desired ordering
  • use a temple.TemporaryDirectory for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you can ctrl-z a running doctest); this allows us to not copy the entire python directory into a build directory before running pytest
  • important: re-generate all input datasets on every run of the tests. Previously, there was a file doctest_write_data.py which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated by doctest_write_data.py might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last time doctest_write_data.py's results were checked in. (the doctests still only take about 2 minutes)
  • I fixed several latent doc bugs caused by the aforementioned situation
  • I changed "Using Variants as Covariates" in guides/genetics.rst because it seemed complicated and was broken by the aforementioned situation
  • I removed NOTEST which was a custom pytest annotation that duplicates the functionality of SKIP (I changed all NOTEST annotations to SKIP)

@danking danking force-pushed the danking:counter-docs branch from 5c742ae to d2ee51a Aug 14, 2019

@tpoterba
Copy link
Collaborator

left a comment

want to understand these changes tomorrow before this goes in

@chrisvittal
Copy link
Collaborator

left a comment

Did not realize there was a difference between NOTEST and SKIP, approval revoked until that's resolved.

@tpoterba
Copy link
Collaborator

left a comment

love the doctest changes and I like the new feature. Need to revert SKIP to NOTEST, though.

@typecheck(expr=expr_any)
def counter(expr) -> DictExpression:
@typecheck(expr=expr_any, weight=nullable(expr_any))
def counter(expr, weight=None) -> DictExpression:

This comment has been minimized.

Copy link
@tpoterba

tpoterba Aug 16, 2019

Collaborator

I would prefer if weight is a keyword-only argument, so it's totally clear what's going on when this is used (this makes it easier to maintain back-compatibility in interfaces too).

The typecheck should also be expr_int64

This comment has been minimized.

Copy link
@danking

danking Aug 23, 2019

Author Collaborator

It is now keyword-only. The type check is now expr_numeric. Floats are fine. Added a test demonstrating this.

{'M': 2L, 'F': 2L}
For each sample and gene, count the number of alternate alleles in each

This comment has been minimized.

Copy link
@tpoterba

tpoterba Aug 16, 2019

Collaborator

this example seems like it belongs in a how-to, not the counter docs. I'd prefer an example using weight= here generically.

Weight also needs to be documented.

This comment has been minimized.

Copy link
@danking

danking Aug 23, 2019

Author Collaborator

gene consequence example removed.

@@ -285,7 +285,25 @@ task testPython(type: Exec, dependsOn: shadowJar) {
}

task doctest(type: Exec, dependsOn: shadowJar) {

This comment has been minimized.

Copy link
@tpoterba

tpoterba Aug 16, 2019

Collaborator

you can just delete this I think

This comment has been minimized.

Copy link
@danking

danking Aug 23, 2019

Author Collaborator

my makefile PR addresses this.

@danking danking force-pushed the danking:counter-docs branch from d2ee51a to b7941e7 Aug 23, 2019

@danking danking dismissed stale reviews from tpoterba and chrisvittal Aug 23, 2019

x

@chrisvittal
Copy link
Collaborator

left a comment

Merge these conflicts

danking added 10 commits Aug 12, 2019
[hail] add weight to counter and reference group_by in docs
I think this may help users discover `group_by` and also help users
who are comfortable with the idea of a counter but not with `group_by`.

@danking danking force-pushed the danking:counter-docs branch from 7712616 to 082bb10 Aug 26, 2019

danking added 3 commits Aug 26, 2019
fix

@danking danking merged commit 4c91c36 into hail-is:master Aug 26, 2019

1 check passed

ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.