Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] add weight to counter and reference group_by in docs #6856

Merged
merged 13 commits into from Aug 26, 2019

Conversation

danking
Copy link
Collaborator

@danking danking commented Aug 12, 2019

I think this may help users discover group_by and also help users
who are comfortable with the idea of a counter but not with group_by.

I added a new dataset for doctests and I realized a couple things:

  • doctest_write_data.py is not deterministic
  • if I add/change one dataset, my commit explodes with changes to all the datasets (see above)
  • doctest_write_data.py has to be run by me, it's not run by CI

I also noticed that when you specify no row_key to import_matrix_table you get a row key called row_id, which is annoying.

Anyway now when someone asks how to count the mutations in each gene by consequence type we can point them to the counter docs.


Adding a dataset caused a bunch of docs failure that lead me to change how we do doctesting. The changes are summarized below.

  • ignore python/.eggs
  • make PARALLELISM configurable in Makefile
  • fix make pytest (it referenced a non-extant target)
  • add make doctest (this and pytest use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables)
  • harmonize doctest and pytest parameters in build.gradle and Makefile
  • clean up import order in conftest.py to match pylint's desired ordering
  • use a temple.TemporaryDirectory for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you can ctrl-z a running doctest); this allows us to not copy the entire python directory into a build directory before running pytest
  • important: re-generate all input datasets on every run of the tests. Previously, there was a file doctest_write_data.py which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated by doctest_write_data.py might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last time doctest_write_data.py's results were checked in. (the doctests still only take about 2 minutes)
  • I fixed several latent doc bugs caused by the aforementioned situation
  • I changed "Using Variants as Covariates" in guides/genetics.rst because it seemed complicated and was broken by the aforementioned situation
  • I removed NOTEST which was a custom pytest annotation that duplicates the functionality of SKIP (I changed all NOTEST annotations to SKIP)

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix tests, looks great otherwise

@danking
Copy link
Collaborator Author

danking commented Aug 13, 2019

ugh. the doctest data is not regenerated on each commit so it's been stale for a long time.

Copy link
Collaborator Author

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, This spiraled into a huge PITA. The fixes:

  • ignore python/.eggs
  • make PARALLELISM configurable in Makefile
  • fix make pytest (it referenced a non-extant target)
  • add make doctest (this and pytest use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables)
  • harmonize doctest and pytest parameters in build.gradle and Makefile
  • clean up import order in conftest.py to match pylint's desired ordering
  • use a temple.TemporaryDirectory for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you can ctrl-z a running doctest); this allows us to not copy the entire python directory into a build directory before running pytest
  • important: re-generate all input datasets on every run of the tests. Previously, there was a file doctest_write_data.py which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated by doctest_write_data.py might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last time doctest_write_data.py's results were checked in. (the doctests still only take about 2 minutes)
  • I fixed several latent doc bugs caused by the aforementioned situation
  • I changed "Using Variants as Covariates" in guides/genetics.rst because it seemed complicated and was broken by the aforementioned situation
  • I removed NOTEST which was a custom pytest annotation that duplicates the functionality of SKIP (I changed all NOTEST annotations to SKIP)

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to understand these changes tomorrow before this goes in

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not realize there was a difference between NOTEST and SKIP, approval revoked until that's resolved.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the doctest changes and I like the new feature. Need to revert SKIP to NOTEST, though.

@typecheck(expr=expr_any)
def counter(expr) -> DictExpression:
@typecheck(expr=expr_any, weight=nullable(expr_any))
def counter(expr, weight=None) -> DictExpression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if weight is a keyword-only argument, so it's totally clear what's going on when this is used (this makes it easier to maintain back-compatibility in interfaces too).

The typecheck should also be expr_int64

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now keyword-only. The type check is now expr_numeric. Floats are fine. Added a test demonstrating this.

{'M': 2L, 'F': 2L}

For each sample and gene, count the number of alternate alleles in each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example seems like it belongs in a how-to, not the counter docs. I'd prefer an example using weight= here generically.

Weight also needs to be documented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gene consequence example removed.

@@ -285,7 +285,25 @@ task testPython(type: Exec, dependsOn: shadowJar) {
}

task doctest(type: Exec, dependsOn: shadowJar) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just delete this I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my makefile PR addresses this.

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge these conflicts

I think this may help users discover `group_by` and also help users
who are comfortable with the idea of a counter but not with `group_by`.
@danking danking merged commit 4c91c36 into hail-is:master Aug 26, 2019
@danking danking deleted the counter-docs branch December 18, 2019 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants