-
Notifications
You must be signed in to change notification settings - Fork 250
[hail] add weight to counter and reference group_by in docs #6856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix tests, looks great otherwise
ugh. the doctest data is not regenerated on each commit so it's been stale for a long time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, This spiraled into a huge PITA. The fixes:
- ignore
python/.eggs
- make
PARALLELISM
configurable inMakefile
- fix
make pytest
(it referenced a non-extant target) - add
make doctest
(this andpytest
use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables) - harmonize
doctest
andpytest
parameters inbuild.gradle
andMakefile
- clean up import order in
conftest.py
to match pylint's desired ordering - use a
temple.TemporaryDirectory
for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you canctrl-z
a running doctest); this allows us to not copy the entire python directory into a build directory before running pytest - important: re-generate all input datasets on every run of the tests. Previously, there was a file
doctest_write_data.py
which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated bydoctest_write_data.py
might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last timedoctest_write_data.py
's results were checked in. (the doctests still only take about 2 minutes) - I fixed several latent doc bugs caused by the aforementioned situation
- I changed "Using Variants as Covariates" in
guides/genetics.rst
because it seemed complicated and was broken by the aforementioned situation - I removed
NOTEST
which was a custom pytest annotation that duplicates the functionality ofSKIP
(I changed allNOTEST
annotations toSKIP
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to understand these changes tomorrow before this goes in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not realize there was a difference between NOTEST
and SKIP
, approval revoked until that's resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the doctest changes and I like the new feature. Need to revert SKIP to NOTEST, though.
@typecheck(expr=expr_any) | ||
def counter(expr) -> DictExpression: | ||
@typecheck(expr=expr_any, weight=nullable(expr_any)) | ||
def counter(expr, weight=None) -> DictExpression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if weight is a keyword-only argument, so it's totally clear what's going on when this is used (this makes it easier to maintain back-compatibility in interfaces too).
The typecheck should also be expr_int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now keyword-only. The type check is now expr_numeric
. Floats are fine. Added a test demonstrating this.
{'M': 2L, 'F': 2L} | ||
|
||
For each sample and gene, count the number of alternate alleles in each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example seems like it belongs in a how-to, not the counter docs. I'd prefer an example using weight=
here generically.
Weight also needs to be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gene consequence example removed.
hail/build.gradle
Outdated
@@ -285,7 +285,25 @@ task testPython(type: Exec, dependsOn: shadowJar) { | |||
} | |||
|
|||
task doctest(type: Exec, dependsOn: shadowJar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just delete this I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my makefile PR addresses this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge these conflicts
I think this may help users discover `group_by` and also help users who are comfortable with the idea of a counter but not with `group_by`.
I think this may help users discover
group_by
and also help userswho are comfortable with the idea of a counter but not with
group_by
.I added a new dataset for doctests and I realized a couple things:
I also noticed that when you specify no
row_key
toimport_matrix_table
you get a row key calledrow_id
, which is annoying.Anyway now when someone asks how to count the mutations in each gene by consequence type we can point them to the
counter
docs.Adding a dataset caused a bunch of docs failure that lead me to change how we do doctesting. The changes are summarized below.
python/.eggs
PARALLELISM
configurable inMakefile
make pytest
(it referenced a non-extant target)make doctest
(this andpytest
use setup.py to replicate the environment the user would have after installation, I prefer this approach because I need not manually install any dependencies, setup.py handles that, it also configures spark correctly without environment variables)doctest
andpytest
parameters inbuild.gradle
andMakefile
conftest.py
to match pylint's desired orderingtemple.TemporaryDirectory
for all doctest and test output, which is automatically cleaned up (if you want to interrogate it you canctrl-z
a running doctest); this allows us to not copy the entire python directory into a build directory before running pytestdoctest_write_data.py
which you were supposed to run when you changed the datasets, but if Hail changes then the random datasets generated bydoctest_write_data.py
might change. This means when I came along to add a new dataset, I had to address all the test failures introduced since the last timedoctest_write_data.py
's results were checked in. (the doctests still only take about 2 minutes)guides/genetics.rst
because it seemed complicated and was broken by the aforementioned situationNOTEST
which was a custom pytest annotation that duplicates the functionality ofSKIP
(I changed allNOTEST
annotations toSKIP
)