-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad error message when indexing a matrix table with itself #9121
Comments
I ran across this when I was trying to place phenotypes (stored in entries), in globals without a collect, although I think I achieved this by something like annotate_globals(Y = mt.Y) (will need to double check). Relatedly, I found the index() method documentation confusing at first glance. |
I don't think there's a way to put entries into globals without a collect |
I think you're right. I tried a number of things, but I need something to key the column by, and a global has no concept a key (which is why it is a global). I found this very confusing. Let's say mt.C contains phenotypes for samples 1..n. This is, in my mind, a distributed array, with someone fancy (non-integer) indexing support. Great, but I don't care about that, I just want a distributed array I want to localize_entries, but this creates a hail Table, which drops my phenotypes, because that's now a table and not a matrix table (why! all I wanted was to create a new field in my MT with the result of a column aggregation per row). So the natural thing I reach to is storing my phenotypes elsewhere. I think: "I want to continue benefitting from Hail query planner), so I try not to materialize the phenotypes in memory. If I say mt.annotate_globals(Y = mt.C) I expect that to just work, because in my mind, I took something that was a a distributed array, but with more powerful indexing support, and converted it to something that is even more array like, that I'm going to need to understand how to index myself (which I'm fine with since I'm moving the thing to globals). Alternatively, I could also expect that globals now contains a reference to a new table, that contains only the column index, and value (phenotype), which seems fine. Neither of these options happens. Instead, I need to realize the array in memory on my master, which seems like a potentially bad idea. The bigger problem though is that I want 1 change (simplify indexing or make a reference to the array), and I seem to need 3 (that + memory + loss of distribution). In short: I want to be able to choose whether I realize the values in memory, not be forced into it. Let me know if there's something I missed! |
You can put them in globals, there are params for the names of the resulting entries and resulting col fields. I think this will solve your problem, for the most part? |
Right, so I placed my phenotypes in globals, but I had to collect() which realized in memory. Here it wasn't a problem because phenotypes are typically small. My issue was surprise factor. I expected a smarter compiler. |
sorry, I think I wasn't clear -- you can put them in globals when doing |
If I agree localize_entries is super confusing and I think if end-users are using it, we probably need to step up the MatrixTable API or push on DNDArray so that they never have to utter the words "localize entries". |
|
Ah, well. disregard first point. Second point stands. |
Ah, I noticed the columns_array_field parameter, but hadn't tried it. That may do what I need. I think the result of this may be a PR or blog post on how to get write Regenie in Hail, so that others can see somewhat more advanced uses of the MT/T apis than is currently present in docs. Would that be of interest? |
If we feel confident the APIs make sense, then that's a great idea! I'm a bit worried that localize_entries was an API mistake (my fault :/) that we use because we lack better tools. I'm especially curious to see how its used and whether there's a better API for that kind of work. |
Currently I'm only using it to place allele counts (GT.n_allele_counts()) into a single field, first as array of structs{ac: }, then [ac1,ac2, ..acN]. There are actually a number of steps to get this to work, and what I really want is fancy Numpy-like indexing, along the lines of: Where X_blocked is now a n_blocks x b_rows x n 3d array |
Absolutely |
I think for that purpose, I think I'd write the blog post to be:
(assuming you mean n_alt_alleles for n_allele_counts) |
* [query] Add expression source checking to annotate methods Fixes #9121 * fix annotate errors * broadcast
Note that
mt.cols()[mt.col_key]
is obviously wrong but instead we get a big error message that is ultimately really quite confusing. A good error message would be "cannot index matrix table with itself".(randomly assigning someone)
The text was updated successfully, but these errors were encountered: