Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad error message when indexing a matrix table with itself #9121

Closed
danking opened this issue Jul 22, 2020 · 14 comments · Fixed by #9207
Closed

Bad error message when indexing a matrix table with itself #9121

danking opened this issue Jul 22, 2020 · 14 comments · Fixed by #9207
Assignees

Comments

@danking
Copy link
Contributor

danking commented Jul 22, 2020

Note that mt.cols()[mt.col_key] is obviously wrong but instead we get a big error message that is ultimately really quite confusing. A good error message would be "cannot index matrix table with itself".

(randomly assigning someone)

ExpressionException                       Traceback (most recent call last)
<ipython-input-47-76acaa85d728> in <module>
      9 #combined.show()
     10
---> 11 combined = combined.annotate_rows (N_Aa1 = mt.cols()[mt.col_key].N_Aa)
     12
     13 combined.cols().show()

<decorator-gen-1171> in annotate_rows(self, **named_exprs)

~/opt/miniconda3/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615
    616     return wrapper

~/opt/miniconda3/lib/python3.7/site-packages/hail/matrixtable.py in annotate_rows(self, **named_exprs)
    955         caller = "MatrixTable.annotate_rows"
    956         check_annotate_exprs(caller, named_exprs, self._row_indices)
--> 957         return self._select_rows(caller, self._rvrow.annotate(**named_exprs))
    958
    959     @typecheck_method(named_exprs=expr_any)

<decorator-gen-651> in annotate(self, **named_exprs)

~/opt/miniconda3/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615
    616     return wrapper

~/opt/miniconda3/lib/python3.7/site-packages/hail/expr/expressions/typed_expressions.py in annotate(self, **named_exprs)
   1624
   1625         result_type = tstruct(**new_types)
-> 1626         indices, aggregations = unify_all(self, *[x for (f, x) in named_exprs.items()])
   1627
   1628         return construct_expr(ir.InsertFields.construct_with_deduplication(

~/opt/miniconda3/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in unify_all(*exprs)
    351                         n=len(sources),
    352                         fields=''.join("\n        {}: {}".format(src, fds) for src, fds in sources.items())
--> 353                     )) from None
    354     first, rest = exprs[0], exprs[1:]
    355     aggregations = first._aggregations

ExpressionException: Cannot combine expressions from different source objects.
    Found fields from 1 objects:
        <hail.matrixtable.MatrixTable object at 0x7ff0ce0d3c50>: ['s']
@akotlar
Copy link
Contributor

akotlar commented Jul 22, 2020

I ran across this when I was trying to place phenotypes (stored in entries), in globals without a collect, although I think I achieved this by something like annotate_globals(Y = mt.Y) (will need to double check). Relatedly, I found the index() method documentation confusing at first glance.

@tpoterba
Copy link
Contributor

I don't think there's a way to put entries into globals without a collect

@akotlar
Copy link
Contributor

akotlar commented Jul 22, 2020

I think you're right. I tried a number of things, but I need something to key the column by, and a global has no concept a key (which is why it is a global). I found this very confusing.

Let's say mt.C contains phenotypes for samples 1..n. This is, in my mind, a distributed array, with someone fancy (non-integer) indexing support. Great, but I don't care about that, I just want a distributed array

I want to localize_entries, but this creates a hail Table, which drops my phenotypes, because that's now a table and not a matrix table (why! all I wanted was to create a new field in my MT with the result of a column aggregation per row). So the natural thing I reach to is storing my phenotypes elsewhere. I think: "I want to continue benefitting from Hail query planner), so I try not to materialize the phenotypes in memory. If I say mt.annotate_globals(Y = mt.C) I expect that to just work, because in my mind, I took something that was a a distributed array, but with more powerful indexing support, and converted it to something that is even more array like, that I'm going to need to understand how to index myself (which I'm fine with since I'm moving the thing to globals). Alternatively, I could also expect that globals now contains a reference to a new table, that contains only the column index, and value (phenotype), which seems fine.

Neither of these options happens. Instead, I need to realize the array in memory on my master, which seems like a potentially bad idea. The bigger problem though is that I want 1 change (simplify indexing or make a reference to the array), and I seem to need 3 (that + memory + loss of distribution).

In short: I want to be able to choose whether I realize the values in memory, not be forced into it. Let me know if there's something I missed!

@tpoterba
Copy link
Contributor

I want to localize_entries, but this creates a hail Table, which drops my phenotypes

You can put them in globals, there are params for the names of the resulting entries and resulting col fields. I think this will solve your problem, for the most part?

@akotlar
Copy link
Contributor

akotlar commented Jul 22, 2020

Right, so I placed my phenotypes in globals, but I had to collect() which realized in memory. Here it wasn't a problem because phenotypes are typically small. My issue was surprise factor. I expected a smarter compiler.

@tpoterba
Copy link
Contributor

sorry, I think I wasn't clear -- you can put them in globals when doing mt.localize_entries by passing both the entries_field_name and columns_array_field_name params.

@danking
Copy link
Contributor Author

danking commented Jul 22, 2020

If columns_array_field_name is None in localize_entries, the column fields are given a unique identifier, not dropped. There should be a global fields with a name like __uid_9 which is array<struct{...}> where the struct fields correspond to the column fields of the MT.

I agree localize_entries is super confusing and I think if end-users are using it, we probably need to step up the MatrixTable API or push on DNDArray so that they never have to utter the words "localize entries".

@tpoterba
Copy link
Contributor

        entries = entries_array_field_name or Env.get_uid()
        cols = columns_array_field_name or Env.get_uid()
        t = self._localize_entries(entries, cols)
        if entries_array_field_name is None:
            t = t.drop(entries)
        if columns_array_field_name is None:
            t = t.drop(cols)
        return t

@danking
Copy link
Contributor Author

danking commented Jul 22, 2020

Ah, well. disregard first point. Second point stands.

@akotlar
Copy link
Contributor

akotlar commented Jul 22, 2020

you can put them in globals when doing mt.localize_entries by passing both the entries_field_name and columns_array_field_name params.

Ah, I noticed the columns_array_field parameter, but hadn't tried it. That may do what I need.

I think the result of this may be a PR or blog post on how to get write Regenie in Hail, so that others can see somewhat more advanced uses of the MT/T apis than is currently present in docs. Would that be of interest?

@danking
Copy link
Contributor Author

danking commented Jul 22, 2020

If we feel confident the APIs make sense, then that's a great idea!

I'm a bit worried that localize_entries was an API mistake (my fault :/) that we use because we lack better tools. I'm especially curious to see how its used and whether there's a better API for that kind of work.

@akotlar
Copy link
Contributor

akotlar commented Jul 22, 2020

Currently I'm only using it to place allele counts (GT.n_allele_counts()) into a single field, first as array of structs{ac: }, then [ac1,ac2, ..acN].

There are actually a number of steps to get this to work, and what I really want is fancy Numpy-like indexing, along the lines of:
X_blocked = range(1, n_blocks).map(lambda block: np.array( mt.c[ (block - 1) * rows_per_block : block *rows-per_block , : ])
Y = mt.y
B = (X_blocked.T @ X_blocked + lambda * I) @ X_blocked @ Y

Where X_blocked is now a n_blocks x b_rows x n 3d array

@akotlar
Copy link
Contributor

akotlar commented Jul 22, 2020

If we feel confident the APIs make sense

Absolutely

@danking
Copy link
Contributor Author

danking commented Jul 24, 2020

I think for that purpose, I think I'd write the blog post to be:

mt = mt.annotate_rows(acs = hl.agg.collect(mt.GT.n_alt_alleles())
rows = mt.rows()

(assuming you mean n_alt_alleles for n_allele_counts)

@tpoterba tpoterba assigned tpoterba and unassigned johnc1231 Aug 3, 2020
tpoterba added a commit to tpoterba/hail that referenced this issue Aug 3, 2020
danking pushed a commit that referenced this issue Aug 5, 2020
* [query] Add expression source checking to annotate methods

Fixes #9121

* fix annotate errors

* broadcast
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants