Skip to content

[query] Add query_matrix_table an analogue to query_table#14806

Merged
hail-ci-robot merged 13 commits into
hail-is:mainfrom
chrisvittal:query/query-matrix-table
Mar 4, 2025
Merged

[query] Add query_matrix_table an analogue to query_table#14806
hail-ci-robot merged 13 commits into
hail-is:mainfrom
chrisvittal:query/query-matrix-table

Conversation

@chrisvittal

Copy link
Copy Markdown
Collaborator

CHANGELOG: Add query_matrix_table an analogue to query_table

Part of #14499.

Security Assessment

Delete all except the correct answer:

  • This change has no security impact

Impact Description

Increases the query python API surface, but is only a query change.

private val zippedReader = PartitionZippedNativeReader(rowsReader, entriesReader)

def contextType = rowsReader.contextType
def fullRowType = zippedReader.fullRowType

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrick-schultz I'm getting an NPE in initialization, I think it's because this assertion in PartitionReader is running before zippedReader is initialized. Thoughts on resolving that?

abstract class PartitionReader {
assert(fullRowType.hasField(uidFieldName))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's annoying. My first thought is to make the sub-readers lazy vals. Maybe try that for now, and we can think more if there's any more satisfying fix.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. It worked. I guess lazy val can force initialization, but val is only truly valid after full initialization of the class.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, lazy val foo = init is basically private var _foo = null and def foo = { if (_foo == null) _foo = init; _foo }. So it will be initialized the first time it's accessed, rather than when the class initializer is run. Usually that means it's initialized later than the class, but in this case it happens earlier.

@chrisvittal chrisvittal self-assigned this Feb 3, 2025
@chrisvittal chrisvittal force-pushed the query/query-matrix-table branch 4 times, most recently from e065c6d to 4e0dbe5 Compare February 4, 2025 18:11
@chrisvittal

Copy link
Copy Markdown
Collaborator Author

I think this is ready for review.

@chrisvittal chrisvittal force-pushed the query/query-matrix-table branch 2 times, most recently from b04b3f5 to cd1d3a0 Compare February 5, 2025 06:21
Comment on lines +1499 to +1533
private[this] class PartitionEntriesNativeIntervalReader(
sm: HailStateManager,
entriesPath: String,
entriesSpec: AbstractTableSpec,
uidFieldName: String,
rowsTableSpec: AbstractTableSpec,
) extends PartitionNativeIntervalReader(sm, entriesPath, entriesSpec, uidFieldName) {
override lazy val partitioner = rowsTableSpec.rowsSpec.partitioner(sm)
}

case class PartitionZippedNativeIntervalReader(
sm: HailStateManager,
mtPath: String,
mtSpec: AbstractMatrixTableSpec,
uidFieldName: String,
) extends PartitionReader {
require(mtSpec.indexed)

// XXX: rows and entries paths are hardcoded, see MatrixTableSpec
private lazy val rowsReader =
PartitionNativeIntervalReader(sm, mtPath + "/rows", mtSpec.rowsSpec, "__dummy")

private lazy val entriesReader =
new PartitionEntriesNativeIntervalReader(
sm,
mtPath + "/entries",
mtSpec.entriesSpec,
uidFieldName,
rowsReader.tableSpec,
)

private lazy val zippedReader = PartitionZippedNativeReader(rowsReader, entriesReader)

def contextType = rowsReader.contextType
def fullRowType = zippedReader.fullRowType

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me Chris, I'm not really following what's mandating all these private lazy vals.
Since you're not using PartitionZippedNativeIntervalReader for pattern matching, can you make this a smart constructor that returns an anonymous instance of PartitionReader? Perhaps that might solve the initialisation order issue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So I'm using these values as components for the code generation. It's much easier and more understandable to reuse what we already have. They're private because they're pure implementation detail and they're lazy because without it, I run this assertion failing due to an NPE.

assert(fullRowType.hasField(uidFieldName))

I don't fully understand what we mean by a smart constructor.

@ehigham ehigham Feb 18, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. A smart constructor is one that builds and validates the state and returns a plain old data object rather than the pattern of subclassing and doing a whole bunch in the constructor. In this case, it might look something like:

def PartitionZippedNativeIntervalReader(
  sm: HailStateManager,
  mtPath: String,
  mtSpec: AbstractMatrixTableSpec,
  uidFieldname: String,
): PartitionReader = {

  // build and validate your state here
  val rowsReader = 
    new PartitionNativeIntervalReader(sm, mtPath + "/rows", mtSpec.rowsSpec, "__dummy")

  object entriesReader 
    extends PartitionNativeIntervalReader(
      sm,
      mtPath + "/entries",
      mySpec.entriesSpec,
      uidFieldname
    ) {
      override lazy val partitioner = 
        rowsTableSpec.rowsSpec.partitioner(sm)
    }
 
  val zippedReader = 
    PartitionZippedNativeReader(rowsReader, entriesReader)

  ...

  // return a dumb data object 
  new ParitionReader {
    override def contextType = rowsReader.contextType
    override def fullRowType = zippedReader.fullRowType
    override val uidFieldName = uidFieldname
    override ..

  }
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a useful pattern. I'll keep it in mind for the future.

Comment thread hail/python/hail/expr/functions.py Outdated
Comment thread hail/python/hail/expr/functions.py Outdated

@ehigham ehigham left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I have a few comments on readability + test parameterisation. Thanks!

Comment thread hail/python/hail/expr/functions.py Outdated
Comment thread hail/python/test/hail/matrixtable/test_matrix_table.py Outdated
@chrisvittal chrisvittal force-pushed the query/query-matrix-table branch 2 times, most recently from 08bfa1a to 4999e0f Compare February 7, 2025 18:37
@chrisvittal chrisvittal force-pushed the query/query-matrix-table branch from 6bc01af to c1840f8 Compare February 10, 2025 17:26
@chrisvittal chrisvittal requested a review from ehigham February 18, 2025 18:36

@ehigham ehigham left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your answers!

@patrick-schultz patrick-schultz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris! A little hacky, but I'm fine with it. Just needs a test for the randomness uids.

Comment on lines +2457 to +2459
@pytest.mark.parametrize("query,expected", query_matrix_table_rows_test_parameters())
def test_query_matrix_table_rows(query_mt_mt, query, expected):
assert hl.eval(hl.query_matrix_table_rows(query_mt_mt, query, 'e')) == expected

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like there's any test coverage for the randomness machinary. Following the strategy used by the test_table_randomness_* tests, I think something like this would do it:

def test_query_matrix_table_rows(query_mt_mt):
    i1 = hl.interval(27, 45)
    i2 = hl.interval(45, 80, includes_end=True)
    rows = hl.query_matrix_table_rows(query_mt_mt, i1, 'e').extend(hl.query_matrix_table_rows(query_mt_mt, i2, 'e'))
    x = hl.eval(rows.aggregate(hl.struct(r=hl.agg.collect_as_set(ht.r), n=hl.agg.count())))
    assert len(x.r) == x.n

I don't see any randomness tests for query_table either. Would you mind adding one? (Should be identical to the matrix one.) If they uncover any issues, I'm happy to help debug.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if these new tests pass.

@chrisvittal chrisvittal force-pushed the query/query-matrix-table branch from 219ee7f to f2a497a Compare March 3, 2025 20:36

@patrick-schultz patrick-schultz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks!

@hail-ci-robot hail-ci-robot merged commit fc85a38 into hail-is:main Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants