Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-15049: Add support for DataUnitJoin tables #57

Merged
merged 2 commits into from Jul 23, 2018
Merged

Conversation

andy-slac
Copy link
Contributor

Not quite complete, region-based joins should work OK, but MultiCameraExposureJoin is not implemented yet.
DataUnits with regions and DataUnitRegions now have explicit declaration for which table column contains region data. This also adds DataUnitRegion class so that region column declaration can be used by client code.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I've made a few stylistic comments, and I have one big question about some code that either doesn't work at all or doesn't make sense to me.

As for testing, I'm happy with deferring unit tests to another ticket as long as you've at least done some one-off testing on the ci_hsc dataset to make sure it isn't completely broken. It sounds like we should prioritize getting a dump of that Registry test data into daf_butler so we can use it in unit tests, though.

def getRegionTable(self, *dataUnitNames):
"""Return the region table that holds regions for the given combination
def getRegionDataUnit(self, *dataUnitNames):
"""Return the DataUnit that holds regions for the given combination
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned thing isn't (necessarily) a DataUnit, so I feel like the documentation and the name of the method aren't quite right (especially as they're part of a DataUnitRegistry, so "DataUnit" is kind of redundant.

What we're really returning is an abstraction for a database table - we just don't want to call it that (I think) because we've reserve "table" to mean sqlalchemy.Table, which is at a lower level of abstraction.

I'm not super fond of this suggestion, but how about we call this getRegionHolder, and just change the docs to say "Return the DataUnit or DataUnitRegion..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not happy about the name either, and I like getRegionHolder() better, will rename it.



def _filterRegions(rowIter, firstRegionIndex):
"""Fiters result rows that have non-overlapping regions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Fiters.

matching some regions may not overlap, this generator method filters
rows that have disjoint regions. If result row contains more than two
regions (this should not happen with our current schema) then row is
filtered if any of two regions are disjoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand the reason why there should not be more than two regions per row - it's because there are fundamentally only VisitSensor and Patch regions, because you filter out Visit and Tract regions when they summarize those?

If so, I think the statement is actually that we can't get multiple regions per row as long as the query covers at most one SkyMap and at most one Camera. That's an entirely reasonable limitation for the expression system for now (after all, that will bring us to more than feature parity with Gen2), and perhaps forever (multiple Cameras and SkyMaps may be rare enough that we always write custom SQL to support them), but it's really about the expression system, not the schema itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, because coarser join tables summarize finer join tables you effectively end up with only one join table used in in SQL expression. True, this only applies for the case of single SkyMap/Camera, but this is all that pre-flight supports currently. I think that schema should more or less reflect expression system assumptions (with some extra things like summarizes defined in schema.yaml) so to me they are almost one and the same thing.

Copy link
Member

@TallJimbo TallJimbo Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema is supposed to be general enough to handle all of our use cases, but the expression system only needs to be general enough to cover the common ones, as long as we have a fall-back to direct SQL. It'd be nice if the expression system can eventually become as general as the schema, but that's a question of whether the complexity is worthwhile.

continue

for connection in (dataUnitJoin.lhs, dataUnitJoin.rhs):
regionDataUnit = self._schema.dataUnits.getRegionDataUnit(*connection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this works - there is no guarantee that there is a DataUnitJoin with a lhs and rhs that together match every entity that has a region - in fact, most regions are on single DataUnits, so calling getRegionDataUnit with two arguments is never going to find those. And the only multi-DataUnit region we have, VisitSensorRegion, isn't matched by any DataUnitJoin because there's no DataUnitJoin between Visit and Sensor.

I think the right way to include regions would be to have a way to iterate over multi-table region holders in DataUnitRegistry, much like how you iterated over DataUnitJoins on line 206, and then filter out region holders whose participants are not a subset of allUnitNames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a loop which first uses lhs and then rhs, so there are two calls here, getRegionDataUnit(*lhs) and getRegionDataUnit(*rhs), not getRegionDataUnit(*(lhs, rhs)). I can add a comment before loop to clarify if that looks confusing.

Iterator for rows returned by the query on registry
firstRegionIndex : `int` or ``None``
If not ``None`` then this is the starting position of the regions
in the row, all regions are encoded as bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume all other regions have consecutive integer indices after this one? Might be good to say so.

Also, better to end the sentence with "in the row." and make "All regions" the beginning of a new one.

# We also have to include regions from each side of the join into
# resultset so that we can filter-out non-overlapping regions.
firstRegionIndex = len(header)
selectColumns.append(regionDataUnit.regionColumn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice (not necessarily for this ticket) to make the extra filtering based on region configurable via an option in schema.yaml, for databases that can do that inside the query (and define the spatial join tables accordingly) - I think this is actually doable for SQLite (see DM-14830), though I don't know if it will be efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to experiment with that. For SQLite I do not think we'll see big difference as everything is still done on client side. It may make difference for Oracle but I have no idea how to implement server-side functions that know about sphgeom regions.

@andy-slac andy-slac force-pushed the tickets/DM-15049 branch 2 times, most recently from fa3c391 to 86eae59 Compare July 23, 2018 16:45
Not quite complete, region-based joins should work OK, but
MultiCameraExposureJoin is not implemented yet.
DataUnits with regions and DataUnitRegions now have explicit declaration
for which table column contains region data. This also adds
DataUnitRegion class so that region column declaration can be used by
client code.
@andy-slac andy-slac merged commit 5760287 into master Jul 23, 2018
@ktlim ktlim deleted the tickets/DM-15049 branch August 25, 2018 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants