Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-27152: Allow alternate values in dimensions and also unadorned dimension record columns #440

Merged
merged 5 commits into from Nov 24, 2020

Conversation

timj
Copy link
Member

@timj timj commented Nov 20, 2020

This means that:

  • exposure="AT_C_20201210_00055" will be converted to the exposure ID
  • detector="R11_S10" should be converted to detector ID
  • day_obs=20201214, seq_num=55 will be associated with exposure
  • raft=R11, name_in_raft=S10 will be associated with detector

If a dataId is assignable to multiple dimensions (eg day_obs is in exposure and visit) it is assigned to the most popular dimension (so if seq_num is specified that would be exposure) with preference given to a required dimension over an inferred temporal dimension.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great; the decision-making logic you came up with seems like a good solution to the problem.

Two big-picture comments:

  • It'd be good to short-circuit as much of this logic as possible when the given data ID is already a DataCoordinate instance, at least in the case where that is true and there are no kwargs.
  • I am still looking forward to integrating this with DataCoordinate.standardize somehow, but while I have some thoughts on how to do that, the problems that came up when we thought about that previously suggest that that will involve some experimentation with how to refactor both this new code and what it ought to fit into. So I don't think that's a job for this ticket, but in case you want to give that a shot before I get to it myself, I'm thinking vaguely of:
@dataclass
class DataIdPartition:

    @classmethod
    def partition(
        cls,
        mapping: Optional[NameLookupMapping[Dimension, DataIdValue]] = None,
        *,
        graph: Optional[DimensionGraph] = None,
        universe: Optional[DimensionUniverse] = None,
        temporal: bool = False,
        **kwargs: Any,
    ) -> DataIdPartition:
        """Partition an input mapping into dimension key-value pairs and other
        constraints.

        Most parameters are the same as `DataCoordinate.standardize`;  if
        ``temporal`` is `True` and `graph` is provided, extend ``graph`` to
        include any dimensions with ``Dimension.temporal is not None`` can also
        be constrained.
        """"
        # TODO

    def resolve(self, registry: Optional[Registry] = None) -> DataCoordinate:
        """Resolve this partition into a `DataCoordinate` via registry query
        (if necessary).

        If ``self.constraints`` is `None`, ``registry`` is ignored (and hence
        may be `None`).
        """
        # TODO

    keys: NamedKeyMapping[Dimension, DataIdValue]
    """Dimensions and the associated primary key values given directly.
    """

    constraints: NamedKeyMapping[Dimension, Mapping[str, DataIdValue]]
    """Additional constraints on dimension table columns.
    """

    graph: DimensionGraph
    """Dimensions identified by the `DataCoordinate` instances returned by
    `resolve`.
    """

That may not hold up to closer scrutiny, but maybe it's a start.

del dataIdDict[dimensionName]
continue

if not isinstance(value, dimension.primaryKey.getPythonType()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea if this will prevent us from getting to the Numpy integer conversion we have downstream? It'd probably be good to add a test for this at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot. It still worked but issued a warning but since 2 and numpy.int64(2) were still the "right" answer for the later check it made no difference. I now force integral types to an int here.

candidateDimensions.add(str(dim))

# Look up table for the first association with a dimension
guessedAssociation: dict[Any, dict[str, Any]] = defaultdict(dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 I thought these PEP585 annotations weren't allowed until 3.8! Glad it didn't take me any longer to learn I was wrong (but I also see that you were already using this in the last ticket I reviewed on this topic, in code just above, so I suppose I'm just blind).

Also, in case they'd be useful here, there are DataIdKey and DataIdValue type aliases (in dimensions/_coordinate.py, exported all the way up to daf.butler) that could be used instead of these Anys, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: these annotations still don't work with my version of mypy - and they don't work with yours either, because the mypy issues is still open (python/mypy#7907) and it turns out we're not type-checking this file at all! We've only enabled mypy for a bunch of specific daf.butler subpackages, but not _butler.py (and we're probably missing a few other modules, too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to fix the mypy config and type annotations in at least this file on https://jira.lsstcorp.org/browse/DM-27689.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot. I'll leave it for you to fix then and just merge this.

@timj
Copy link
Member Author

timj commented Nov 23, 2020

I will not try to do the refactor on this ticket. Whilst updating the tests I did realize that instrument.class_name does not work because there is a standardization phase of the dataId inside queryDimensionRecords. It is possible that governor dimensions won't work for the moment with this inferred scheme.

If the type of the value for the dimension does not match
the python type of the primary key, look at the alternate
keys for a match.

This commit does not support keyword args.
This allows raft and name_in_raft to be used for detector
or seq_num and day_obs to be used for exposure.

Also includes disambiguation logic to work out that day_obs
can be in multiple dimensions -- picks the most popular
dimension.
This again traps for numpy.int64 and might even make the later
trap moot.
@timj timj merged commit 3f99bad into master Nov 24, 2020
@timj timj deleted the tickets/DM-27152 branch November 24, 2020 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants