Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-15214: extend pre-flight with output datasets information #78

Merged
merged 9 commits into from Aug 30, 2018

Conversation

andy-slac
Copy link
Contributor

Pre-flight API has changed to support extra information about existing output (and input) datasets, it now returns pre-made DatasetRefs instead of raw columns (via new class PreFlightUnitsRow).

Implemented support for specifying multiple input collections and per-dataset collections overrides (new PreFlightCollections class).

Pre-flight ginormous query had been extended to support priority-based search of input datasets in multiple collections (see docstring in sqlPreFlight).


Returns
-------
`list` of `str`, names of input collections.
Copy link
Member

@timj timj Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns are meant to be written as:

Returns
-------
collections : `list` of `str`
    Names of input collections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to give a name to returned item or can I just write

Returns
-------
`list` of `str`
    Names of input collections

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataId : `dict`
Maps DataUnit link name to its corresponding value.
datasetRefs : `dict`
Maps `DatasetType` to its corresponding `DatasetRef`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the keys actually DatasetType instances, or the str names associated with them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are DatasetType instances, this is probably not a hard requirement, can switch to names if that is more natural.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DatasetType instances are more convenient for you, that's fine. I hadn't realized they were usable as dict keys, but it looks like they are.

The `list` of `DatasetTypes <DatasetType>` whose DataUnits will
be included in the returned column set. Output is limited by the
the Datasets of these DatasetTypes which already exist in the
registry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"limited by" -> "limited to"

The `list` of `DatasetTypes <DatasetType>` whose DataUnits will
be included in the returned column set. It is expected that
Datasets for these DatasetTypes do not exist in the registry,
but presently this is not checked.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think selectDataUnits would be more generally useful outside of preflight (e.g. for selecting subsets for export/transfer) if we never actually check for futureDatasetTypes not existing in the registry.
In fact, it might be most useful to have this take just take a list of tuples of DataUnits, rather than a list of DatasetTypes. I think that's all preflight uses this for, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not checking futureDatasetTypes can be added as an option indeed.

For preflight we still need to associate units with DatasetType names, and the query that we run also uses DatasetType names to find existing input Datasets, so we need some DatasetType information there, maybe just a name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like it would make the most sense to have another argument that just adds more explicit DataUnits to the query, while leaving this argument as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, and I'll leave that for the future tickets 🙂

rows that have disjoint regions. If result row contains more than two
regions (this should not happen with our current schema) then row is
filtered if any of two regions are disjoint.
Oredering is based on dependencies, units with no dependencies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "Oredering"

Also, FYI, I've added a note to DM-15034 that it should obsolete this code by unifying the topological sorting done both here and in DataUnitRegistry. Right now I don't see any easy way to combine them.

dataUnit : `DataUnit`
DataUnit to join with ``fromClause``.
otherDataUnits : iterable of `DataUnit`
DataUnits whose tableshave PKs for ``dataUnit`` table's FK. They all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "tableshave"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read this as "table shave" 😃

(here ``Dataset.DataId`` means ``Dataset.link1, Dataset.link2, etc.``)

Filtering is complicated, it is simpler to use CTE but not all
databases support CTEs so we will have to do with the repeating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "CTE" here?

Copy link
Contributor Author

@andy-slac andy-slac Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common Table Expression - special SQL syntax. I'll update docstring.

----------
dsType : `DatasetType`
collections : `PreFlightCollections`
Object which provides names of the input/output collections.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find using "collections" for the name of this argument a bit confusing; that suggests that it would be the same list of strings that's passed to selectDataUnits. I think the best solution is to also come up with a new name for the PreFlightCollections class as well: how about calling the class something like DatasetOriginInfo, and this argument originInfo?

As I side note, I would eventually like to drop "Preflight" from the names of all of your classes in daf_butler, because I think they will eventually be much more generally useful. But I'm okay with doing that on another ticket, as I think there will also be some refactoring involved, unless you want to do it on this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to DatasetOriginInfo and re-factoring on separate ticket.

ON Dataset.dataset_id = DatasetCollection.dataset_id
WHERE Dataset.dataset_type_name = :dsType_name
AND DatasetCollection.collection = :collection_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to add a reminder note here that output Dataset subqueries always have this form (if they exist at all).

Yields
------
row : `PreFlightUnitsRow`
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly about the future, but I think we'll eventually want to make this functionality a public API, and that probably means making it into a class in order to make it easier for callers to construct consistent arguments to it.
That's because I think this is the natural place where we could accept custom SQL to drive preflight: while the query you're constructing is quite complex, the form of its outputs really aren't, and it's quite conceivable that someone who wanted to do something complex that's not supported by the query generator could write their own query that generates consistent outputs. We could then pass the outputs of that custom query through this function and use them in the rest of preflight.
Another option would be to start using custom queries after this function, i.e. make the alternate interface some other iterator that yields PreFlightUnitsRow. That would already be an easier-to-use interface, though it would be nice to provide a general way to filter out non-overlapping regions.

The `list` of `DatasetType`\ s whose instances should be included
in the graph and limit its extent.
The `list` of `DatasetTypes <DatasetType>` whose DataUnits will
be included in the returned column set. Output is limited by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"limited to", as above.

@andy-slac andy-slac force-pushed the tickets/DM-15214 branch 2 times, most recently from c7ae02b to 3eccb8a Compare August 29, 2018 21:28
@andy-slac andy-slac merged commit 5f6f719 into master Aug 30, 2018
@timj timj deleted the tickets/DM-15214 branch April 22, 2020 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants