New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-15214: extend pre-flight with output datasets information #78
Conversation
|
||
Returns | ||
------- | ||
`list` of `str`, names of input collections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns are meant to be written as:
Returns
-------
collections : `list` of `str`
Names of input collections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to give a name to returned item or can I just write
Returns
-------
`list` of `str`
Names of input collections
❔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://developer.lsst.io/python/numpydoc.html?highlight=numpydoc#py-docstring-returns suggests a dummy name is needed.
c06bb74
to
be016ee
Compare
dataId : `dict` | ||
Maps DataUnit link name to its corresponding value. | ||
datasetRefs : `dict` | ||
Maps `DatasetType` to its corresponding `DatasetRef`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the keys actually DatasetType
instances, or the str
names associated with them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are DatasetType
instances, this is probably not a hard requirement, can switch to names if that is more natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DatasetType
instances are more convenient for you, that's fine. I hadn't realized they were usable as dict
keys, but it looks like they are.
The `list` of `DatasetTypes <DatasetType>` whose DataUnits will | ||
be included in the returned column set. Output is limited by the | ||
the Datasets of these DatasetTypes which already exist in the | ||
registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"limited by" -> "limited to"
The `list` of `DatasetTypes <DatasetType>` whose DataUnits will | ||
be included in the returned column set. It is expected that | ||
Datasets for these DatasetTypes do not exist in the registry, | ||
but presently this is not checked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think selectDataUnits
would be more generally useful outside of preflight (e.g. for selecting subsets for export/transfer) if we never actually check for futureDatasetTypes
not existing in the registry.
In fact, it might be most useful to have this take just take a list of tuples of DataUnits, rather than a list of DatasetTypes
. I think that's all preflight uses this for, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not checking futureDatasetTypes
can be added as an option indeed.
For preflight we still need to associate units with DatasetType names, and the query that we run also uses DatasetType names to find existing input Datasets, so we need some DatasetType information there, maybe just a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like it would make the most sense to have another argument that just adds more explicit DataUnits to the query, while leaving this argument as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and I'll leave that for the future tickets 🙂
rows that have disjoint regions. If result row contains more than two | ||
regions (this should not happen with our current schema) then row is | ||
filtered if any of two regions are disjoint. | ||
Oredering is based on dependencies, units with no dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "Oredering"
Also, FYI, I've added a note to DM-15034 that it should obsolete this code by unifying the topological sorting done both here and in DataUnitRegistry. Right now I don't see any easy way to combine them.
dataUnit : `DataUnit` | ||
DataUnit to join with ``fromClause``. | ||
otherDataUnits : iterable of `DataUnit` | ||
DataUnits whose tableshave PKs for ``dataUnit`` table's FK. They all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "tableshave"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read this as "table shave" 😃
(here ``Dataset.DataId`` means ``Dataset.link1, Dataset.link2, etc.``) | ||
|
||
Filtering is complicated, it is simpler to use CTE but not all | ||
databases support CTEs so we will have to do with the repeating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's "CTE" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common Table Expression - special SQL syntax. I'll update docstring.
---------- | ||
dsType : `DatasetType` | ||
collections : `PreFlightCollections` | ||
Object which provides names of the input/output collections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find using "collections" for the name of this argument a bit confusing; that suggests that it would be the same list of strings that's passed to selectDataUnits
. I think the best solution is to also come up with a new name for the PreFlightCollections
class as well: how about calling the class something like DatasetOriginInfo
, and this argument originInfo
?
As I side note, I would eventually like to drop "Preflight" from the names of all of your classes in daf_butler, because I think they will eventually be much more generally useful. But I'm okay with doing that on another ticket, as I think there will also be some refactoring involved, unless you want to do it on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to DatasetOriginInfo and re-factoring on separate ticket.
ON Dataset.dataset_id = DatasetCollection.dataset_id | ||
WHERE Dataset.dataset_type_name = :dsType_name | ||
AND DatasetCollection.collection = :collection_name | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to add a reminder note here that output Dataset subqueries always have this form (if they exist at all).
Yields | ||
------ | ||
row : `PreFlightUnitsRow` | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly about the future, but I think we'll eventually want to make this functionality a public API, and that probably means making it into a class in order to make it easier for callers to construct consistent arguments to it.
That's because I think this is the natural place where we could accept custom SQL to drive preflight: while the query you're constructing is quite complex, the form of its outputs really aren't, and it's quite conceivable that someone who wanted to do something complex that's not supported by the query generator could write their own query that generates consistent outputs. We could then pass the outputs of that custom query through this function and use them in the rest of preflight.
Another option would be to start using custom queries after this function, i.e. make the alternate interface some other iterator that yields PreFlightUnitsRow
. That would already be an easier-to-use interface, though it would be nice to provide a general way to filter out non-overlapping regions.
The `list` of `DatasetType`\ s whose instances should be included | ||
in the graph and limit its extent. | ||
The `list` of `DatasetTypes <DatasetType>` whose DataUnits will | ||
be included in the returned column set. Output is limited by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"limited to", as above.
c7ae02b
to
3eccb8a
Compare
3eccb8a
to
b9e7bf2
Compare
Pre-flight API has changed to support extra information about existing output (and input) datasets, it now returns pre-made
DatasetRefs
instead of raw columns (via new classPreFlightUnitsRow
).Implemented support for specifying multiple input collections and per-dataset collections overrides (new
PreFlightCollections
class).Pre-flight ginormous query had been extended to support priority-based search of input datasets in multiple collections (see docstring in sqlPreFlight).