New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-25826: lsst.alert.packet reader should iterate over alerts #21
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jdswinbank
approved these changes
Jul 7, 2020
Alert files can be quite large, even terabytes. That's too big to hold in memory so we need to be able to emit them message-by-message.
spenczar
force-pushed
the
tickets/DM-25826
branch
from
July 7, 2020 23:25
3beac6c
to
dc6f608
Compare
Numpy style is to have a single newline after the commentary. Do that.
spenczar
added a commit
to lsst/ap_association
that referenced
this pull request
Jul 9, 2020
In lsst/alert_packet#21, we would like to change lsst.alert.packet's behavior to stream alerts out from disk rather than loading them entirely into memory, since alert files might be very large. Technically, this breaks no APIs, since the documented return value is an "iterable." Making that change requires this one: change the test code so that it loads data into a list explicitly.
While the mock schema was simple, it didn't really check all the corners of our code. In particular, it didn't test the behavior of the SCHEMA_DEFS cache which is internal to fastavro. That cache gets used when fastavro reads a complex record which uses named types, like lsst.diaSource, by reference. We need to be extra-careful in how we handle modifications to that cache while iterating over records, and modifications are plausible because we modify it when we instantiate a lsst.alert.packet.Schema wrapper around the fastavro.reader.writer_schema in retrieve_alerts - that instantiation causes the SCHEMA_DEFS cache to be cleared. This commit just adds a test case that reveals the above problem, it doesn't fix it.
The way to hold multiple versions of an Avro schema at once is through namespaces since fastavro uses a global map, SCHEMA_DEFS, to keep track of types it has already seen and parsed. Previous versions of lsst.alert.packet reused names and namespaces, which made it very difficult to interoperate with fastavro. This is the first commit in a small series. It deletes all the old schemas and renames v3.0's schemas, and changes their internal references. We could rename v3.0 to v1.0.0, but that's more intrusive without clear benefit. We could remove all of the directory structure (since version information is right in the filename, so they won't collide), but that would be more intrusive too. Tests don't pass with this commit.
Now that Avro types have unique namespaces, even across versions, the SCHEMA_DEFS fastavro registry should operate as expected. We don't need to explicitly clear it. Clearing it caused bugs if a user loaded a schema while reading Avro data.
spenczar
commented
Jul 10, 2020
@@ -278,7 +268,7 @@ def __eq__(self, other): | |||
return self.definition == other.definition | |||
|
|||
@classmethod | |||
def from_file(cls, filename=None, root_name="lsst.alert"): | |||
def from_file(cls, filename=None, root_name="lsst.v3_0.alert"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little evil, and hardcodes a v3.0
expectation. That's okay while there's only one version available, though, I think.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ticket:
DM-25826
This PR changes the
retrieve_alerts
function to stream alerts out, rather than loading them all into memory.This takes a bit of care since
fastavro
lazily loads the writer schema header of the alert file - it only loads it when the first record is read from the file.The tests added here pass for me locally, but until #20 is fixed, you'd still see test failures. I have a local branch with both tickets/DM-25208 and tickets/DM-25826 merged together, and all tests pass there.