Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-25826: lsst.alert.packet reader should iterate over alerts #21

Merged
merged 16 commits into from Jul 10, 2020

Conversation

spenczar
Copy link
Contributor

@spenczar spenczar commented Jul 7, 2020

Ticket:

DM-25826

This PR changes the retrieve_alerts function to stream alerts out, rather than loading them all into memory.

This takes a bit of care since fastavro lazily loads the writer schema header of the alert file - it only loads it when the first record is read from the file.

The tests added here pass for me locally, but until #20 is fixed, you'd still see test failures. I have a local branch with both tickets/DM-25208 and tickets/DM-25826 merged together, and all tests pass there.

@spenczar spenczar requested a review from jdswinbank July 7, 2020 22:06
@spenczar spenczar marked this pull request as ready for review July 7, 2020 22:06
test/test_io.py Outdated Show resolved Hide resolved
test/test_io.py Outdated Show resolved Hide resolved
test/test_io.py Show resolved Hide resolved
test/test_io.py Outdated Show resolved Hide resolved
python/lsst/alert/packet/schema.py Outdated Show resolved Hide resolved
Numpy style is to have a single newline after the commentary. Do that.
spenczar added a commit to lsst/ap_association that referenced this pull request Jul 9, 2020
In lsst/alert_packet#21, we would like to
change lsst.alert.packet's behavior to stream alerts out from disk
rather than loading them entirely into memory, since alert files might
be very large. Technically, this breaks no APIs, since the documented
return value is an "iterable."

Making that change requires this one: change the test code so that it
loads data into a list explicitly.
While the mock schema was simple, it didn't really check all the
corners of our code. In particular, it didn't test the behavior of the
SCHEMA_DEFS cache which is internal to fastavro. That cache gets used
when fastavro reads a complex record which uses named types, like
lsst.diaSource, by reference. We need to be extra-careful in how we
handle modifications to that cache while iterating over records, and
modifications are plausible because we modify it when we instantiate a
lsst.alert.packet.Schema wrapper around the
fastavro.reader.writer_schema in retrieve_alerts - that instantiation
causes the SCHEMA_DEFS cache to be cleared.

This commit just adds a test case that reveals the above problem, it
doesn't fix it.
The way to hold multiple versions of an Avro schema at once is through
namespaces since fastavro uses a global map, SCHEMA_DEFS, to keep
track of types it has already seen and parsed. Previous versions of
lsst.alert.packet reused names and namespaces, which made it very
difficult to interoperate with fastavro.

This is the first commit in a small series. It deletes all the old
schemas and renames v3.0's schemas, and changes their internal
references.

We could rename v3.0 to v1.0.0, but that's more intrusive without
clear benefit. We could remove all of the directory structure (since
version information is right in the filename, so they won't collide),
but that would be more intrusive too.

Tests don't pass with this commit.
Now that Avro types have unique namespaces, even across versions, the
SCHEMA_DEFS fastavro registry should operate as expected. We don't
need to explicitly clear it. Clearing it caused bugs if a user loaded
a schema while reading Avro data.
@@ -278,7 +268,7 @@ def __eq__(self, other):
return self.definition == other.definition

@classmethod
def from_file(cls, filename=None, root_name="lsst.alert"):
def from_file(cls, filename=None, root_name="lsst.v3_0.alert"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little evil, and hardcodes a v3.0 expectation. That's okay while there's only one version available, though, I think.

@ebellm ebellm merged commit e2bc8cb into master Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants