DM-25826: lsst.alert.packet reader should iterate over alerts #21

spenczar · 2020-07-07T22:06:32Z

Ticket:

This PR changes the retrieve_alerts function to stream alerts out, rather than loading them all into memory.

This takes a bit of care since fastavro lazily loads the writer schema header of the alert file - it only loads it when the first record is read from the file.

The tests added here pass for me locally, but until #20 is fixed, you'd still see test failures. I have a local branch with both tickets/DM-25208 and tickets/DM-25826 merged together, and all tests pass there.

test/test_io.py

python/lsst/alert/packet/schema.py

Alert files can be quite large, even terabytes. That's too big to hold in memory so we need to be able to emit them message-by-message.

Numpy style is to have a single newline after the commentary. Do that.

In lsst/alert_packet#21, we would like to change lsst.alert.packet's behavior to stream alerts out from disk rather than loading them entirely into memory, since alert files might be very large. Technically, this breaks no APIs, since the documented return value is an "iterable." Making that change requires this one: change the test code so that it loads data into a list explicitly.

While the mock schema was simple, it didn't really check all the corners of our code. In particular, it didn't test the behavior of the SCHEMA_DEFS cache which is internal to fastavro. That cache gets used when fastavro reads a complex record which uses named types, like lsst.diaSource, by reference. We need to be extra-careful in how we handle modifications to that cache while iterating over records, and modifications are plausible because we modify it when we instantiate a lsst.alert.packet.Schema wrapper around the fastavro.reader.writer_schema in retrieve_alerts - that instantiation causes the SCHEMA_DEFS cache to be cleared. This commit just adds a test case that reveals the above problem, it doesn't fix it.

The way to hold multiple versions of an Avro schema at once is through namespaces since fastavro uses a global map, SCHEMA_DEFS, to keep track of types it has already seen and parsed. Previous versions of lsst.alert.packet reused names and namespaces, which made it very difficult to interoperate with fastavro. This is the first commit in a small series. It deletes all the old schemas and renames v3.0's schemas, and changes their internal references. We could rename v3.0 to v1.0.0, but that's more intrusive without clear benefit. We could remove all of the directory structure (since version information is right in the filename, so they won't collide), but that would be more intrusive too. Tests don't pass with this commit.

Now that Avro types have unique namespaces, even across versions, the SCHEMA_DEFS fastavro registry should operate as expected. We don't need to explicitly clear it. Clearing it caused bugs if a user loaded a schema while reading Avro data.

spenczar · 2020-07-10T00:09:16Z

python/lsst/alert/packet/schema.py

@@ -278,7 +268,7 @@ def __eq__(self, other):
        return self.definition == other.definition

    @classmethod
-    def from_file(cls, filename=None, root_name="lsst.alert"):
+    def from_file(cls, filename=None, root_name="lsst.v3_0.alert"):


This is a little evil, and hardcodes a v3.0 expectation. That's okay while there's only one version available, though, I think.

spenczar requested a review from jdswinbank July 7, 2020 22:06

spenczar marked this pull request as ready for review July 7, 2020 22:06

jdswinbank approved these changes Jul 7, 2020

View reviewed changes

test/test_io.py Outdated Show resolved Hide resolved

test/test_io.py Outdated Show resolved Hide resolved

test/test_io.py Show resolved Hide resolved

test/test_io.py Outdated Show resolved Hide resolved

python/lsst/alert/packet/schema.py Outdated Show resolved Hide resolved

spenczar added 5 commits July 7, 2020 16:17

Return an iterable instead of a list of records

0a588f5

Stream out alerts in an iterator instead of a list

648e44a

Alert files can be quite large, even terabytes. That's too big to hold in memory so we need to be able to emit them message-by-message.

Do not cast alert iterable records to a list, even from inside schema

04f3178

Obey docstring style guidelines

207f135

Call the io retrieve_alerts rather than its Schema method shadow

dc6f608

spenczar force-pushed the tickets/DM-25826 branch from 3beac6c to dc6f608 Compare July 7, 2020 23:25

More fiddling with docstring style

6a16051

Numpy style is to have a single newline after the commentary. Do that.

spenczar mentioned this pull request Jul 9, 2020

DM-25826: Explicitly cast alert data to a list in tests lsst/ap_association#93

Merged

spenczar added 8 commits July 9, 2020 09:54

Fix a spelling mistake in a comment

69e0f1c

Manually compare alert lists for test equality

ac75e83

Document why we need a custom test of alert lists

9693d14

Do not mess with SCHEMA_DEFS

18f4d27

Now that Avro types have unique namespaces, even across versions, the SCHEMA_DEFS fastavro registry should operate as expected. We don't need to explicitly clear it. Clearing it caused bugs if a user loaded a schema while reading Avro data.

hack: read the v3_0 alert file from disk

fa4879f

Update validateAvroRoundTrip.py to match new file location

189d780

spenczar commented Jul 10, 2020

View reviewed changes

ebellm and others added 2 commits July 10, 2020 15:16

add get_path_to_latest_schema()

11dca77

Remove redundant whitespace

7bb6f1b

ebellm merged commit e2bc8cb into master Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-25826: lsst.alert.packet reader should iterate over alerts #21

DM-25826: lsst.alert.packet reader should iterate over alerts #21

spenczar commented Jul 7, 2020

spenczar Jul 10, 2020

DM-25826: lsst.alert.packet reader should iterate over alerts #21

DM-25826: lsst.alert.packet reader should iterate over alerts #21

Conversation

spenczar commented Jul 7, 2020

Ticket:

spenczar Jul 10, 2020

Choose a reason for hiding this comment