-
Notifications
You must be signed in to change notification settings - Fork 0
DM-49670: Add option to use a service for Butler database writes #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# This file is part of prompt_processing. | ||
# | ||
# Developed for the LSST Data Management System. | ||
# This product includes software developed by the LSST Project | ||
# (https://www.lsst.org). | ||
# See the COPYRIGHT file at the top-level directory of this distribution | ||
# for details of code ownership. | ||
# | ||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# (at your option) any later version. | ||
# | ||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU General Public License for more details. | ||
# | ||
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <https://www.gnu.org/licenses/>. | ||
|
||
from __future__ import annotations | ||
kfindeisen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
__all__ = ("KafkaButlerWriter",) | ||
|
||
from datetime import date | ||
from typing import Literal | ||
from uuid import uuid4 | ||
|
||
from confluent_kafka import Producer | ||
import pydantic | ||
|
||
from lsst.daf.butler import ( | ||
Butler, | ||
DatasetRef, | ||
SerializedDimensionRecord, | ||
SerializedFileDataset, | ||
) | ||
from lsst.resources import ResourcePath | ||
|
||
from .middleware_interface import ButlerWriter, GroupedDimensionRecords | ||
|
||
|
||
class KafkaButlerWriter(ButlerWriter): | ||
def __init__(self, producer: Producer, *, output_topic: str, file_output_path: str) -> None: | ||
self._producer = producer | ||
self._output_topic = output_topic | ||
self._file_output_path = ResourcePath(file_output_path, forceDirectory=True) | ||
|
||
def transfer_outputs( | ||
self, local_butler: Butler, dimension_records: GroupedDimensionRecords, datasets: list[DatasetRef] | ||
) -> list[DatasetRef]: | ||
# Create a subdirectory in the output root distinct to this processing | ||
# run. | ||
date_string = date.today().strftime("%Y-%m-%d") | ||
subdirectory = f"{date_string}/{uuid4()}/" | ||
output_directory = self._file_output_path.join(subdirectory, forceDirectory=True) | ||
kfindeisen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# There is no such thing as a directory in S3, but the Butler complains | ||
# if there is not an object at the prefix of the export path. | ||
output_directory.mkdir() | ||
|
||
# Copy files to the output directory, and retrieve metadata required to | ||
# ingest them into the central Butler. | ||
file_datasets = local_butler._datastore.export(datasets, directory=output_directory, transfer="copy") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The use of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes -- in the upcoming PR where the outputs get written in-place instead of in a separate tree, this will be changed to use a different public function. ( |
||
|
||
# Serialize Butler data as a JSON string. | ||
event = PromptProcessingOutputEvent( | ||
type="pp-output", | ||
dimension_records=_serialize_dimension_records(dimension_records), | ||
datasets=[dataset.to_simple() for dataset in file_datasets], | ||
root_directory=subdirectory, | ||
) | ||
message = event.model_dump_json() | ||
|
||
self._producer.produce(self._output_topic, message) | ||
self._producer.flush() | ||
|
||
return datasets | ||
|
||
|
||
class PromptProcessingOutputEvent(pydantic.BaseModel): | ||
type: Literal["pp-output"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not familiar with pydantic, but I find this construct particularly baroque. You have to declare a field as having only one value, and then initialize it to that value anyway? More generally, the responsibility for the serialized form (e.g., representing all collections as lists) seems to be split between this class, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it's an annoying quirk of Pydantic. You can say You're not really supposed to have methods with behavior on Pydantic models -- it's more of a schema definition than an actual class. I could add a separate helper function to do the serialization, but the main point of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed that having it all be in the same module means it's not too big a problem, I was more worried about the ease of modifying the code when it has these redundant-but-matching lines. (I'll point out that your argument assumes you have to use Pydantic for conversion to JSON. Personally, this kind of awkwardness is exactly why I don't like it.) |
||
root_directory: str | ||
dimension_records: list[SerializedDimensionRecord] | ||
datasets: list[SerializedFileDataset] | ||
|
||
|
||
def _serialize_dimension_records(grouped_records: GroupedDimensionRecords) -> list[SerializedDimensionRecord]: | ||
output = [] | ||
for records in grouped_records.values(): | ||
for item in records: | ||
output.append(item.to_simple()) | ||
return output |
Uh oh!
There was an error while loading. Please reload this page.