Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Schedule] Synchronize labels on upgrade #6389

Merged
merged 3 commits into from
Sep 23, 2024
Merged

Conversation

rokatyy
Copy link
Member

@rokatyy rokatyy commented Sep 23, 2024

After implementing ML-7349, which aligns the labels in schedules and schedule objects, we need to account for objects created prior to this change. To address this, a new data migration (version 8) has been added. This migration retrieves all schedules from the database and aligns their labels accordingly.

Jira - https://iguazio.atlassian.net/browse/ML-7914

@liranbg liranbg requested a review from alonmr September 23, 2024 13:48
Copy link
Member

@quaark quaark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Got a couple of comments on the alignment implementation

Comment on lines 906 to 931
def _align_schedule_labels(db, db_session):
db_schedules: list[mlrun.common.schemas.ScheduleRecord] = db.list_schedules(
session=db_session
)
schedules_update = []
for db_schedule in db_schedules:
# convert list[LabelRecord] to dict
db_schedule_labels = {label.name: label.value for label in db_schedule.labels}
# merging labels
merged_labels = server.api.utils.scheduler.Scheduler()._merge_schedule_and_schedule_object_labels(
labels=db_schedule_labels,
scheduled_object=db_schedule.scheduled_object,
)

# get a Schedule object (not a ScheduleRecord) and update it
schedule = db._get_schedule_record(
db_session, db_schedule.project, db_schedule.name
)
db._update_schedule_body(
schedule=schedule,
scheduled_object=db_schedule.scheduled_object,
labels=merged_labels,
)
schedules_update.append(schedule)

db._upsert(db_session, schedules_update)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This method is very expensive. It is going to preform O(n) queries for each schedule in the DB. I suggest adding a as_records: bool = False, param to list_schedules (similar to list_artifacts). You can still extract the schedule struct with _transform_schedule_record_to_scheme but then you don't need to re-query for each record after getting them all.
  2. This method is using loads of the db's private methods, I don't think we should make them public, but maybe this method should be part of the sqldb interface?

Copy link
Member

@alonmr alonmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well done! A few improvement suggestions

# convert list[LabelRecord] to dict
db_schedule_labels = {label.name: label.value for label in db_schedule.labels}
# merging labels
merged_labels = server.api.utils.scheduler.Scheduler()._merge_schedule_and_schedule_object_labels(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use ensure_scheduler and get_scheduler just for consistency

Comment on lines 907 to 911
db_schedules: list[mlrun.common.schemas.ScheduleRecord] = db.list_schedules(
session=db_session
)
schedules_update = []
for db_schedule in db_schedules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that we need to optimize this for scale systems. I would take the following steps:

  1. Add as_records flag similar to list_artifacts.
  2. Use sqlalchemy lazy load - return the query and then for loop on the query instead of using query.all() and loading all schedules in memory.
  3. Transform the DB schedule to schema schedule then you have both and you don't need to get the DB one (saves us O(n) queries to the DB/cache).
Suggested change
db_schedules: list[mlrun.common.schemas.ScheduleRecord] = db.list_schedules(
session=db_session
)
schedules_update = []
for db_schedule in db_schedules:
schedules_update = []
for db_schedule in db.list_schedules(session=db_session, as_records=True):
schedule_record = db._transform_schedule_record_to_scheme(db_schedule)

@@ -228,6 +237,39 @@ def test_create_project_summaries():
assert migrated_project_summary.name == project.metadata.name


def test_align_schedule_labels():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some more schedules with edge cases. we need to make sure this will not break in field

@rokatyy
Copy link
Member Author

rokatyy commented Sep 23, 2024

@quaark @alonmr Great minds think alike :)

Thanks for suggestions! Have fixed them + also moved code from scheduler to helpers to enable access from both scheduler and sql avoiding circular imports.

Copy link
Member

@alonmr alonmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well done! Minor suggestion

server/api/utils/helpers.py Show resolved Hide resolved
@liranbg liranbg merged commit 74af972 into mlrun:development Sep 23, 2024
11 checks passed
rokatyy added a commit to rokatyy/mlrun that referenced this pull request Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants