Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Project] Retain producers of exported artifacts #5283

Merged
merged 8 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion mlrun/artifacts/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,10 @@ class ArtifactSpec(ModelObj):
"db_key",
"extra_data",
"unpackaging_instructions",
"producer",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change needed? This means the producer will now be in the result of base_dict() for artifacts - is this intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When setting an artifact on the project it is using the base_dict(), so the producer property was not exported because it was a part of the _extra_fields.
This is a crucial part to maintain the producers when loading a project - to actually have it on the exported spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that this fix will not work for any project exported prior to this PR, right?
Also, need to make sure there are no BC issues there (I don't think there are, but needs to be verified).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It obviously won't fix projects that were exported before this fix, so I'm not sure what BC issues we should check here..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious that projects exported before the fix won't be fixed - it's the situation because the export doesn't contain the information needed for the fix.
As for BC, just to make sure that now that artifact export/import contains the producer field in the base export fields, that artifacts that were exported before won't fail importing due to it (again, I'm pretty sure they won't fail, just something to pay attention to).

]

_extra_fields = ["annotations", "producer", "sources", "license", "encoding"]
_extra_fields = ["annotations", "sources", "license", "encoding"]
_exclude_fields_from_uid_hash = [
# if the artifact is first created, it will not have a db_key,
# exclude it so further updates of the artifacts will have the same hash
Expand Down
121 changes: 106 additions & 15 deletions mlrun/projects/project.py
Original file line number Diff line number Diff line change
Expand Up @@ -1375,14 +1375,7 @@ def register_artifacts(self):
artifact_path = mlrun.utils.helpers.template_artifact_path(
self.spec.artifact_path or mlrun.mlconf.artifact_path, self.metadata.name
)
# TODO: To correctly maintain the list of artifacts from an exported project,
# we need to maintain the different trees that generated them
producer = ArtifactProducer(
"project",
self.metadata.name,
self.metadata.name,
tag=self._get_hexsha() or str(uuid.uuid4()),
)
project_tag = self._get_project_tag()
for artifact_dict in self.spec.artifacts:
if _is_imported_artifact(artifact_dict):
import_from = artifact_dict["import_from"]
Expand All @@ -1402,6 +1395,14 @@ def register_artifacts(self):
artifact.src_path = path.join(
self.spec.get_code_path(), artifact.src_path
)
producer = self._resolve_artifact_producer(artifact, project_tag)
if (
producer.name != self.metadata.name
and self._resolve_existing_artifact(
artifact,
)
):
continue
TomerShor marked this conversation as resolved.
Show resolved Hide resolved
artifact_manager.log_artifact(
producer, artifact, artifact_path=artifact_path
)
Expand Down Expand Up @@ -1498,12 +1499,20 @@ def log_artifact(
artifact_path = mlrun.utils.helpers.template_artifact_path(
artifact_path, self.metadata.name
)
producer = ArtifactProducer(
"project",
self.metadata.name,
self.metadata.name,
tag=self._get_hexsha() or str(uuid.uuid4()),
)
producer = self._resolve_artifact_producer(item)
if producer.name != self.metadata.name:
alonmr marked this conversation as resolved.
Show resolved Hide resolved
# the artifact producer is retained, log it only if it doesn't already exist
if existing_artifact := self._resolve_existing_artifact(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the artifact changed and we want to update it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question actually.
As per @theSaarco we shouldn't re-register artifacts if we retain their producer, but indeed the spec etc can change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question indeed.
Of course this is only expected if we're re-importing on the same env, since it's a run-id that already exists in the system. Even then, I tend to think it's a strange situation in which the same run-id had generated a different version of the same artifact. It seems more likely that the same artifact was modified after the run was done, in which case it would be probably better to maintain the manual changes done to the artifact.
I guess if the user wants to "reset" the artifact, it will need to be deleted before the import.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that makes sense. I suggest adding a log that says the artifact is already registered so we're skipping it.

item,
tag,
):
artifact_key = item if isinstance(item, str) else item.key
logger.info(
"Artifact already exists, skipping logging",
key=artifact_key,
tag=tag,
)
return existing_artifact
item = am.log_artifact(
producer,
item,
Expand Down Expand Up @@ -3333,7 +3342,12 @@ def get_artifact(self, key, tag=None, iter=None, tree=None):
artifact = db.read_artifact(
key, tag, iter=iter, project=self.metadata.name, tree=tree
)
return dict_to_artifact(artifact)

# in tests, if an artifact is not found, the db returns None
# in real usage, the db should raise an exception
if artifact:
return dict_to_artifact(artifact)
return None

def list_artifacts(
self,
Expand Down Expand Up @@ -3761,6 +3775,83 @@ def _validate_file_path(self, file_path: str, param_name: str):
f"<project.spec.get_code_path()>/<{param_name}>)."
)

def _resolve_artifact_producer(
self,
artifact: typing.Union[str, Artifact],
project_producer_tag: str = None,
) -> typing.Optional[ArtifactProducer]:
"""
Resolve the artifact producer of the given artifact.
If the artifact's producer is a run, the artifact is registered with the original producer.
Otherwise, the artifact is registered with the current project as the producer.

:param artifact: The artifact to resolve its producer.
:param project_producer_tag: The tag to use for the project as the producer. If not provided, a tag will be
generated for the project.
:return: A tuple of the resolved producer and the resolved artifact.
"""

if not isinstance(artifact, str) and artifact.producer:
# if the artifact was imported from a yaml file, the producer can be a dict
if isinstance(artifact.spec.producer, ArtifactProducer):
producer_dict = artifact.spec.producer.get_meta()
else:
producer_dict = artifact.spec.producer

if producer_dict.get("kind", "") == "run":
return ArtifactProducer(
name=producer_dict.get("name", ""),
kind=producer_dict.get("kind", ""),
project=producer_dict.get("project", ""),
tag=producer_dict.get("tag", ""),
)

# do not retain the artifact's producer, replace it with the project as the producer
project_producer_tag = project_producer_tag or self._get_project_tag()
return ArtifactProducer(
kind="project",
name=self.metadata.name,
project=self.metadata.name,
tag=project_producer_tag,
)

def _resolve_existing_artifact(
self,
item: typing.Union[str, Artifact],
tag: str = None,
) -> typing.Optional[Artifact]:
"""
Check if there is and existing artifact with the given item and tag.
If there is, return the existing artifact. Otherwise, return None.

:param item: The item (or key) to check if there is an existing artifact for.
:param tag: The tag to check if there is an existing artifact for.
:return: The existing artifact if there is one, otherwise None.
"""
try:
if isinstance(item, str):
existing_artifact = self.get_artifact(key=item, tag=tag)
else:
existing_artifact = self.get_artifact(
key=item.key,
tag=item.tag,
iter=item.iter,
tree=item.tree,
)
if existing_artifact is not None:
return existing_artifact.from_dict(existing_artifact)
theSaarco marked this conversation as resolved.
Show resolved Hide resolved
except mlrun.errors.MLRunNotFoundError:
logger.debug(
TomerShor marked this conversation as resolved.
Show resolved Hide resolved
"No existing artifact was found",
key=item if isinstance(item, str) else item.key,
tag=tag if isinstance(item, str) else item.tag,
tree=None if isinstance(item, str) else item.tree,
)
return None

def _get_project_tag(self):
return self._get_hexsha() or str(uuid.uuid4())


def _set_as_current_default_project(project: MlrunProject):
mlrun.mlconf.default_project = project.metadata.name
Expand Down
89 changes: 89 additions & 0 deletions tests/projects/test_project.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import pytest

import mlrun
import mlrun.artifacts
import mlrun.common.schemas
import mlrun.errors
import mlrun.projects.project
Expand Down Expand Up @@ -900,6 +901,94 @@ def test_import_artifact_using_relative_path():
assert artifact.spec.db_key == "y"


def test_import_artifact_retain_producer(rundb_mock):
base_path = tests.conftest.results
project_1 = mlrun.new_project(
name="project-1", context=f"{base_path}/project_1", save=False
)
project_2 = mlrun.new_project(
name="project-2", context=f"{base_path}/project_2", save=False
)

# create an artifact with a 'run' producer
artifact = mlrun.artifacts.Artifact(key="x", body="123", is_inline=True)
run_name = "my-run"
run_tag = "some-tag"

# we set the producer as dict so the export will work
artifact.producer = mlrun.artifacts.ArtifactProducer(
kind="run",
project=project_1.name,
name=run_name,
tag=run_tag,
).get_meta()

# export the artifact
artifact_path = f"{base_path}/my-artifact.yaml"
artifact.export(artifact_path)

# import the artifact to another project
new_key = "y"
imported_artifact = project_2.import_artifact(artifact_path, new_key)
assert imported_artifact.producer == artifact.producer

# set the artifact on the first project
project_1.set_artifact(artifact.key, artifact)
project_1.save()

# load a new project from the first project's context
project_3 = mlrun.load_project(name="project-3", context=project_1.context)

# make sure the artifact was registered with the original producer
# the db key should include the run since it's a run artifact
db_key = f"{run_name}_{new_key}"
loaded_artifact = project_3.get_artifact(db_key)
assert loaded_artifact.producer == artifact.producer


def test_replace_exported_artifact_producer(rundb_mock):
base_path = tests.conftest.results
project_1 = mlrun.new_project(
name="project-1", context=f"{base_path}/project_1", save=False
)
project_2 = mlrun.new_project(
name="project-2", context=f"{base_path}/project_2", save=False
)

# create an artifact with a 'project' producer
key = "x"
artifact = mlrun.artifacts.Artifact(key=key, body="123", is_inline=True)

# we set the producer as dict so the export will work
artifact.producer = mlrun.artifacts.ArtifactProducer(
kind="project",
project=project_1.name,
name=project_1.name,
).get_meta()

# export the artifact
artifact_path = f"{base_path}/my-artifact.yaml"
artifact.export(artifact_path)

# import the artifact to another project
new_key = "y"
imported_artifact = project_2.import_artifact(artifact_path, new_key)
assert imported_artifact.producer != artifact.producer
assert imported_artifact.producer["name"] == project_2.name

# set the artifact on the first project
project_1.set_artifact(artifact.key, artifact)
project_1.save()

# load a new project from the first project's context
project_3 = mlrun.load_project(name="project-3", context=project_1.context)

# make sure the artifact was registered with the new project producer
loaded_artifact = project_3.get_artifact(key)
assert loaded_artifact.producer != artifact.producer
assert loaded_artifact.producer["name"] == project_3.name


@pytest.mark.parametrize(
"relative_artifact_path,project_context,expected_path,expected_in_context",
[
Expand Down