Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/multi model artifact handler #869

Merged

Conversation

YashPandit4u
Copy link
Member

@YashPandit4u YashPandit4u commented May 27, 2024

Adding a new class to easily handle multi-model artifact creation in ADS.

Signed-off-by: Yash Pandit <yash.pandit@oracle.com>
Signed-off-by: Yash Pandit <yash.pandit@oracle.com>
Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

  • PR author: YashPandit4u

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label May 27, 2024
darenr
darenr previously requested changes May 29, 2024
ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
@oracle-contributor-agreement oracle-contributor-agreement bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels Jun 1, 2024
Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, I think we should be reusing most of the implementation from DataScienceModel class instead of recreating it in here. A high level design would look like:

class DataScienceModelCollection(DataScienceModel):

    def __init__():
        super().__init__(**kwargs)
        # no need to add model self.modeldescriptionjson since this attribute is already present in DataScienceModel
    
    def add():
        ...
        
    def remove():
        ...
        
    def show():
        ...     

The usage would look like:

    >>> model = (DataScienceModelCollection()
    ...    .with_compartment_id()
    ...    .with_project_id()
    ...    .with_display_name()
    ...    .with_description()
    ...    .with_freeform_tags()
    ...    .with_artifact("model1_path", "model2_path")

    >>> model.add(...)
    >>> model.remove(...)
    >>> model.create(model_by_reference=True)

On second thought, one other option is to have add(), remove() within DataScienceModel itself, then we don't need the collection class. The add/remove functions are only manipulating the model_file_description property that represents multiple models, so having a separate subclass may not be required.

any thoughts on this?

# from ads.common import logger
import logging

logger = logging.getLogger("ads.model_description")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the logger from ads.common?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I used logger from ads.common I don't see any info messages and for some methods in this class I want to show info messages.

# Remove if the model already exists
self.remove(namespace, bucket, prefix)

def checkIfFileExists(fileName):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse is_path_exists from ads.common.utils instead?

return isExists

# Function to un-paginate the api call with while loop
def listObjectVersionsUnpaginated():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ads.common.object_storage_details.list_object_versions can be used instead?

ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
logger.error(f"An unexpected error occurred: {e}")
raise e

def add(self, namespace, bucket, prefix=None, files=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend using type hinting for all the functions, for example:
def add(self, namespace: str, bucket: str, ... ): -> None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added type hints.

ads/model/model_description.py Outdated Show resolved Hide resolved
display_name = (
"Created by MMS SDK on "
+ datetime.datetime.now(pytz.utc).strftime("%Y-%m-%d %H:%M:%S %Z")
if (display_name == None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use if display_name: instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested statement will trigger even when we have blank string.

ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
logger.error(f"An unexpected error occurred: {e}")
raise e

def add(self, namespace, bucket, prefix=None, files=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
ads/model/model_description.py Outdated Show resolved Hide resolved
Copy link
Member

@mrDzurb mrDzurb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misunderstood what we're trying to achieve in this PR, but wouldn't a user experience like the following be more streamlined?

model_description = ModelDescription().add("oci://bucket_name@namespace/prefix1", "oci://bucket_name@namespace/person_prefix2")

Here, ModelDescription acts as a builder for the JSON configuration needed.
It still can contain all the introduced methods in this PR: add, remove, build, print...

I would prefer instead of .build to use .to_json(uri="path/to/the/json") where if user don't provide uri then JSON should be returned as a result of the function. Also adding .to_dict() might be useful as well.
In this case we don't need .show() method.

We could then enhance the DatascienceModel class with:

DatascienceModel().with_model_description(model_description).create()

This approach leverages the existing DatascienceModel class, which already supports CRUD operations for models, so we wouldn't need to replicate these in the ModelDescription class. The purpose of ModelDescription would be to simplify the preparation of the correct JSON configuration for users.

Perhaps we could consider a clearer name for clarity, as "ModelDescription" might be somewhat confusing. How about using "ModelArtifactConfigBuilder" to more accurately reflect its purpose as a tool for constructing model artifact configurations?

Then user experience would be

model_artifact_config = ModelArtifactConfigBuilder().add().add()

DatascienceModel().with_model_artifact(model_artifact_config).create()

cc: @mayoor @VipulMascarenhas

@YashPandit4u
Copy link
Member Author

YashPandit4u commented Jun 5, 2024

Hi @VipulMascarenhas ,

I have now extended from the class already present in ads and the usage of the utility looks like this now:

import sys
import ads.common
from ads.model.model_collection import DataScienceModelCollection
import ads
import json

ads.set_auth(auth="security_token")

mdes = DataScienceModelCollection()
.with_ref_model_id(model_ocid="ocid1.datasciencemodel.oc1.iad.amaaaaaav66vvnialtuoq5y4ltdz26bxhlow4uh5w544ullowe3jl4vwuayq")
.with_display_name("New Design 8")
.with_compartment_id("ocid1.tenancy.oc1..aaaaaaaahzy3x4boh7ipxyft2rowu2xeglvanlfewudbnueugsieyuojkldq")
.with_project_id("ocid1.datascienceproject.oc1.iad.amaaaaaav66vvniaqsyu2nufljutkn4rzth2nz4q3zqslirct7eayl5ojpma")

mdes.add(namespace="ociodscdev", bucket="unzip-multi-model", prefix="model-linear-1")
mdes.remove(namespace="ociodscdev", bucket="unzip-multi-model", prefix="model-linear-1")

mdes.create()

sys.exit(0)

Please take a look when free and give your comments.

Thanks and Regards
Yash Pandit

ads/model/model_collection.py Outdated Show resolved Hide resolved
from typing import List, Optional
import logging

logger = logging.getLogger("ads.model_description")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be better - logger = logging.getLogger(__name__)?

# Remove if the model already exists
self.remove(namespace, bucket, prefix)

def check_if_file_exists(fileName):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check if the existing utils methods are already implemented such functionality.

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YashPandit4u suggested a few changes, just needs to be re-organized to make it consistent with ads usage.

ads/model/model_collection.py Outdated Show resolved Hide resolved
ads/model/model_collection.py Outdated Show resolved Hide resolved
ads/model/model_collection.py Outdated Show resolved Hide resolved
raise e
return self

def add(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename this to add_artifact(uri:str, files=Optional[List[str]]) which takes in uri and files list as input to be consistent with how paths are referred to within ads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user won't have uri handy, hence we have parameters like namespace, and bucket to make it easy.
Will it be possible to keep these parameters? The function signature would look like:

def add_artifact( self, namespace: str, bucket: str, prefix: Optional[str] = None, files: Optional[List[str]] = None, )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the uri same as f"oci://{bucketName}@{namespace}/{prefix}", all of which are added as input to this function. My concern is that this input makes it incosistent with all other DataScienceModel or GenericModel operations in ads and we wouldn't be able to change it later on.
@mrDzurb what do you think?

"objects": objects,
}
)
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, tmp_model_file_description)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify _prepare_file_description_artifact with optional params for files list and content json? This replaces the above functions check_if_file_exists and list_obj_versions_unpaginated altogether.

If content is passed, then no need to create the dict within that function.
for example:

if not content:
    content = dict()
    content["version"] = MODEL_BY_REFERENCE_VERSION
    ...

In this case _prepare_file_description_artifact would take in inputs like def _prepare_file_description_artifact(bucket_uri: list, content: Optional[Dict] = None, files: Optional[List[List[str]]] = None).

The inputs for files isn't ideal, it might be good to implement a dataclass for this and pass a List[ModelFileDescriptionArtifact] instead of bucket_uri list.

@dataclass
class ModelFileDescriptionArtifact:
    bucket_uri: str
    files: List[str] = None

This isn't urgent though, we can update this later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion.
Will it be okay if we do this in the next PR due to customer deadlines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let's mark a todo somewhere to make sure of this.

)
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, tmp_model_file_description)

def remove(self, namespace: str, bucket: str, prefix: Optional[str] = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as add_artifact - rename this to remove_artifact(uri:str) which takes in uri to be consistent with how paths are referred to within ads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes done.

ads/model/model_collection.py Outdated Show resolved Hide resolved
# model found case
self.model_file_description["models"].pop(modelSearchIdx)

def create(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't needed since we can just the existing create method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes removed this.

ads/model/model_collection.py Outdated Show resolved Hide resolved
logger.info("Model Artifact stored successfully.")
return os.path.abspath(file_path)

def show(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is covered by model.model_file_description property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes removing this.

@VipulMascarenhas
Copy link
Member

@YashPandit4u thanks for making all the changes here, looks good overall. One concern I have is around the path inputs, it would be good to have the oci path like I mentioned in the comments above as user can construct the path easily with the params (bucket, namespace and prefix). Also, could you please add a couple unit tests in test_datascience_model.py for this change?

cc: @mrDzurb

ads/model/model_collection.py Outdated Show resolved Hide resolved
ads/model/model_collection.py Outdated Show resolved Hide resolved

def add_artifact(
self,
namespace: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency it would be better to accept OCIFS "oci://" path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed both functions add_artifact and remove_artifact with this uri type input.

ads/model/datascience_model.py Outdated Show resolved Hide resolved
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, self.empty_json)

# Get object storage client
authData = default_signer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the auth object is already constructed and accessible by

self.dsc_model.auth

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, using this now.

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YashPandit4u thanks for the update. There are a few improvements to make for the next update:

  • check_if_file_exists and list_obj_versions_unpaginated should reuse existing methods.
  • _extract_oci_uri_components wasn't required, as mentioned in previous comments we could just use ObjectStorageDetails.from_path(uri) to get bucket name, namespace and prefix. If we want to add validation ObjectStorageDetails has a few static methods like is_valid_uri to do this.

@YashPandit4u
Copy link
Member Author

Hi @VipulMascarenhas , thanks for telling the improvement items.

  • I have now removed the "_extract_oci_uri_components" method and used the already present ObjectStorageDetails.from_path(uri) to get the namespace, prefix, and filepath. Thanks for the suggestion.

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@YashPandit4u
Copy link
Member Author

Hi @VipulMascarenhas ,
Based on the management decision, I have reverted the method signatures to the ones suggested by the customer.

@YashPandit4u
Copy link
Member Author

Hi @VipulMascarenhas ,

I have now added options for both uri and (namespace, bucket, prefix) in add_artifact, remove_artifact.

@@ -1466,3 +1467,226 @@ def _download_file_description_artifact(self) -> Tuple[Union[str, List[str]], in
bucket_uri.append(uri)

return bucket_uri[0] if len(bucket_uri) == 1 else bucket_uri, artifact_size

def add_artifact(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Could you please add an examples section for the add and delete artifacts feature?

if object_storage_details.filepath == ""
else object_storage_details.filepath
)
if (not namespace) or (not bucket):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: if not (namespace and bucket):

else object_storage_details.filepath
)
if (not namespace) or (not bucket):
raise ValueError("Both 'namespace' and 'bucket' must be provided.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Wouldn't it be better to say something like - Artifacts cannot be added. Both 'namespace' and 'bucket' parameters must be provided.. otherwise it would be hard for user to understand where they need to provide the namespace and bucket.

- If no objects are found to add to the model description, a ValueError is raised.
"""

if uri and (namespace or bucket):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do something like this?

if uri and (namespace or bucket):
    raise ValueError("Artifacts cannot be added. Please provide either a 'uri' alone, or 'namespace' and 'bucket' together, but not 'uri' with 'namespace' or 'bucket'.")

elif not (uri or (namespace and bucket)):
    raise ValueError("Artifacts cannot be added. You must provide either a 'uri' or both 'namespace' and 'bucket'.")

# Remove if the model already exists
self.remove_artifact(namespace=namespace, bucket=bucket, prefix=prefix)

def check_if_file_exists(fileName):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I would not recommend to create nested functions, it would be hard to write unit tests for such methods. I would rather recommend to move such methods into utils or outside the class.

]

if len(objects) == 0:
error_message = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

error_message = (
        f"No files found to add in the bucket '{bucket}' within the namespace '{namespace}' "
        f"and prefix '{prefix}'. Expected file names: {files}"
    )

)
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, tmp_model_file_description)

def remove_artifact(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the examples section.

- If the model description JSON is None.
"""

if uri and (namespace or bucket):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments for the add_artifact method.

@VipulMascarenhas VipulMascarenhas dismissed darenr’s stale review June 18, 2024 16:14

changes requested were implemented, pending items will be covered in subsequent update.

@VipulMascarenhas VipulMascarenhas merged commit 30534f7 into oracle:main Jun 18, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants