Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/multi model artifact handler #869

Merged
Merged
Changes from 12 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
6d9fa4a
Initial commit
YashPandit4u May 27, 2024
6a930f1
rename in main module file
YashPandit4u May 27, 2024
5f3b316
Logger used for prints, error handling improved, one extra file creat…
YashPandit4u Jun 1, 2024
802e438
Merge branch 'oracle:main' into feature/multi-model-artifact-handler
YashPandit4u Jun 1, 2024
35e8464
Reformatted using black.
YashPandit4u Jun 1, 2024
e625107
Merge branch 'feature/multi-model-artifact-handler' of https://github…
YashPandit4u Jun 1, 2024
74a235e
Separate logger used.
YashPandit4u Jun 1, 2024
848972e
Added python docs for all methods.
YashPandit4u Jun 1, 2024
d35b917
Added class DataScienceModelCollection that extends from DataScienceM…
YashPandit4u Jun 5, 2024
3ac5338
removed old model description class
YashPandit4u Jun 5, 2024
822c7e8
formatted using black
YashPandit4u Jun 5, 2024
3d4d950
black formatter used and one return type added.
YashPandit4u Jun 5, 2024
d4ef023
Added add_artifact and remove_artifact method in main DataScienceMode…
YashPandit4u Jun 11, 2024
c40ea8b
Removed new added class.
YashPandit4u Jun 11, 2024
d0309f4
Added uri based approach
YashPandit4u Jun 13, 2024
19ec921
Added unit tests.
YashPandit4u Jun 13, 2024
9089028
Changed the pydocs according to ads specifications
YashPandit4u Jun 13, 2024
464cb37
Merge branch 'main' into feature/multi-model-artifact-handler
YashPandit4u Jun 13, 2024
49be8f8
replaces regex with normal splitting for uri
YashPandit4u Jun 13, 2024
a2894a5
Merge branch 'feature/multi-model-artifact-handler' of https://github…
YashPandit4u Jun 13, 2024
5d327cd
removed default_signer
YashPandit4u Jun 13, 2024
b7e7d74
Used ObjectStorageDetails.from_path(uri) for url decoding.
YashPandit4u Jun 13, 2024
4418bf8
namespace, bucket, prefix way added again.
YashPandit4u Jun 14, 2024
e2e7099
Merge branch 'main' into feature/multi-model-artifact-handler
YashPandit4u Jun 14, 2024
4cc8823
Given options for both uri and (namespace, bucket) in add_artifact, r…
YashPandit4u Jun 15, 2024
7f3cc7f
Updated python docs
YashPandit4u Jun 15, 2024
32f4826
Ran black formatter.
YashPandit4u Jun 15, 2024
9f69e35
Ran black formatter of UTs.
YashPandit4u Jun 15, 2024
ad0ce2e
Added uri UTs.
YashPandit4u Jun 15, 2024
eec2b4d
Prefix null check added.
YashPandit4u Jun 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
288 changes: 288 additions & 0 deletions ads/model/model_collection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
import json
import ads.common
import oci
import os
import ads
from ads.model.datascience_model import DataScienceModel
from typing import List, Optional
import logging

logger = logging.getLogger("ads.model_description")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be better - logger = logging.getLogger(__name__)?

logger.setLevel(logging.INFO)


class DataScienceModelCollection(DataScienceModel):
VipulMascarenhas marked this conversation as resolved.
Show resolved Hide resolved

def _auth(self):
VipulMascarenhas marked this conversation as resolved.
Show resolved Hide resolved
"""
Internal method that authenticates the model description instance by initializing OCI clients.

Parameters:
- None

Returns:
- None

Note:
- This method retrieves authentication data using default signer from the `ads.common.auth` module.
- The region information is extracted from the authentication data.
"""
authData = ads.common.auth.default_signer()
signer = authData["signer"]
self.region = authData["config"]["region"]

# data science client
self.data_science_client = oci.data_science.DataScienceClient(
{"region": self.region}, signer=signer
)
# oss client
self.object_storage_client = oci.object_storage.ObjectStorageClient(
{"region": self.region}, signer=signer
)

def __init__(self, spec: ads.Dict = None, **kwargs) -> None:
super().__init__(spec, **kwargs)

self.empty_json = {
"version": "1.0",
"type": "modelOSSReferenceDescription",
"models": [],
}
self.region = ""
self._auth()

self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, self.empty_json)

def with_ref_model_id(self, model_ocid: str):
VipulMascarenhas marked this conversation as resolved.
Show resolved Hide resolved

# if model given then get that as the starting reference point
logger.info("Getting model details from backend")
try:
get_model_artifact_content_response = (
self.data_science_client.get_model_artifact_content(
model_id=model_ocid,
)
)
content = get_model_artifact_content_response.data.content
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, json.loads(content))
except json.JSONDecodeError as e:
logger.error(f"Error decoding JSON: {e}")
raise e
except Exception as e:
logger.error(f"An unexpected error occurred: {e}")
raise e
return self

def add(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename this to add_artifact(uri:str, files=Optional[List[str]]) which takes in uri and files list as input to be consistent with how paths are referred to within ads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user won't have uri handy, hence we have parameters like namespace, and bucket to make it easy.
Will it be possible to keep these parameters? The function signature would look like:

def add_artifact( self, namespace: str, bucket: str, prefix: Optional[str] = None, files: Optional[List[str]] = None, )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the uri same as f"oci://{bucketName}@{namespace}/{prefix}", all of which are added as input to this function. My concern is that this input makes it incosistent with all other DataScienceModel or GenericModel operations in ads and we wouldn't be able to change it later on.
@mrDzurb what do you think?

self,
namespace: str,
bucket: str,
prefix: Optional[str] = None,
files: Optional[List[str]] = None,
):
"""
Adds information about objects in a specified bucket to the model description JSON.

Parameters:
- namespace (str): The namespace of the object storage.
- bucket (str): The name of the bucket containing the objects.
- prefix (str, optional): The prefix used to filter objects within the bucket. Defaults to None.
- files (list of str, optional): A list of file names to include in the model description.
If provided, only objects with matching file names will be included. Defaults to None.

Returns:
- None

Raises:
- ValueError: If no files are found to add to the model description.

Note:
- If `files` is not provided, it retrieves information about all objects in the bucket.
If `files` is provided, it only retrieves information about objects with matching file names.
- If no objects are found to add to the model description, a ValueError is raised.
"""

# Remove if the model already exists
self.remove(namespace, bucket, prefix)

def check_if_file_exists(fileName):
VipulMascarenhas marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check if the existing utils methods are already implemented such functionality.

isExists = False
try:
headResponse = self.object_storage_client.head_object(
namespace, bucket, object_name=fileName
)
if headResponse.status == 200:
isExists = True
except Exception as e:
if hasattr(e, "status") and e.status == 404:
logger.error(f"File not found in bucket: {fileName}")
else:
logger.error(f"An error occured: {e}")
return isExists

# Function to un-paginate the api call with while loop
def list_obj_versions_unpaginated():
objectStorageList = []
has_next_page, opc_next_page = True, None
while has_next_page:
response = self.object_storage_client.list_object_versions(
namespace_name=namespace,
bucket_name=bucket,
prefix=prefix,
fields="name,size",
page=opc_next_page,
)
objectStorageList.extend(response.data.items)
has_next_page = response.has_next_page
opc_next_page = response.next_page
return objectStorageList

# Fetch object details and put it into the objects variable
objectStorageList = []
if files == None:
objectStorageList = list_obj_versions_unpaginated()
else:
for fileName in files:
if check_if_file_exists(fileName=fileName):
objectStorageList.append(
self.object_storage_client.list_object_versions(
namespace_name=namespace,
bucket_name=bucket,
prefix=fileName,
fields="name,size",
).data.items[0]
)

objects = [
{"name": obj.name, "version": obj.version_id, "sizeInBytes": obj.size}
for obj in objectStorageList
if obj.size > 0
]

if len(objects) == 0:
error_message = (
f"No files to add in the bucket: {bucket} with namespace: {namespace} "
f"and prefix: {prefix}. File names: {files}"
)
logger.error(error_message)
raise ValueError(error_message)

tmp_model_file_description = self.model_file_description
tmp_model_file_description["models"].append(
{
"namespace": namespace,
"bucketName": bucket,
"prefix": prefix,
"objects": objects,
}
)
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, tmp_model_file_description)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify _prepare_file_description_artifact with optional params for files list and content json? This replaces the above functions check_if_file_exists and list_obj_versions_unpaginated altogether.

If content is passed, then no need to create the dict within that function.
for example:

if not content:
    content = dict()
    content["version"] = MODEL_BY_REFERENCE_VERSION
    ...

In this case _prepare_file_description_artifact would take in inputs like def _prepare_file_description_artifact(bucket_uri: list, content: Optional[Dict] = None, files: Optional[List[List[str]]] = None).

The inputs for files isn't ideal, it might be good to implement a dataclass for this and pass a List[ModelFileDescriptionArtifact] instead of bucket_uri list.

@dataclass
class ModelFileDescriptionArtifact:
    bucket_uri: str
    files: List[str] = None

This isn't urgent though, we can update this later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion.
Will it be okay if we do this in the next PR due to customer deadlines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let's mark a todo somewhere to make sure of this.


def remove(self, namespace: str, bucket: str, prefix: Optional[str] = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as add_artifact - rename this to remove_artifact(uri:str) which takes in uri to be consistent with how paths are referred to within ads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes done.

"""
Removes information about objects in a specified bucket from the model description JSON.

Parameters:
- namespace (str): The namespace of the object storage.
- bucket (str): The name of the bucket containing the objects.
- prefix (str, optional): The prefix used to filter objects within the bucket. Defaults to None.

Returns:
- None

Note:
- This method removes information about objects in the specified bucket from the
instance of the ModelDescription.
- If a matching model (with the specified namespace, bucket name, and prefix) is found
in the model description JSON, it is removed.
- If no matching model is found, the method returns without making any changes.
"""

def findModelIdx():
for idx, model in enumerate(self.model_file_description["models"]):
VipulMascarenhas marked this conversation as resolved.
Show resolved Hide resolved
if (
model["namespace"],
model["bucketName"],
(model["prefix"] if ("prefix" in model) else None),
) == (namespace, bucket, prefix):
return idx
return -1

modelSearchIdx = findModelIdx()
if modelSearchIdx == -1:
return
else:
# model found case
self.model_file_description["models"].pop(modelSearchIdx)

def create(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't needed since we can just the existing create method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes removed this.

"""
Saves the model to the Model Catalog of Oracle Cloud Infrastructure (OCI) Data Science service.

Parameters:
- project_ocid (str): The OCID (Oracle Cloud Identifier) of the OCI Data Science project.
- compartment_ocid (str): The OCID of the compartment in which the model will be created.
- display_name (str, optional): The display name for the created model. If not provided,
a default display name indicating the creation timestamp is used. Defaults to None.

Returns:
- str: The OCID of the created model.

Note:
- The display name defaults to a string indicating the creation timestamp if not provided.
"""
tmp_file_path = self.build()
self = self.with_artifact(uri=tmp_file_path)
created_model = super().create()
try:
os.remove(tmp_file_path)
except Exception as e:
logger.error(f"Error occurred while cleaning file: {e}")
raise e
mrDzurb marked this conversation as resolved.
Show resolved Hide resolved
return created_model.id

def build(self) -> str:
VipulMascarenhas marked this conversation as resolved.
Show resolved Hide resolved
"""
Builds the model description JSON and writes it to a file.

Parameters:
- None

Returns:
- str: The absolute file path where the model description JSON is stored.

Note:
- This method serializes the current model description attribute to a JSON file named 'resultModelDescription.json' with an indentation of 2 spaces.
"""
logger.info("Building...")
file_path = "resultModelDescription.json"
mrDzurb marked this conversation as resolved.
Show resolved Hide resolved
try:
with open(file_path, "w") as json_file:
json.dump(self.model_file_description, json_file, indent=2)
except IOError as e:
logger.error(
f"Error writing to file '{file_path}': {e}"
) # Handle the exception accordingly, e.g., log the error, retry writing, etc.
except Exception as e:
logger.error(
f"An unexpected error occurred: {e}"
) # Handle other unexpected exceptions
logger.info("Model Artifact stored successfully.")
return os.path.abspath(file_path)

def show(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is covered by model.model_file_description property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes removing this.

"""
Displays the current model description JSON in a human-readable format.

Parameters:
- None

Returns:
- str: The json representation of current model artifact

Note:
- The JSON representation of the model description is formatted with an indentation
of 4 spaces.
"""
logger.info(json.dumps(self.model_file_description, indent=4))
return json.dumps(self.model_file_description, indent=4)
Loading