Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BYOM Model support #812

Merged
merged 7 commits into from May 3, 2024
Merged

BYOM Model support #812

merged 7 commits into from May 3, 2024

Conversation

mayoor
Copy link
Member

@mayoor mayoor commented Apr 29, 2024

Bring your own model Support

This support entails support for "any" huggingface model on AI quick actions. There are following scenarios that needs to be addressed to facilitate this feature -

  1. User chooses a huggingface model that has been already tested and certified on AI Quick Actions by the service. To support such models, the user need not specify which runtime container is required for inference or finetuning
  2. User chooses a model that has not been certified, but user chooses from the list of readymade runtimes from the service. To support such models, the user will specify the service managed container name as available on the documentation. The documentation will list out the key libraries and their versions across different service managed container so that the user can choose the right image.
  3. Use chooses a model that has no supporting container in service managed container list. The user builds the inference/finetuning container in their tenancy and provides the the container URI while registering the model

The PR also brings in a notion of "regstering" model. What it means is that user is importing the huggingface model into model catalog within user tenancy.

Assumptions

  1. Model registration requires internet connection. It is assumed that the surface from where the user "registers" the model has internet connection
  2. For the gated model model user will have authorized huggingface token.

Huggingface token setup

Run huggingface-cli login command and follow the on screen instruction

Usage

  1. Scenario 1[Verified Model]- ads aqua model register --model meta-llama/Meta-Llama-3-8B --os_path oci://mayoor-dev-versioned@namespace/cached-models --local_dir `pwd`/cache-models
  2. Scenario 2[Unverified Model with SMC] - ads aqua model register --model meta-llama/Meta-Llama-3-8B --os_path oci://mayoor-dev-versioned@namespace/cached-models --local_dir `pwd`/cache-models --odsc-vllm-container --inference_container_type_smc --finetuning_container odsc-finetuning-llm-container --finetuning_container_type_smc
  3. Scenario 3[Unverified Model with Custom Container] - ads aqua model register --model meta-llama/Meta-Llama-3-8B --os_path oci://mayoor-dev-versioned@namespace/cached-models --local_dir `pwd`/cache-models --inference_container iad.ocir.io/my/custom:1.0 --finetuning_container iad.ocir.io/my/custom-ft:1.0

TODO

  • Test cases.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 29, 2024
Copy link

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

else:
break
os.makedirs(local_dir, exist_ok=True)
snapshot_download(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some details here, why do we need to download the snapshot again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first download saves the model to local hf cache (not local_dir). This is resumable in case there’s something wrong with the internet. The second download is copying from hf cache to local_dir. Downloading to local_dir is not resumable based how it works but copying is unlikely to have errors.

Copy link
Member

@mrDzurb mrDzurb Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, i missed this part - local_dir=local_dir.

From HF:

If `local_dir` is provided, the file structure from the repo will be replicated in this location. When using this
    option, the `cache_dir` will not be used and a `.huggingface/` folder will be created at the root of `local_dir`
    to store some metadata related to the downloaded files. While this mechanism is not as robust as the main
    cache-system, it's optimized for regularly pulling the latest version of a repository.

Also looks like local_dir_use_symlinks is deprecated argument?

if local_dir_use_symlinks != "auto":
            warnings.warn(
                "`local_dir_use_symlinks` parameter is deprecated and will be ignored. "
                "The process to download files to a local folder has been updated and do "
                "not rely on symlinks anymore. You only need to pass a destination folder "
                "as`local_dir`.\n"
                "For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder."
            )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this deprecation came only 2 days ago - huggingface/huggingface_hub#2223
It will depend on what version of hub api will carry this code.

f"Error uploading the object. Exit code: {e.returncode} with error {e.stdout}"
)

print(os_details)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably used for debug purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -116,6 +116,7 @@ opctl = [
"rich",
"fire",
"cachetools",
"huggingface_hub"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we need to get approval for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to.

@mrDzurb
Copy link
Member

mrDzurb commented Apr 30, 2024

I've been thinking whether we should allow users to specify the VLLM/TGI version as alternative to the container name, and have our system automatically select the appropriate container based on that input. If a user specifies just the VLLM/TGI interface without the specific version, we could default to the latest container that supports VLLM/TGI.

ads aqua model register --model meta-llama/Meta-Llama-3-8B  --os_path oci://bucket@namespace/cached-models --local_dir `pwd`/cache-models --interface vllm

Additionally, I think it's important to offer users the ability to specify different containers for inference and fine-tuning when creating deployments and fine-tuning. When users register models and specify containers, these can be set as default for deployment and fine-tuning, however given that containers may become obsolete really quick, particularly with regular security updates, users still will have an option to override containers.

container_type=container_type_key,
)
if not is_custom_container
else container_type_key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarifying: for SMC container, container_type_key is odsc-vllm-serving, whereas for byoc container, it will be something like <region>.ocir.io/<namespace>/user-provided-container:1.0.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right

os_path,
model_name: str,
inference_container,
finetuning_contianer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace finetuning_contianer with finetuning_container

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

str: Display name of th model (This should be model ocid)
"""
api = HfApi()
model_info = api.model_info(model_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will raise RepositoryNotFoundError if model_name is isn't available.

filter_tag = Tags.AQUA_FINE_TUNED_MODEL_TAG.value
elif model_type == MODEL_TYPE.BASE.value:
filter_tag = Tags.BASE_MODEL_CUSTOM.value
print(filter_tag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logger.debug instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, was unintentional commit

@@ -59,6 +67,11 @@ class FineTuningMetricCategories(Enum):
TRAINING = "training"


class MODEL_TYPE(Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use camel case, i.e. ModelType to stick with convention?

os_path=os_path, local_dir=local_dir, model_name=model
)
# Create Model catalog entry with pass by reference
return self._create_model_catalog_entry(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after registering, can we return an AquaModel object instead of just returning the display name? User can refer to info within that returned result to proceed with next steps (deploy, FT, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of returning model id, maybe AquaModel is better.

break
os.makedirs(local_dir, exist_ok=True)
snapshot_download(
repo_id=model, local_dir=local_dir, local_dir_use_symlinks=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if model size exceeds local_dir, download can be interrupted. Can we check repo metadata first before downloading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add this to TODO. Can we be done in the a separate PR.

Copy link

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

Copy link

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

Copy link

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

Copy link

github-actions bot commented May 1, 2024

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving this PR, we can address todos in subsequent updates.

Copy link

github-actions bot commented May 1, 2024

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

1 similar comment
Copy link

github-actions bot commented May 3, 2024

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

Externalize container configuration for deployment
Copy link

github-actions bot commented May 3, 2024

⚠️ This PR changed pyproject.toml file. ⚠️

  • PR Creator must update 📃 THIRD_PARTY_LICENSES.txt, if any 📚 library added/removed in pyproject.toml.
  • PR Approver must confirm 📃 THIRD_PARTY_LICENSES.txt updated, if any 📚 library added/removed in pyproject.toml.

@mayoor mayoor merged commit 10d9393 into feature/aquav1.0.2 May 3, 2024
3 checks passed
mayoor added a commit that referenced this pull request May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants