feat(components): Support dynamic machine type paramters in CustomTrainingJobOp #10883

KevinGrantLee · 2024-06-10T02:38:29Z

Description of your changes:

Enables setting the machine_type, accelerator_type, and accelerator_count dynamically from task outputs and pipeline inputs in CustomTrainingJobOp.

ex.

@dsl.component
def machine_type() -> str:
    return 'n1-standard-4'


@dsl.component
def accelerator_type() -> str:
    return 'NVIDIA_TESLA_P4'


@dsl.component
def accelerator_count() -> int:
    # This can either be int or int string
    return 1



@dsl.pipeline
def pipeline(
    project: str,
    location: str,
    encryption_spec_key_name: str = '',
):
    machine_type_task = machine_type()
    accelerator_type_task = accelerator_type()
    accelerator_count_task = accelerator_count()

    custom_job.CustomTrainingJobOp(
        display_name='add-numbers',
        worker_pool_specs=[{
            'container_spec': {
                # doesn't need to be the container under test
                # just need an image within the VPC-SC perimeter
                'image_uri':
                    ('gcr.io/ml-pipeline/google-cloud-pipeline-components:2.5.0'
                    ),
                'command': ['echo'],
                'args': ['foo'],
            },
            'machine_spec': {
                'machine_type': machine_type_task.output,
                'accelerator_type': accelerator_type_task.output,
                'accelerator_count': accelerator_count_task.output,
            },
            'replica_count': 1,
        }],
        project=project,
        location=location,
        encryption_spec_key_name=encryption_spec_key_name,
    )

This PR also enables the following behavior:

@dsl.pipeline
def pipeline(
    project: str,
    location: str,
    machine_type: str,
    accelerator_type: str,
    accelerator_count: int,
    encryption_spec_key_name: str = '',
):

    custom_job.CustomTrainingJobOp(
        display_name='add-numbers',
        worker_pool_specs=[{
            'container_spec': {
                # doesn't need to be the container under test
                # just need an image within the VPC-SC perimeter
                'image_uri':
                    ('gcr.io/ml-pipeline/google-cloud-pipeline-components:2.5.0'
                    ),
                'command': ['echo'],
                'args': ['foo'],
            },
            'machine_spec': {
                'machine_type': machine_type,
                'accelerator_type': accelerator_type
                'accelerator_count': accelerator_count,
            },
            'replica_count': 1,
        }],
        project=project,
        location=location,
        encryption_spec_key_name=encryption_spec_key_name,
    )

Checklist:

The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

KevinGrantLee · 2024-06-10T08:25:44Z

/retest

KevinGrantLee · 2024-06-10T09:02:33Z

/retest

connor-mccarthy · 2024-06-11T04:18:18Z

@KevinGrantLee, can you please address required presubmit checks DCO and kubeflow-pipelines-sdk-yapf?

connor-mccarthy · 2024-06-11T04:18:22Z

/assign @chensun

…iningJobOp Signed-off-by: KevinGrantLee <kglee@google.com>

Signed-off-by: KevinGrantLee <kglee@google.com>

KevinGrantLee · 2024-06-11T17:56:14Z

/retest

Signed-off-by: KevinGrantLee <kglee@google.com>

KevinGrantLee · 2024-06-11T18:26:43Z

/retest

chensun · 2024-06-11T18:07:20Z

sdk/python/test_data/pipelines/pipeline_with_dynamic_custom_training_job.py

+
+@dsl.component
+def accelerator_count() -> int:
+    # This can either be int or int string


Is this true? The type hint doesn't indicate the same.

Yes, I also tried compiling and submitting a pipeline with def accelerator_count() -> str: returning '1' and that pipeline succeeded.

I'll remove the comment to avoid confusion because leaving return annotation as int and returning '1' causes errors

chensun · 2024-06-11T18:08:43Z

sdk/python/test_data/pipelines/pipeline_with_dynamic_custom_training_job.py

+        worker_pool_specs=[{
+            'container_spec': {
+                # doesn't need to be the container under test
+                # just need an image within the VPC-SC perimeter


I don't quite follow this comment. Is it necessary?

I copied this comment from another test pipeline, will remove.

chensun · 2024-06-11T18:10:34Z

sdk/python/test_data/pipelines/pipeline_with_dynamic_custom_training_job.py

+            'machine_spec': {
+                'machine_type': machine_type_task.output,
+                'accelerator_type': accelerator_type_task.output,
+                'accelerator_count': accelerator_count_task.output,


Can you make a case where the dynamic value comes from pipeline input parameter? That would make sure we cover all the the dynamic value paths.

Done. Tested compiling and submitting pipeline.

chensun · 2024-06-11T19:03:36Z

sdk/python/kfp/compiler/compiler_utils.py

+    elif isinstance(data, list):
+        return [recursive_replace(i, old_value, new_value) for i in data]
+    else:
+        if isinstance(data, pipeline_channel.PipelineChannel):


This method seems explicitly replacing placeholder from one representation to another. It's not for replacing arbitrary value. So the method name should reflect it's purpose.

Although here I'm just using this method for replacing placeholders, it can be used for arbitrary values as well so I left the method name general.

Renamed method to recursive_replace_placeholders

chensun · 2024-06-11T19:05:35Z

sdk/python/kfp/compiler/compiler_utils_test.py

+                3: ['d']
+            }],
+            'old_value': 'd',
+            'new_value': 'dd',


The old and new values doesn't seem to be testing the real case scenario?

That testcase was for some simple dummy values to verify behavior of recursive_replace(), I can remove if you think its redundant?

Removed testcase.

chensun · 2024-06-11T19:15:25Z

sdk/python/kfp/compiler/pipeline_spec_builder.py

@@ -239,70 +327,18 @@ def build_task_spec_for_task(
                        component_input_parameter)

        elif isinstance(input_value, str):
-            # Handle extra input due to string concat


IIRC, this chunk of code is only applicable for string typed inputs, why merging the code and expand it to other input types? Also it's a bit hard to read the diff between the deleted code and the extracted. Can you try make the changes in place without refactoring, and see if it's actually necessary to expand the logic to non-string typed inputs?

I found that this block of code could be reused for handling PipelineChannels inside of worker_pool_specs in addition to handling string typed inputs. Instead of copying the ~50 lines of code, I thought it'd be better to refactor the logic as a separate function def replace_and_inject_placeholders().

So I could un-refactor and duplicate the logic; I do have a slight preference for this refactoring but can go either way. wdyt?

This name of the extracted method isn't accurate--the code does more than placeholder manipulation but also component input expansion.
The branch logic now reads like this:

if isinstance(input_value, str): # shared code pipeline_task_spec.inputs.parameters[ input_name].runtime_value.constant.string_value = input_value elif isinstance(input_value, (int, float, bool, dict, list)): if isinstance(input_value, (dict, list): # shared code pipeline_task_spec.inputs.parameters[ input_name].runtime_value.constant.CopyFrom( to_protobuf_value(input_value)) else: raise

You can achieve the same goal, and even more code reuse, without extracting a shared method by:

if not isinstance(input_value, (str, dict, list, int, float, bool)): raise if isinstance(input_value, (str, dict, list)): # shared code pipeline_task_spec.inputs.parameters[ input_name].runtime_value.constant.CopyFrom( to_protobuf_value(input_value))

Aside from the refactoring part, I wonder what's the case for dict and list? In case CustomTrainingJobOp is used, what's the input_value here?

If CustomTrainingJobOp is used, then we pass in worker_pool_spec into input_value.

It looks like this with PipelineChannel objects
input_value = [{'container_spec': {'image_uri': 'gcr.io/ml-pipeline/google-cloud-pipeline-components:2.5.0', 'command': ['echo'], 'args': ['foo']}, 'machine_spec': {'machine_type': {{channel:task=machine-type;name=Output;type=String;}}, 'accelerator_type': {{channel:task=accelerator-type;name=Output;type=String;}}, 'accelerator_count': {{channel:task=accelerator-count;name=Output;type=Integer;}}}, 'replica_count': 1}]

Thanks for the explanation. So input_value would be of type list in this case. Including dict in the same code path is just for future use cases not entirely necessary at this moment, right? I'm fine to include dict now.

Signed-off-by: KevinGrantLee <kglee@google.com>

KevinGrantLee · 2024-06-11T23:24:37Z

@chensun I verified that nested dags with dsl.Condition() and custom jobs with dynamic (task output and pipeline inputs) machine parameters compile and run successfully.

Signed-off-by: KevinGrantLee <kglee@google.com>

chensun · 2024-06-12T17:07:02Z

sdk/python/kfp/compiler/pipeline_spec_builder.py

@@ -239,70 +327,18 @@ def build_task_spec_for_task(
                        component_input_parameter)

        elif isinstance(input_value, str):
-            # Handle extra input due to string concat


This name of the extracted method isn't accurate--the code does more than placeholder manipulation but also component input expansion.
The branch logic now reads like this:

if isinstance(input_value, str): # shared code pipeline_task_spec.inputs.parameters[ input_name].runtime_value.constant.string_value = input_value elif isinstance(input_value, (int, float, bool, dict, list)): if isinstance(input_value, (dict, list): # shared code pipeline_task_spec.inputs.parameters[ input_name].runtime_value.constant.CopyFrom( to_protobuf_value(input_value)) else: raise

You can achieve the same goal, and even more code reuse, without extracting a shared method by:

if not isinstance(input_value, (str, dict, list, int, float, bool)): raise if isinstance(input_value, (str, dict, list)): # shared code pipeline_task_spec.inputs.parameters[ input_name].runtime_value.constant.CopyFrom( to_protobuf_value(input_value))

chensun · 2024-06-12T17:10:57Z

sdk/python/kfp/compiler/compiler_utils.py

+    """Recursively replaces values in a nested dict/list object.
+
+    This method is used to replace PipelineChannel objects with pipeine channel
+    placeholders in a nested object like worker_pool_specs for custom jobs.


pipeline channel placeholders -> input parameter placeholder

chensun · 2024-06-12T17:15:22Z

sdk/python/kfp/compiler/pipeline_spec_builder.py

@@ -239,70 +327,18 @@ def build_task_spec_for_task(
                        component_input_parameter)

        elif isinstance(input_value, str):
-            # Handle extra input due to string concat


Aside from the refactoring part, I wonder what's the case for dict and list? In case CustomTrainingJobOp is used, what's the input_value here?

chensun · 2024-06-12T17:16:23Z

@chensun I verified that nested dags with dsl.Condition() and custom jobs with dynamic (task output and pipeline inputs) machine parameters compile and run successfully.

Can you add a test case?

Signed-off-by: KevinGrantLee <kglee@google.com>

KevinGrantLee · 2024-06-12T18:22:39Z

/retest

chensun · 2024-06-12T20:20:46Z

sdk/python/kfp/compiler/pipeline_spec_builder.py

-                            additional_input_name].task_output_parameter.output_parameter_key = (
-                                channel.name)
+        elif isinstance(input_value, (str, int, float, bool, dict, list)):
+            if isinstance(input_value, (str, dict, list)):


you can remove this if isinstance(input_value, (str, dict, list)): check, extract_pipeline_channels_from_any would return an empty list in case float, int, bool.

Is it fine to remove the inner if check and expand the type annotations?

It would simplify the code but I'm not sure if it makes sense to update the type annotations for extract_pipeline_channels_from_any since ints, floats, and bools can't contain pipeline channels.

pipelines/sdk/python/kfp/dsl/pipeline_channel.py

Line 589 in 92c3178

payload: Union[PipelineChannel, str, list, tuple, dict]

Yes, you can update the payload annotation type.

chensun · 2024-06-12T20:21:39Z

sdk/python/kfp/compiler/pipeline_spec_builder.py

+
+                    if isinstance(input_value, str):
+                        input_value = input_value.replace(
+                            channel.pattern, additional_input_placeholder)


This would be covered by recursive_replace_placeholders, right?

The string case is not covered by recursive_replace_placeholders as is, I would need to embed string.replace() logic in recursive_replace_placeholders if we wanted to get rid of this ifelse block.

I suppose question is if we want to expose this logic in pipeline_spec_builder:build_task_spec_for_task or compiler_utils:recursive_repalce_placeholders?

I see. It's up to you whether you want to keep it as-is or remove the ifelse block. I don't have a strong preference.

I'm not sure how this is related to your question on exposing the logic.

Alright, kept as it.

Signed-off-by: KevinGrantLee <kglee@google.com>

google-oss-prow · 2024-06-13T01:16:53Z

@KevinGrantLee: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
kfp-kubernetes-execution-tests	`1c70801`	link	false	`/test kfp-kubernetes-execution-tests`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

chensun

/lgtm
/approve

Thanks, @KevinGrantLee !

google-oss-prow · 2024-06-13T07:08:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chensun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/OWNERS~~ [chensun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pthieu · 2024-06-14T01:52:04Z

@KevinGrantLee is it possible to do the same for worker_pool_specs.container_spec.env? I have a use-case where I am trying to keep a consistent timestamp used for file naming on my custom training op and testing op afterwards (testing op needs the freshly-trained model), so I pass it in as an environment variable.

#10902

KevinGrantLee · 2024-06-17T22:33:52Z

Hi @pthieu, this PR should also enable that usecase - I did some local tests, but can you confirm on your end?

pthieu · 2024-06-17T23:48:38Z

@KevinGrantLee think I'll need to wait for the next release as we use this in our CI/CD pipeline. Do you know what the release schedule is? I can do a test and confirm when it's out.

KevinGrantLee · 2024-06-18T22:45:24Z

@pthieu, we're planning to do a release later this week. cc @chensun

alexredplanet · 2024-06-21T06:14:00Z

I believe the same issue is occurring for the ModelBatchPredictOp @pthieu @KevinGrantLee , here is a minimal example:

@dsl.component
def gcs_jsonl_uri(bucket_name: str, file_name: str) -> str:
    return f"gs://{bucket_name}/{file_name}"

@dsl.pipeline
def pipeline(project: str, location: str, model_name: str):

    model = ModelGetOp(project=project, model_name=model_name, location=location)
    gcs_jsonl = gcs_jsonl_uri(bucket_name="example", file_name="example.jsonl")

    batch_predict_job = ModelBatchPredictOp(
        model=model.outputs["model"],
        job_display_name="example",
        gcs_source_uris=[gcs_jsonl.output],
        location=location
    )

producing the same error as your issue @pthieu :
ValueError: Value must be one of the following types: str, int, float, bool, dict, and list. Got: "{{channel:task=gcs-jsonl-uri;name=Output;type=String;}}" of type "<class 'kfp.dsl.pipeline_channel.PipelineParameterChannel'>"

KevinGrantLee · 2024-06-21T08:28:14Z

Hi @alexredplanet , I'm reasonably confident that this pr should also fix your case. Once the next kfp release is done, can you retry?

alexredplanet · 2024-06-25T02:32:32Z

Thanks @KevinGrantLee it did indeed fix that issue!

KevinGrantLee · 2024-06-25T21:04:44Z

Hi @pthieu , KFP SDK 2.8.0 has been released: you should be able to test #10902 again

pthieu · 2024-06-26T16:15:23Z

@KevinGrantLee looks like getting the latest worked (at least no errors thrown), thanks

Just waiting on a dependency requirement change on the google pipelines package:

The conflict is caused by:
    The user requested kfp==2.8.0
    google-cloud-pipeline-components 2.14.1 depends on kfp<=2.7.0 and >=2.6.0

google-oss-prow bot requested review from chensun and connor-mccarthy June 10, 2024 02:38

google-oss-prow bot added the size/XL label Jun 10, 2024

google-oss-prow bot assigned chensun Jun 11, 2024

KevinGrantLee added 4 commits June 11, 2024 10:40

feat(components): Support dynamic machine type paramters in CustomTra…

a987706

…iningJobOp Signed-off-by: KevinGrantLee <kglee@google.com>

fix formatting

77da3a3

Signed-off-by: KevinGrantLee <kglee@google.com>

Remove unused imports and try to fix type annotation error.

002998a

Signed-off-by: KevinGrantLee <kglee@google.com>

fix formatting

21d2bd8

Signed-off-by: KevinGrantLee <kglee@google.com>

KevinGrantLee force-pushed the dynamic-customtrainingjobop branch from 959fc6c to 21d2bd8 Compare June 11, 2024 17:41

Fix string formatting

30e436e

Signed-off-by: KevinGrantLee <kglee@google.com>

Fix string formatting

d255fbb

Signed-off-by: KevinGrantLee <kglee@google.com>

chensun requested changes Jun 11, 2024

View reviewed changes

Add new test pipeline

19d2887

Signed-off-by: KevinGrantLee <kglee@google.com>

KevinGrantLee requested a review from chensun June 11, 2024 21:22

update release.md

32f701a

Signed-off-by: KevinGrantLee <kglee@google.com>

Rename recursive_replace

9f462c4

Signed-off-by: KevinGrantLee <kglee@google.com>

chensun reviewed Jun 12, 2024

View reviewed changes

Refactor code and add condition test pipeline

973c66d

Signed-off-by: KevinGrantLee <kglee@google.com>

google-oss-prow bot added size/XXL and removed size/XL labels Jun 12, 2024

KevinGrantLee requested a review from chensun June 12, 2024 18:17

fix formatting

eb8a587

Signed-off-by: KevinGrantLee <kglee@google.com>

KevinGrantLee removed the request for review from connor-mccarthy June 12, 2024 18:30

chensun requested changes Jun 12, 2024

View reviewed changes

KevinGrantLee requested a review from chensun June 12, 2024 20:49

minor clean up

1c70801

Signed-off-by: KevinGrantLee <kglee@google.com>

chensun approved these changes Jun 13, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Jun 13, 2024

google-oss-prow bot added the approved label Jun 13, 2024

google-oss-prow bot merged commit b57f9e8 into master Jun 13, 2024
28 of 29 checks passed

google-oss-prow bot deleted the dynamic-customtrainingjobop branch June 13, 2024 07:09

gmfrasca mentioned this pull request Jun 25, 2024

[feature] Support parameter inputs for V2 Kubernetes_platform Spec #10534

Open

feat(components): Support dynamic machine type paramters in CustomTrainingJobOp #10883

feat(components): Support dynamic machine type paramters in CustomTrainingJobOp #10883

Conversation

KevinGrantLee commented Jun 10, 2024 • edited Loading

KevinGrantLee commented Jun 10, 2024

KevinGrantLee commented Jun 10, 2024

connor-mccarthy commented Jun 11, 2024

connor-mccarthy commented Jun 11, 2024

KevinGrantLee commented Jun 11, 2024

KevinGrantLee commented Jun 11, 2024

Choose a reason for hiding this comment

KevinGrantLee Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGrantLee Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGrantLee Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGrantLee Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

chensun Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

KevinGrantLee commented Jun 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chensun commented Jun 12, 2024

KevinGrantLee commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jun 13, 2024

chensun left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jun 13, 2024

pthieu commented Jun 14, 2024 • edited Loading

KevinGrantLee commented Jun 17, 2024

pthieu commented Jun 17, 2024

KevinGrantLee commented Jun 18, 2024

alexredplanet commented Jun 21, 2024

KevinGrantLee commented Jun 21, 2024

alexredplanet commented Jun 25, 2024

KevinGrantLee commented Jun 25, 2024 • edited Loading

pthieu commented Jun 26, 2024 • edited Loading

KevinGrantLee commented Jun 10, 2024 •

edited

Loading

KevinGrantLee Jun 11, 2024 •

edited

Loading

KevinGrantLee Jun 11, 2024 •

edited

Loading

KevinGrantLee Jun 11, 2024 •

edited

Loading

KevinGrantLee Jun 12, 2024 •

edited

Loading

chensun Jun 12, 2024 •

edited

Loading

pthieu commented Jun 14, 2024 •

edited

Loading

KevinGrantLee commented Jun 25, 2024 •

edited

Loading

pthieu commented Jun 26, 2024 •

edited

Loading