GCS Path as input artifact #3548

sadeel · 2020-04-19T00:23:47Z

What steps did you take:

I have a pipeline that accepts the GCS path that contains input data that is used to train the model. I would like the GCS path to be tracked as an input artifact. Here is my workflow:

@dsl.pipeline(
    name='Trainer Pipeline', description='Trainer.')
def trainer(train_path: str = 'gs://path-to-my-data/record.tfrecord'):
  op = dsl.ContainerOp(
      name='Train',
      image=container_path,
      command=['python3', 'main.py'],
      arguments=[
          '--train_path',
          train_path,
      ])

What happened:

The input GCS path (train_path) is not tracked by the artifact feature.

What did you expect to happen:

The input GCS path should be tracked.

Environment:

Kubeflow on GCP

How did you deploy Kubeflow Pipelines (KFP)?
Hosted Kubeflow on GCP

/kind bug

numerology · 2020-04-19T04:35:54Z

/assign @Ark-kun
/assign @numerology

numerology · 2020-04-19T04:44:07Z

Hey @sadeel , this is because under the hood the train_path is not interpreted as an artifact, but only passed as a primitive value.

You can try to declare it as an InputPath as in this example.

Ark-kun · 2020-04-19T05:57:32Z

Hey @sadeel , this is because under the hood the train_path is not interpreted as an artifact, but only passed as a primitive value.

You can try to declare it as an InputPath as in [this example]

Unfortunately, even in this case it won't be considered an artifact.

I would like the GCS path to be tracked as an input artifact.

Artifacts are intermediate data. They must be outputs of some components. (And TFX has the same behavior - artifacts are only created when they are outputs.)

To make this data tracked, you need to import it into the system.

import kfp
from kfp import components

import_op = components.load_component_from_uri('https://raw.githubusercontent.com/kubeflow/pipelines/2dac60c/components/google-cloud/storage/download_blob/component.yaml')

train_op = components.load_component_from_text('''
name: Train
inputs:
  - {name: training_data}
outputs:
  - {name: model}
implementation:
  container:
    image: <your image>
    command=['python3', 'main.py']
    arguments=[
      '--train_path', {inputPath: training_data},
      '--model_path', {outputPath: model},
    ]
''')

def trainer_pipeline(training_data_uri: 'URI' = 'gs://path-to-my-data/record.tfrecord'):
    import_task = import_op(training_data_uri)
    train_task = train_op(
        training_data=import_task.output,
    )

kfp.Client(host=...).create_run_from_pipeline_func(trainer_pipeline, arguments={})

P.S. You might find the Creating components from command-line programs tutorial useful.

sadeel · 2020-04-19T13:12:40Z

Thank you for the clarification. Does import mean that I have to copy it from GCS to a local file, or can I feed the GCS path directly to my binary by using a component that just echoes the GCS path?

…

On Sun, Apr 19, 2020, 1:57 AM Alexey Volkov ***@***.***> wrote: Hey @sadeel <https://github.com/sadeel> , this is because under the hood the train_path is not interpreted as an artifact, but only passed as a primitive value. You can try to declare it as an InputPath as in [this example] Unfortunately, even in this case it won't be considered an artifact. I would like the GCS path to be tracked as an input artifact. Artifacts are *intermediate* data. They must be *outputs* of some components. (And TFX has the same behavior - artifacts are only created when they are outputs.) To make this data tracked, you need to *import* it into the system. import kfp from kfp import components import_op = components.load_component_from_uri('https://raw.githubusercontent.com/kubeflow/pipelines/2dac60c/components/google-cloud/storage/download_blob/component.yaml') train_op = components.load_component_from_text(''' name: Train inputs: - {name: training_data} outputs: - {name: model} implementation: container: image: <your image> command=['python3', 'main.py'] arguments=[ '--train_path', {inputPath: training_data}, '--model_path', {outputPath: model}, ] ''') def trainer_pipeline(training_data_uri: 'URI' = 'gs://path-to-my-data/record.tfrecord'): import_task = import_op(training_data_uri) train_task = train_op( training_data=import_task.output, ) kfp.Client(host=...).create_run_from_pipeline_func(trainer_pipeline, arguments={}) P.S. You might find the Creating components from command-line programs <https://github.com/Ark-kun/kfp_samples/blob/master/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb> tutorial useful. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3548 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLO3JCPW5J67Z66Q6GXMALRNKHFPANCNFSM4MLSEFEQ> .

Ark-kun · 2020-04-20T07:37:11Z

Thank you for the clarification. Does import mean that I have to copy it from GCS to a local file, or can I feed the GCS path directly to my binary by using a component that just echoes the GCS path?

If you want the data to become an artifact, you have to import it (download it so that the system takes the data file, puts it into storage, generates the URI for it and adds Artifact).
If you're fine with not having this initial data as an Artifact, then you can just pass the URI to your trainer as you did initially.

That's the current situation.

sadeel · 2020-04-21T16:05:19Z

Thanks, a follow-up question. My trainer needs to run on a GPU, while using ContainerOp I would just use:
op.set_gpu_limit(1)

Do you know how to add a GPU using the component definition you mentioned above?

numerology · 2020-04-21T16:33:35Z

Perhaps you can try

train_task = train_op(
    training_data=import_task.output
).set_gpu_limit(1)

sadeel · 2020-04-21T19:00:28Z

Thanks numerology that works.

It looks like by default minio is used for artifact storage - is there any way to use GCS instead? My workflow looks like the one Ark-kun mentioned

numerology · 2020-04-21T19:02:12Z

@Ark-kun

Does that mean we need to revive ArtifactLocation?

Ark-kun · 2020-04-26T00:29:05Z

Thanks numerology that works.

It looks like by default minio is used for artifact storage - is there any way to use GCS instead? My workflow looks like the one Ark-kun mentioned

This simple guide shows how to switch from Minio to GCS: https://github.com/argoproj/argo/blob/master/docs/configure-artifact-repository.md#configuring-gcs-google-cloud-storage

Ark-kun · 2020-04-26T00:31:28Z

Does that mean we need to revive ArtifactLocation?

I think it's better to have this configuration in the cluster. When we start using GCS by default in Marketplace deployment (plus maybe add an option/guide for standalone and Kubeflow deployments), we should start getting less of those questions.

Bobgy · 2020-05-07T06:17:37Z

Hi @sadeel, @rmgogogo just recently wrote a README for how to install Kubeflow standalone using GCS: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize#option-3-install-it-to-gcp-with-cloudsql--gcs-minio-managed-storage. Does it work for you?

timothyjlaurent · 2020-05-12T21:38:31Z

I agree this would be great -- We have several pipelines created by hand that accept Input artifacts that we can't write with the SDK

Ark-kun · 2020-05-13T22:01:25Z

I agree this would be great -- We have several pipelines created by hand that accept Input artifacts that we can't write with the SDK

Can you please explain your scenario. I'm trying to understand why the current SDK prevents you from reaching your goal.

SDK already allows you to pass arbitrary data. Including BigQuery table names, GCS paths, HDFS URIs, HTTPS URLs, GIT URIs etc. The SDK does not restrict the type of data you're passing.

KFP can also optionally automatically move data from the system data storage to the container and back from the configured S3 artifact storage (https://github.com/argoproj/argo/blob/master/docs/configure-artifact-repository.md).

We also make it easy to import the data from external data store to the system data store.
See #3548 (comment)

What are we missing?

stale · 2020-08-12T03:15:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-08-19T07:13:51Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

k8s-ci-robot added the kind/bug label Apr 19, 2020

k8s-ci-robot assigned Ark-kun and numerology Apr 19, 2020

sadeel closed this as completed Apr 21, 2020

sadeel reopened this Apr 21, 2020

Bobgy added the status/triaged Whether the issue has been explicitly triaged label May 7, 2020

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 12, 2020

stale bot closed this as completed Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCS Path as input artifact #3548

GCS Path as input artifact #3548

sadeel commented Apr 19, 2020

numerology commented Apr 19, 2020

numerology commented Apr 19, 2020

Ark-kun commented Apr 19, 2020

sadeel commented Apr 19, 2020 via email

Ark-kun commented Apr 20, 2020

sadeel commented Apr 21, 2020

numerology commented Apr 21, 2020

sadeel commented Apr 21, 2020

numerology commented Apr 21, 2020

Ark-kun commented Apr 26, 2020

Ark-kun commented Apr 26, 2020 •

edited

Loading

Bobgy commented May 7, 2020

timothyjlaurent commented May 12, 2020

Ark-kun commented May 13, 2020 •

edited

Loading

stale bot commented Aug 12, 2020

stale bot commented Aug 19, 2020

GCS Path as input artifact #3548

GCS Path as input artifact #3548

Comments

sadeel commented Apr 19, 2020

What steps did you take:

What happened:

What did you expect to happen:

Environment:

numerology commented Apr 19, 2020

numerology commented Apr 19, 2020

Ark-kun commented Apr 19, 2020

sadeel commented Apr 19, 2020 via email

Ark-kun commented Apr 20, 2020

sadeel commented Apr 21, 2020

numerology commented Apr 21, 2020

sadeel commented Apr 21, 2020

numerology commented Apr 21, 2020

Ark-kun commented Apr 26, 2020

Ark-kun commented Apr 26, 2020 • edited Loading

Bobgy commented May 7, 2020

timothyjlaurent commented May 12, 2020

Ark-kun commented May 13, 2020 • edited Loading

stale bot commented Aug 12, 2020

stale bot commented Aug 19, 2020

Ark-kun commented Apr 26, 2020 •

edited

Loading

Ark-kun commented May 13, 2020 •

edited

Loading