Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS Path as input artifact #3548

Closed
sadeel opened this issue Apr 19, 2020 · 16 comments
Closed

GCS Path as input artifact #3548

sadeel opened this issue Apr 19, 2020 · 16 comments
Assignees
Labels
kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label. status/triaged Whether the issue has been explicitly triaged

Comments

@sadeel
Copy link

sadeel commented Apr 19, 2020

What steps did you take:

I have a pipeline that accepts the GCS path that contains input data that is used to train the model. I would like the GCS path to be tracked as an input artifact. Here is my workflow:

@dsl.pipeline(
    name='Trainer Pipeline', description='Trainer.')
def trainer(train_path: str = 'gs://path-to-my-data/record.tfrecord'):
  op = dsl.ContainerOp(
      name='Train',
      image=container_path,
      command=['python3', 'main.py'],
      arguments=[
          '--train_path',
          train_path,
      ])

What happened:

The input GCS path (train_path) is not tracked by the artifact feature.

What did you expect to happen:

The input GCS path should be tracked.

Environment:

Kubeflow on GCP

How did you deploy Kubeflow Pipelines (KFP)?
Hosted Kubeflow on GCP

/kind bug

@numerology
Copy link

/assign @Ark-kun
/assign @numerology

@numerology
Copy link

Hey @sadeel , this is because under the hood the train_path is not interpreted as an artifact, but only passed as a primitive value.

You can try to declare it as an InputPath as in this example.

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 19, 2020

Hey @sadeel , this is because under the hood the train_path is not interpreted as an artifact, but only passed as a primitive value.

You can try to declare it as an InputPath as in [this example]

Unfortunately, even in this case it won't be considered an artifact.

I would like the GCS path to be tracked as an input artifact.

Artifacts are intermediate data. They must be outputs of some components. (And TFX has the same behavior - artifacts are only created when they are outputs.)

To make this data tracked, you need to import it into the system.

import kfp
from kfp import components

import_op = components.load_component_from_uri('https://raw.githubusercontent.com/kubeflow/pipelines/2dac60c/components/google-cloud/storage/download_blob/component.yaml')

train_op = components.load_component_from_text('''
name: Train
inputs:
  - {name: training_data}
outputs:
  - {name: model}
implementation:
  container:
    image: <your image>
    command=['python3', 'main.py']
    arguments=[
      '--train_path', {inputPath: training_data},
      '--model_path', {outputPath: model},
    ]
''')

def trainer_pipeline(training_data_uri: 'URI' = 'gs://path-to-my-data/record.tfrecord'):
    import_task = import_op(training_data_uri)
    train_task = train_op(
        training_data=import_task.output,
    )

kfp.Client(host=...).create_run_from_pipeline_func(trainer_pipeline, arguments={})

P.S. You might find the Creating components from command-line programs tutorial useful.

@sadeel
Copy link
Author

sadeel commented Apr 19, 2020 via email

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 20, 2020

Thank you for the clarification. Does import mean that I have to copy it from GCS to a local file, or can I feed the GCS path directly to my binary by using a component that just echoes the GCS path?

If you want the data to become an artifact, you have to import it (download it so that the system takes the data file, puts it into storage, generates the URI for it and adds Artifact).
If you're fine with not having this initial data as an Artifact, then you can just pass the URI to your trainer as you did initially.

That's the current situation.

@sadeel
Copy link
Author

sadeel commented Apr 21, 2020

Thanks, a follow-up question. My trainer needs to run on a GPU, while using ContainerOp I would just use:
op.set_gpu_limit(1)

Do you know how to add a GPU using the component definition you mentioned above?

@numerology
Copy link

Perhaps you can try

train_task = train_op(
    training_data=import_task.output
).set_gpu_limit(1)

@sadeel sadeel closed this as completed Apr 21, 2020
@sadeel
Copy link
Author

sadeel commented Apr 21, 2020

Thanks numerology that works.

It looks like by default minio is used for artifact storage - is there any way to use GCS instead? My workflow looks like the one Ark-kun mentioned

@numerology
Copy link

@Ark-kun

Does that mean we need to revive ArtifactLocation?

@sadeel sadeel reopened this Apr 21, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 26, 2020

Thanks numerology that works.

It looks like by default minio is used for artifact storage - is there any way to use GCS instead? My workflow looks like the one Ark-kun mentioned

This simple guide shows how to switch from Minio to GCS: https://github.com/argoproj/argo/blob/master/docs/configure-artifact-repository.md#configuring-gcs-google-cloud-storage

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 26, 2020

Does that mean we need to revive ArtifactLocation?

I think it's better to have this configuration in the cluster. When we start using GCS by default in Marketplace deployment (plus maybe add an option/guide for standalone and Kubeflow deployments), we should start getting less of those questions.

@Bobgy
Copy link
Contributor

Bobgy commented May 7, 2020

Hi @sadeel, @rmgogogo just recently wrote a README for how to install Kubeflow standalone using GCS: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize#option-3-install-it-to-gcp-with-cloudsql--gcs-minio-managed-storage. Does it work for you?

@Bobgy Bobgy added the status/triaged Whether the issue has been explicitly triaged label May 7, 2020
@timothyjlaurent
Copy link

I agree this would be great -- We have several pipelines created by hand that accept Input artifacts that we can't write with the SDK

@Ark-kun
Copy link
Contributor

Ark-kun commented May 13, 2020

I agree this would be great -- We have several pipelines created by hand that accept Input artifacts that we can't write with the SDK

Can you please explain your scenario. I'm trying to understand why the current SDK prevents you from reaching your goal.

SDK already allows you to pass arbitrary data. Including BigQuery table names, GCS paths, HDFS URIs, HTTPS URLs, GIT URIs etc. The SDK does not restrict the type of data you're passing.

KFP can also optionally automatically move data from the system data storage to the container and back from the configured S3 artifact storage (https://github.com/argoproj/argo/blob/master/docs/configure-artifact-repository.md).

We also make it easy to import the data from external data store to the system data store.
See #3548 (comment)

What are we missing?

@stale
Copy link

stale bot commented Aug 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 12, 2020
@stale
Copy link

stale bot commented Aug 19, 2020

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label. status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

No branches or pull requests

6 participants