-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS Path as input artifact #3548
Comments
/assign @Ark-kun |
Hey @sadeel , this is because under the hood the You can try to declare it as an |
Unfortunately, even in this case it won't be considered an artifact.
Artifacts are intermediate data. They must be outputs of some components. (And TFX has the same behavior - artifacts are only created when they are outputs.) To make this data tracked, you need to import it into the system.
P.S. You might find the Creating components from command-line programs tutorial useful. |
Thank you for the clarification.
Does import mean that I have to copy it from GCS to a local file, or can I
feed the GCS path directly to my binary by using a component that just
echoes the GCS path?
…On Sun, Apr 19, 2020, 1:57 AM Alexey Volkov ***@***.***> wrote:
Hey @sadeel <https://github.com/sadeel> , this is because under the hood
the train_path is not interpreted as an artifact, but only passed as a
primitive value.
You can try to declare it as an InputPath as in [this example]
Unfortunately, even in this case it won't be considered an artifact.
I would like the GCS path to be tracked as an input artifact.
Artifacts are *intermediate* data. They must be *outputs* of some
components. (And TFX has the same behavior - artifacts are only created
when they are outputs.)
To make this data tracked, you need to *import* it into the system.
import kfp
from kfp import components
import_op = components.load_component_from_uri('https://raw.githubusercontent.com/kubeflow/pipelines/2dac60c/components/google-cloud/storage/download_blob/component.yaml')
train_op = components.load_component_from_text('''
name: Train
inputs:
- {name: training_data}
outputs:
- {name: model}
implementation:
container:
image: <your image>
command=['python3', 'main.py']
arguments=[
'--train_path', {inputPath: training_data},
'--model_path', {outputPath: model},
]
''')
def trainer_pipeline(training_data_uri: 'URI' = 'gs://path-to-my-data/record.tfrecord'):
import_task = import_op(training_data_uri)
train_task = train_op(
training_data=import_task.output,
)
kfp.Client(host=...).create_run_from_pipeline_func(trainer_pipeline, arguments={})
P.S. You might find the Creating components from command-line programs
<https://github.com/Ark-kun/kfp_samples/blob/master/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb>
tutorial useful.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3548 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLO3JCPW5J67Z66Q6GXMALRNKHFPANCNFSM4MLSEFEQ>
.
|
If you want the data to become an artifact, you have to import it (download it so that the system takes the data file, puts it into storage, generates the URI for it and adds Artifact). That's the current situation. |
Thanks, a follow-up question. My trainer needs to run on a GPU, while using ContainerOp I would just use: Do you know how to add a GPU using the component definition you mentioned above? |
Perhaps you can try
|
Thanks numerology that works. It looks like by default minio is used for artifact storage - is there any way to use GCS instead? My workflow looks like the one Ark-kun mentioned |
Does that mean we need to revive ArtifactLocation? |
This simple guide shows how to switch from Minio to GCS: https://github.com/argoproj/argo/blob/master/docs/configure-artifact-repository.md#configuring-gcs-google-cloud-storage |
I think it's better to have this configuration in the cluster. When we start using GCS by default in Marketplace deployment (plus maybe add an option/guide for standalone and Kubeflow deployments), we should start getting less of those questions. |
Hi @sadeel, @rmgogogo just recently wrote a README for how to install Kubeflow standalone using GCS: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize#option-3-install-it-to-gcp-with-cloudsql--gcs-minio-managed-storage. Does it work for you? |
I agree this would be great -- We have several pipelines created by hand that accept Input artifacts that we can't write with the SDK |
Can you please explain your scenario. I'm trying to understand why the current SDK prevents you from reaching your goal. SDK already allows you to pass arbitrary data. Including BigQuery table names, GCS paths, HDFS URIs, HTTPS URLs, GIT URIs etc. The SDK does not restrict the type of data you're passing. KFP can also optionally automatically move data from the system data storage to the container and back from the configured S3 artifact storage (https://github.com/argoproj/argo/blob/master/docs/configure-artifact-repository.md). We also make it easy to import the data from external data store to the system data store. What are we missing? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
What steps did you take:
I have a pipeline that accepts the GCS path that contains input data that is used to train the model. I would like the GCS path to be tracked as an input artifact. Here is my workflow:
What happened:
The input GCS path (train_path) is not tracked by the artifact feature.
What did you expect to happen:
The input GCS path should be tracked.
Environment:
Kubeflow on GCP
How did you deploy Kubeflow Pipelines (KFP)?
Hosted Kubeflow on GCP
/kind bug
The text was updated successfully, but these errors were encountered: