Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline step failed with exit status code 2: failed to save outputs #750

Closed
Svendegroote91 opened this issue Jan 29, 2019 · 8 comments
Closed

Comments

@Svendegroote91
Copy link

Svendegroote91 commented Jan 29, 2019

Hi,

I am trying to run a very basic kubeflow pipeline with 2 components:
1/ preprocess
2/ train

However, when trying to run the pipeline, I get the message This step is in Error state with this message: failed to save outputs: exit status 2 in the Pipeline UI.

When I check the pod logs I get the following error and I attached the succesful logs before the error.

level=fatal msg="exit status 2\ngithub.com/argoproj/argo/errors.Wrap\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:87\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:70\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).GetFileContents\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:40\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveParameters\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/executor.go:343\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:49\ngithub.com/argoproj/argo/cmd/argoexec/commands.glob..func4\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:19\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:766\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"

schermafbeelding 2019-01-29 om 21 33 15

My actual pipeline script is as follows:

import kfp.dsl as dsl


class Preprocess(dsl.ContainerOp):

  def __init__(self, name, data_bucket, cutoff_year):
    super(Preprocess, self).__init__(
      name=name,
      # image needs to be a compile-time string
      image='gcr.io/sven-sandbox/kubeflow/cpu:v3',
      command=['python3', 'run_preprocess.py'],
      arguments=[
        '--data_bucket', data_bucket,
        '--cutoff_year', cutoff_year
      ],
      file_outputs={'blob-path': 'data/data_stocks.csv'}
    )

class Train(dsl.ContainerOp):

  def __init__(self, name, blob_path, version, data_bucket, model_bucket):
    super(Train, self).__init__(
      name=name,
      # image needs to be a compile-time string
      image='gcr.io/sven-sandbox/kubeflow/cpu:v3',
      command=['python3', 'run_train.py'],
      arguments=[
        '--version', version,
        '--blob_path', blob_path,
        '--data_bucket', data_bucket,
        '--model_bucket', model_bucket
      ]
    )


@dsl.pipeline(
  name='financial time series',
  description='Train Financial Time Series'
)
def train_and_deploy(
        data_bucket=dsl.PipelineParam('data-bucket', value='kf-data'),
        cutoff_year=dsl.PipelineParam('cutoff-year', value='2010'),
        model_bucket=dsl.PipelineParam('model-bucket', value='kf-finance'),
        version=dsl.PipelineParam('version', value='8')
):
  """Pipeline to train financial time series model"""
  preprocess_op = Preprocess('preprocess', data_bucket, cutoff_year)
  train_op = Train('train and deploy', preprocess_op.output, version, data_bucket, model_bucket)


if __name__ == '__main__':
  import kfp.compiler as compiler
  compiler.Compiler().compile(train_and_deploy, __file__ + '.tar.gz')

The preprocess container is actually executed as the files on storage were stored but it looks like something is going wrong between the container communication and the orchestration.
schermafbeelding 2019-01-29 om 22 04 09

As the error message is quite cryptic, can anyone help to tell me where to look to fix this issue?

FYI: this is my run_preprocess.py:

"""Module for running the data retrieval and preprocessing.

Scripts that performs all the steps to get the train and perform preprocessing.
"""
import logging
import argparse
import shutil
import os

from helpers import preprocess
from helpers import storage as storage_helper


def run_preprocess(args):
  """Runs the retrieval and preprocessing of the data.

  Args:
    args: args that are passed when submitting the training

  Returns:

  """
  tickers = ['snp', 'nyse', 'djia', 'nikkei', 'hangseng', 'ftse', 'dax', 'aord']
  closing_data = preprocess.load_data(tickers, args.cutoff_year)
  time_series = preprocess.preprocess_data(closing_data)
  temp_folder = 'data'
  if not os.path.exists(temp_folder):
    os.mkdir(temp_folder)
  file_path = os.path.join(temp_folder, 'data_{}.csv'.format(args.cutoff_year))
  time_series.to_csv(file_path, index=False)
  storage_helper.upload_to_storage(args.data_bucket, temp_folder)
  shutil.rmtree('data')


def main():
  parser = argparse.ArgumentParser(description='Preprocessing')

  parser.add_argument('--data_bucket',
                      type=str,
                      help='GCS bucket where preprocessed data is saved',
                      default='<your-bucket-name>')

  parser.add_argument('--cutoff_year',
                      type=str,
                      help='Cutoff year for the stock data',
                      default='2010')

  args = parser.parse_args()
  run_preprocess(args)


if __name__ == '__main__':
  logging.basicConfig(level=logging.INFO)
  main()
@Svendegroote91 Svendegroote91 changed the title pipeline step failed with exit status code 1 pipeline step failed with exit status code 2 Jan 29, 2019
@Svendegroote91 Svendegroote91 changed the title pipeline step failed with exit status code 2 pipeline step failed with exit status code 2: failed to save outputs Jan 29, 2019
@ssbagalkar
Copy link

ssbagalkar commented Feb 27, 2019

I am getting exactly the same error.
Here is my pipeline code:

#!/usr/bin/env python3

import kfp.dsl as dsl
import kfp.gcp as gcp

class ObjectDict(dict):
    def __getattr__(self, name):
        if name in self:
            return self[name]
        else:
            raise AttributeError("No such attribute: " + name)


class DataAcquisitionOP(dsl.ContainerOp):

  def __init__(self, name, project, region, cluster_name):
    super(DataAcquisitionOP, self).__init__(
      name=name,
      image='gcr.io/blah_project/data-acquisition:latest',
      arguments=[
          '--project', project,
          '--region', region,
          '--cluster', cluster_name
     ],
     file_outputs={'train_img': '/tmp/data_acquisition/train_img.txt',
                   'train_labels':'tmp/data_acquisition/train_labels.txt',
                   'valid_img': '/tmp/data_acquisition/valid_img.txt',
                   'valid_labels':'/tmp/data_acquisition/valid_labels.txt',
                   'test_img': '/tmp/data_acquisition/test_img.txt',
                   'test_labels': '/tmp/data_acquisition/test_labels.txt',
                   })


class PreprocessOP(dsl.ContainerOp):

  def __init__(self, name, project, region, cluster_name, train_img, train_labels, valid_img, valid_labels, test_img, test_labels):
    super(PreprocessOP, self).__init__(
      name=name,
      image='gcr.io/blah_project/preprocess:latest',
      arguments=[
          '--project', project,
          '--region', region,
          '--cluster', cluster_name,
          '--train_img',train_img,
          '--train_labels',train_labels,
          '--valid_img',valid_img,
          '--valid_labels',valid_labels,
          '--test_img',test_img,
          '--test_labels',test_labels,
     ],
     file_outputs={'train_records': '/tmp/preprocess/locations/output_train.txt',
                   'test_records': '/tmp/preprocess/locations/output_test.txt',
                   'valid_records': '/tmp/preprocess/locations/output_valid.txt',
                   })

# =======================================================================

@dsl.pipeline(
  name='MNIST trainer',
  description='A trainer that does end-to-end distributed training for mnist models.'
)
def mnist_train_pipeline(
    project=dsl.PipelineParam('project',value='blah_project'),
    train_bucket=dsl.PipelineParam('train_bucket',value='gs://kfp-mnist/train'),
    valid_bucket=dsl.PipelineParam('valid_bucket',value='gs://kfp-mnist/valid'),
    test_bucket=dsl.PipelineParam('test_bucket', value='gs://kfp-mnist/test'),
    region=dsl.PipelineParam('region',value = 'us-central1'),
    cluster_name=dsl.PipelineParam('cluster_name',value='mnisttwo'),
):
    dataacquisition_op = DataAcquisitionOP('dataacquisition', project, region, cluster_name).apply(gcp.use_gcp_secret('user-gcp-sa'))
    preprocess_op = PreprocessOP('preprocess', project, region, cluster_name,
                                 dataacquisition_op.outputs['train_img'],
                                 dataacquisition_op.outputs['train_labels'],
                                 dataacquisition_op.outputs['valid_img'],
                                 dataacquisition_op.outputs['valid_labels'],
                                 dataacquisition_op.outputs['test_img'],
                                 dataacquisition_op.outputs['test_labels']).apply(gcp.use_gcp_secret('user-gcp-sa'))



if __name__ == '__main__':
    import kfp.compiler as compiler
    import sys

    if len(sys.argv) != 2:
        sys.exit(-1)

    filename = sys.argv[1]
    compiler.Compiler().compile(mnist_train_pipeline,'mnist_main'+ '.tar.gz')

@ssbagalkar
Copy link

So my issue was i had specified wrong path, specifically tmp/preprocess/locations/output_train.txt was specified as tmp/preprocess/output_train.txt when i saved it in my docker local. Changing it solved the problem

@DurivetMatthias
Copy link

DurivetMatthias commented Mar 4, 2019

Hey,

I fixed the pipeline and it now runs run_preprocess.py >> run_train.py.

The problem was in the run_preprocess.py file. You have to write a file to the local file path inside your docker container. In this case I wrote the file_path variable to /blob_path.txt. In the next step, run_train.py, this variable is used to find where the preprocessed data is stored.

with open("/blob_path.txt", "w") as output_file:
output_file.write(file_path)

afbeelding

Kubeflow pipelines manages its workflow by using file_outputs={'blob-path': '/blob_path.txt'} and preprocess_op.output. If you specify file_outputs, you basically sign a "contract" to write that file to the local path inside the docker container. If a second image then uses <your_previous_dsl.ContainerOp>.output you say that that image only starts when the previous "contract" has been completed, or in this example the blob_path variable has been written to /blob_path.txt

A small pitfall to keep in mind is that inside the code of the second container, run_train.py in this case, the variable <your_previous_dsl.ContainerOp>.output holds the content of the file instead of the filepath.

I really only added the write /blob_path.txt lines, if you want my code its at https://github.com/DurivetMatthias/examples/blob/add_pipelines/financial_time_series/tensorflow_model

I wonder whether this way of chaining output to input is intended to pass parameters or small values to the next container OR if kubeflow wants you to pass the entire preprocessed dataset.

@vincent-pli
Copy link
Contributor

@DurivetMatthias Could you share your workflow yaml file?
I hit the same issue, the "wait" container said the path of my output stuff is not found,
I noticed in the workflow yaml file, the output stuff(/blob_path.txt in your case) is set to be parameter, so I guess there must be something wrong there.

@DurivetMatthias
Copy link

@vincent-pli the code is at https://github.com/DurivetMatthias/examples/blob/add_pipelines/financial_time_series/tensorflow_model/ml_pipeline.py

Make sure that when the preprocessing script ends, there is a file at the specified path in output_files.
(also make sure you are not removing the file just before ending with something like shutil.rmtree('data')).
Kubeflow only checks for promised file_outputs files when the container is done but before it is cleaned up/removed.

@vincent-pli
Copy link
Contributor

@DurivetMatthias thanks, it helped

@Svendegroote91
Copy link
Author

Thanks @DurivetMatthias for the fix, closing the issue.

@DurivetMatthias
Copy link

DurivetMatthias commented Nov 23, 2019 via email

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
…kubeflow#750 (kubeflow#814)

* wip

* fix cleanup job

* fix cleanup cronjob permission

* add cleanup job for kf-ci-v1 cluster

* update namespace
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
…eflow#750)

* Add bert triton example

* fix init

* Triton reference

* Rename to triton

* Add bert transformer to e2e image build

* Generate for triton sdk types

* Add to bert transformer build to CI

* Add e2e test for triton

* Add back license

* Add gpu annotation on test

* Fix storage uri in the test

* upgrade test cluster

* Add debug for the pod

* Fix test resource

* Add retry

* Test bert model with triton

* Skip triton test

* Skip tag resolution for nvcr.io

* Use simpler example

* skip tag resolution for nvcr.io

* Upgrade to Knative 0.15.0 for e2e testing

* Upgrade to kubernetes 1.16

* Add note for knative version dependency
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants