Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

Closed
fstetic opened this issue Jan 19, 2023 · 21 comments
Closed

Comments

@fstetic
Copy link

fstetic commented Jan 19, 2023

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Local Canonical Kubeflow using this guide
  • KFP version:
    Bottom of KFP UI left sidenav says build version dev_local and the guide states 1.6
  • KFP SDK version:
    kfp 2.0.0b10
    kfp-pipeline-spec 0.1.17
    kfp-server-api 2.0.0a6

Steps to reproduce

Install Kubeflow using aforementioned guide. Copy addition pipeline and compile it and either run it after uploading through UI or run it from code. Both doesn't work.

Expected result

Pipeline shouldn't fail.

Materials and Reference

In details it says Cannot find context with {"typeName":"system.PipelineRun","contextName":"a5e7085e-ef10-48b2-a0a5-1ced3b93e2e5"}: Unknown Content-type received.

Addition pipeline from documentation

from kfp import compiler
from kfp import dsl


@dsl.component
def addition_component(num1: int, num2: int) -> int:
    return num1 + num2


@dsl.pipeline(name="addition-pipeline")
def my_pipeline(a: int, b: int, c: int):
    add_task_1 = addition_component(num1=a, num2=b)
    add_task_2 = addition_component(num1=add_task_1.output, num2=c)


cmplr = compiler.Compiler()
cmplr.compile(my_pipeline, package_path="my_pipeline.yaml")

Impacted by this bug? Give it a 👍.

@gkcalat
Copy link
Member

gkcalat commented Jan 19, 2023

Hi @fstetic!
Thank you for reporting this. Could you confirm whether the problem is persistent or if it goes away after the run completes?

@gkcalat gkcalat self-assigned this Jan 19, 2023
@gkcalat gkcalat added this to Needs triage in KFP Runtime Triage via automation Jan 19, 2023
@gkcalat gkcalat moved this from Needs triage to Needs More Info in KFP Runtime Triage Jan 19, 2023
@fstetic
Copy link
Author

fstetic commented Jan 20, 2023

Hi @gkcalat! Thanks for the quick response.

The run doesn't complete. That error happens at the start of the run.

I tried a tutorial pipeline with v1 YAML spec and that one behaves as expected. I inspected MinIO bucket and found out that v1 pipelines make a dir named <workflow name> in mlpipelines/artifacts, but v2 don't. "contextName" in the error message stated in the issue corresponds to the RunID of the pipeline, not workflow name.

I also noticed in network requests, when a run is opened in UI, a POST request to /ml_metadata.MetadataStoreService/GetContextByTypeAndName where v1 and v2 pipelines differ. V1 pipelines send pipeline_run in request body and v2 pipelines send system.PipelineRun. Don't know if that means anything because in both cases the request fails with 400 error and message Cannot POST /ml_metadata.MetadataStoreService/GetContextByTypeAndName

I also raised this issue in Slack and a person responded that it might be related to a namespace/profile instantiation issue so I'll look into that next.

@tleewongjaro-agoda
Copy link

Hello @fstetic

I am also having the same problem.
Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

@fstetic
Copy link
Author

fstetic commented Apr 20, 2023

Hi @tleewongjaro-agoda. Unfortunately no, I gave up and downgraded to v1 pipelines.

@gkcalat
Copy link
Member

gkcalat commented Apr 20, 2023

/cc @chensun

@gkcalat gkcalat assigned chensun and jlyaoyuli and unassigned gkcalat May 4, 2023
@Enochlove
Copy link

Hello @fstetic

I am also having the same problem. Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

Have u fighred out now? Or any ideas?

@LordWaif

This comment was marked as outdated.

@LordWaif
Copy link

LordWaif commented Sep 5, 2023

The use of v1 pipelines is still viable?, I have the same problem reported above

But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below

In the ui, I keep coming across this error without being able to use it
Error: failed to retrieve list of pipelines. Click Details for more information.

+++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
++ cd /opt/proxy
++ pwd

  • DIR=/opt/proxy
    ++ jq -r '.data.Hostname // empty'
    ++ kubectl get configmap inverse-proxy-config -o json
  • HOSTNAME=
  • [[ -n '' ]]
  • [[ ! -z '' ]]
    ++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google'
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal
  • INSTANCE_ZONE=/

@Enochlove
Copy link

Enochlove commented Sep 6, 2023 via email

@DnPlas
Copy link

DnPlas commented Oct 5, 2023

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):

"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.

Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

@chensun
Copy link
Member

chensun commented Oct 5, 2023

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):

"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.

Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

I don't think this would be a blocker, as we had tested pipelines like this in KFP 2.0 standalone deployment, while I do recall seeing similar error messages sometime, but it shouldn't fail the pipeline execution.
That being said, I will test this again with Kubeflow 1.8 rc shortly.

@chensun
Copy link
Member

chensun commented Oct 9, 2023

Confirming this doesn't reproduce on Kubeflow 1.8.0-rc.1
image

The error message about cannot get MLMD context does sometime shown in the UI, this is expected before a run starts (we should consider some UI improvement to not make it confusing), but it should be gone once the run starts (the root driver pod will create MLMD context).

@chensun chensun closed this as completed Oct 9, 2023
KFP Runtime Triage automation moved this from Needs More Info to Closed Oct 9, 2023
@venkatesh-chinni
Copy link

facing the same issue. I get this error msg and run doesn't start. I see the issue is closed, but don't see a solution other than downgrading. Any workable solution without downgrading ?

@ZeynepRuveyda
Copy link

ZeynepRuveyda commented May 16, 2024

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

@venkatesh-chinni
Copy link

venkatesh-chinni commented May 17, 2024

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

@ZeynepRuveyda
Copy link

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

I found a solution with downgrade to 1.7 kubeflow and with 1.24 kubernetes. And using kfp 2.0.1 version.

I hope it will help for you!

@photonbit
Copy link

I had this issue and it stopped happening after creating a volume. The runs start working even if I create a volume and then I delete it, but keeps happening if I create a pipeline run on a newly installed kubeflow 1.8 from the manifests.

@sapphire008
Copy link

  • this guide

@photonbit Do you mind elaborating on what volume is being created? Is this a Docker volume? Is there a specific name I need to use? Thanks.

@photonbit
Copy link

@photonbit Do you mind elaborating on what volume is being created? Is this a Docker volume? Is there a specific name I need to use? Thanks.

I created a volume from the kubeflow dashboard. For the name, I tried both with a random name and creating the same volume I had configured for the pipeline and both resulted in the issue being solved.

@sapphire008
Copy link

@photonbit Thanks!

@thesuperzapper
Copy link
Member

Thanks to the investigation done by @orfeas-k in canonical/bundle-kubeflow#966, it seems like this issue might be some kind of irrecoverable race condition in the Deployment/metadata-envoy-deployment Pods.

As a temporary workaround, it seems like you can simply restart that deployment and it should fix it:

kubectl rollout restart deployment/metadata-envoy-deployment --namespace kubeflow

As a longer-term solution (assuming a restart is all that is required), we can add a livenessProbe on the PodSpec of the manifests. For example, this Kustomize patch may work:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metadata-envoy-deployment
spec:
  template:
    spec:
      containers:
        - name: container
          livenessProbe:
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 15
            successThreshold: 1
            timeoutSeconds: 5
            httpGet:
              path: "/"
              port: md-envoy
              httpHeaders:
                - name: Content-Type
                  value: application/grpc-web-text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests