Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

component push throws error of waited 4m0s but couldn't find running pod matching selector #2877

Closed
prietyc123 opened this issue Apr 14, 2020 · 40 comments
Labels
area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)).

Comments

@prietyc123
Copy link
Contributor

prietyc123 commented Apr 14, 2020

/kind flake

What versions of software are you using?

Operating System:
All Supported

Output of odo version:
master

How did you run odo exactly?

odo push --context context on OpenShift ci.

Actual behavior

Throwing error as:

[odo] Please use `odo push` command to create the component with source deployed
Running odo with args [odo push --context /tmp/947847413]
[odo] Validation
[odo]  •  Checking component  ...
[odo] 
 ✓  Checking component [60ms]
[odo] 
[odo] Configuration changes
[odo]  ✓  Initializing component
[odo]  •  Creating component  ...
[odo] 
 ✓  Creating component [200ms]
[odo] 
[odo] Pushing to component dotnet-app of type local
[odo]  •  Checking files for pushing  ...
[odo] 
 ✓  Checking files for pushing [553410ns]
[odo]  •  Waiting for component to start  ...
[odo]  ✗  Waiting for component to start [4m]
[odo]  ✗  waited 4m0s but couldn't find running pod matching selector: 'deploymentconfig=dotnet-app-app'
Deleting project: bkbwthrcvq
Running odo with args [odo project delete bkbwthrcvq -f]

Expected behavior

It should push the component successfully into the deployment.

Any logs, error output, etc?

[odo] Please use `odo push` command to create the component with source deployed
Running odo with args [odo push --context /tmp/947847413]
[odo] Validation
[odo]  •  Checking component  ...
[odo] 
 ✓  Checking component [60ms]
[odo] 
[odo] Configuration changes
[odo]  ✓  Initializing component
[odo]  •  Creating component  ...
[odo] 
 ✓  Creating component [200ms]
[odo] 
[odo] Pushing to component dotnet-app of type local
[odo]  •  Checking files for pushing  ...
[odo] 
 ✓  Checking files for pushing [553410ns]
[odo]  •  Waiting for component to start  ...
[odo]  ✗  Waiting for component to start [4m]
[odo]  ✗  waited 4m0s but couldn't find running pod matching selector: 'deploymentconfig=dotnet-app-app'
Deleting project: bkbwthrcvq
Running odo with args [odo project delete bkbwthrcvq -f]

For more details: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_odo/2875/pull-ci-openshift-odo-master-v4.1-integration-e2e-benchmark/1778#1:build-log.txt%3A710

@openshift-ci-robot openshift-ci-robot added the flake Categorizes issue or PR as related to a flaky test. label Apr 14, 2020
@kadel kadel added the area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering label Apr 15, 2020
@amitkrout
Copy link
Contributor

similar error with little bit twist - #2942 (comment)

@amitkrout amitkrout self-assigned this Apr 23, 2020
@amitkrout amitkrout added the priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). label Apr 23, 2020
@amitkrout
Copy link
Contributor

@mik-dass This issue appears more frequently than before when the test node was 2. I am also assigning you to this issue along with me as you have fixed similar kind of issue before.

@mik-dass
Copy link
Contributor

@mik-dass This issue appears more frequently than before when the test node was 2. I am also assigning you to this issue along with me as you have fixed similar kind of issue before.

Let's reduce the number of nodes back to 2.

@prietyc123
Copy link
Contributor Author

prietyc123 commented Apr 23, 2020

@mik-dass This issue appears more frequently than before when the test node was 2. I am also assigning you to this issue along with me as you have fixed similar kind of issue before.

Let's reduce the number of nodes back to 2.

Not a bad though I suspect the failure might be due to less resources in travis CI. Is there any way to increase the resources in travis while doing oc cluster up? cc_ @amitkrout

@prietyc123
Copy link
Contributor Author

Hitting this issue more frequently when running tests with xenial distribution of travis CI https://travis-ci.com/github/openshift/odo/jobs/324857465#L473 .

Xenial distribution of travis is required to run latest kubernetes cluster. On the other hand we can use older version of minikube but odo push does not support the older version of minikube, which has been elaborated in issue : #2928 .

However there are some other consequences also of using older version of minikube like lag of latest feature implementation and I am suspecting this could be one of the reason that odo push is not supporting on older version of minikube. IMO we should move towards latest version minikube and hence xenial distribution of travis CI could be one of the solution.

Right now we are running our test on travis CI with trusty distribution, ubuntu version 14.04. And running the latest minikube needs systemd which was added in ubuntu 16.04, therefore we need to bump the ubuntu version to 16.04+ using xenial distribution of travis CI.

@kadel @girishramnani WDYT?

@prietyc123
Copy link
Contributor Author

@mik-dass This issue appears more frequently than before when the test node was 2. I am also assigning you to this issue along with me as you have fixed similar kind of issue before.

Let's reduce the number of nodes back to 2.

Not a bad though I suspect the failure might be due to less resources in travis CI. Is there any way to increase the resources in travis while doing oc cluster up? cc_ @amitkrout

I have raised a ticket asking for more resources on travis CI. Let see what they are replying on that. I will update the same once the reply

@prietyc123
Copy link
Contributor Author

prietyc123 commented Apr 29, 2020

I have raised a ticket asking for more resources on travis CI. Let see what they are replying on that. I will update the same once the reply

Got the response from Travis CI team image

The provided default memory 7.5 is enough for running 4 component push in parallel. So i think we should check component push failure from odo end. WDYT @mik-dass ?

@mik-dass
Copy link
Contributor

mik-dass commented Apr 29, 2020

The provided default memory 7.5 is enough for running 4 component push in parallel. So i think we should check component push failure from odo end. WDYT @mik-dass ?

But decreasing the test nodes to 2 has indeed reduced the amount of this failure. Also 7.5 may not be enough as we are running a cluster in the background in most of our tests which can be a expensive operation too. Also the pod initialization step can consume a lot of the time. I would suggest increasing the push timeout value by odo preference set pushtimeout <some higher value> -f in most of our tests, increasing the number of test nodes to 4 and verifying.

@amitkrout
Copy link
Contributor

The provided default memory 7.5 is enough for running 4 component push in parallel. So i think we should check component push failure from odo end. WDYT @mik-dass ?

But decreasing the test nodes to 2 has indeed reduced the amount of this failure. Also 7.5 may not be enough as we are running a cluster in the background in most of our tests which can be a expensive operation too. Also the pod initialization step can consume a lot of the time. I would suggest increasing the push timeout value by odo preference set pushtimeout <some higher value> -f in most of our tests, increasing the number of test nodes to 4 and verifying.

@mik-dass may be you are right, however iirc even on single test node we had the similar issue. Anyway we can try your suggestion to narrow down the reason for failure.

@amitkrout
Copy link
Contributor

@prietyc123 Can you please apply @mik-dass suggestion in one of your pr you mentioned in the comment #2877 (comment).

You just need to overwrite the pushtimeout through the global config. Lets make the wait time twice the actual i.e. 8 minute.

@prietyc123
Copy link
Contributor Author

@prietyc123 Can you please apply @mik-dass suggestion in one of your pr you mentioned in the comment #2877 (comment).

You just need to overwrite the pushtimeout through the global config. Lets make the wait time twice the actual i.e. 8 minute.

Sure I will definitely try it out and update the result.

@prietyc123
Copy link
Contributor Author

The provided default memory 7.5 is enough for running 4 component push in parallel. So i think we should check component push failure from odo end. WDYT @mik-dass ?

But decreasing the test nodes to 2 has indeed reduced the amount of this failure. Also 7.5 may not be enough as we are running a cluster in the background in most of our tests which can be a expensive operation too. Also the pod initialization step can consume a lot of the time. I would suggest increasing the push timeout value by odo preference set pushtimeout <some higher value> -f in most of our tests, increasing the number of test nodes to 4 and verifying.

@mik-dass I have set the pushtimeout to 8 min but still getting the same failure.

Running odo with args [odo push --context /tmp/280910682]
[odo] Validation
[odo]  •  Checking component  ...
 ✓  Checking component [11ms]
[odo] 
[odo] Configuration changes
[odo]  ✓  Initializing component
[odo]  •  Creating component  ...
 ✓  Creating component [122ms]
[odo] 
[odo] Applying URL changes
[odo]  ✓  URL warfile: http://warfile-app-yweencucjw.127.0.0.1.nip.io created
[odo] 
[odo] Pushing to component javaee-war-test of type binary
[odo]  •  Checking files for pushing  ...
 ✓  Checking files for pushing [302877ns]
[odo]  •  Waiting for component to start  ...
 ✗  waited 8m0s but was unable to find a running pod matching selector: 'deploymentconfig=javaee-war-test-app'
[odo] For more information to help determine the cause of the error, re-run with '-v'.
[odo] See below for a list of failed events that occured more than 5 times during deployment:
[odo] 
[odo]  NAME                                          COUNT  REASON  MESSAGE                 
[odo] 
[odo]  javaee-war-test-app-1-cljpt.160a8760de1a2fe9  20     Failed  Error: ImagePullBackOff 
[odo] 
[odo] 
[odo]  ✗  Waiting for component to start [8m] [WARNING x20: Failed]
Deleting project: yweencucjw
Running odo with args [odo project delete yweencucjw -f]
[odo] This project contains the following applications, which will be deleted
[odo] Application app
[odo] This application has following components that will be deleted
[odo] component named javaee-war-test
[odo] This component has following urls that will be deleted with component
[odo] URL named warfile with host warfile-app-yweencucjw.127.0.0.1.nip.io having protocol http at port 8080
[odo] No services / could not get services
[odo]  •  Deleting project yweencucjw  ...
 ✓  Deleting project yweencucjw [6s]
[odo]  ✓  Deleted project : yweencucjw
Deleting dir: /tmp/280910682
• Failure [488.196 seconds]
odo java e2e tests
/home/travis/gopath/src/github.com/openshift/odo/tests/e2escenarios/e2e_java_test.go:13
  odo component creation
  /home/travis/gopath/src/github.com/openshift/odo/tests/e2escenarios/e2e_java_test.go:42
    Should be able to deploy a .war file using wildfly [It]
    /home/travis/gopath/src/github.com/openshift/odo/tests/e2escenarios/e2e_java_test.go:59
    No future change is possible.  Bailing out early after 480.422s.
    Running odo with args [odo push --context /tmp/280910682]
    Expected
        <int>: 1
    to match exit code:
        <int>: 0

More details : https://travis-ci.com/github/openshift/odo/jobs/325397186#L1861

@mik-dass
Copy link
Contributor

mik-dass commented Apr 30, 2020

@mik-dass I have set the pushtimeout to 8 min but still getting the same failure.

There seems to be some issue with the network on travis and most probably the ImagePullBackOff happened because of that

https://travis-ci.com/github/openshift/odo/jobs/325397186#L1693
https://travis-ci.com/github/openshift/odo/jobs/325397186#L625

Also TBH I haven't seen this error on most PRs since we switched back to 2 nodes. Even on 4 nodes it happened in 2-4 test scripts.

But for your PR #2913 it's happening for all the test scripts. In fact all the jobs on Travis, which run on oc cluster up, are failing with almost the same error (either image pull back off or http connection error) https://travis-ci.com/github/openshift/odo/builds/162887080. I would advise you to

  • revert all the changes in your PR and test on Travis
  • switch the distro to xenial and test
  • implement all the changes step by step in your PR and test on Travis and report back

Maybe there is some comparability issue with xenial or some problem on Travis side regarding xenial.

@openshift-ci-robot openshift-ci-robot added priority/Medium Nice to have issue. Getting it done before priority changes would be great. and removed priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). labels Dec 4, 2020
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 3, 2021
@mohammedzee1000
Copy link
Contributor

/remove-lifecycle rotten

@openshift-ci-robot openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 27, 2021
@prietyc123 prietyc123 added priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). and removed priority/Medium Nice to have issue. Getting it done before priority changes would be great. labels May 24, 2021
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 21, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link

openshift-ci bot commented Oct 21, 2021

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this as completed Oct 21, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing, Quality Assurance or Quality Engineering flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)).
Projects
None yet
Development

No branches or pull requests

9 participants