Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency with resource="imagestreamimports" and verb="POST" #21508

Closed
Reamer opened this issue Nov 16, 2018 · 22 comments
Closed

High latency with resource="imagestreamimports" and verb="POST" #21508

Reamer opened this issue Nov 16, 2018 · 22 comments
Labels
component/image lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Reamer
Copy link

Reamer commented Nov 16, 2018

With the default monitoring from cluster-monitoring-operator the alert KubeAPILatencyHigh reports periodical problems with a high latency on API-Server.
The resource "imagestreamimports" is triggered periodical because I have imagestreams with "importPolicy.scheduled" = true.

Version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-openshift.mycompany.com:443
openshift v3.11.0+6c2b013-59
kubernetes v1.11.0+d4cacc0
Steps To Reproduce
  1. Install Openshift 3.11 with Ansible
  2. Verify that openshift-monitoring is installed with default configuration
  3. Create some imagestreams with importPolicy.scheduled = true
Current Result

API calls latency is quiet high 1-4 seconds
screenshot_2018-11-16 prometheus time series collection and processing server

Expected Result

Fast API calls with low latency.

Additional Infos

Maybe related with #14264
Noticed this messages in API-Server Log.

I1116 10:15:42.621803       1 trace.go:76] Trace[2046733054]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:15:41.486368579 +0000 UTC m=+80018.770131016) (total time: 1.135397696s):
Trace[2046733054]: [1.13489078s] [1.134427655s] Object stored in database
I1116 10:19:28.517513       1 trace.go:76] Trace[1052075296]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:19:26.48801308 +0000 UTC m=+80243.771775530) (total time: 2.029461307s):
Trace[1052075296]: [2.028804306s] [2.028362867s] Object stored in database
I1116 10:26:59.096277       1 trace.go:76] Trace[1356838264]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:26:56.48712211 +0000 UTC m=+80693.770884485) (total time: 2.609073903s):
Trace[1356838264]: [2.608495955s] [2.607996341s] Object stored in database
I1116 10:34:28.377486       1 trace.go:76] Trace[1656023571]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:34:26.485752391 +0000 UTC m=+81143.769514793) (total time: 1.891700389s):
Trace[1656023571]: [1.891163987s] [1.890827358s] Object stored in database
I1116 10:42:01.529790       1 trace.go:76] Trace[40799593]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:41:56.677180492 +0000 UTC m=+81593.960942873) (total time: 4.852573087s):
Trace[40799593]: [4.851734298s] [4.851280253s] Object stored in database
I1116 10:45:42.642385       1 trace.go:76] Trace[556365896]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:45:41.48695107 +0000 UTC m=+81818.770713475) (total time: 1.155388409s):
Trace[556365896]: [1.154500425s] [1.154029628s] Object stored in database
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2019
@Reamer
Copy link
Author

Reamer commented Feb 14, 2019

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2019
@dmage
Copy link
Member

dmage commented Feb 21, 2019

This API call goes to remote registries and, in general, we can't guarantee it to be fast.

@StoneIsle
Copy link

+1

@Reamer
Copy link
Author

Reamer commented Mar 25, 2019

@dmage Can you explain a little bit more?
Some thoughts:
We enable with Imagestream configuration importPolicy.scheduled: true a periodic sync with remote image registries. I think some kind of scheduler will check for this configuration flag. The scheduler should then ask the remote registry for updated information and if all information are present in RAM, then we should push information at once to the API-Server, which updates the API-objects.

@dmage
Copy link
Member

dmage commented Mar 25, 2019

@Reamer we have the controller "openshift.io/image-import" that goes through all image streams that have "importPolicy.scheduled: true" and creates ImageStreamImport objects. The image API has a special handler for creating ImageStreamImport objects: it gets the list of remote registries from the ImageStreamImport object and from the image stream, fetches manifests from the remote registries, and writes information about new images back to the image stream.

@Reamer
Copy link
Author

Reamer commented Mar 25, 2019

I disabled alertrule in alertmanager configuration, because it's to noisy. Maybe we should disable this specific api endpoint alertrule in general, because I think nearly everyone should have this problems, if they use dockerhub as a remote registry.

@clcollins
Copy link
Member

Just adding a voice here - the alarm began to trigger for us when we configured the image streams for scheduled updates for the official OpenShift images in the openshift namespace, on two different clusters.

@yocum137
Copy link

yocum137 commented Jul 1, 2019

Just adding a voice here - the alarm began to trigger for us when we configured the image streams for scheduled updates for the official OpenShift images in the openshift namespace, on two different clusters.

Are you using registry.access.redhat.com or registry.redhat.io?

@nate-duke
Copy link

@yocum137 I work on the same cluster @clcollins was referring to, we're not using either of those registries. the registries in our openshift namespace are mostly on docker.io or registry.centos.org

We're OKD, not OCP.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2019
@Reamer
Copy link
Author

Reamer commented Oct 25, 2019

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2019
@trumbaut
Copy link

We have this issue too:

  • Using OCP v3.11.146 (Kubernetes v1.11.0+d4cacc0).
  • After enabling importPolicy.scheduled: true for some imagestreams.
  • These imagestreams are pointing to registry.redhat.io.
  • They are correctly being synchronized.
  • They produce a KubeAPILatencyHigh alert in Prometheus (The API server has a 99th percentile latency of 7.826666666666666 seconds for POST imagestreamimports.)

I will contact Red Hat Support on how to handle with this.

@trumbaut
Copy link

For your reference: the same issue is being described at Red Hat Bugzilla - Bug 1670380 - Alertmanager triggers error when updating the image.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2020
@trumbaut
Copy link

Issue should have been fixed in Red Hat OpenShift Container Platform 3.11.170, but I haven't test it myself. Can someone confirm?

Related:

@nate-duke
Copy link

nate-duke commented Feb 20, 2020

@trumbaut that's great that it's fixed downstream! Any idea where we cound find that in openshift/origin? The last official release we have here is v3.11.0 in October 2018.

Did a bit of digging based on that "Port ... to 3.11" MR that maybe this is in the quay.io/openshift/origin-cluster-monitoring-operator images, maybe on the v3.11.0 tag but I have a hard time finding anything concrete about the history of images on quay. We could probably find a CI process log somewhere that would connect that MR to an image getting pushed to the origin-cluster-monitoring-operator image repository. I'm willing to test in our dev cluster if we can find at least a tad of evidence that the change is actually available outside of OCP.

@trumbaut
Copy link

trumbaut commented Mar 4, 2020

@nate-duke :

Any idea where we cound find that in openshift/origin? The last official release we have here is v3.11.0 in October 2018.

The changes can be found at

But no, I heave no idea if new releases for OKD 3.11 will still be delivered. OKD 4 can be found at https://github.com/openshift/okd.

@nate-duke
Copy link

But no, I heave no idea if new releases for OKD 3.11 will still be delivered. OKD 4 can be found at https://github.com/openshift/okd.

We've since silenced this alert but i did do the leg work of updating our images so there's a chance we picked up these fixes if they were ever included in anything built for OKD 3.11.

I'm glad to see that the development of OKD 4 is proceeding. I've been following that repo for the last few months. I think we'll likely wait until FCOS and support for it in OKD 4.0 has matured just a bit more before we go building a new cluster to actually try and run it.

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2020
@openshift-ci-robot openshift-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 3, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/image lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants