Update monitoring manifests to be deployed in waves #44

hemajv · 2020-12-02T20:45:00Z

When deploying the monitoring setup in quicklab (via argocd), ran into the following Application Sync Error:

The monitoring manifests should be updated to be deployed in waves: https://argoproj.github.io/argo-cd/user-guide/sync-waves/
so that the kfdef is deployed first, and then followed by the grafana/prometheus custom resources

cc @HumairAK @4n4nd

The text was updated successfully, but these errors were encountered:

HumairAK · 2020-12-02T20:46:50Z

Current workaround is to sync the kfdef manually then click the overall sync button.

tumido · 2020-12-08T18:13:50Z

@hemajv waves won't help, since you have no control over when ODH operator picks up the kfdef and actually does the deployment. It may be long after the kfdef resouce is up in the cluster. We can do this differently via overrides though. I'll solve this as part of #49

HumairAK · 2020-12-08T18:16:30Z

ah that's a good point, thanks @tumido !

anishasthana · 2021-01-13T18:18:51Z

Hey folks, @hemajv was running into some issues with overrides for grafana dashboard deployment that me think a little further.

We probably don't want all dashboards to go into ODH overrides either as there are some cases in which we will have dashboards that are user specific (say some MOC user wants to visualize time series data). I think these types of dashboards should be deployed via ArgoCD. So to summarize:

Dashboards that are relevant to upstream -> overrides -> Deployed by ODH operator
Dashboards that are not relevant to upstream -> Kustomize overlays -> Deployed by ArgoCD

tumido · 2021-01-15T08:42:03Z

we don't have that GrafanaDashboard resource kind available until ODH operator deploys it for us. We can't deploy it side by side since it's dependent on the ODH deployment. The solution you're proposing for the 2nd case would reintroduce all the sync issues.

4n4nd · 2021-01-15T15:56:06Z

I like @anishasthana's suggestion. In the 2nd case, if odh has not deployed GrafanaDashboard yet, argocd will fail and try again since we have autosync enabled for all our apps. I'd rather have the argocd sync fail couple of times than have the odh operator deploy more resources, this is because the odh operator can be a bit difficult to monitor and with argocd we have more visibility.

tumido · 2021-01-18T11:16:33Z

With the increasing amount of dashboards we'd probably like to collocate them with the application they are monitoring, right? So putting everything within ODH may not be the best idea when it comes to repo sanity and KISS. I agree with that point.

However we currently don't have a solution for the failed syncs. The auto sync you're suggesting @4n4nd won't work unfortunately. That's actually the core reason why this issue was created. The auto sync retries only on the OutOfSync deployments, it doesn't retry on Errored apps.

https://argoproj.github.io/argo-cd/user-guide/auto_sync/#automated-sync-semantics

Another aspect is that this error state would happen only on fresh cluster deployments, because CRDs are required to be deployed only once. However it still means that somebody must initiate the sync manually for all the apps with a dashboard in them on every fresh cluster deployment. I don't really want to introduce manual steps into the workflows.

4n4nd · 2021-01-18T21:07:51Z

@tumido it also says you can enable selfHeal and it will try again

tumido · 2021-01-19T11:50:08Z

yes that is true. And it's a bit confusing as well. If I understand it properly, selfHeal also checks if there's a previous successful sync that can the manifests can be "restored" to. But it's very confusing here... (the next paragraph after your quote)

I'll try spinning up an instance later on, to see how it would behave, before we draw any conclusion here. I'm sorry about this, but I'm starting to be rather super cautious about these things.

https://github.com/argoproj/argo-cd/blob/27a609fb1a24f3ca81ae7798c43e18a66fe8e36a/controller/appcontroller.go#L1410-L1435

btw, this way we'll screenshot all their docs in here. 😄 Paragraph by paragraph. 😄 We have to stop at some point.

tumido · 2021-01-19T13:04:16Z

Actually, a clever way to workaround this completely would be to leverage the operate-first/blueprint#19 and just include the CRDs to the cluster-scope application. That way we have all the CRDs available even before the operator is installed by ODH - which allows us to deploy the resources just fine via ArgoCD while ODH would deploy the operator itself whenever it wants... (The CRDs would be consumed by the operator only once the operator is available, but the resources can be already present ahead of time). And that would allow us to move the Dashboards and whatever else from ODH app to wherever it need to go. What would you say to that? I think that would solve the problem just nicely.

HumairAK · 2021-01-19T14:53:17Z

We would have to ensure we're pulling the appropriate CRDs, i.e. the same ones that ODH would be intending to install. I'm wondering how ODH handles' it's diffs when running reconcilliations. ArgoCD will add a label to the manifests it deploys, so in this case the CRDs, that would already introduce at least 1 change between the two. When ODH is deployed, would it try to fight with ArgoCD for these CRDs?

sesheta · 2021-10-11T20:37:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta · 2021-11-10T21:11:01Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta · 2021-12-10T21:17:36Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta · 2021-12-10T21:17:45Z

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hemajv assigned 4n4nd Dec 2, 2020

4n4nd mentioned this issue Dec 2, 2020

[Monitoring] Add argocd wave annotations for monitoring resources #45

Merged

sesheta closed this as completed in #45 Dec 2, 2020

tumido mentioned this issue Dec 11, 2020

fix: Use Grafana/Prometheus as an ODH ocomponent #80

Merged

tumido linked a pull request Dec 11, 2020 that will close this issue

fix: Use Grafana/Prometheus as an ODH ocomponent #80

Merged

tumido mentioned this issue Dec 16, 2020

Move jh grafana dashboard from /overrides into /base/monitoring #85

Merged

HumairAK mentioned this issue Dec 16, 2020

Override Vs Overlay and CRD rejection issue #87

Closed

anishasthana reopened this Jan 13, 2021

sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2021

sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 10, 2021

sesheta closed this as completed Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update monitoring manifests to be deployed in waves #44

Update monitoring manifests to be deployed in waves #44

hemajv commented Dec 2, 2020

HumairAK commented Dec 2, 2020

tumido commented Dec 8, 2020

HumairAK commented Dec 8, 2020

anishasthana commented Jan 13, 2021

tumido commented Jan 15, 2021

4n4nd commented Jan 15, 2021

tumido commented Jan 18, 2021

4n4nd commented Jan 18, 2021

tumido commented Jan 19, 2021 •

edited

Loading

tumido commented Jan 19, 2021

HumairAK commented Jan 19, 2021 •

edited

Loading

sesheta commented Oct 11, 2021

sesheta commented Nov 10, 2021

sesheta commented Dec 10, 2021

sesheta commented Dec 10, 2021

Update monitoring manifests to be deployed in waves #44

Update monitoring manifests to be deployed in waves #44

Comments

hemajv commented Dec 2, 2020

HumairAK commented Dec 2, 2020

tumido commented Dec 8, 2020

HumairAK commented Dec 8, 2020

anishasthana commented Jan 13, 2021

tumido commented Jan 15, 2021

4n4nd commented Jan 15, 2021

tumido commented Jan 18, 2021

4n4nd commented Jan 18, 2021

tumido commented Jan 19, 2021 • edited Loading

tumido commented Jan 19, 2021

HumairAK commented Jan 19, 2021 • edited Loading

sesheta commented Oct 11, 2021

sesheta commented Nov 10, 2021

sesheta commented Dec 10, 2021

sesheta commented Dec 10, 2021

tumido commented Jan 19, 2021 •

edited

Loading

HumairAK commented Jan 19, 2021 •

edited

Loading