Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update monitoring manifests to be deployed in waves #44

Closed
hemajv opened this issue Dec 2, 2020 · 15 comments · Fixed by #45 or #80
Closed

Update monitoring manifests to be deployed in waves #44

hemajv opened this issue Dec 2, 2020 · 15 comments · Fixed by #45 or #80
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@hemajv
Copy link
Member

hemajv commented Dec 2, 2020

When deploying the monitoring setup in quicklab (via argocd), ran into the following Application Sync Error:

Screenshot from 2020-12-02 15-35-06

The monitoring manifests should be updated to be deployed in waves: https://argoproj.github.io/argo-cd/user-guide/sync-waves/
so that the kfdef is deployed first, and then followed by the grafana/prometheus custom resources

cc @HumairAK @4n4nd

@HumairAK
Copy link
Member

HumairAK commented Dec 2, 2020

Current workaround is to sync the kfdef manually then click the overall sync button.

@tumido
Copy link
Member

tumido commented Dec 8, 2020

@hemajv waves won't help, since you have no control over when ODH operator picks up the kfdef and actually does the deployment. It may be long after the kfdef resouce is up in the cluster. We can do this differently via overrides though. I'll solve this as part of #49

@HumairAK
Copy link
Member

HumairAK commented Dec 8, 2020

ah that's a good point, thanks @tumido !

@anishasthana
Copy link
Member

Hey folks, @hemajv was running into some issues with overrides for grafana dashboard deployment that me think a little further.

We probably don't want all dashboards to go into ODH overrides either as there are some cases in which we will have dashboards that are user specific (say some MOC user wants to visualize time series data). I think these types of dashboards should be deployed via ArgoCD. So to summarize:

  1. Dashboards that are relevant to upstream -> overrides -> Deployed by ODH operator
  2. Dashboards that are not relevant to upstream -> Kustomize overlays -> Deployed by ArgoCD

@anishasthana anishasthana reopened this Jan 13, 2021
@tumido
Copy link
Member

tumido commented Jan 15, 2021

we don't have that GrafanaDashboard resource kind available until ODH operator deploys it for us. We can't deploy it side by side since it's dependent on the ODH deployment. The solution you're proposing for the 2nd case would reintroduce all the sync issues.

@4n4nd
Copy link
Contributor

4n4nd commented Jan 15, 2021

I like @anishasthana's suggestion. In the 2nd case, if odh has not deployed GrafanaDashboard yet, argocd will fail and try again since we have autosync enabled for all our apps. I'd rather have the argocd sync fail couple of times than have the odh operator deploy more resources, this is because the odh operator can be a bit difficult to monitor and with argocd we have more visibility.

@tumido
Copy link
Member

tumido commented Jan 18, 2021

With the increasing amount of dashboards we'd probably like to collocate them with the application they are monitoring, right? So putting everything within ODH may not be the best idea when it comes to repo sanity and KISS. I agree with that point.

However we currently don't have a solution for the failed syncs. The auto sync you're suggesting @4n4nd won't work unfortunately. That's actually the core reason why this issue was created. The auto sync retries only on the OutOfSync deployments, it doesn't retry on Errored apps.

https://argoproj.github.io/argo-cd/user-guide/auto_sync/#automated-sync-semantics
image

Another aspect is that this error state would happen only on fresh cluster deployments, because CRDs are required to be deployed only once. However it still means that somebody must initiate the sync manually for all the apps with a dashboard in them on every fresh cluster deployment. I don't really want to introduce manual steps into the workflows.

@4n4nd
Copy link
Contributor

4n4nd commented Jan 18, 2021

@tumido it also says you can enable selfHeal and it will try again
Screenshot from 2021-01-18 16-05-15

@tumido
Copy link
Member

tumido commented Jan 19, 2021

yes that is true. And it's a bit confusing as well. If I understand it properly, selfHeal also checks if there's a previous successful sync that can the manifests can be "restored" to. But it's very confusing here... (the next paragraph after your quote)

image

I'll try spinning up an instance later on, to see how it would behave, before we draw any conclusion here. I'm sorry about this, but I'm starting to be rather super cautious about these things.

https://github.com/argoproj/argo-cd/blob/27a609fb1a24f3ca81ae7798c43e18a66fe8e36a/controller/appcontroller.go#L1410-L1435

btw, this way we'll screenshot all their docs in here. 😄 Paragraph by paragraph. 😄 We have to stop at some point.

@tumido
Copy link
Member

tumido commented Jan 19, 2021

Actually, a clever way to workaround this completely would be to leverage the operate-first/blueprint#19 and just include the CRDs to the cluster-scope application. That way we have all the CRDs available even before the operator is installed by ODH - which allows us to deploy the resources just fine via ArgoCD while ODH would deploy the operator itself whenever it wants... (The CRDs would be consumed by the operator only once the operator is available, but the resources can be already present ahead of time). And that would allow us to move the Dashboards and whatever else from ODH app to wherever it need to go. What would you say to that? I think that would solve the problem just nicely.

@HumairAK
Copy link
Member

HumairAK commented Jan 19, 2021

We would have to ensure we're pulling the appropriate CRDs, i.e. the same ones that ODH would be intending to install. I'm wondering how ODH handles' it's diffs when running reconcilliations. ArgoCD will add a label to the manifests it deploys, so in this case the CRDs, that would already introduce at least 1 change between the two. When ODH is deployed, would it try to fight with ArgoCD for these CRDs?

@sesheta
Copy link
Member

sesheta commented Oct 11, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2021
@sesheta
Copy link
Member

sesheta commented Nov 10, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 10, 2021
@sesheta
Copy link
Member

sesheta commented Dec 10, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta closed this as completed Dec 10, 2021
@sesheta
Copy link
Member

sesheta commented Dec 10, 2021

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
6 participants