fix: remove duplicate metrics service, correct certificate DNS name, and add tolerations docs#44
Conversation
- Remove auth_proxy_service.yaml which duplicated metrics-service.yaml - Add troubleshooting docs for pod scheduling failures due to taints
|
Claude finished @ian-flores's task —— View job PR Review
Review submitted. The PR correctly removes a duplicate metrics service and adds well-structured troubleshooting documentation for tolerations that aligns with the existing Helm chart configuration. |
There was a problem hiding this comment.
Review Summary
Deleted file (dist/chart/templates/rbac/auth_proxy_service.yaml): ✅ Correct removal
Verified that this was indeed a duplicate of dist/chart/templates/metrics/metrics-service.yaml. Both created team-operator-controller-manager-metrics-service with identical specs. The only difference was the conditional guard (.Values.rbac.enable vs .Values.metrics.enable). Removing the duplicate resolves the Helm conflict.
Documentation (docs/guides/troubleshooting.md): ✅ Well-structured
The new section follows the existing troubleshooting pattern with clear:
- Symptoms
- Diagnosis commands
- Cause explanation
- Solution with examples
- Common patterns table
Verified that the Helm values (values.yaml lines 52-67) and template (manager.yaml lines 68-75) support the documented tolerations and nodeSelector configuration.
Minor observation: In dist/chart/templates/certmanager/certificate.yaml:54, the service name is team-operator-metrics-service while elsewhere it's team-operator-controller-manager-metrics-service. Unrelated to this PR but might be worth checking separately.
Replace hardcoded service name with .Values.controllerManager.serviceAccountName
The certificate referenced 'team-operator-metrics-service' but the actual service is named 'team-operator-controller-manager-metrics-service'. This mismatch would cause TLS validation failures for metrics scraping.
Summary
dist/chart/templates/rbac/auth_proxy_service.yamlwhich duplicated the metrics serviceteam-operator-metrics-service→team-operator-controller-manager-metrics-service)Why
Two templates were creating the same
team-operator-controller-manager-metrics-service, causing Helm conflicts during installation. This was a contributing factor in migration failures.The certificate DNS name fix ensures TLS validation works correctly when Prometheus scrapes metrics over HTTPS with cert-manager enabled.
The tolerations documentation helps operators diagnose and resolve scheduling failures on clusters with tainted nodes.
Related
Bean:
ptd-fvuq